Understanding Multiscale CNNs with Brain Tumours

11 min readMay 30, 2023

hey everyone, my name is Dev and I’m going to be reviewing and breaking down some of the most interesting + new topics/papers in AI and Machine Learning. Before jumping into the paper, I’d also like to introduce myself. I just finished my freshman year, studying Computer Science, at the University of Toronto. Outside of that, I’m currently a Machine Learning Researcher at a ML lab run by Dr. Pascal Tyrrell. This is also my first attempt at trying something like this, so I would love any and all feedback I can get, you can find me on LinkedIn here or leave a comment below :)

The paper that this post is based on is named “Multiscale CNNs for Brain Tumour Segmentation and Diagnosis”. The main idea of having multiscale convolutional neural networks is to analyze different parts of the image at different scales and combining the information to make a more accurate prediction. Overall, the multiscale CNNs framework solves the problem of accurately segmenting objects in images by extracting both local and global features. Traditional CNNs only focus on local features, which can lead to the loss of important global information. Now that I’ve given a brief overview, let’s jump right into the paper!

paper review.

The main focus of the paper was the application of Multiscale CNNs for Brain Tumour Segmentation and Diagnosis. The types of images that are used are MRI images from the infamous BRaTs Dataset; this is often a good benchmark to evaluate how effective a model is. Before jumping into the previous problems of Machine Learning models, let me give a quick overview of brain tumours.

What are brain tumours?

Brain tumours are characterized by the uncontrolled proliferation of abnormal cells, forming solid masses in various regions of the brain. These tumours can be classified into two categories: malignant and benign. Malignant tumours encompass both primary tumours and metastatic tumours. Among adults, glioma tumors are particularly prevalent, and individuals diagnosed with high-grade gliomas typically have a life expectancy extension of up to two years.

How is it currently being treated?

At the moment, there are many options for how brain tumours are treated. The most common include surgery, radiation, and chemotherapy. The surgery option is an attempt at resecting and curing the tumour, while radiation and chemotherapy is often used to simply slow down the growth of the tumour. This is where the application of Multiscale CNNs comes in. It’s extremely important for early diagnosis of the brain tumour to ensure it isn’t too late. Moreover, accurate location and segmentation of the tumour allows doctors to know where they need to plan the surgery.

What’s been done so far?

Extensive research spanning several decades has been dedicated to seeking efficient approaches for brain tumour segmentation. However, despite these ongoing efforts, a flawless algorithm remains elusive. Many of the previously employed techniques relied on conventional machine learning methods or segmentation approaches designed for other anatomical structures. Unfortunately, these methods often relied on manually crafted features or yielded suboptimal results when faced with diffuse and poorly contrasted tumour surroundings. Approaches based on hand-designed features necessitate the computation of numerous features, primarily focusing on edge-related information, while overlooking the broader contextual understanding of the image. Consequently, these traditional methods encounter limitations when it comes to accurately detecting brain tumour in MRI images.

What’s the current problem?

In recent times, Convolutional Neural Networks (CNNs) have emerged as a promising approach in the field of brain tumour segmentation. CNNs have demonstrated their ability to learn hierarchical representations of complex features, yielding promising results in various domains such as the MNIST dataset and mitosis detection. Their application in segmentation problems has also shown some degree of success. However, a notable limitation of CNNs lies in their tendency to focus solely on local textual features, which can inadvertently lead to the exclusion of crucial global information.

While CNNs excel at capturing intricate patterns within localized regions, their reliance on local features often disregards the larger context present in the image. This limitation becomes particularly problematic when dealing with brain tumour segmentation, as tumours can exhibit intricate relationships and interactions with their surrounding tissues. The failure to consider these global cues can result in incomplete or inaccurate tumour delineation.

This is where the proposed Multiscale CNN model comes into play.

understanding multiscale CNNs.

Traditional CNNs are also known as single scale CNNs and they mainly focus on local features and sometimes struggle to capture global features as explained previously. The entire idea of multiscale CNNs is to extend the capabilities of traditional CNNs by incorporating multiple scales of information. By leveraging a range of receptive field sizes, multiscale CNNs are designed to capture both local and global features simultaneously. This enables them to better understand the context and relationships within an image or data.

The concept behind multiscale CNNs revolves around the notion that different features and patterns exist at various scales. While traditional CNNs excel at detecting local details, they may overlook larger structures or contextual information that can significantly impact the overall understanding of an image. Multiscale CNNs address this limitation by employing multiple layers with varying receptive field sizes, allowing them to analyze information at different levels of granularity.

By integrating these multiple scales, multiscale CNNs can capture both fine-grained details and broader contextual information. This ability enhances their capacity to recognize complex patterns, objects, and relationships within images or datasets. By considering a wider range of scales, these networks gain a more comprehensive understanding of the data, enabling them to make more informed predictions or classifications.

That was a very brief and basic breakdown of the general idea of Multiscale CNNs, now let’s jump into the very specifics!

drawbacks & limitations of traditional CNNs.

For those of you who may be new to AI, you may not be familiar with the MNIST dataset, but it’s one of the most infamous datasets in the AI space and often used as a benchmark for many image classification models. For a typical CNN, after the input layer, there are 2 main parts; feature extraction & classification.

Feature Extraction & Classification

This phase comprises a series of convolutional layers and pooling layers, working together to learn and extract hierarchical features from the input image. The convolutional layer operates by applying a set of filters to the input image, scanning across the entire image and generating a feature map. This convolution process enables the network to capture local patterns and spatial relationships within the image.

Following the convolutional layer, the pooling layer comes into play. Its primary function is to downsample the feature map by either selecting the maximum or average value within each local region. This downsampling step reduces the dimensionality of the feature map while preserving significant information. This process of convolution and pooling is repeated multiple times in the network, allowing each subsequent layer to learn increasingly intricate and abstract features from the output of the previous layer. The network gradually gains the ability to recognize more complex patterns and hierarchical representations within the image. Once the feature extraction phase is complete, the resulting features are passed to a fully connected layer. This layer performs classification or regression tasks based on the learned features, using them as input for decision-making and prediction.

So what’s the problem?

The issue comes in when working with difference sizes of patches as inputs; for example, the MNIST dataset model usually takes in 28 * 28 size for inputs, where as ImageNet (a pre-trained model) takes in 224 * 224. Take a look at these 2 figures below:

From figure 1, we can see that traditional CNNs work particularly well on feature detection and recognition. Same applies for figure 2, it works well in identifying where the local features are, but struggles with global features. This allows CNNs to flourish when it comes down to classifying problems where the network only needs to classify in one specific region. However, the network falls short when dealing with variable region problems; specifically problems where the object that they’re trying to identify may come in different shapes and appear in different regions. In the context of this paper, they are trying to identify and classify brain tumours and tumours can appear in different sizes, shapes and in different areas of the brain as well. Hence why multiscale CNNs were proposed for this particular paper.

architecture of multiscale CNNs.

The proposed model involved taking images at 3 different scales, the initial scale being 48 * 48 and 2 subsequent scales at size 28*28 and 12 * 12. These aren’t just randomly chosen subsequent scales, they are chosen through a automatic selection of proper image patch size. 1% of the training 2D slices data is randomly selected and different scales of image patches are extracted in order to get the top-3 patch sizes. Here’s a quick visual:

Visualization of image patch size selection.

These patches won’t remain consistent when the datasets change. This same procedure would be applied whenever the dataset changes, ensuring adaptability to different data distributions. By dynamically selecting the proper image patch sizes based on the dataset, the proposed model remains flexible and effective across various scenarios.

Now before diving into the specifics of the model, let’s take a look at what the entire multiscale CNN model looks like

As you can see above, there are 3 separate inputs, one for each of the scales as aforementioned. The proposed model employs a combination of different scales of features to predict the classification of each pixel. This fusion of local and global information allows for a comprehensive understanding of the context surrounding the pixel. By considering features at multiple scales, the model takes into account both the local neighborhood around the pixel and the broader global region. This holistic approach enhances the model’s ability to capture fine-grained details and intricate patterns within the local context, while also incorporating the broader context of the entire image.

By leveraging this combination of local and global features, the model achieves a more robust and accurate prediction for each pixel’s classification. The local features provide detailed information about the pixel’s immediate surroundings, enabling precise identification of local patterns and structures. Simultaneously, the global features offer a broader perspective, capturing the overall context and relationships between different regions in the image. This integration of multiple scales of features allows the model to effectively capture both local and global cues, resulting in a more comprehensive understanding of the image and improved prediction accuracy.

A closer look…

Let’s consider an example using a block size of 48 * 48. In this case, we design a seven-layer architecture, comprising an input layer, five convolutional layers (C1, C2, C3, C4, and C5), and one max pooling layer. These convolutional layers serve as fundamental components within the CNN framework, forming a hierarchical structure of features.

Each convolutional layer utilizes a filter kernel size of 11 * 11, 11 * 11, 11 * 11, 11 * 11, and 5 * 5, respectively. Starting with the first convolutional layer, C1, different modalities of MRI image patches are taken as inputs, generating 12 feature maps with a size of 38 * 38. The output of C1 then becomes the input for C2. The computation of a feature map can be expressed as:

In Equation (1), X_c represents the input channel (corresponding to each modality), W_sc is the kernel specific to that channel, * denotes the convolution operation, and b_s denotes the bias term. After passing through the five convolutional layers, a max pooling operation is performed. This operation selects the maximum feature value within subwindows of each feature map, effectively reducing the size of the corresponding feature map. This subsampling introduces the property of invariance to local translations.

The weights of the three pathways within the network are learned separately and then combined to create a three-pathway network. The success of our proposed multiscale CNNs can be attributed to these data-driven and task-specific dense feature extractors. By leveraging these architectural elements, the proposed model is capable of extracting rich and discriminative features from the input data, enabling effective representation learning for the given task.

So what next?

After extracting task-specific hierarchical features from different pathways, the outputs of these pathways are combined as input to a fully connected layer. This layer is responsible for the final classification of the central pixel in the input image patch. The learned hierarchical features from all three pathways are arranged in a one-dimensional format and utilized collectively for patch classification.

Here’s a breakdown of the procedure:

For the first channel (with an image patch size of 48 * 48) after the max pooling step, the resulting feature has a size of 1024 and a shape of 4 * 1.
For the second channel (with an image patch size of 28 * 28), the output feature consists of 72 features with a size of 4 * 4.
In the third channel (with an image patch size of 12 * 12), the output feature has a size of 4 * 4 and consists of 16 features.

To combine these outputs, which differ in both feature size and number, a new one-dimensional vector is created. This vector contains all the features from the three channels and has a size of 5504, which is the sum of the features from each channel. And now this entire model was applied to the brain tumour dataset filled with MRI images.

final thoughts.

In the paper, they used the infamous BRATS dataset to provide a benchmark to assess how effective their model was. The 2 best performing algorithms were selected for comparison. On average, the performance of traditional one-pathway CNNs with patch sizes of 48 * 48, 28 * 28, and 12 * 12 consistently lags behind the top two methods. However, among these sizes, CNNs with a patch size of 28 * 28 still outperform Bauer 2012, indicating its suitability. Although the mean score of the multiscale CNNs is slightly lower than the best method (0.81 compared to 0.82), the approach demonstrates comparable stability (variance of 0.099 versus 0.095).

When compared to the second-best method, Menze, the method achieves higher accuracy (0.81 versus 0.78), but is relatively less stable (0.099 versus 0.06). This discrepancy may arise from the absence of a specific preprocessing step before training the CNNs.

Overall, while the method may not surpass the top-performing approach in terms of mean score, it maintains competitive stability. The results highlight the importance of considering proper preprocessing steps to further enhance the accuracy and stability of multiscale CNNs.

That marks the end of this post, it was my first attempt at something like this, so I would greatly appreciate any and all feedback! The main goal of this publication is to transform my research into content for the public, enabling others to learn from it. If you found this helpful, please consider subscribing below! Looking forward to seeing you in the next one :)

Thank you for taking the time to read my post, if you found it valuable and would love to chat more, feel free to reach out to me on LinkedIn or check out my Personal Website!