build a multiscale CNN with me! step-by-step guide to implement a multiscale CNN!

12 min readJul 21, 2024

In recent years, the field of computer vision has been taken by storm by Convolutional Neural Networks and more recently Vision Transformers (p.s I wrote an article about vision transformers, read it here). But along with that, there have been many developments with Convolutional Neural Networks to allow these models to capture more complex and nuanced features in images.

One such development is the Multiscale Convolutional Neural Network (Multiscale CNN). This innovative approach addresses a fundamental limitation of traditional CNNs: their struggle to simultaneously capture both fine-grained details and global context in images. While traditional CNNs excel at detecting local features, they often fall short when it comes to understanding the broader context or identifying objects that appear at various scales within an image.

The idea of a Multiscale CNN actually came from this paper called “Multiscale CNNs for Brain Tumour Segmentation and Diagnosis”. The main idea of having multiscale convolutional neural networks is to analyze different parts of the image at different scales and combining the information to make a more accurate prediction. Overall, the multiscale CNNs framework solves the problem of accurately segmenting objects in images by extracting both local and global features. Traditional CNNs only focus on local features, which can lead to the loss of important global information. Now that we have a rough understanding of the motivation behind using multiscale CNNs, let’s understand how they work and implement them :)

understanding multiscale CNNs.

Traditional CNNs are also known as single scale CNNs and they mainly focus on local features and sometimes struggle to capture global features as explained previously. The entire idea of multiscale CNNs is to extend the capabilities of traditional CNNs by incorporating multiple scales of information. By leveraging a range of receptive field sizes, multiscale CNNs are designed to capture both local and global features simultaneously. This enables them to better understand the context and relationships within an image or data.

The concept behind multiscale CNNs revolves around the notion that different features and patterns exist at various scales. While traditional CNNs excel at detecting local details, they may overlook larger structures or contextual information that can significantly impact the overall understanding of an image. Multiscale CNNs address this limitation by employing multiple layers with varying receptive field sizes, allowing them to analyze information at different levels of granularity.

By integrating these multiple scales, multiscale CNNs can capture both fine-grained details and broader contextual information. This ability enhances their capacity to recognize complex patterns, objects, and relationships within images or datasets. By considering a wider range of scales, these networks gain a more comprehensive understanding of the data, enabling them to make more informed predictions or classifications.

That was a very brief and basic breakdown of the general idea of Multiscale CNNs, now let’s jump into the very specifics!

drawbacks & limitations of traditional CNNs.

For those of you who may be new to AI, you may not be familiar with the MNIST dataset, but it’s one of the most infamous datasets in the AI space and often used as a benchmark for many image classification models. For a typical CNN, after the input layer, there are 2 main parts; feature extraction & classification.

Feature Extraction & Classification

This phase comprises a series of convolutional layers and pooling layers, working together to learn and extract hierarchical features from the input image. The convolutional layer operates by applying a set of filters to the input image, scanning across the entire image and generating a feature map. This convolution process enables the network to capture local patterns and spatial relationships within the image.

Following the convolutional layer, the pooling layer comes into play. Its primary function is to downsample the feature map by either selecting the maximum or average value within each local region. This downsampling step reduces the dimensionality of the feature map while preserving significant information. This process of convolution and pooling is repeated multiple times in the network, allowing each subsequent layer to learn increasingly intricate and abstract features from the output of the previous layer. The network gradually gains the ability to recognize more complex patterns and hierarchical representations within the image. Once the feature extraction phase is complete, the resulting features are passed to a fully connected layer. This layer performs classification or regression tasks based on the learned features, using them as input for decision-making and prediction.

So what’s the problem?

The issue comes in when working with difference sizes of patches as inputs; for example, the MNIST dataset model usually takes in 28 * 28 size for inputs, where as ImageNet (a pre-trained model) takes in 224 * 224. Take a look at these 2 figures below:

From figure 1, we can see that traditional CNNs work particularly well on feature detection and recognition. Same applies for figure 2, it works well in identifying where the local features are, but struggles with global features. This allows CNNs to flourish when it comes down to classifying problems where the network only needs to classify in one specific region. However, the network falls short when dealing with variable region problems; specifically problems where the object that they’re trying to identify may come in different shapes and appear in different regions. In the context of this paper, they are trying to identify and classify brain tumours and tumours can appear in different sizes, shapes and in different areas of the brain as well. Hence why multiscale CNNs were proposed for this particular paper.

architecture of multiscale CNNs.

The proposed model involved taking images at 3 different scales, the initial scale being 48 * 48 and 2 subsequent scales at size 28*28 and 12 * 12. These aren’t just randomly chosen subsequent scales, they are chosen through a automatic selection of proper image patch size. 1% of the training 2D slices data is randomly selected and different scales of image patches are extracted in order to get the top-3 patch sizes. Here’s a quick visual:

Visualization of image patch size selection.

These patches won’t remain consistent when the datasets change. This same procedure would be applied whenever the dataset changes, ensuring adaptability to different data distributions. By dynamically selecting the proper image patch sizes based on the dataset, the proposed model remains flexible and effective across various scenarios.

Now before diving into the specifics of the model, let’s take a look at what the entire multiscale CNN model looks like

As you can see above, there are 3 separate inputs, one for each of the scales as aforementioned. The proposed model employs a combination of different scales of features to predict the classification of each pixel. This fusion of local and global information allows for a comprehensive understanding of the context surrounding the pixel. By considering features at multiple scales, the model takes into account both the local neighborhood around the pixel and the broader global region. This holistic approach enhances the model’s ability to capture fine-grained details and intricate patterns within the local context, while also incorporating the broader context of the entire image.

By leveraging this combination of local and global features, the model achieves a more robust and accurate prediction for each pixel’s classification. The local features provide detailed information about the pixel’s immediate surroundings, enabling precise identification of local patterns and structures. Simultaneously, the global features offer a broader perspective, capturing the overall context and relationships between different regions in the image. This integration of multiple scales of features allows the model to effectively capture both local and global cues, resulting in a more comprehensive understanding of the image and improved prediction accuracy.

Let’s take a closer look at an example

Let’s consider an example using a block size of 48 * 48. In this case, we design a seven-layer architecture, comprising an input layer, five convolutional layers (C1, C2, C3, C4, and C5), and one max pooling layer. These convolutional layers serve as fundamental components within the CNN framework, forming a hierarchical structure of features.

Each convolutional layer utilizes a filter kernel size of 11 * 11, 11 * 11, 11 * 11, 11 * 11, and 5 * 5, respectively. Starting with the first convolutional layer, C1, different modalities of MRI image patches are taken as inputs, generating 12 feature maps with a size of 38 * 38. The output of C1 then becomes the input for C2. The computation of a feature map can be expressed as:

In Equation (1), X_c represents the input channel (corresponding to each modality), W_sc is the kernel specific to that channel, * denotes the convolution operation, and b_s denotes the bias term. After passing through the five convolutional layers, a max pooling operation is performed. This operation selects the maximum feature value within subwindows of each feature map, effectively reducing the size of the corresponding feature map. This subsampling introduces the property of invariance to local translations.

The weights of the three pathways within the network are learned separately and then combined to create a three-pathway network. The success of our proposed multiscale CNNs can be attributed to these data-driven and task-specific dense feature extractors. By leveraging these architectural elements, the proposed model is capable of extracting rich and discriminative features from the input data, enabling effective representation learning for the given task.

So what next?

After extracting task-specific hierarchical features from different pathways, the outputs of these pathways are combined as input to a fully connected layer. This layer is responsible for the final classification of the central pixel in the input image patch. The learned hierarchical features from all three pathways are arranged in a one-dimensional format and utilized collectively for patch classification.

Here’s a breakdown of the procedure:

For the first channel (with an image patch size of 48 * 48) after the max pooling step, the resulting feature has a size of 1024 and a shape of 4 * 1.
For the second channel (with an image patch size of 28 * 28), the output feature consists of 72 features with a size of 4 * 4.
In the third channel (with an image patch size of 12 * 12), the output feature has a size of 4 * 4 and consists of 16 features.

To combine these outputs, which differ in both feature size and number, a new one-dimensional vector is created. This vector contains all the features from the three channels and has a size of 5504, which is the sum of the features from each channel. And this is the entire architecture from a theoretical perspective. Now that we understand how a multiscale CNN works, let’s dive into the code implementation of it!

implementing a multiscale CNN

The typical architecture of a traditional CNN looks a little like this:

Input Layer: This is where the raw image data is fed into the network. The images are typically represented as a 3D matrix with dimensions corresponding to the height, width, and the number of color channels (e.g., RGB).
Convolutional Layer: This layer applies a set of convolutional filters to the input image. Each filter slides over the input data, performing element-wise multiplications and summing the results, which helps in detecting spatial hierarchies in the data, such as edges, textures, or patterns. The result is a set of feature maps.
Activation Function (ReLU): After each convolutional layer, an activation function, typically the Rectified Linear Unit (ReLU), is applied. ReLU introduces non-linearity into the model, enabling it to learn more complex patterns.
Pooling Layer: This layer reduces the spatial dimensions (height and width) of the feature maps, usually by applying a max pooling operation. Pooling helps in reducing the computational load and the number of parameters, and also makes the network invariant to small translations of the input.
Stacking Convolutional and Pooling Layers: Multiple convolutional and pooling layers are stacked to create a deep network. Each layer extracts higher-level features from the input data, progressively capturing more complex and abstract representations.
Fully Connected Layer: After several convolutional and pooling layers, the 3D feature maps are flattened into a 1D vector, which is then fed into one or more fully connected (dense) layers. These layers learn to combine the features extracted by the convolutional layers to make predictions.
Output Layer: The final fully connected layer outputs the predictions of the network. For classification tasks, this layer typically uses a softmax activation function to produce a probability distribution over the possible classes.

And if we were to look at this in code, this is what it would look like. We would define our convolutional and pooling layers and stack them together and have a couple FCC layers before producing the final output.

Building off of this idea, multiscale CNNs introduce an important modification to the traditional CNN architecture. The key concept behind multiscale CNNs is to process the input image at multiple scales or resolutions simultaneously. This approach allows the network to capture both fine-grained details and global context, which can be particularly useful for tasks that require understanding of both local and global features.

Here’s how a multiscale CNN typically extends the traditional CNN architecture:

Multiple Input Streams: Instead of a single input layer, a multiscale CNN often has multiple input streams, each processing the image at a different scale. For example: Stream 1: Original resolution image, Stream 2: Downsampled version of the image (e.g., 1/2 resolution), Stream 3: Further downsampled version (e.g., 1/4 resolution)
Parallel Convolutional Layers: Each input stream is processed by its own set of convolutional layers. These parallel streams allow the network to extract features at different scales simultaneously.
Scale-specific Feature Extraction: The convolutional layers in each stream can be tailored to extract features appropriate for that particular scale. For instance, finer scales might focus on detailed textures, while coarser scales might capture overall shapes and layouts.
Feature Fusion: At some point in the network, features from different scales are combined. This can be done in various ways: Upsampling lower resolution features to match the highest resolution. Using skip connections to combine features from different scales. Employing attention mechanisms to weight the importance of features from different scales
Continued Processing: After fusion, the combined features may undergo further convolutional and pooling operations to integrate information across scales.
Output Layers: Similar to traditional CNNs, the final stages typically involve fully connected layers leading to the output predictions.

In code, a basic multiscale CNN might look something like this (using a hypothetical deep learning framework):

The key idea occurs with the convolutional layers. Instead of creating separate input streams for different scales, we use a single input tensor and leverage PyTorch’s interpolation functionality to dynamically create and process different scales of the image within the network. In the forward method of our MultiscaleCNN class, we start with the original high-resolution input. We then use F.interpolate() to create half-scale and quarter-scale versions of this input. Each scale is processed through its own set of convolutional layers (conv1_, conv2_, and conv3_* respectively). This allows each path to extract features appropriate to its scale — fine details from the high-resolution path, intermediate features from the half-scale path, and more global, abstract features from the quarter-scale path.

After processing, we use F.interpolate() again, this time to upsample the features from the lower resolution paths to match the size of the high-resolution features. This upsampling allows us to combine features from all scales using torch.cat(). The resulting concatenated feature tensor then undergoes further convolution and pooling operations (conv_combined and adaptive_avg_pool2d) to integrate information across scales. Finally, the fully connected layers (fc1 and fc2) produce the output predictions. This approach enables our network to adaptively focus on both fine details and global context, leading to more robust and informed predictions.

That marks the end of understanding how multiscale CNNs work and how we can implement them in python. I hope that reading this article added value to you and provided a clear understanding of how to use a Multiscale CNNand the inner workings of the different scales of the convolutional layers. That was a long article, but if you’ve read till the end, thank you so much for taking time to read my article and I hope you walked away with more knowledge than you came in with :)

If you have any questions regarding this article or just want to connect, you can find me on LinkedIn or my personal website :)

build a multiscale CNN with me! step-by-step guide to implement a multiscale CNN!

understanding multiscale CNNs.

drawbacks & limitations of traditional CNNs.

architecture of multiscale CNNs.

implementing a multiscale CNN

Written by Dev Shah

No responses yet