Breaking Down Contrastive Learning with a Hierarchical Twist

14 min readMay 30, 2023

hey everyone, my name is Dev and I’m going to be reviewing and breaking down some of the most interesting + new topics/papers in AI and Machine Learning. Before jumping into the paper, I’d also like to introduce myself. I just finished my freshman year, studying Computer Science, at the University of Toronto. Outside of that, I’m currently a Machine Learning Researcher at a ML lab run by Dr. Pascal Tyrrell. This is also my first attempt at trying something like this, so I would love any and all feedback I can get, you can find me on LinkedIn here.

The paper that this post is based on is named “HiCo: Hierarchical Contrastive Learning for Ultrasound Video Model Pretraining.” The main purpose of this paper is to propose a new contrastive learning that is able to out-perform the vanilla contrastive learning methods that have been previously proposed. Now that I’ve given a brief overview, let’s jump into the thick of the paper!

paper review.

The focus of this post is going to be on the application of Contrastive Learning on a very specific subset of biomedical imaging, Ultrasounds. Ultrasounds are widely used in medical diagnostics to visualize internal body structures and monitor the health of organs such as the heart, liver, and kidneys. They provide real-time images by emitting high-frequency sound waves and capturing the echoes as they bounce back from different tissues.

In the past, machine learning models have been trained on large datasets of ultrasound images, but the one thing that has been common among those datasets is that they are large + well-labeled. The process of annotation ultrasound pictures + videos is very expensive and time-consuming. This has limited the availability of labeled ultrasound data, making it challenging to train accurate and robust machine learning models. As such, this was the motivation behind creating a model that can operate well on datasets that aren’t well-labeled and might not be as big in size.

So what’s been done in the past?

In the past, there’s been a lot of talks about playing around with Deep Neural Networks and pretraining, but most recently, pretraining combined with fine-tuning has turned out to be pretty successful. The reason for this is that it can transfer knowledge learned on large amounts of un-labeled data very effectively to downstream tasks.

Downstream tasks: a task that depends on the output of the previous task.

We’ve gotten this far in, but I’m yet to address the elephant in the room, Contrastive Learning. But this is exactly where it comes in, this entire idea of combing pre-training with fine tuning is based on the ability of contrastive learning. Before I describe how exactly that works, let me explain what contrastive learning is in a nutshell.

Contrastive Learning
It’s a machine learning technique that aims to learn representations by maximizing the similarity between similar samples and minimizing the similarity between dissimilar samples.

That might be a little complex for those unfamiliar with machine learning, let’s take a look at an example. Imagine you have a dataset of various animal images, but there are no labels indicating the specific species of each animal. With Contrastive Learning, you can train a model to learn representations of these animals without relying on explicit labels.

In the Contrastive Learning framework, the model learns by comparing and contrasting pairs of images. The goal is to maximize the similarity between images that belong to the same category (positive pairs) and minimize the similarity between images from different categories (negative pairs).

To apply Contrastive Learning, the model takes in two images as input: an anchor image and a randomly chosen image from the dataset. The model then generates embeddings, which are numerical representations of these images that capture their essential features. The embeddings are designed in such a way that similar images are mapped to nearby points in the embedding space.

To illustrate, let’s say we have an anchor image of a lion and a positive pair image of another lion. These two images belong to the same category (positive pair) and should have similar embeddings. The model will try to minimize the distance between their embeddings in the embedding space.

On the other hand, if we randomly choose an image of a zebra as the negative pair, the model aims to maximize the distance between their embeddings. This ensures that images from different categories have dissimilar representations.

Through iterative training on a large volume of image pairs, the model gradually learns to capture the underlying patterns and structures that differentiate one animal species from another. After training, the learned representations can be used for various downstream tasks. For example, the model can be fine-tuned on a smaller labeled dataset of lion, zebra, and other animal images to perform tasks like image classification, object detection, or even unsupervised clustering.

That in a nutshell, is what contrastive learning is and how it works (a very dumbed down version, excluding the complicated math 😅).

Now let’s take a look at how that can be applied to Ultrasound Imaging.

Typical Contrastive Learning, also known as Vanilla Contrastive Learning, has been proposed to learn ultrasound video representations. These studies showed promising results, however, with Contrastive Learning, the existing CL methods for US videos typically take the output of a specific layer in a Deep Neural Network and use it for contrast. This means that the model compares the features extracted from that particular layer to understand similarities and differences between different parts of the videos.

However, this approach has its limitations. By focusing on a single layer, the model may miss out on important interactions and connections between different levels of information in the videos. These interactions can be crucial for capturing complex patterns and improving the transferability of pre-trained models.

To address this issue, a new Contrastive Learning approach has been proposed; this model allows multi-level information interaction, allowing the model to be more effective. The name of this approach is Hierarchical Contrastive Learning.

understanding hierarchical contrastive learning.

In order to get around the problem that I described above, a Hierarchical CL method was proposed. The entire idea of this was to not only perform peer-level checks, but also performs cross-level checks. As described earlier, the purpose of this is to ensure that the model isn’t missing out on important interactions and connections between different levels of information in the videos. In other words, this means aligning features within the same level (peer-level alignment) and aligning features across different levels (cross-level alignment). By doing so, the model can capture a more comprehensive understanding of the data. Moreover, medical images from different classes or lesions may have significant local similarities, which can be more challenging than natural images. To avoid this, a label smoothing strategy is used, which involves designing a batch-based softened objective function during the pretraining process. This helps prevent the model from becoming over-confident and reduces the negative impact caused by local similarities between different classes.

Here’s a visual to illustrate the difference between the 2:

Now that I’ve given a high-level overview of this, let’s jump right into the specifics of how this proposed model works.

The overall framework of the HiCo CL model consists of 2 types of semantic alignment as mentioned previously, cross-level & peer-level. It also consists of a softened objective function. In simple terms, when working with medical images, there can be similarities between images from different classes. Traditional classification methods assume that different classes are completely different from each other. However, in medical imaging, some parts of the images may look similar across different classes, like certain tissues and organs that are not related to the diseases being studied.

To handle this, a softened objective function is used. This function helps address the problem of local similarities by making the classification labels smoother. Instead of using hard labels that strictly assign an image to a single class, the softened objective function introduces a parameter called alpha. This parameter controls the degree of smoothing applied to the labels.

By using a softened objective function, the model becomes more flexible in dealing with the local similarities between different classes in medical images. It allows for a more nuanced understanding of the data and helps improve the model’s performance in distinguishing between various classes, even when there are similarities in certain parts of the images.

Here is a diagram of the HiCo Contrastive Learning framework and there are 2 distinct parts to it, and I’ll explain each one in further detail.

Step 1: Extracting 2 images from each ultrasound video as a positive sample pair. For a refresher, a positive sample pair = 2 similar images.
Step 2: The ResNet-FPN pre-trained model is used to obtain local, medium, and global embeddings. Following this, 3 projection heads are also obtained; h, m & g. After this, the entire network is optimized by minimizing the peer-level loss and cross-level loss and the softened CE loss.
Now those were some complex words that might not mean much right now, so let me break this down.

basic concept of vanilla contrastive learning.

The basic concept of Vanilla Contrastive Learning is that it is a method for training a deep neural network to recognize patterns in images. The method involves learning a global feature encoder and a projection head that map an image into a feature vector. A global feature encoder is a component of a deep neural network that maps an input image into a feature vector in a high-dimensional space. The purpose of the global feature encoder is to extract meaningful and discriminative features from the input image that can be used for downstream tasks such as classification, object detection, or segmentation.

The projection heads are also very important to understand. Projection heads are components of a deep neural network that map the output of a feature extractor into a different space where similarity between images can be measured. In the context of contrastive learning, projection heads are typically used to obtain embeddings that are optimized to be similar for similar images and dissimilar for dissimilar images. Projection heads are typically learned through backpropagation during the training of a deep neural network. In the context of contrastive learning, the projection heads are trained to map the output of a feature extractor into a space where similarity between images can be measured using a contrastive loss function. During training, the feature extractor and projection heads are jointly optimized to minimize the contrastive loss function. The goal is to learn embeddings that are similar for similar images and dissimilar for dissimilar images. This is achieved by adjusting the weights of the projection heads so that they map similar images into nearby points in embedding space while pushing dissimilar images further apart.

That was a lot of information, but in a nutshell, once we have the feature vectors, they are then used to compute the Information Noise-Contrast Estimation Loss (InfoNCE Loss). This encourages similar images to be closer together in feature space, while dissimilar images are pushed further apart.

InfoNCE (Information Noise-Contrastive Estimation) loss is a standard loss function used in contrastive learning for computer vision tasks. It is evaluated based on the feature representations of images extracted from a backbone network, such as ResNet. The purpose of InfoNCE loss is to maximize the similarity between positive pairs of images and minimize the similarity between negative pairs of images. This is achieved by computing the dot product between pairs of feature vectors and applying a softmax function to obtain probabilities that measure the similarity between the pairs. The InfoNCE loss function then maximizes the sum of log probabilities for positive pairs and minimizes the sum of log probabilities for negative pairs. InfoNCE loss has been shown to be effective in improving feature representation and performance on downstream tasks such as image classification or object detection.

Now that we have a rough understanding of the vanilla contrastive learning method, let me break down the fundamentals of this framework/pipeline.

understanding the pipeline + framework.

The backbone of the HiCo CL model is built on ResNet-FPN. For some context, the ResNet (Residual Network) architecture is a widely used deep neural network that has been shown to achieve state-of-the-art performance on a variety of computer vision tasks, including image classification, object detection, and segmentation. The purpose of using ResNet as the backbone in the HiCo for US video model pretraining is to extract meaningful and discriminative features from ultrasound videos that can be used for downstream tasks.

The ResNet-FPN backbone is used to extract local, medium, and global embeddings from ultrasound videos. These embeddings are then fed into three separate projection heads to obtain feature representations that can be used for contrastive learning. Specifically, the ResNet-FPN backbone is used to extract features at different scales from the input ultrasound videos. The local features capture fine-grained details such as texture and shape, while the medium and global features capture more abstract information such as object shapes or spatial relationships between objects.

These features that are obtained are passed through three separate projection heads, as mentioned earlier. These projection heads help obtain local, medium, and global embeddings. The purpose of using multiple projection heads is to capture different levels of abstraction in the feature space. The local, medium, and global embeddings are then used to compute peer-level and cross-level semantic alignment losses, which encourage similarity between videos that belong to the same class or category. By optimizing these losses during training, the model learns transferable feature representations that can be used for downstream tasks such as classification or segmentation.

breaking down projection heads.

The local projection head (h) in the HiCo method is responsible for obtaining local embeddings of an image. These embeddings capture fine-grained details about specific regions or objects within the image. The local embeddings are obtained by applying a 3x3 convolutional layer to the feature maps extracted from the backbone network (ResNet-FPN) at a certain level of abstraction. The output of this convolutional layer is then passed through a 1x1 convolutional layer to reduce its dimensionality and obtain the final local embedding. By capturing fine-grained details, the local embeddings can help improve the performance of the model on tasks that require recognizing subtle differences between objects or regions within an image.

The medium projection head (m) is used to obtain medium embeddings from ultrasound videos. These embeddings capture medium-grained information of the original images, such as object shapes or spatial relationships between objects. The medium embeddings are obtained by passing the medium-grained features extracted by the ResNet-FPN backbone through a 1-layer MLP. This MLP consists of a linear layer followed by a non-linear activation function such as ReLU. The purpose of the medium projection head is to capture information that is complementary to both fine-grained and global information.

The global projection head (g) plays a crucial role in deriving global embeddings from ultrasound videos. These embeddings effectively encapsulate essential details from the original images, including overall image content and scene context. To obtain the global embeddings, the global features extracted by the ResNet-FPN backbone are passed through a 1-layer MLP. This MLP comprises a linear layer followed by a non-linear activation function like ReLU. The primary objective of the global projection head is to capture information that remains consistent despite local variations within the image.

peer-level semantic alignment.

Peer-level semantic alignment is a technique used in HiCo (hierarchical contrastive learning) to make ultrasound video models better at recognizing similarities between images of the same class. It helps the model learn useful features by comparing pairs of images at different levels of detail.

When applying peer-level semantic alignment, the model calculates three different contrastive losses: local CL loss, medium CL loss, and global CL loss. These losses measure how similar pairs of embeddings (numerical representations of images) are at the same level of detail.

The local contrast refers to fine-grained information captured by the model, such as texture or small-scale patterns. The local projection head is used to obtain local embeddings from ultrasound videos. The medium contrast refers to medium-grained information captured by the model, such as object shapes or spatial relationships between objects. The medium projection head is used to obtain medium embeddings from ultrasound videos. The global contrast refers to high-level information captured by the model, such as overall image content or scene context. The global projection head is used to obtain global embeddings from ultrasound videos.

By calculating these contrastive losses at each level, HiCo encourages the model to learn features that capture both specific details and broader patterns in ultrasound videos.

During training, these contrastive losses are combined with cross-level semantic alignment losses (such as global-medium CL loss and global-local CL loss) and a softened cross-entropy loss. The model’s projection heads (parts of the network that transform the embeddings) are adjusted using backpropagation to minimize the overall objective function.

cross-level semantic alignment.

Cross-level semantic alignment is a technique used in HiCo (hierarchical contrastive learning) to help ultrasound video models understand the similarities between different levels of detail. It aligns local and medium features with global features to improve learning efficiency and enhance feature representation.

During training, cross-level semantic alignment is achieved by calculating two types of contrastive losses: the global-medium CL loss and the global-local CL loss. The global-medium CL loss measures how similar pairs of medium-level features are to their corresponding global-level features. Similarly, the global-local CL loss measures the similarity between pairs of local-level features and their corresponding global-level features.

By calculating these contrastive losses, HiCo encourages the model to learn feature representations that capture both detailed and high-level information in ultrasound videos. Aligning local and medium features with global features ensures that the model understands both small-scale details like texture and larger-scale semantics like object shapes or spatial relationships.

During training, these contrastive losses are combined with peer-level semantic alignment losses (local CL loss, medium CL loss, and global CL loss) and a softened cross-entropy loss. The model’s projection heads (which transform the features) are adjusted using backpropagation to minimize this overall objective function.

Putting all that together is how the model works as a whole; taking the basic concepts of vanilla contrastive learning and combining it with the introduction of cross-level semantic alignment.

final thoughts.

In the paper, there was an experiment that tested the HiCo (Hierarchical Contrastive Learning) model for analyzing ultrasound videos. The goal was to see if this method could help the computer understand ultrasound images better and perform different tasks accurately.

The experiment focused on five different tasks related to ultrasound diagnosis, such as detecting the fetal head, identifying standard planes, and taking fetal measurements. They compared the performance of HiCo with other advanced methods that are currently used.

The results of the experiment were quite promising. HiCo outperformed the other methods in all five tasks. On average, it improved the accuracy of the model by 3.5% compared to the other techniques. This means that HiCo was better at correctly identifying and analyzing ultrasound images. To better understand the reasons behind HiCo’s success, the authors conducted additional studies. These studies looked at the individual components of HiCo to see how they influenced the model’s performance. The results showed that both the cross-level semantic alignment and peer-level semantic alignment played important roles in achieving high accuracy.

In simpler terms, the experiment showed that the HiCo method performed better than other methods in analyzing ultrasound videos. It improved the accuracy of the model in tasks like detecting fetal head and identifying standard planes. The authors also found that specific techniques within HiCo, like aligning different levels of information, were crucial for its success.

That marks the end of this post! If you found this helpful, please consider subscribing below! Looking forward to seeing you in the next one :)

Thank you for taking the time to read my post, if you found it valuable and would love to chat more, feel free to reach out to me on LinkedIn or check out my Personal Website!