reconstructing music from brain waves!? meet Brain2Music

10 min readAug 26, 2023

Growing up, one of the first things that captured my heart was music; it not only holds immense significance in my life but also carries a universal importance. Music, with its ethereal ability to transcend language barriers and cultural boundaries, serves as a testament to the shared emotions and experiences that connect humanity. It’s a medium that allows individuals to express their innermost thoughts, joys, sorrows, and aspirations, reaching across diverse cultures to create a sense of unity.

Recently, the unique and abstract world of music has been extrapolated a bit further. More specifically, neuroscientific studies; these studies delve into the representation of music within our brains. There’s been lots of research in the human brain activity space; these studies, conducted using functional magnetic resonance imaging (fMRI), offer a fascinating glimpse into the inner workings of our minds as we engage with melodies.

Recently, a paper titled “Brain2Music: Reconstructing Music from Human Brain Activity” was published, authored by Timo I. Denk, Yu Takagi, Takuya Matsuyama, Andrea Agostinelli, TOmoya Nakai, Christian Frank, and Shinji Nishimoto. This paper delves into reconstructing music from brain activity scans with MusicLM (I’ll dive deeper into this later). The music is reconstructed using fMRI scans by predicting high-level, semantically-structured music embeddings and use a deep neural network(DNN) to generate music from those features. Now that I’ve given you a brief overview, let’s jump into this paper!

paper review.

Before jumping straight into the model architecture and how this DNN works, let’s take a step back and understand the previous work in this domain that has allowed for this new model to flourish. Within the music x ML space, there have been 2 main subsets that have allowed this study to come to life: music generation models & fMRI audio encoding + decoding.

music generation models.

Generating music has been really challenging because it is important that the music generated is high-quality and has long-term consistency. This endeavor has given rise to diverse approaches, all aiming to master the intricate interplay of high-quality audio and sustained consistency. One pioneering effort introduced a meticulously structured hierarchy of temporal resolutions. This complex framework, guided by transformer models, orchestrates music generation with remarkable temporal integrity. However, while this approach achieves high coherence, it occasionally introduces perceptible artifacts. (For reference, the model being talked about here is PerceiverAR).

An alternative approach involves the utilization of auto-regressive and diffusion-based models, which play a pivotal role in advancing synthesis quality across both music and broader audio creation. A previous study, AudioLM, introduced the concept of autoregressively modeling a hierarchical tokenization scheme. This approach seamlessly integrates semantic and acoustic discrete audio representations. Expanding on this, MusicLM combines the AudioLM framework with a collaborative music/text embedding model. This integration empowers the generation of high-fidelity music, driven by detailed text descriptions.

That sounds a bit complex, but let me give an analogy to make it a little easier

Imagine the world of audio synthesis as a symphony of creativity, where different techniques play the roles of virtuoso musicians. Among them, auto-regressive and diffusion-based models step onto the stage as master composers, refining the quality of music and broader audio generation.

Think of AudioLM as a skilled painter crafting a canvas masterpiece. They layer colors to build depth, combining broad strokes (semantic elements) with intricate details (acoustic elements). This fusion creates a harmonious artwork, much like musical notes forming a symphony. Enter MusicLM, the curator who envisions a harmonious collaboration. It takes the canvas of AudioLM’s masterpiece and pairs it with a complementary piece — a descriptive text, akin to poetry. This blend enriches the experience, much like combining art and literature, adding layers of meaning and depth.

In this analogy, the models, like skilled musicians, harmonize their efforts. They create multi-layered compositions, evoking emotions and deepening our understanding, similar to how a symphony resonates with audiences, leaving a lasting impact.

Now let’s get back to music generation models!

Within the Brain2Music framework, MusicLM was primarily used. The methodology employed can adapt to any music generator. However, the precondition for the music generator is that it must be able to accomodate conditioning on a dense embedding. This requirement is that the generator needs to be capable of adjusting itself based on a detailed embedding, which is like a set of instructions or information that guides how the music is created.

fMRI audio decoding & encoding.

A significant objective within the field of neuroscience is comprehending how brain activity relates to our sensory and cognitive experiences. To achieve this, scientists create encoding models to precisely outline which aspects of these experiences (like colors, motion, and sounds) correspond to specific brain activity patterns. On the flip side, they also develop decoding models that can deduce the content of an experience based on distinct patterns of brain activity.

Recent progress has led to noteworthy discoveries. Researchers have found similarities between the internal representations of deep learning models and those within the brain across different sensory and cognitive aspects. This revelation has contributed to understanding brain functions through developing encoding models based on these representations, interpreting them in relation to brain functions, and even reconstructing experienced content (such as visual images) from brain activity.

Turning to the exploration of auditory brain functions, scientists have created encoding models employing deep learning techniques to process auditory inputs. Additionally, they’ve engaged in studies to reconstruct perceived sounds from brain activity. However, these studies have mostly focused on general sounds, including voices and everyday sounds. Interestingly, there hasn’t been any instance of constructing encoding models using the internal representations of text-to-music generative models or reconstructing musical experiences from brain activity, focusing specifically on the distinctive features of music.

Now that we have a solid understanding of the previous work, let’s jump into how this model actually works!

understanding model architecture.

MuLan & MusicLM

MuLan is a model that combines text and music embeddings. It’s made up of two parts: one for text (MuLantext) and another for music (MuLanmusic). The text part uses a BERT model pre-trained on lots of text. The music part uses a ResNet-50 variant known as ResNet-50. MuLan’s goal during training is to make sure that the embeddings it generates for both music and text are similar for related examples. For instance, the embedding for a rock song should be similar to the embedding for the text about rock music, but different from an embedding for a calm violin solo. In this article, whenever a MuLan embedding is mentioned, I’m referring to the music tower’s embedding by default.

On the other hand, MusicLM is a model that generates music based on certain conditions. These conditions could be things like text, other music, or melody. In this process, MusicLM uses a MuLan embedding that we compute from an fMRI response to guide the generation. Imagine MusicLM like a two-step process: first, it translates a MuLan embedding into a sequence of special tokens. These tokens are extracted from another model called w2v-BERT. Then, in the second step, MusicLM transforms these tokens and the MuLan embedding into acoustic tokens. These acoustic tokens come from another model called SoundStream. These tokens are then turned back into audio using a SoundStream decoder. Similar to MuLan, all of these steps are carried out using Transformer models, which are a type of technology that helps computers understand patterns and relationships in data.

decoding process.

When we talk about decoding, we’re talking about trying to recreate the original things a person experienced by looking at their brain activity records. This is like putting together clues to understand what the person saw or heard. This process is shown in the table below and has two parts:

Predicting the music qualities from the brain activity data.
Getting or making music based on those predicted qualities.

music embedding prediction from fMRI data.

Predicting musical information from brain scans involves looking at the brain’s response to different stimuli. Imagine we have recorded brain activity data for five people while they listened to 15-second music clips. We break down this data into parts: n represents the number of clips, s is the scans for each clip, and dfmri signifies the brain’s voxels (small units).

Each person’s brain size slightly influences the number of voxels (dfmri), and for one person, it’s around 60k.

Our goal is to predict qualities of the music these people heard. We call these qualities “music embeddings.” For each clip, we have r embeddings that capture details about the music’s features. This number depends on how we analyze the music and the steps we take.

To make everything match up, we use a method where we compare the brain activity data (R) and the music embeddings (T) in terms of time. We do this by averaging the brain data to match when the music features were calculated. For instance, if we want to predict the music qualities from 0s to 10s, we use the average of five brain scans (0–1.5s, 1.5–3s, and so on).

This gives us pairs of brain responses and music features. We follow a certain approach to split and organize this data. We then use a technique (L2-regularized linear regression) to find a relationship between the brain data and the music features. However, this doesn’t work the same for everyone since each person’s brain is unique. We adjust things separately for each person.

We also look at specific areas of the brain, called “regions of interest” (ROIs), which are groups of voxels. From a group of 150 ROIs, we pick the top 6 that are most connected to the music features. These ROIs can be different in size. On average, they have about 258.6 voxels. Although the exact spots may change for each person, we mainly focus on brain areas related to hearing.

For each 15-second music clip, we predict multiple music features (depending on the kind of features we’re looking at).

now let’s look at music retrieval and music reconstruction

We’re looking at two ways to recreate the original music based on the predictions we made. One method is retrieving similar music from a collection, and the other is generating new music using the MusicLM model.

For the retrieval approach, we calculate MuLan embeddings for the first 15 seconds of each music clip in the Free Music Archive (FMA). This archive contains a diverse range of music tracks from various genres. We find the audio clip that has embeddings closest to the predicted ones using cosine similarity. So, it’s like finding a similar music piece from a library.

On the other hand, for the generation approach, we use the predicted embeddings to guide the MusicLM model in creating new music. We average the predicted embeddings along the time dimension and use this information to make the model generate music. This method is powerful because it can potentially create a wide range of music, even ones not seen during training. However, it might not always perfectly match the provided predicted embeddings.

Each approach has its pros and cons. The retrieval method is limited by the available music collection, so it might not capture all the details of the original music. On the other hand, the generative model can theoretically create various types of music, but it might not always precisely match the given predicted embeddings.

encoding: whole-brain voxel-wise modeling

To understand the internal representations of MusicLM, we look at how they relate to recorded brain activity. Specifically, we create models that predict fMRI signals using different music embeddings from MusicLM: embeddings derived from audio (MuLanmusic and w2v-BERT-avg) and those derived from text (MuLantext).

In order to do this, we create models to predict brain activity using audio-based embeddings, comparing MuLanmusic and w2v-BERT-avg to see how they’re represented in the brain.

Then, we build models using both audio-based MuLanmusic and text-based MuLantext embeddings to predict fMRI signals. This helps understand the differences between these two types of embeddings. MuLantext embeddings capture high-level information from music captions. These embeddings are particularly interesting because they represent the text-based description of the music.

We also conduct a control analysis to see if MuLanmusic embeddings contain more information than just the genre of the music. To do this, we compare the prediction performance of the MuLanmusic model with models that use one-hot vectors representing music genres.

The training data is prepared in the same way as we did for decoding. We estimate model weights from training data using L2-regularized linear regression and apply them to test data. The regularization parameters are adjusted during training through five-fold cross-validation. For evaluation, we use Pearson’s correlation coefficients between predicted and actual fMRI signals. We measure statistical significance by comparing the estimated correlations with a null distribution of correlations from independent random vectors. We consider correlations with a significance level of P < 0.05 and correct for multiple comparisons using the FDR procedure. When comparing MuLantext and one-hot music genre vectors, we adjust the sampling rate to match that of MuLanmusic.

Putting all these pieces together is how this model works. The architecture of this model, known as MusicLM, integrates various elements to predict, retrieve, and generate music based on brain activity and music embeddings. It incorporates encoding models to bridge the gap between brain signals and music features, allowing us to decode the music that corresponds to those brain signals. Additionally, MusicLM leverages embeddings derived from both audio and text data, enabling it to understand and manipulate different aspects of music. Through a combination of prediction, retrieval, and generation, MusicLM offers a comprehensive approach to exploring the connection between brain activity and music perception, shedding light on the intricate relationship between our minds and the melodies we experience.

If you’ve made it this far, thank you for reading this article and I hope it added some value to your life 😁 — Dev

If you have any questions regarding this article or just want to connect, you can find me on LinkedIn or my personal website :)