How Convolutional Neural Networks Learn Musical Similarity

Contents

audio embeddings for music recommendation?How does Audio become an input into a neural network?Chunking and Contrastive Learning A simple CNN architecture How do we know the model is any good?Turning the project into a usable music recommendation ap Data sources and Images

audio embeddings for music recommendation?

Streaming platforms (Spotify, Apple Music, etc.) need to have the ability to recommend new songs to their users. The better the recommendations, the better the listening experience.

There are many ways these platforms can build their recommendation systems. Modern systems will blend different recommendation methods together into a hybrid structure.

Think about when you first joined Spotify, you will have been asked what genres you like. Based on the genres you select, Spotify will recommend some songs. Recommendations based on song metadata like this are referred to as content-based filtering. Collaborative filtering is also used, which groups together customers that behave similarly, and then suggestions are transferred between them.

Full embedding pipeline: from MP3 to file learned embedding
(Diagram generated by the author using OpenAI’s image generation tools)

The two methods above lean heavily on user behaviour. Another method, which is increasingly being used by large streaming services, is using Deep Learning to represent songs in learned embedding spaces. This allows songs to be represented in a high dimensional embedding space which captures rhythm, timbre, texture, and production style. Similarity between songs can then be computed easily, which scales better than using classical collaborative filtering approaches when considering hundreds of millions of users and tens of millions of tracks.

Through the rise of LLMs, word and phrase embeddings have become mainstream and are relatively well understood. But how does embedding work for songs and what problem are they solving? The remainder of this post focuses on how audio becomes a model input, what architectural choices encode music features, how contrastive learning shapes the geometry of the embedding space and how a song recommender system using an embedding might work in practice.

How does Audio become an input into a neural network?

Raw audio files like MP3 are fundamentally a waveform – a rapidly varying time series. Learning from these files is possible, but is typically data-hungry and computationally expensive. We can convert .mp3 files into mel-spectrograms, which are much more suited as inputs to a neural network.

Mel-spectrograms are a way of representing audio file’s frequency content over time, adapted to how humans perceive sound. It’s a 2D representation where the x-axis corresponds to time, the y-axis corresponds to mel-scaled frequency bands, and each value represents the log-scaled energy in that band at that time.

*Converting from raw wave files to mel-spectograms*
*(Diagram generated by the author using OpenAI’s image generation tools)*

The colours and shapes we see on a mel-spectrogram can tell us meaningful musical information. Brighter colours indicate higher energy at that frequency and time and darker colours indicate lower energy. Thin horizontal bands indicate sustained pitches and often correspond to sustained notes (vocals, strings, synth pads). Tall, vertical streaks indicate energy across many frequencies at once, concentrated in time. These can represent drum snares and claps.

Now we can start to think about how convolutional neural networks can learn to recognise features of these audio representations. At this point, the key challenge becomes: how do we train a model to recognise that two short audio excerpts belong to the same song without labels?

Chunking and Contrastive Learning

Before we jump into the architecture of the CNN that we have used, we will take some time to talk about how we load the spectrogram data into the network, and how we set up the loss function of the network without labels.

At a very high level, we feed the spectrograms into the CNN, lots of matrix multiplication happens inside, and then we are left with a 128-dimensional vector which is a latent representation of physical features of that audio file. But how do we set up the batching and loss for the network to be able to evaluate similar songs.

Let’s start with the batching. We have a dataset of songs (from the FMA small dataset) that we have converted into spectrograms. We make use of the tensorflow.keras.utils.Sequence class to randomly select 8 songs from the dataset. We then randomly “chunk” each spectrogram to select a 128 x 129 rectangle which represents a small portion of each song, as depicted below.

*Chunking random samples from 8 mel-spectograms*
*(Diagram generated by the author using OpenAI’s image generation tools)*

This means that every batch we feed into the network is of the shape (8, 128, 129, 1) (batch size, mel frequency dimension, time chunk, channel dimension). By feeding chunks of songs instead of whole songs, the model will see different parts of the same songs across training epochs. This prevents the model from overfitting to a specific moment in each track. Using short samples from each song encourages the network to learn local musical texture (timbre, rhythmic density) rather than long-range structure.

Next, we make use of a contrastive learning objective. Contrastive loss was introduced in 2005 by Chopra et al. to learn an embedding space where similar pairs (positive pairs) have a low Euclidean distance, and dissimilar pairs (negative pairs) are separated by at least a certain margin. We’re using a similar concept by making use of InfoNCE loss.

We create two stochastic “views” of each batch. What this really means is that we create two augmentations of the batch, each with random, normally distributed noise added. This is done simply, with the following function:

@tf.function
def augment(x):
    """Tiny time-frequency noise."""
    noise = tf.random.normal(shape=tf.shape(x), mean=0.0, stddev=0.05)
    return tf.clip_by_value(x + noise, -80.0, 0.0)  
# mel dB range usually -80–0

Embeddings of the same audio sample should be more similar to each other than to embeddings of any other sample in the batch.

So for a batch of size 8, we compute the similarity of every embedding from the first view and every embedding from the second view, resulting in an 8×8 similarity matrix.

We define the two L2-normalised augmented batches as \[z_i, z_j \in \mathbb{R}^{N \times d} \]

Each row (a 128-D embedding, in our case) of the two batches are L2-normalised, that is,

\[ \Vert z_i^{(k)} \Vert_2 = 1 \]

We can then compute the similarity of every embedding from the first view and every embedding from the second view, resulting in an NxN similarity matrix. This matrix is defined as:

\[ S = \frac{1}{\tau} z_i z_j^T \]

Where every element of S is the similarity between the embedding of song k and embedding of song l across both augmentations. This can be defined element-wise as:

\[
S_{kl} = \frac{1}{\tau} \langle z_i^{(k)}, z_j^{(l)} \rangle
= \frac{1}{\tau} \cos(z_i^{(k)}, z_j^{(l)})
\]

Where tau is a temperature parameter. This means that the diagonal entries (the similarity between chunks from the same song) will be the positive pairs, and the off-diagonal entries are the negative pairs.

Then for each row k of the similarity matrix, we compute:

\[
\ell_k =\log
\frac{\exp(S_{kk})}{\sum_{l=1}^{N} \exp(S_{kl})}
\]

This is a softmax cross-entropy loss where the numerator is similarity between the positive chunks, and the denominator is the sum of all the similarities across the row.

Finally we average the loss over the batch, giving us the full loss objective:

\[
L =
\frac{1}{N}
\sum_{k=1}^{N}
\left( \log
\frac{
\exp\left(
\frac{1}{\tau}
\langle z_i^{(k)}, z_j^{(k)} \rangle
\right)
}{
\sum_{l=1}^{N}
\exp\left(
\frac{1}{\tau}
\langle z_i^{(k)}, z_j^{(l)} \rangle
\right)
}
\right)
\]

Minimising the contrastive loss encourages the model to assign the highest similarity to matching augmented views while suppressing similarity to all other samples in the batch. This simultaneously pulls representations of the same audio closer together and pushes representations of different audio further apart, shaping a structured embedding space without requiring explicit labels.

This loss function is neatly described by the following python function:

def contrastive_loss(z_i, z_j, temperature=0.1):
    """
    Compute InfoNCE loss between two batches of embeddings.
    z_i, z_j: (batch_size, embedding_dim)
    """
    z_i = tf.math.l2_normalize(z_i, axis=1)
    z_j = tf.math.l2_normalize(z_j, axis=1)


    logits = tf.matmul(z_i, z_j, transpose_b=True) / temperature
    labels = tf.range(tf.shape(logits)[0])
    loss = tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)
    return tf.reduce_mean(loss)

Now we have built some intuition of how we load batches into the model and how minimising our loss function clusters similar sounds together, we can dive into the structure of the CNN.

A simple CNN architecture

We have chosen a fairly simple convolutional neural network architecture for this task. CNNs first originated with Yann LeCun and team when they created LeNet for handwritten digit recognition. CNNs are great at learning to understand images, and we have converted each song into an image-like format that works with CNNs.

The first convolution layer applies 32 small filters across the spectrogram. At this point, the network is mostly learning very local patterns: things like short bursts of energy, harmonic lines, or sudden changes that often correspond to note onsets or percussion. Batch normalization keeps the activations well-behaved during training, and max pooling reduces the resolution slightly so the model doesn’t overreact to tiny shifts in time or frequency.

The second block increases the number of filters to 64 and starts combining those low-level patterns into more meaningful structures. Here, the network begins to pick up on broader textures, repeating rhythmic patterns, and consistent timbral features. Pooling again compresses the representation while keeping the most important activations.

By the third convolution layer, the model is working with 128 channels. These feature maps tend to reflect higher-level aspects of the sound, such as overall spectral balance or instrument-like textures. At this stage, the exact position of a feature matters less than whether it appears at all.

*High level overview of the convolutional neural network used*
*(Diagram generated by the author using OpenAI’s image generation tools)*

Global average pooling removes the remaining time–frequency structure by averaging each feature map down to a single value. This forces the network to summarize what patterns are present in the chunk, rather than where they occur, and produces a fixed-size vector regardless of input length.

A dense layer then maps this summary into a 128-dimensional embedding. This is the space where similarity is learned: chunks that sound alike should end up close together, while dissimilar sounds are pushed apart.

Finally, the embedding is L2-normalized so that all vectors lie on the unit sphere. This makes cosine similarity easy to compute and keeps distances in the embedding space consistent during contrastive training.

At a high level, this model learns about music in much the same way that a convolutional neural network learns about images. Instead of pixels arranged by height and width, the input here is a mel-spectrogram arranged by frequency and time.

How do we know the model is any good?

Everything we’ve talked about so far has been quite abstract. How do we actually know that the mel-spectrogram representations, the model architecture and the contrastive learning have done a decent job at creating meaningful embeddings?

One common way of understanding the embedding space we have created is to visualise the space in a lower-dimensional one, one that humans can actually visualise. This technique is called dimensionality reduction, and is useful when trying to understand high dimensionality data.

*PCA representation of learned embeddings*
*(Image by author)*

Two techniques we can use are PCA (Principal Component Analysis) and t-SNE (t-distributed Stochastic Neighbor Embedding). PCA is a linear method that preserves global structure, making it useful for understanding the overall shape and major directions of variation in an embedding space. t-SNE is a non-linear method that prioritises local neighbourhood relationships, which makes it better for revealing small clusters of similar points but less reliable for interpreting global distances. As a result, PCA is better for assessing whether an embedding space is coherent overall, while t-SNE is better for checking whether similar items tend to group together locally.

As mentioned above, I trained this CNN using the FMA small dataset, which contains genre labels for each song. When we visualise the embedding space, we can group genres together which helps us make some statements about the quality of the embedding space.

The two-dimensional projections give different but complementary views of the learned embedding space. Neither plot shows perfectly separated genre clusters, which is expected and actually desirable for a music similarity model.

In the PCA projection, genres are heavily mixed and form a smooth, continuous shape rather than distinct groups. This suggests that the embeddings capture gradual differences in musical characteristics such as timbre and rhythm, rather than memorising genre labels. Because PCA preserves global structure, this indicates that the embedding space is coherent and organised in a meaningful way.

The t-SNE projection focuses on local relationships. Here, tracks from the same genre are more likely to appear near each other, forming small, loose clusters. At the same time, there is still significant overlap between genres, reflecting the fact that many songs share characteristics across genre boundaries.

*t-SNE representation of learned embeddings*
*(Image by author)*

Overall, these visualisations suggest that the embeddings work well for similarity-based tasks. PCA shows that the space is globally well-structured, while t-SNE shows that locally similar songs tend to group together — both of which are important properties for a music recommendation system. To further evaluate the quality of the embeddings we could also look at recommendation-related evaluation metrics, like NDCG and recall@k.

Turning the project into a usable music recommendation ap

Lastly we will spend some time talking about how we can actually turn this trained model into something usable. To illustrate how a CNN like this might be used in practice, I have created a very simple song recommender web app. This app takes an uploaded MP3 file, computes its embedding and returns a list of the most similar tracks based on cosine similarity. Rather than treating the model in isolation, I designed the pipeline end-to-end: audio preprocessing, spectrogram generation, embedding inference, similarity search, and result presentation. This mirrors how such a system would be used in practice, where models must operate reliably on unseen inputs rather than curated datasets.

The embeddings from the FMA small dataset are precomputed and stored offline, allowing recommendations to be generated quickly using cosine similarity rather than running the model repeatedly. Chunk-level embeddings are aggregated into a single song-level representation, ensuring consistent behaviour for tracks of different lengths.

The final result is a lightweight web application that demonstrates how a learned representation can be integrated into a real recommendation workflow.

This is a very simple representation of how embeddings could be used in an actual recommendation system, but it doesn’t capture the whole picture. Modern recommendation systems will combine both audio embeddings and collaborative filtering, as mentioned at the start of this article.

Audio embeddings capture what things sound like and collaborative filtering captures who likes what. A combination of the two, along with additional ranking models can combine to create a hybrid system that balances acoustic similarity and personal taste.

Data sources and Images

This project uses the FMA Small dataset, a publicly available subset of the Free Music Archive (FMA) dataset introduced by Defferrard et al. The dataset consists of short music clips released under Creative Commons licenses and is widely used for academic research in music information retrieval.

All schematic diagrams in this article were generated by the author using AI-assisted image generation tools and are used in accordance with the tool’s terms, which permit commercial use. The images were created from original prompts and do not reference copyrighted works, fictional characters, or real individuals.