, I’ve talked a lot about Reterival Augmented Generation (RAG). In particular, I’ve covered the basics of the RAG methodology, as well as a bunch of relevant concepts, like chunking, embeddings, reranking, and retrieval evaluation.
The traditional RAG methodology is so useful because it allows for searching for relevant parts of text in a large knowledge base, based on the meaning of the text rather than exact words. In this way, it allows us to utilize the power of AI on our custom documents. Ironically, as useful as this similarity search is, it sometimes fails to retrieve parts of text that are exact matches to the user’s prompt. More specifically, when searching in a large knowledge base, specific keywords (such as specific technical terms or names) may get lost, and relevant chunks may not be retrieved even if the user’s query contains the exact words.
Happily, this issue can be easily tackled by utilising an older keyword-based searching technique, like BM25 (Best Matching 25). Then, by combining the results of the similarity search and BM25 search, we can essentially get the best of both worlds and significantly improve the results of our RAG pipeline.
. . .
In information retrieval systems, BM25 is a ranking function used to evaluate how relevant a document is to a search query. Unlike similarity search, BM25 evaluates the document’s relevance to the user’s query, not based on the semantic meaning of the document, but rather on the actual words it contains. More specifically, BM25 is a bag-of-words (BoW) model, meaning that it doesn’t take into account the order of the words in a document (from which the semantic meaning emerges), but rather the frequency with which each word appears in the document.
BM25 score for a given query q containing terms t and a document d can be (not so) easily calculated as follows:
😿
Since this expression can be a bit overwhelming, let’s take a step back and look at it bit by bit.
. . .
Starting simple with TF-IDF
The basic underlying concept of BM25 is TF-IDF (Term Frequency – Inverse Document Frequency). TF-IDF is a fundamental information retrieval concept aiming to measure how important a word is in a specific document in a knowledge base. In other words, it measures in how many documents of the knowledge base a term appears in, allowing in this way to express how specific and informative a term is about a specific document. The rarer a term is in the knowledge base, the more informative it is considered to be for a specific document.
In particular, for a document d in a knowledge base and a term t, the Term Frequency TF(t,d) can be defined as follows:

and Inverse Document Frequency IDF(t) can be defined as follows:

Then, the TF-IDF score can be calculated as the product of TF and IDF as follows:

. . .
Let’s do a quick example to get a better grip of TF-IDF. Let’s assume a tiny knowledge base containing three movies with the following descriptions:
- “A sci-fi thriller about time travel and a dangerous adventure across alternate realities.”
- “A romantic drama about two strangers who fall in love during unexpected time travel.”
- “A sci-fi adventure featuring an alien explorer forced to travel across galaxies.”
After removing the stopwords, we can consider the following terms in each document:
- document 1: sci-fi, thriller, time, travel, dangerous, adventure, alternate, realities
- size of document 1, |d1| = 8
- document 2: romantic, drama, two, strangers, fall, love, unexpected, time, travel
- size of document 2, |d2| = 9
- document 3: sci-fi, adventure, featuring, alien, explorer, forced, travel, galaxies
- size of document 3, |d3| = 8
- total documents in knowledge base N = 3
We can then calculate the f(t,d) for each term in each document:

Next, for each document, we also calculate the Document Frequency and the Inverse Document Frequency:

And then finally we calculate the TF-IDF score of each term.

So, we do we get from this? Let’s take a look, for example, at the TF-IDF scores of document 1. The word ‘travel’ is not informative at all, since it is included in all documents of the knowledge base. On the flip side, words like ‘thriller’ and ‘dangerous’ are very informative, specifically for document 1, since they are only included in it.
In this way, TF-IDF score provides a simple and straightforward way to identify and quantify the importance of the terms in each document of a knowledge base. To put it differently, the higher the total score of the terms in a document, the rarer the information in this document is in comparison to the information contained in all other documents in the knowledge base.
. . .
Understanding BM25 score
In BM25, we utilise the TF-IDF concept in order to quantify how imformative (how rare or important) each document in a knowledge base is, with respect to a specific query. To do this, for the BM25 calculation, we only take into account the terms of each document that are contained in the user’s query, and perform a calculation somewhat similar to TF-IF.
BM25 uses the TF-IDF concept, but with a few mathematical tweaks in order to improve two main weaknesses of TF-IDF.
. . .
The first pain point of TF-IDF is that TF is linear with the number of times a term t appears in a document d, f(t,d), as any function of the form:

This means that the more times a term t appears in a document d, the more TF grows linearly, which, as you may imagine, can be problematic for large documents, where a term appears over and over again without necessarily being correspondingly more important.
A simple way to resolve this is to use a saturation curve instead of a linear function. This means that output increases with the input but approaches a maximum limit asymptotically, unlike the linear function, where the output increases with the input forever:

Thus, we can try to rewrite TF in this form as follows, introducing a parameter k1, which allows for the control of the frequency scaling. In this way, the parameter K1allows for introducing diminishing returns. That is, the 1st occurrence of the term t in a document has a big impact on the TF score, whereas the 20th appearance only adds a small extra gain.

Noetheless, this would result in values in the range 0 to 1. We can tweak this a bit more and add a (k1 + 1) in the nominator, so that the resulting values of TF are comparable with the initial definition of TF used in TD-IDF.

. . .
So far, so good, but one critical piece of information that is still missing from this expression is the size of the document |d| that was included in the initial calculation of TF. Nonetheless, before adding the |d| term, we also need to alter it a little bit since this is the second pain point of the initial TF-IDF expression. More specifically, the issue is that a knowledge base is going to contain documents with variable lengths |d|, resulting in scores of different terms not being comparable. BM25 resolves this by normalizing |d|. That is, instead of |d|, the following expression is used:

where avg(dl) is the average document length of the documents in the knowledge base. Additionally, b is a parameter in [0,1] that controls the length normalization, with b = 0 corresponding to no normalization and b = 1 corresponding to complete normalization.
So, adding the normalised expressionof |d|, we can get the fancier version of TF used in BM25. This will be as follows:

Usually, the used parameter values are k₁ ≈ 1.2 to 2.0 and b ≈ 0.75.
. . .
BM52 also uses a slightly altered expression for the IDF calculation as follows:

This expression is derived by asking a better question. In the initial IDF calculation, we ask:
“How rare is the term?”
Instead, when trying to calculate the IDF for BM25, we ask:
“How much more likely is this term in relevant documents than in non-relevant documents?”
The probability of a document containing the term t, in a knowledge base of N documents, can be expressed as:

We can then express the odds of a document containing a term t versus not containing it as:

And then taking the inverse, we end up with:

Similarly to the typical IDF, we get the log of this expression to compress the extreme values. An exotic transformation called Robertson–Sparck Jones smoothing is also performed, and in this way, we finally get the IDF expression used in BM25.
. . .
Ultimately, we can calculate the BM25 score for a specific document d for a given query q that contains multiple terms t.

In this way, we can score the documents available in a knowledge base based on their relevance to a specific query, and then retrieve the most relevant documents.
All this is just to say that the BM52 score is something like the much more easily understood TD-IDF score, but a bit more sophisticated. So, BM52 is very popular for performing keyword searches and is also used in our case for keyword searches in a RAG system.
RAG with Hybrid Search
So, now that we have an idea about how BM25 works and scores the various documents in a knowledge base based on the frequency of keywords, we can further take a look at how BM25 scores are incorporated in a traditional RAG pipeline.
As discussed in several of my previous posts, a very simple RAG pipeline would look something like this:

Such a pipeline uses a similarity score (like cosine similarity) of embeddings in order to search for, find, and retrieve chunks that are semantically similar to the user’s query. While similarity search is very useful, it can sometimes miss exact matches. Thus, by incorporating a keyword search, on top of the similarity search in the RAG pipeline, we can identify relevant chunks more effectively and comprehensively. This would alter our landscape as follows:

For each text chunk, apart from the embedding, we now also calculate a BM25 index, allowing for quick calculation of respective BM25 scores on various user queries. In this way, for each user query, we can identify the chunks with the highest BM25 scores – that is, the chunks that contain the most rare, most informative terms with respect to the user’s query in comparison to all other chunks in the knowledge base.
Notice how now we match the user’s query both with the embeddings in the vector store (semantic search) and the BM25 index (keyword search). Different chunks are retrieved based on the semantic search and the keyword search – then the retrieved chunks are combined, deduplicated, and ranked using rank fusion.
. . .
On my mind
Integrating BM25 keyword search into a RAG pipeline allows us to get the best of both worlds: the semantic understanding of embeddings and the precision of exact keyword matching. By combining these approaches, we can retrieve the most relevant chunks even from a larger knowledge base more reliably, ensuring that critical terms, technical phrases, or names are not overlooked. In this way, we can significantly improve the effectiveness of our retrieval process and ensure that no important relevant information is left behind.
Loved this post? Let’s be friends! Join me on: