Rerankers Aren’t Magic Either: When the Cross-Encoder Layer Is Worth the Cost

Contents

1. What a reranker actually is 1.1 The cost/precision gradient 1.2 The funnel 1.3 Bi-encoder vs cross-encoder mechanically 2. The cost-perf gradient, tested on the same cases 2.1 Literal-token trap (Article 2, section 1.6)2.2 Synonym recovery with a hard lexical distractor (Article 2, section 1.2)2.3 Topical proximity vs answer relevance (Article 2, section 2.3)2.4 Signal dilution in long context (Article 2, section 2.4)2.5 The yes/no question (Article 2, section 2.6)3. Where the cross-encoder still breaks 3.1 Negation, still invisible 3.2 Exact identifiers and internal acronyms 3.3 Listing, the canonical failure mode 3.4 Out-of-domain vocabulary 4. Where rerankers actually justify their cost 5. Conclusion 6. Further reading

article. Two situations.

Scene 1. A team building a RAG system over a few hundred contracts has read Article 2. Embeddings break on negation, on exact identifiers, on the gap between a question and its answer. The team’s first reflex is the one the literature suggests: add a reranker. Cross-encoder, smaller than an LLM, smarter than cosine, slot it between embeddings and the LLM. They wire in bge-reranker-base, send it the top-100 from the embedding stage, keep the top-10. A few queries that were broken yesterday seem to work today. The team is encouraged.

Scene 2. Two weeks in, the same operational pattern from Article 2 returns. The user asks “list every clause that mentions termination” and the system returns the three “most relevant” ones, exactly three, ranked. The contract has eleven. The user asks “what’s the cancellation rule for non-employees?” The reranker has never seen the company’s term non-employee labor, and ranks an unrelated paragraph on top. The user asks “is there a clause that does NOT mention indemnification?” Same negation failure as before; the cross-encoder doesn’t see logical complementation any more than the embedding did. Latency, meanwhile, is now in the hundreds of milliseconds. The cross-encoder runs at query time on every candidate, and there’s no way to precompute it. Worse: when they run side-by-side comparisons against text-embedding-3-large without the reranker, the embedding alone often matches or beats ada-002 + bge-reranker-base.

The classical retrieval funnel looks the same way it did in Article 2. Cheap embedding similarity at the bottom narrows millions of candidates to thousands. An optional cross-encoder reranker between narrows the thousands to dozens. The chat-completion LLM on top reads the dozens. The reranker is the layer that sits between two large constants on the cost-and-quality ladder. Knowing what each stage really does is what makes the funnel work; expecting magic from any single stage is how teams lose six months. This article tests the cost-perf gradient empirically: four embedding models from 2014 to 2024, plus three off-the-shelf cross-encoder rerankers, scored side by side on the cases Article 2 catalogued. The result is more surprising than the funnel suggests.

This article tests the cost-perf gradient empirically: four embedding models from 2014 to 2024, plus three off-the-shelf cross-encoder rerankers, scored side by side on the cases Article 2 catalogued. The result is more surprising than the funnel suggests.

The seven models tested, with their license attestation URLs (the URL of the page on which the model author themselves declares the license):

GloVe-avg (2014, 300-dim word vectors): Apache 2.0, declared on the HuggingFace model card.
all-MiniLM-L6-v2 (2021, 22M params, 384-dim): Apache 2.0, declared on the HuggingFace model card.
text-embedding-ada-002 (OpenAI 2022, 1536-dim): proprietary; OpenAI Terms of Use.
text-embedding-3-large (OpenAI 2024, 3072-dim): proprietary; OpenAI Terms of Use.
bge-reranker-base (BAAI 2023, 278M params): MIT license, declared on the HuggingFace model card.
bge-reranker-large (BAAI 2023, 560M params): MIT license, declared on the HuggingFace model card.
cross-encoder/ms-marco-MiniLM-L-12-v2 (historical baseline): Apache 2.0, declared on the HuggingFace model card.

from sentence_transformers import CrossEncoder
from openai import OpenAI

# Bi-encoder (the embedding stage from Article 2).
# Each text becomes a vector INDEPENDENTLY. Cosine in vector space.
client = OpenAI()
def cosine_score(query, passage):
    v_q = client.embeddings.create(input=query,   model="text-embedding-ada-002").data[0].embedding
    v_p = client.embeddings.create(input=passage, model="text-embedding-ada-002").data[0].embedding
    return dot(v_q, v_p) / (norm(v_q) * norm(v_p))

# Cross-encoder reranker.
# Query and passage are TOKENIZED TOGETHER and attended over jointly.
# One forward pass per (query, passage) pair. Returns a single relevance score.
reranker = CrossEncoder("BAAI/bge-reranker-base")
def rerank_score(query, passage):
    return reranker.predict([(query, passage)])[0]

This article is one piece of the broader Entreprise Document Intelligence Vol. 1 series, which builds enterprise RAG brick by brick from a baseline pipeline to corpus-scale architecture.

1. What a reranker actually is

Before the empirical tests, the architectural picture. Two reasons it matters: the reranker is a real engineering object with real costs, and the editorial position the series defends only makes sense once the classical role is on the table.

1.1 The cost/precision gradient

Three stages, ordered by cost per query:

Bi-encoder embedding similarity. A precomputed vector per document. At query time the model encodes the query once and runs cosine similarity against the index. Milliseconds for millions of candidates. Cheap and approximate.
Cross-encoder reranker. Query and passage are tokenised together and passed through a transformer that attends across both. The output is a single relevance score per pair. Cannot be precomputed because the query is part of the input. Tens of milliseconds per pair. Mid-cost, mid-precision.
Chat-completion LLM. Reads a small candidate set and produces a structured answer. Hundreds of milliseconds, dollars per million tokens. Most expensive, most accurate.

Each stage is justified by what it can do cheaper than the next stage above. Embeddings can’t do everything an LLM can, but they can score a million candidates in the time the LLM reads ten. Rerankers can’t do everything an LLM can, but they can rank a thousand candidates in the time the LLM reads twenty. That is the textbook story. Section 2 tests it on real query shapes. The gradient turns out to be flatter, and sometimes inverted, compared to what the funnel suggests.

1.2 The funnel

The architectural picture is a funnel. The corpus has, say, 200,000 pages. The embedding stage scores them all and returns the top 100. The reranker scores the 100 and returns the top 10. The LLM reads the 10 and produces an answer. Each arrow narrows the candidate pool by an order of magnitude or more, and each stage is justified by the cost-versus-quality trade with its neighbours.

Cost grows top to bottom; candidate count shrinks; each stage hands a smaller set on – Image by author

This funnel logic is what makes the reranker interesting only when the upstream stage produces a large pool. If you already retrieve top-5 from a well-scoped pipeline, there is no funnel to narrow. The reranker re-orders five candidates the LLM will read anyway. The reranker’s value is proportional to the size of the candidate pool it inherits.

On paper, the funnel is elegant: three mathematically distinct scorers, each tuned to its rung of the cost-versus-quality ladder, each justified by the trade with its neighbours. In practice, the elegance does not transfer to the people the system is built for. A business expert who opens an audit log sees three different scores per page, each on a different scale, each produced by a model they do not understand and cannot reproduce. The system becomes harder to explain than the documents it is supposed to answer questions on. The editorial position the series defends (developed in section 4) is not that the funnel is wrong on paper. It is that the architectural moves the experts can audit (expert vocabulary, structure-aware retrieval, classify-before-retrieve, specific pipelines per question type) buy more trust per dollar than stacking statistically distinct scorers does.

1.3 Bi-encoder vs cross-encoder mechanically

The mechanical difference matters for what each can model. A bi-encoder (the embedding model from Article 2) encodes the query and the passage independently, then compares vectors. The two never see each other inside the model. Whatever interaction matters between them (does this passage answer this question) has to survive the projection into a fixed-dimensional vector for each side.

A cross-encoder tokenises query and passage together, separated by a special token, and runs them through a transformer that attends across both sides. Every token in the passage can attend to every token in the query. The model can directly score “the second token of the query is a negation; the third token of the passage means the opposite”. In principle this gives a cross-encoder access to fine-grained interactions a bi-encoder cannot represent.

In principle. The training data and objective decide what it actually learns to score.

2. The cost-perf gradient, tested on the same cases

The textbook funnel sells a clean cost-perf gradient: weak embeddings at the bottom, strong embeddings in the middle, cross-encoder rerankers on top. Each step costs more, each step is supposed to score more accurately. The honest test is to take the cases Article 2 catalogued and run them across the whole gradient: four embedding models from GloVe-avg (2014) to text-embedding-3-large (2024), plus three off-the-shelf cross-encoder rerankers (bge-base, bge-large, ms-marco-MiniLM-L-12-v2). Seven columns per figure. Read each row horizontally and the gradient either holds, breaks, or sometimes inverts.

Three things to watch as you scan each figure: – Does the TARGET row’s #1 win migrate from left to right (the gradient holds, bigger model = better)? Does the TARGET get stuck at #2-#3 across all seven columns (no learned scorer catches the shape)? Or does a smaller, cheaper model rank the TARGET higher than the big rerankers (the gradient inverts)?

All three patterns appear below.

2.1 Literal-token trap (Article 2, section 1.6)

Query hot dog, candidates: a food paraphrase (TARGET, zero shared tokens), the lexical trap the dog basked in the hot sun, and an unrelated decoy. In Article 2, ada-002 fell for the trap; only text-embedding-3-large recovered.

The result on the seven-column grid is striking: 3-large is still the only model that flips the trap to #2 and lifts the paraphrase to #1. None of the three rerankers do. Stacking bge-large on top of ada-002 does not buy you what 3-large already gives you for free at the embedding stage. If the budget is “either upgrade the embedding or add a reranker,” this case argues for upgrading the embedding.

*Query `hot dog`. Each column’s #1 row shows whether the scorer picked paraphrase or trap – Image by author*

2.2 Synonym recovery with a hard lexical distractor (Article 2, section 1.2)

Query is green card needed. The right answer (Permanent resident card is required for this process.) shares zero tokens with the query but is the strict synonym. The trap (Green colored cards are popular in stationery stores.) shares THREE tokens (green, card, cards) and is semantically unrelated. This is the canonical “synonym vs lexical overlap” test.

The grid surfaces an inversion of the cost-perf claim. MiniLM, ada-002, 3-large and bge-base all rank the synonym TARGET #1. Then bge-large and ms-marco-MiniLM-L-12-v2 fall back to the lexical trap at #1, as if the bigger / MS-MARCO-trained models have a stronger lexical bias. Two of the three rerankers actively make this worse than bge-base does. A team that auto-stacks the biggest available reranker on every query loses ground here that they would have kept by sticking with the small one, or by skipping the reranker entirely.

*Synonym TARGET shares zero tokens; trap shares three. Each scorer rewards meaning or token overlap – Image by author*

2.3 Topical proximity vs answer relevance (Article 2, section 2.3)

User question: “Who signed the contract?” The corpus has one passage describing how contracts must be signed (procedural, dense in signed/signature/representative), and one passage that is the actual signature (Signed: John Smith, Marketing Director, dated 2025-03-15). On every embedding model in Article 2, the procedural passage outranked the actual signature. This is the kind of question-answer mismatch cross-encoders are trained on (MS-MARCO is roughly this shape repeated millions of times).

The grid says something the textbook doesn’t predict. MiniLM is the only model, embedding or reranker, that promotes the actual signature line to #1. Every other column, including the three cross-encoder rerankers explicitly trained on this kind of pair, leaves the procedural passage at #1 and the signature at #2. A 22M-parameter free embedding beats six other layers on the canonical reranker test. The cost-perf gradient does not just flatten here; it inverts.

*Procedural passage shares more tokens; signature line is the answer. Topical proximity vs answer-ness – Image by author*

2.4 Signal dilution in long context (Article 2, section 2.4)

The same answer sentence, presented twice: once alone, once buried inside a 70-word policy paragraph. A topical decoy (talking densely about deductibles, never giving the answer) and an unrelated paragraph round out the candidates. In Article 2 every embedding model picked the short answer alone, but lost the buried-answer paragraph to the topical decoy: the surrounding noise diluted the signal of the answer sentence.

This is the one shape where the rerankers earn their cost. bge-large, bge-base-saturated and ms-marco-MiniLM all rank the short answer #1 with the buried-answer paragraph #2. They recover the buried answer to second place, where ada-002 and MiniLM had it third or worse. 3-large already gets there at the embedding stage. So the picture is: on signal dilution, either pay for 3-large at the embedding stage, or stack a free reranker on top of a cheaper embedding. Both paths work. This is the cleanest case in the article for the cross-encoder layer.

*Same answer alone vs buried in a 70-word paragraph against a topical decoy – Image by author*

2.5 The yes/no question (Article 2, section 2.6)

Article 2’s deepest case: the actual answer (Yes, it is needed.) to a yes/no question, against a literal copy of the query keywords (Permanent resident card) and a longer mix. On every embedding model, the literal-keyword copy beat the answer. The whole reason cross-encoders exist as a layer is that they are trained on query-answer pairs where the answer rarely repeats the query.

The grid mostly confirms: the literal copy Permanent resident card is #1 on every column. The TARGET (Yes, it is needed.) is #3 or #4 across all the embeddings and the BGE rerankers. The one column that promotes the actual answer is ms-marco-MiniLM-L-12-v2. It puts Yes, it is needed. at #2, ahead of A green card may be required. and the No answer. A small win, on a yes/no shape that nothing else handles. Worth knowing the MS-MARCO-trained reranker has this specific behavior; not enough to design a pipeline around.

*Yes/no answer is TARGET; literal copy of query is the trap. Does the scorer rank answer above echo – Image by author*

Read the columns horizontally and the cost-perf gradient is mostly flat. On 2.1 the only winner is 3-large (a 2024 embedding, no reranker required). On 2.3 the only winner is MiniLM (a 22M-param free embedding from 2021). On 2.2 two of the three rerankers are worse than the smaller models. Only 2.4 (signal dilution) shows a clean reranker win. Stacking a free off-the-shelf reranker on top of a cheaper embedding does not buy reliable lift over swapping the embedding for a stronger one; on some shapes it actively hurts.

This matches a pattern engineering teams discover the hard way: the marginal dollar is better spent on the embedding stage (or, as the rest of the series argues, on upstream architecture: expert keywords, classify-before-retrieve) than on a reranker. The classical funnel sells “embeddings cheap, rerankers more accurate” as a clean ladder. On these query shapes there is no ladder. Section 3 is the harder side: cases that don’t move regardless of which scorer you use.

3. Where the cross-encoder still breaks

Four failure modes that survive the cross-encoder layer regardless of size or family. The architectural job, which the rest of the series is about, is to recognise these cases at the question-parsing stage and route them through pipelines that don’t rely on similarity scoring at all.

3.1 Negation, still invisible

Article 2 ran the negation test on four embedding models: query “What is NOT a city?”, candidates Paris, New York, City, Table. Every model ranked Table (the only correct answer) at the bottom. The negation token carried no signal. Does any cross-encoder pick up the inversion?

*`Table` is the correct answer for negation. Does each scorer pick it or a city – Image by author*

Cross-encoders are trained on (query, relevant_passage) pairs from web search and MS-MARCO. Almost no training pair has the shape “the relevant passage is the complement of the query’s topic”. The model learned to score topical alignment, and a NOT in the query barely shifts that. The fix is at question-parsing time: detect the negation, invert the retrieval (Article 6).

3.2 Exact identifiers and internal acronyms

Contract reference numbers, internal product codes, acronyms that exist only inside the company. The intuition is that learned similarity will confuse ZRX-2025-A with the close-by ZRX-2024-B. Let’s see.

Two contracts with one-character identifier difference. Every scorer except GloVe ranks the right one – Image by author

The figure is a useful lesson in test design as much as in retrieval. With only three candidates and the right contract appearing verbatim in the candidate text, every modern scorer disambiguates correctly. MiniLM, both OpenAI embeddings, and all three rerankers put ZRX-2025-A at #1. Only GloVe gets confused. The real failure mode for identifiers is at scale: a corpus with hundreds of contracts whose surrounding text follows a templated pattern (Contract <ID> covers <line of business> up to <amount>), where the identifier is the only discriminating feature. There the embedding’s literal-token signal becomes a tiny fraction of the cosine, and the close-by IDs blur. Production-scale identifier disambiguation belongs in BM25 or an exact-match index (Article 6, section 2.2 via concept_keywords_df), not in similarity. The 3-candidate test here just shows that embeddings are not blind to identifiers when the field is small.

3.3 Listing, the canonical failure mode

The reranker’s job is to rank candidates. A listing question wants all of them. Every scorer will dutifully order the eleven termination clauses from most to least relevant; the top-k cut discards the ones it ranked lowest, and the user, who asked for the complete set, gets a partial answer.

*Eleven termination clauses, every scorer. Score gradient is real but top-k silently discards real matches – Image by author*

The fix is listing aggregation (Article 12), not a reranker. A listing question is parsed as a list_all intent at the question-parsing stage and routed to a pipeline that returns every matching item, not the top-k by score.

3.4 Out-of-domain vocabulary

Every model on the grid carries the inductive bias of its training corpus. The OpenAI embeddings and the BGE rerankers are trained on broad web/retrieval data; ms-marco-MiniLM-L-12-v2 on MS-MARCO. Specialised vocabularies (medical, legal, financial, regulatory) sit outside those distributions. Fine-tuning the reranker on domain data fixes much of this. But fine-tuning is a project, not a free upgrade. Off-the-shelf, no scorer on the grid bridges to the company term.

*Query `contractor overtime` vs company term `non-employee labor`. Every scorer ranks TARGET at #3 – Image by author*

Universal failure across the seven columns. The TARGET sits at #3 on every model; Contractors are paid on a per-project basis (the surface lexical match) wins at #1. Neither the largest embedding nor the largest reranker bridges contractor → non-employee labor. This is exactly the problem the series’s concept_keywords_df is built to solve. The expert curates the mapping contractor → non-employee labor, overtime → beyond 40h/week, and the retrieval stage uses those keywords directly. The reranker would need fine-tuning on the company’s contracts to learn the same mapping the expert just typed.

4. Where rerankers actually justify their cost

The position of the series, stated plainly:

Cross-encoder rerankers are a fallback for narrow cases, not the primary stage of an enterprise pipeline. They are worth their cost when the candidate pool is large (top-100,000 from a vector store), the upstream is generic cosine, and there is no time to build a curated pipeline. They add little when the upstream is already small, already-scoped, and already structured.

In production enterprise RAG, three architectural moves make the reranker’s value smaller than the literature suggests.

Question parsing routes the query to a specific pipeline. A listing question runs through list_all aggregation (Article 12), not through ranked retrieval. A filtering question runs through metadata filters (Article 18), not through similarity scoring. A negation question is detected and inverted at question-parsing time (Article 6). The reranker’s input is therefore a small, already-scoped candidate set produced by a structurally appropriate pipeline, not a top-100 dump from a generic vector store.

Classify-before-retrieve shrinks the candidate pool. Article 15 develops the classification step that tags each document with topic, type, and date metadata. At query time, metadata filters reduce the candidate corpus from 200,000 documents to maybe 800. The reranker (if it runs at all) runs on a pool small enough that a domain expert could review it in fifteen minutes. There is no top-100,000 funnel left to manage.

Expert keywords replace probabilistic ranking on the cases that matter. Article 6 builds the concept_keywords_df table that maps user vocabulary to document vocabulary. The mapping is curated; it is auditable; it is exactly the work that a reranker is supposed to do probabilistically. Where the keyword dictionary covers the case, ranking is replaced by structured retrieval and the reranker’s value drops further.

The legitimate large-corpus case (thousands to hundreds of thousands of documents in a vector store, single ad-hoc question, no time to build a curated pipeline) is real, and the series acknowledges it in Articles 15-20 (corpus scale). Even there, the preferred move is classify-and-filter first; the reranker comes in to disambiguate the residual pool.

The bottom line for the reader: rerankers are useful. They have a real place in the literature. The cost/precision gradient is real, and the funnel is the engineering reality of any production retrieval architecture. The series explains them and uses them where they earn their cost. But the architectural choices the series defends (expert vocabulary, structure-aware retrieval, classify-before-retrieve, specific pipelines for specific question-types) push the reranker into a narrow corner rather than the default. Article 9 returns to method combination at the retrieval layer; Articles 15-20 develop the corpus-scale case.

5. Conclusion

The rerankers question is one slice of a larger framing: Enterprise Document Intelligence Volume 1 builds enterprise RAG brick by brick, with the upstream bricks (question parsing, classify-before-retrieve, expert keywords) doing the work the reranker is usually asked to do.

The textbook funnel sells a clean cost-perf gradient: cheap embeddings at the bottom, a more expressive cross-encoder reranker above, then the LLM. Stacking the reranker on top of weak retrieval is supposed to fix what the embedding misses.

The seven-column grid says otherwise. On four of the five “expected reranker wins” from Article 2, the cross-encoder columns either match the embedding or do worse. Only signal dilution (a buried answer in a long paragraph) is a clean reranker win. On the literal-token trap, the canonical answer-vs-procedural test, and the synonym-vs-distractor case, a strong embedding (text-embedding-3-large) or even a small free one (MiniLM) often beats off-the-shelf rerankers. Negation, exact identifiers (at small candidate count), out-of-domain vocabulary, listing: none of them move regardless of which scorer you use.

The series’s editorial position survives the data, and is reinforced by it: rerankers are a fallback for one specific shape (signal dilution in long context), not the primary stage. The marginal dollar buys more lift at the embedding stage than the reranker stage on these query shapes. The architectural moves that make rerankers mostly redundant (question parsing, classify-before-retrieve, expert keywords, specific pipelines for specific intents) are what the rest of the series builds. Article 3 makes the wider case (RAG is not machine learning). Articles 6 and 7 build the upstream bricks. Article 9 returns to method combination at the retrieval layer. Articles 15-20 develop the corpus-scale case where rerankers might genuinely justify their place.

6. Further reading

Nogueira & Cho, Passage Re-ranking with BERT, 2019 (arXiv:1901.04085). The seminal cross-encoder reranker paper; sets up the architecture the bge-reranker family inherits.
Khattab & Zaharia, ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT, SIGIR 2020 (arXiv:2004.12832). The late-interaction alternative — keeps token-level cross-attention but at bi-encoder cost.
Xiao et al., C-Pack / BGE Reranker family, 2023 (arXiv:2309.07597). The BAAI release notes for the rerankers used in this article (bge-reranker-base, bge-reranker-large).
Pradeep et al., RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!, 2023 (arXiv:2312.02724). LLM-as-reranker alternative; relevant once frontier model costs drop further.