Anchor Detection for RAG: Parallel Detectors, Then One LLM Call at the End

Editor
50 Min Read


brick of Enterprise Document Intelligence, a series that builds an enterprise RAG system from four bricks: parsing, question parsing, retrieval, and generation. It is the second of the brick’s three parts. The previous part, Article 7A (retrieval as filtering), set the mental model; this one builds the machine: parallel anchor detectors, keyword always, embeddings alongside, and one LLM call at the end.

where this article sits in the series: Article 7 (retrieval), the anchor-detection part, inside Part II (the four bricks) – Image by author

Retrieval in an enterprise RAG system is filtering on two structured tables (line_df and toc_df), and every candidate carries an anchor (where the match lands) plus a context (what gets expanded for generation). That mental model is the subject of Article 7A (retrieval as filtering). This article zooms into how the anchors are produced: a three-stage pipeline that runs keyword detection and embeddings in parallel, aggregates the hits to a structural unit, and ends with a single LLM call that ranks the candidates with reasons.

The user types “How is attention computed?” on the Transformer paper. Six candidate pages match attention. The right one mentions softmax, query, key, d_k together, and sits in the section the TOC calls “Scaled Dot-Product Attention”. Two retrievers — keyword and embedding — both spot the candidate set. Neither alone can tell which page actually answers the question. A third step has to read the candidates side by side, with the section each one sits in, and pick the right one with a reason the auditor can read months later.

The three-stage pipeline that follows runs on three principles:

  • Keywords run always. Keyword detection is free. There’s no scenario where you wouldn’t want its signal. It runs on both line_df and toc_df from the first millisecond.
  • Embeddings run in parallel and optional. When vocabulary mismatch is expected or the question is conceptual, embeddings catch what keyword misses. With pre-computed indices, the query-time cost is microseconds. Skip them when the keyword signal is already clean.
  • One LLM call at the end. No mid-pipeline LLM “TOC reasoning” step. The arbiter at stage 3 sees the TOC, the keyword hits, the embedding hits, and the structural attachment of each candidate, in a single call. It does the reasoning over the TOC implicitly as part of ranking.

This article walks the detectors on each table (Section 2 on toc_df, Section 3 on line_df), then the combinations across both tables (Section 4). The arbiter call itself, the decision tree, and the output JSON live in Article 7C (the LLM arbiter and the retrieval output JSON).

Throughout this article we work on a single document, Attention Is All You Need (Vaswani et al. 2017, 15 pages; arXiv non-exclusive distribution license, declared on the arXiv abstract page). It carries a clean native TOC in the PDF outline (22 entries, 3 levels deep), and the content is familiar territory for any engineer touching RAG: encoder, decoder, attention, queries, keys, values. That keeps the focus on the retrieval methods rather than on parsing a domain-specific corpus. This article also assumes the document carries its own TOC; recovering one from raw text is left to follow-up work.

Every method in this article starts from line_df and toc_df – Image by author

1. The anchor-detection pipeline

Anchor detection runs in three stages. Stage 1 runs keyword detection and embedding similarity in parallel on line_df and toc_df. Stage 2 aggregates the hits to a structural unit (section via toc_df if available, otherwise page or chunk). Stage 3 hands the aggregated units to a single LLM call that ranks them and writes its reasoning per pick.

Keyword detection is the always-on baseline. It matches rows whose text contains the question’s keywords, with co-occurrence boosts when several keywords land in the same line or page. Cheap, deterministic, auditable. There’s no reason not to run it: it costs nothing, and when it hits cleanly, it gives the LLM strong signal at stage 3.

Embeddings run in parallel as an optional second signal. Useful when vocabulary mismatch is expected (the question says “prime”, the document says “montant annuel”), or when the question is conceptual rather than lexical. If you’ve pre-computed the embeddings, the marginal cost is microseconds at query time. If not, you can skip embeddings entirely on questions where the keyword signal is already clean.

The LLM at the end sees everything: keyword hits, embedding hits, the structural unit each candidate belongs to. It ranks the units once, with reasons. Two design consequences of putting the LLM at the end rather than mid-pipeline:

  • The LLM does the reasoning over the TOC implicitly. Asked “what happens if we exit early?” against a document whose TOC has Termination and Penalties (and no Exit section), the LLM picks both at ranking time. There’s no separate “TOC reasoning” LLM step earlier in the pipeline; the arbiter does that work as part of its single call.
  • The LLM resolves subtle title matches. If the question is about “the premium” but the relevant section is titled “Summary of the contract”, no keyword will match the title. The LLM, given the keyword hits in the body lines + the structural attachment to that section, will still pick it.
Three stages: detection (parallel) → aggregate → one LLM call – Image by author

The rest of this article walks the detection methods (Section 2 on toc_df, Section 3 on line_df, Section 4 on how the two tables collaborate). Article 7C (the LLM arbiter) is where that final call lives: the single call that turns aggregated candidates into a ranked answer.

2. Filtering on toc_df

Two detectors run on the TOC: keyword match (always, free) and embedding match (optional, parallel). Both are pure scoring, no LLM at this stage. The cognitive work (picking the right sections from a question like “what happens if we exit early?” when the relevant section is titled “Termination”) happens later, in the arbiter call. The arbiter sees the TOC and the keyword / embedding hits in a single LLM call.

We do show a standalone reason_on_toc function below as a pedagogical aside: it isolates what the arbiter does internally when it reasons over the TOC. In production you can either run it as a separate call (extra LLM cost, useful for debugging) or fold it into the arbiter (one LLM call total, the preferred default).

2.1 What the arbiter reasons about

The toc_df is small enough to pass in its entirety to an LLM. The arbiter (developed in Article 7C, the LLM arbiter) exploits this: it reads the whole TOC and reasons about which sections answer the question. The standalone reason_on_toc function below isolates the same logic as a separate call, useful when you want to inspect or debug the TOC reasoning step on its own.

Why this matters. The LLM understands semantics, but more importantly it understands implications. “What happens if we exit early?” does not share vocabulary with “Termination”, but the LLM identifies that exiting a contract is what termination means. “How does the insurer handle a flood?” does not share vocabulary with “Claims procedure”, but the LLM identifies that handling damage is the claims process. “Are there fees for changing the coverage?” may match both “Coverage modification” and “Schedule of fees”, and the LLM picks both, with reasoning that explains why. A subtle case in production: a question about “the premium” lands on a section titled “Summary of the contract”. No keyword matches, but the LLM, given the body lines that mention premium amounts attached to that section, will still pick it.

An embedding model captures “exit early ≈ termination” through similarity, but it cannot capture “exit early implies penalties”. That is reasoning, not similarity.

The cost is one mid-tier LLM call (a few thousand tokens for a typical TOC), a few hundred milliseconds of latency. When folded into the arbiter, it costs nothing extra: the arbiter would see the TOC anyway. The method is infeasible on line_df: passing 12,000 lines of content to an LLM and asking it to “pick the relevant ones” is too expensive, too slow, too unreliable. The TOC’s small size is what unlocks this method.

class SectionSelection(BaseModel):
    section_ids: list[str]
    reasoning: str
def reason_on_toc(question: str, toc_df: pd.DataFrame) -> SectionSelection:
    """Pass the full TOC to an LLM, ask which sections are relevant, with reasoning.

    The prompt uses [id=N] markers so the LLM returns our internal section_id, not
    the title's leading number (e.g. "5.2") which would not match line_df.
    """
    toc_text = "\\n".join(
        f"[id={row.section_id}] {row.title} (level {row.level}, pp. {row.start_page}-{row.end_page})"
        for row in toc_df.itertuples()
    )
    prompt = (
        "Given this question and the document's table of contents, "
        "identify which sections most likely contain the answer. "
        "Consider implications and related concepts, not just keyword overlap.\\n\\n"
        "IMPORTANT: return the value inside the [id=...] brackets -- just the bare integer, "
        "e.g. \\"9\\" not \\"id=9\\" and not \\"5.2\\".\\n\\n"
        f"Question: {question}\\n\\nTable of contents:\\n{toc_text}"
    )
    return client.responses.parse(
        model=model_chat,
        input=prompt,
        text_format=SectionSelection,
    ).output_parsed
# A reader's question about the paper.
selection = reason_on_toc(
    "How does the Transformer handle long-range dependencies between words?",
    toc_df,
)
print("Picked sections:", selection.section_ids)
print("Reasoning:", selection.reasoning)

On the Transformer paper, the LLM picks sections ['4', '11'] for the question “How does the Transformer handle long-range dependencies between words?”. Its verbatim reasoning: “The question about how the Transformer handles long-range dependencies between words is best addressed in sections that discuss the attention mechanism (Section 4) and the reasoning behind using self-attention (Section 11). The attention mechanism is key to managing long-range dependencies, while Section 11 likely provides insights into its necessity and effectiveness.”

That paragraph is exactly what makes the choice auditable. A keyword method would have returned a list of sections without telling you why. The LLM, by writing its reason inline, hands the audit trail to you for free.

2.2 Title keyword match (default detector)

Match the parsed question’s keywords against section titles. A section whose title contains the keyword is almost certainly the section you want. This is the default detector on toc_df: cheap, deterministic, always-on. There’s no reason not to run it; its hits feed straight into the arbiter.

When it suffices on its own: When the question’s vocabulary is unambiguous and matches the document’s vocabulary directly. “What does the warranty section say?”: the title “Warranty” matches. The arbiter sees one clean hit and confirms it.

When it isn’t enough: Generic titles (“Article 1”, “Section 2.1”), vocabulary mismatch (“exit early” vs “Termination”), or subtle titles where the relevant section’s title doesn’t mention the question’s terms (“prime” vs “Summary of the contract”). The arbiter handles those. Keyword match still runs and contributes whatever it can, the arbiter takes it from there.

def match_titles(toc_df: pd.DataFrame, keywords: list[str]) -> pd.DataFrame:
    """Return toc_df rows whose title contains any of the keywords (case-insensitive)."""
    keywords_lower = [kw.lower() for kw in keywords]
    mask = toc_df["title"].str.lower().apply(
        lambda t: any(kw in t for kw in keywords_lower)
    )
    return toc_df[mask]
# A natural search for "attention" -- matches five sections (one parent + four subsections).
match_titles(toc_df, ["attention"])

Running it on the paper’s TOC with the keyword attention:

One keyword, five clean hits, no LLM call, no embedding – Image by author

2.3 Title embedding match (optional parallel signal)

Embed each section title once at ingestion, then at query time embed the question and find the closest titles by cosine similarity. Runs in parallel with keyword match, contributes a second detection signal to the arbiter.

Where it helps: Vocabulary mismatch cases where the title doesn’t share words with the question (“exit early” vs “Termination”). Cosine catches the semantic proximity even when keyword doesn’t. With pre-computed title embeddings, the marginal query-time cost is microseconds.

Where it adds little: When the keyword signal is already clean and the arbiter would have picked the right section anyway. You can skip embeddings on questions where keyword hits look strong; the arbiter still gets enough signal to decide.

def embed_match_titles(query: str, toc_df_with_embeddings: pd.DataFrame, top_k: int = 3):
    """Find titles closest to the query by cosine similarity."""
    query_vec = get_embedding(query, client=client)
    scored = []
    for row in toc_df_with_embeddings.itertuples():
        title_vec = np.array(row.embedding)
        sim = float(np.dot(query_vec, title_vec) / (np.linalg.norm(query_vec) * np.linalg.norm(title_vec)))
        scored.append((row.section_id, row.title, sim))
    return sorted(scored, key=lambda x: -x[2])[:top_k]
# Embed every title and rank by similarity to a vague reader query.
toc_df_emb = toc_df.copy()
toc_df_emb["embedding"] = toc_df_emb["title"].apply(lambda t: get_embedding(t, client=client))
embed_match_titles("where do they explain self-attention?", toc_df_emb, top_k=3)

Ranked on the query “where do they explain self-attention?” against the paper’s section titles:

Embedding over titles ranks the right subsection on top, with a narrow gap – Image by author

2.4 Summary

The default flow on toc_df: run title keyword match (always, free) and title embedding match (parallel, optional) as detectors. Let the arbiter do the reasoning when it ranks the aggregated candidates. The standalone reason_on_toc function shown above isolates that reasoning step for debugging or for pipelines that prefer two LLM calls over one.

3. Filtering on line_df

The line_df is where the actual answer text lives. You cannot afford to run an LLM over tens of thousands of lines, but the LLM’s content-understanding work is not absent: it is delocalized to the question parsing brick. Two methods run directly on line_df: keyword matching (carrying the expanded vocabulary from the parsed question) and embedding similarity.

Content keyword match: The standard method on line_df, with several enhancements that turn naive keyword search into something enterprise-grade.

Use the parsed question’s keywords, with weights. The question parsing brick produced a Keyword list from three sources, each carrying its own weight at retrieval time:

  • Direct extraction from the user’s wording (premium).
  • LLM expansion for synonyms and variants (prime, cotisation).
  • Expert dictionary entries with regex patterns and disambiguators (premium near €<amount>).

This is where the LLM contribution to content filtering appears. The expanded keywords and the expert dictionary patterns came from LLM-driven question parsing. The keyword filter on line_df is therefore much smarter than naive lexical search; it carries the LLM’s understanding of the question vocabulary as a precomputed signal, not as a per-query LLM call.

3.1 Boost on co-occurrence

A line that contains keywords from two semantic groups (the topic and a value-shaped term) is far more likely to be the answer than a line that contains only one. “premium” alone matches the section header, the definitions, and the explanatory prose. “premium” near a number near “€” matches the recap line that states the amount.

The demo uses the Attention Is All You Need paper (Vaswani et al. 2017, arXiv preprint), with the question “How is attention computed?”. The primary group is the attention vocabulary; the secondary group is “any line that looks like a formula” (math tokens: softmax, query, key, d_k). Lines hitting BOTH groups are almost certainly the formula definitions, not the prose around them.

def co_occurrence_score(text: str, primary: list[str], secondary: list[str]) -> int:
    """Score a line by how many keywords from each semantic group it contains.

    Returns 0 if either group is missing -- the point is co-occurrence, not frequency.
    """
    p_hits = sum(1 for kw in primary if re.search(kw, text, re.IGNORECASE))
    s_hits = sum(1 for kw in secondary if re.search(kw, text, re.IGNORECASE))
    if p_hits == 0 or s_hits == 0:
        return 0
    return p_hits + s_hits
# Question: "How is attention actually computed?"
primary = [r"\\battention\\b"]
secondary = [r"\\bsoftmax\\b", r"\\bquery(ies)?\\b", r"\\bkey(s)?\\b", r"\\bd_?k\\b"]
scored = line_df.assign(score=line_df["text"].apply(
    lambda t: co_occurrence_score(t, primary, secondary)
))
top = scored[scored["score"] > 0].sort_values("score", ascending=False).head(8)
top[["page_num", "line_num", "section_id", "score", "text"]]

Run on the “How is attention computed?” question with attention as the primary group and formula tokens (softmax, query, key, d_k) as the secondary group:

Co-occurrence collapses 200+ mentions to the few lines carrying the formula – Image by author

Regex for high-value patterns: Some answer shapes are too specific for keyword matching but trivial for regex: monetary amounts, ISO dates, policy codes, clause numbers. When the expected-shape pattern from the parsed question says “the answer is a date” or “the answer is a monetary amount”, retrieval boosts lines whose text matches the corresponding regex. The Attention paper has no monetary values, so we demonstrate on a few synthetic strings.

Each shape fires only on its target; cost is one regex per pattern per line – Image by author

Lexicon match for enumerated entities. Other answer shapes are not regex-shaped, they are enumerated: the answer is one of a finite, knowable list. Country names, currency codes, ISO language codes, named contract parties, product references. The pattern’s twin: instead of a regex, you carry a dictionary that lists every valid value with its variants. When the question’s expected-shape says “the answer is a country”, retrieval boosts lines containing any term from the country lexicon.

Like the regex catalog, the lexicon comes from the expert dictionary built at question parsing time. The work happens once: the expert lists the entities and their variants ("France": ["France", "FR", "FRA", "République française", "Hexagone"]). The retrieval engine then pays a cheap lookup per line.

The two cases together cover most “the answer has a known shape” situations: regex when the shape is a syntactic pattern (number plus currency, ISO date format, code prefix), lexicon when the shape is membership in a closed set (country, currency, language, contract party, product code). Both fire only when question parsing flags the expected shape, so they cost nothing on questions where they are not relevant.

Lexicon resolves variants (Deutschland, U.S., Great Britain) to canonical names – Image by author

3.2 BM25 and TF-IDF

A more advanced variant uses TF-IDF or BM25. BM25 (Best Match 25) is the classical keyword-scoring formula that weights terms by their information content and normalizes for document length: common terms like “the” carry little signal, rare terms like “L131-1” are highly discriminative. In the RAG community, BM25 is often presented as the sparse complement to dense embeddings, the “hybrid retrieval” recipe.

In enterprise RAG, BM25 underperforms keyword filtering with business weighting. Here is the pattern, illustrated on three domains.

Example 1, insurance: “What is the annual premium?” Page A is a recap line “Annual premium: $125,000.” (one “premium”). Page B is a six-paragraph explanatory section on the concept of premium (ten “premium”, no amount). Page A is the answer; BM25 favors Page B.

Example 2, software docs: “What is the rate limit for the search endpoint?” Page A is a reference row /search: 100 requests/minute” (one “rate”, one “limit”). Page B is a tutorial on rate-limiting concepts (twelve “rate”, nine “limit”). Page A is the answer; BM25 favors Page B.

Example 3, legal contracts: “What is the notice period for termination?” Page A is the termination clause “upon thirty (30) days’ written notice” (one of each). Page B is a Definitions section that uses “termination” and “notice” five or more times while defining adjacent concepts. Page A is the answer; BM25 favors Page B.

The shape these examples share is the dominant shape of enterprise documents. The right answer mentions the keyword once, alongside a specific value (amount, number, duration). The wrong answer mentions the keyword many times in explanatory or definitional prose, with no specific value. BM25 ranks by frequency, so it favors the wrong answer.

What works instead:

  • Co-occurrence boost rewards lines where multiple semantic groups are present (the topic plus a value group). Page A in each example pairs the topic with a value; Page B has only the topic.
  • High-value patterns boost lines containing the expected answer shape, whether the shape is a syntactic pattern (a monetary regex catches Example 1, a numeric regex catches Example 2, a duration regex catches Example 3) or a membership lookup (the country lexicon flags lines mentioning “France”, “FR”, or “République française”).
  • Expert dictionary weighting lets a domain expert mark “premium”, “rate limit”, “notice period” as value-required concepts that only score highly when paired with a value pattern.

These three signals are explicit business knowledge, not statistical heuristics. BM25’s IDF formula encodes none of them, regardless of how much corpus data it sees. The enterprise corpus is also too narrow for IDF to discriminate well: a single contract, or a homogeneous collection, lacks the cross-document variance BM25 was built to exploit.

Practical recommendation: If you already have a BM25 index, keep it as a cheap baseline. If you do not, do not add one for the sake of “hybrid retrieval”. Invest the same engineering effort in the expert dictionary, the co-occurrence patterns, and the regex catalog. You will get more accuracy per hour invested.

3.3 Chunk embedding match

The standard vector-search method: chunk the line_df into pieces of 200 to 500 tokens, embed each chunk once at ingestion, embed the query at retrieval time, return the chunks with highest cosine similarity.

When it wins.

  • Vocabulary mismatch where rewrites help. “Exit early” embeds close to “early termination provisions” once the parsed-question rewrites are in play. Keyword match misses; embedding catches.
  • Conceptual or fuzzy questions: “Is the liability cap reasonable?”: no specific keyword. Embedding can expose passages with the right conceptual shape.
  • Documents without a TOC and without specialized vocabulary. Memos, emails, articles. No structure to navigate; no expert dictionary. Embedding similarity is the most useful signal.

Three principles when this is your method.

  1. Multiple queries, not one: Use the rewrites from question parsing.
  2. Max similarity across rewrites, not average. A chunk that matches one rewrite well beats a chunk that matches all rewrites mediocrely.
  3. It is one signal, not the whole pipeline. Combine with content keyword and TOC methods (Section 4).

Note on the LLM on line_df: You will sometimes see proposals to run an LLM-based filter directly on line_df content, for example scoring each chunk by passing it through an LLM with the question. This works for very small documents (a few hundred lines) but does not scale: at 10,000 lines, a per-line LLM call is prohibitive in cost and latency. The series handles content-level LLM understanding via question parsing (rewrites and dictionary) rather than per-query LLM calls on content. The LLM is involved in shaping the keywords before retrieval, not in scoring lines during retrieval.

3.4 Line-level cosine to explain (and re-rank) a page-level hit

A persistent complaint against page-level embedding retrieval is that the score is opaque. When retrieve_pages_by_similarity returns page 9 with cosine 0.7843 and page 6 with cosine 0.7728, there is no way to say which lines on each page drove the score, and therefore no way to defend the ranking to an auditor. The complaint is the one the minimal RAG pipeline raised on the same paper and the same question.

The fix is the same embedding trick applied at finer granularity. Pick the top-K pages from page-level retrieval. For each one, embed every line on the page, run cosine against the question, and look at the top lines. The result is a per-line explanation of the page-level signal: which lines carry vocabulary close to the question.

Page 9 (page-level winner): the signal scatters across Table 3 rows, no compact hotspot – Image by author
Page 6 (the actual answer): a dense deep-red cluster on the section heading and the equations – Image by author

Four things become visible immediately on the heatmap.

First, the line-level view shows what each page carries, and where on the page the signal sits. Page 9, the page-level winner on cosine, shows a scatter of orange tints across its Table 3 rows: lines that compare “sinusoidal positional encoding” with “learned positional embeddings”, related content, just not the section that defines positional encoding. Page 6, ranked third by page-level cosine, shows a dense deep-red cluster on its section heading “Positional Encoding” and the equations that follow. The top line on the page scores 0.8902, higher than any line on page 9.

Second, the page-level ranking is a poor aggregator. Page 6 contains the section that answers the question, but the page averages 53 lines (the encoding section plus surrounding content), and the mean is pulled down by the rest. Page 9 has a flatter distribution, so its mean stays higher even though its peak is lower. Switching the page score from mean(line_sims) to max(line_sims) (one line of code) flips the ranking and puts page 6 first on this question.

Third, the line-level view restores the anchor/context framing from Article 7A (retrieval as filtering). Each top line is an anchor (the actual phrase that scored high, auditable to the reader); the page is the context that gets passed downstream to generation. The same detect_then_extract pattern, with embedding instead of keyword.

Fourth, and unsurprisingly when one reads the actual top lines: the top-scoring lines on both pages are the ones that physically contain positional encoding or positional embedding. On page 6, the top three are the section heading itself (Positional Encoding) and two body lines that name the term (“as the embeddings, so that the two can be summed”, “where pos is the position and i is the dimension”). On page 9, the two strongest are Table 3 row (E) lines that mention the variation (“sinusoidal positional encoding with learned positional embeddings”, “positional embedding instead of sinusoids”). The embedding is not surfacing some deep semantic property here. It is finding the literal keyword, with the rest of each line’s words moving the score by a few percentage points around the keyword baseline. When the question carries an obvious key term, line-level embedding collapses to keyword matching with extra steps. A bare str.contains("positional") filter on line_df would have shown the same top lines, in microseconds, with no embedding model in the loop.

The shape of the score distribution adds a second, useful signal: on page 6 the top five lines decay smoothly (0.890, 0.866, 0.843, 0.802, 0.798), a dense plateau that signals “this section discusses the topic”; on page 9 the distribution shows two outliers (0.826, 0.823) then a five-point drop to 0.771, a sparse pattern that signals “the page mentions the topic in passing”. The density of above-threshold lines, not just the max, separates a defining section from a passing mention.

What this does not buy: an embedding score still cannot tell you whether the matched line is the definition you wanted or a passing mention in a comparison table. The line-level view makes the ranking inspectable; it does not make it precise. For precision, when the question carries a high-signal token (positional encoding, liability cap, effective date), the keyword and regex methods from Section 3 remain the right tools. A keyword match on “positional encoding” would have flagged both page 6 and page 9 directly, with no embedding step at all. Line-level cosine is the audit overlay that makes a page-level embedding pipeline defendable when embeddings are the right method to begin with, and Article 7C (the LLM arbiter) develops when that is, and when it is not.

4. Combining the two tables

The pipeline of Section 1 runs detectors on both tables in parallel and hands everything to the arbiter at the end. The combinations below are three concrete ways the detection layer can cross-pollinate line_df and toc_df before the arbiter sees the candidates: each makes the arbiter’s job easier by handing it pre-scoped, pre-boosted candidates rather than raw matches.

4.1 Reason-then-match (two-LLM-call alternative)

A two-stage pipeline that uses an extra LLM call up front: the LLM reads the TOC, picks the relevant sections, returns a short list of section IDs, then keyword retrieval runs only on the lines within those sections.

When this is worth the extra call: A 100-page contract has 50 sections; the LLM picks 2 to 3 in one call; keyword retrieval then operates on a few hundred lines instead of the full 15,000. The trade-off versus the single-arbiter pattern: you pay two LLM calls instead of one, but the second-stage keyword search runs over a much smaller pool, which matters when the pool is huge (think: a 500-page regulatory filing). For documents in the typical enterprise range (10 to 100 pages), the single-arbiter pattern is enough; the arbiter sees the whole TOC and the keyword hits at the same time and does the section reasoning as part of its single call.

Two-stage flow: LLM picks sections from toc_df, keywords score lines inside – Image by author

This pattern is a pre-arbiter narrowing trick. The dispatcher reaches for it on very large documents where the single-arbiter pattern would feed too many lines to the LLM; on normal-sized enterprise documents, the dispatcher folds the TOC reasoning into the arbiter itself.

def reason_then_match(question: str, primary_kw: list[str], secondary_kw: list[str],
                      line_df: pd.DataFrame, toc_df: pd.DataFrame, top_n: int = 5):
    """Stage 1: LLM picks sections from toc_df. Stage 2: keyword scoring within those sections.

    Stage 2 filters by page range (not by section_id) so that nested sub-sections sharing pages
    with their parent are all included -- line_df.section_id collapses when a page hosts multiple
    sections, but the page-range join recovers everything inside the picked scope.
    """
    selection = reason_on_toc(question, toc_df)
    relevant_ids = set(selection.section_ids)
    print(f"Stage 1 picked sections: {sorted(relevant_ids)}")
    print(f"  reasoning: {selection.reasoning[:200]}")
    relevant_pages = set()
    for sid in relevant_ids:
        rows = toc_df[toc_df["section_id"] == sid]
        for _, sec in rows.iterrows():
            relevant_pages.update(range(int(sec["start_page"]), int(sec["end_page"]) + 1))
    candidates = line_df[line_df["page_num"].isin(relevant_pages)].copy()
    candidates["score"] = candidates["text"].apply(
        lambda t: co_occurrence_score(t, primary_kw, secondary_kw)
    )
    top = candidates[candidates["score"] > 0].sort_values("score", ascending=False).head(top_n)
    return top
# A reader's question: where's the actual formula? Not the prose about attention.
result = reason_then_match(
    "Where do they define the actual formula for scaled dot-product attention?",
    primary_kw=primary,
    secondary_kw=secondary,
    line_df=line_df, toc_df=toc_df,
)
result[["page_num", "line_num", "section_id", "score", "text"]]

Run on the question “Where do they define the actual formula for scaled dot-product attention?”, with stage 1’s picked sections and stage 2’s top-scoring lines side by side:

The two-stage trace makes the result auditable end to end – Image by author

4.2 Section-weighted match

A line that matches the keyword and lives in a section whose title also matches is far more likely to be the answer than a line that matches the keyword in an unrelated section. This is the zero-LLM alternative when you want a cheap section-aware pre-score before the arbiter call. Both signals (title match and content match) are pure keyword operations. Quality is below the full arbiter pattern on hard cases but adequate on most easy ones, and you can use it as a pre-filter when the candidate pool is too big to send to the LLM in one go.

def section_weighted_match(question: str, line_df: pd.DataFrame, toc_df: pd.DataFrame,
                          primary_kw: list[str], secondary_kw: list[str], boost: float = 1.5):
    """Score lines by keyword co-occurrence, boosted when the line's section title also matches."""
    title_keywords = [w for w in question.lower().split() if len(w) > 3]
    relevant_section_ids = set(match_titles(toc_df, title_keywords)["section_id"])
    scored = []
    for row in line_df.itertuples():
        line_score = co_occurrence_score(row.text, primary_kw, secondary_kw)
        if line_score == 0:
            continue
        section_boost = boost if row.section_id in relevant_section_ids else 1.0
        scored.append((row.line_num, row.page_num, row.section_id, line_score * section_boost, row.text))
    return pd.DataFrame(scored, columns=["line_num", "page_num", "section_id", "score", "text"]).sort_values("score", ascending=False).head(8)
section_weighted_match(
    "How is attention actually computed in this paper?",
    line_df, toc_df,
    primary_kw=primary, secondary_kw=secondary,
)

Run on the same attention question, with a 1.5x boost when the line’s section title also matches:

Lines in on-topic sections get a 1.5x boost; cheap, deterministic – Image by author

4.3 Hybrid embedding

For documents where embeddings are the appropriate base method (vocabulary mismatch, conceptual questions), the section-weighting principle still applies: embed-search pages, then boost pages in sections whose titles match the question. This recovers the structural awareness that pure embedding search lacks. Even when embeddings are the right base method, knowing which section a chunk belongs to and whether that section is on-topic improves precision substantially.

def hybrid_embedding(question: str, page_df: pd.DataFrame, line_df: pd.DataFrame,
                     toc_df: pd.DataFrame, top_k: int = 5, boost: float = 1.3):
    """Embed-search pages, boost those in sections whose title matches question vocabulary."""
    title_keywords = [w for w in question.lower().split() if len(w) > 3]
    relevant_section_ids = set(match_titles(toc_df, title_keywords)["section_id"])
    page_to_section = (
        line_df.dropna(subset=["section_id"])
        .groupby("page_num")["section_id"]
        .agg(lambda s: s.value_counts().index[0])
        .to_dict()
    )
    retrieved, _ = retrieve_pages_by_similarity(page_df, line_df, question, top_k=top_k * 2, client=client)
    scored = retrieved.copy()
    scored["section_id"] = scored["page_num"].map(page_to_section)
    scored["boosted_sim"] = scored.apply(
        lambda r: r["similarity"] * (boost if r["section_id"] in relevant_section_ids else 1.0),
        axis=1,
    )
    return scored.sort_values("boosted_sim", ascending=False).head(top_k)[["page_num", "section_id", "similarity", "boosted_sim", "text"]]
hybrid_embedding("How is attention actually computed in this paper?", page_df, line_df, toc_df, top_k=5)

Run on the same attention question, with the section-title boost applied to the embedding cosine score:

Naive top-k plus a 1.3x boost when the page lives in an on-topic section – Image by author

4.4 Why combinations win

Single detectors miss too much on their own:

  • Title keyword match misses when titles do not share vocabulary with the question.
  • Content keyword match misses when the same word appears in many sections, only one of which is relevant.
  • Chunk embedding match misses when the answer is in a specific section that embeddings cannot isolate.

The combinations above add cross-table signal to each candidate (its section, the title’s keyword overlap, an embedding boost). The arbiter that Article 7C (the LLM arbiter) develops is what unifies them: it sees keyword hits, embedding hits, structural attachment, and ranks once with reasons. The combinations make the arbiter’s job easier; they don’t replace it.

The LLM can see that a high-co-occurrence-score line “the premium for the optional add-on is 200€” is less relevant than a lower-scoring line “l’assurance comprend une prime annuelle de 4 500 €” because the question was about the main premium.

Stripped of the marketing label (“agentic RAG”), it is: use a small LLM call to make a judgment, after the deterministic methods have done their work. We develop the combination logic in a follow-up integrated pipeline.

A note on cross-encoder rerankers (Cohere Rerank, bge-reranker, monoT5, and the broad family that re-scores top-k candidates by a learned relevance model). The position of the series is that rerankers are a remediation for weak upstream retrieval, not a default stage of strong retrieval. When the upstream stage already incorporates expert keywords, TOC reasoning, metadata filters, and structural scope selection (the work this article is about), the right passage already sits near the top, and a cross-encoder gives a small lift at a real latency and inference cost. When the upstream is generic cosine similarity over an undifferentiated vector store, a reranker recovers a large gap. But the right architectural response is to harden the upstream stage so the gap doesn’t open in the first place. The series treats reranking as the patch over a thin upstream pipeline, useful in narrow latency-tolerant scenarios and when the upstream is cheap to compute and hard to improve, not as a default. Several public benchmarks (BEIR, mTEB) show large gains from reranking on top of vector-only retrieval; on hybrid retrieval that already incorporates BM25 and metadata filters, the marginal gain is much smaller. That gap is the editorial position.

The four methods, side by side: Same question, four retrievers. Watch each single method fall short, and the combination win.

Four methods side by side; reason-then-match wins on the formula question – Image by author

Detect-then-extract: The canonical “anchor on a TOC title, extract from the section body” combination. The anchor is a toc_df row found by keyword on the title; the context is the full body of that section pulled from line_df. This is the function Article 7A referenced when introducing the two-scope framing.

def detect_then_extract(toc_df: pd.DataFrame, line_df: pd.DataFrame, keywords: list[str]):
    """Anchor on toc_df titles, then extract the section body by page range from line_df."""
    matched = match_titles(toc_df, keywords)
    if matched.empty:
        return None, None
    section = matched.iloc[0]
    body = line_df[
        (line_df["page_num"] >= section["start_page"])
        & (line_df["page_num"] <= section["end_page"])
    ]
    return section, body
# A user asks "where do they describe training?" -- the anchor is one short title, the context is the entire training section body.
section, body = detect_then_extract(toc_df, line_df, ["Training"])
print("=== ANCHOR (toc_df, scope: title, a few words) ===")
print(f"  matched: section_id={section['section_id']}  title='{section['title']}'")
print(f"  page range: {section['start_page']} to {section['end_page']}")
print()
print(f"=== CONTEXT EXTRACTION (line_df, scope: full section body, {len(body)} lines) ===")
for _, ln in body.head(8).iterrows():
    print(f"  p{ln['page_num']:>2} l{ln['line_num']:>3}: {ln['text'][:90]}")

Run with the keyword Training against the Attention paper’s TOC: the matched TOC row is one line, the extracted section body is dozens of lines:

Two scopes in one run: anchor is one TOC row, context is the section body – Image by author

5. Conclusion

Anchor detection runs in three stages, with one LLM call at the end.

  • Stage 1. Detection (parallel). Keyword detection on line_df and toc_df always runs (it is free and auditable). Embeddings run in parallel as an optional second signal, useful for vocabulary mismatch and conceptual questions.
  • Stage 2. Aggregate. Hits are grouped into a structural unit (section via toc_df if available, otherwise page or chunk).
  • Stage 3. One LLM call. The arbiter sees the TOC, the keyword hits, the embedding hits, and the structural attachment, all in a single call. It does the TOC reasoning and the final ranking together.

Three composition patterns (reason-then-match, section-weighted match, hybrid embedding) cross-pollinate line_df and toc_df before the arbiter sees the candidates, making its job easier on large or hard documents.

What the LLM at the end actually does with those candidates, what it returns, and how that becomes a defensible JSON for generation is the subject of Article 7C (the LLM arbiter and the retrieval output JSON): the structured brief handed to the arbiter, the per-candidate roles (primary / supporting / tangential / discarded) with reasons, the audit trail, the dispatcher that picks which detectors to run per question, the “not found” path, and the unified RetrievalResult contract.

This article is part of the Enterprise Document Intelligence series. The minimal RAG pipeline shows anchor detection in use end-to-end on a real PDF.

6. Sources and further reading

Anchor detection combines keyword scoring (always on, auditable) with optional embedding similarity, handing both to an LLM arbiter at the end. The references below cover the detectors.

Same direction as the article:

  • Robertson & Zaragoza, The Probabilistic Relevance Framework: BM25 and Beyond, FnT IR 2009. Canonical BM25 reference; the article’s claim that BM25 measures frequency where business retrieval needs co-occurrence rests on this.
  • Thakur et al., BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models, NeurIPS 2021 (arXiv:2104.08663). Empirical support for the article’s claim that BM25 is a strong baseline that dense retrievers don’t always beat.
  • Karpukhin et al., Dense Passage Retrieval for Open-Domain Question Answering (DPR), EMNLP 2020 (arXiv:2004.04906). Dense retrieval as the production default; useful contrast with this article’s keyword-first stance on enterprise corpora.

Earlier in the series:

Share this Article
Please enter CoinGecko Free Api Key to get this plugin works.