The Untaught Lessons of RAG Question Parsing: Structure Before You Search

Contents

The naive baseline this article pushes back on Lesson 1 – A relational schema, symmetric to the document side Lesson 2 – A schema, not branching code Lesson 3 – Two briefs, one per downstream brick Lesson 4 – The expert dictionary that beats embeddings Lesson 5 – Four compound-question patterns, none silent Lesson 6 – Deterministic dispatcher, not LLM-decides Across sectors and professions Where these lessons land in the series Sources and further reading

companion to Enterprise Document Intelligence, the series whose philosophy is laid out in Amplify the Expert.

It zooms in on brick 2 (question parsing) of the four-brick architecture and surfaces the lessons most tutorials skip.

Most RAG tutorials skip question parsing. The user’s string goes straight to retrieval, cosine runs on top-k, and the model gets handed whatever came back. We do not do that, for one reason: a user question is not a search query. Treat it as one and you get silent partial answers, and in production that is where a lot of RAG quietly breaks.

where this article sits in the series: brick 6 (question parsing) highlighted – Image by author

📓 Runnable companion notebooks are on GitHub: doc-intel/notebooks-vol1.

*The public companion-code repo at doc-intel/notebooks-vol1 – Image by author*

The naive baseline this article pushes back on

*The architectural contrast: a string sent verbatim vs a typed question_df row with two briefs – Image by author*

The naive pipeline embeds the user string and asks the vector store for the top-k most similar chunks. Nothing in that setup knows the question had two parts, or that the user wanted an exact value and not a paragraph. So we spend one extra brick on the question itself: a row in question_df with five typed columns (keywords, scope, shape, decomposition, clarification) plus satellite tables, and two derived briefs (RetrievalQuery for the retrieval brick, GenerationBrief for the generation brick).

*From a user string to a typed row with five columns, then to two briefs each downstream brick can act on – Image by author*

The anatomy diagram shows the five core columns, but a production question_df carries two more that decide how wide a window retrieval will pass to generation. The context discipline is measured in lines (not characters, too noisy; not pages, too coarse). The table below shows three sample rows: one factual lookup, one yes/no boolean, one listing question. Each row sizes its context window differently, by reading the answer shape and the decomposition pattern.

*One row per ask, seven typed columns, the context window sized in lines around the detected anchor – Image by author*

The two emerald columns are the cheap discipline most pipelines never write down. A factual lookup (premium amount, effective date, deductible value) needs almost no surrounding context: one or two lines before the anchor for verifiability, a few after for the punctuation tail. A listing question needs zero context before and a long forward window because the list extends down the section. The parser fills lines_before_anchor and lines_after_anchor from the parsed shape and decomposition; retrieval respects them; no magic top-k cutoff travels through the pipeline.

Below are the six untaught lessons that hold the brick together.

Lesson 1 – A relational schema, symmetric to the document side

The literature has “query understanding” and “query rewriting”, but both treat the question as a string turned into another string. Modeling it as a row in question_df plus satellite tables is not how people usually frame it. What makes it click is the symmetry with the document side (line_df, toc_df, span_df): both sides are relational, both join, and retrieval becomes a filter across them.

Why it matters. Most production pipelines store the question as a single string inside the LLM prompt template. There is no notion of “the question has a shape”, “the question has a scope”, “the question has a decomposition”. When the team needs a new capability (handle negation, handle compound questions, handle ranges), the only place to add it is the prompt template. Six months in, the prompt carries sixty lines of special-case clauses none of which the audit can trace. Structuring the question once at the parser boundary, the way parsing structures the document at its boundary, removes that rot at its source.

Concrete contrast. The user asks “What is the premium amount and the renewal deadline?”. The naive baseline embeds that string and ranks chunks. The series fills one row of question_df: keywords ["premium", "amount", "renewal", "deadline"] , scope "contract" , shape (Amount, Date) , decomposition "independent" (two sub-questions). Now retrieval has a row to filter line_df against, and generation has a typed shape to fill.

A second one. A legal counsel asks “Does the indemnification clause survive termination, and if so, for how long?”. The naive way passes the whole string to the LLM; the answer often comes back with a yes-or-no on the survival and silently skips the duration. The series fills question_df with shape (Boolean, Duration) , decomposition "conditional" (the duration is only meaningful if survival is True), and the downstream bricks know exactly which sub-question is gated by which.

→ Article 6A: Parse the question before you search walks through the whole parser end to end.

Lesson 2 – A schema, not branching code

Most RAG codebases grow the question-handling logic as branching code, gated by if intent == "..." chains that ossify over months. We grow the brick as a schema instead: a new capability is a column added to question_df, edited by the expert, not a new code path. The cost of a new feature stays linear in the number of columns, not quadratic in branch combinations.

Concrete contrast. Add “negation handling” to the brick. Naive way: new branch in the prompt assembly code, plus tests, plus an integration test for the regression. Series way: add a negation_present column (boolean), add a row in the dictionary of negation tokens, document downstream behaviour, and the dispatcher reads that column where it needs to.

→ Article 6B: Five fields RAG should extract from any question builds the five columns one by one.

Lesson 3 – Two briefs, one per downstream brick

The default is one prompt that carries everything, where retrieval has to ignore the generation-only fields and generation has to re-parse the retrieval fields. We split them: the retrieval brick receives only what it can act on (keywords, scope, structural hints), and the generation brick receives only what it needs (intent, output shape, exclusions). Each downstream brick reads a brief sized to its job, not the whole question.

Concrete contrast. For “What is the premium amount in dollars, not euros?” the retrieval brief is keywords ["premium", "amount"] plus scope "contract" . The generation brief is shape "Amount(value, currency='USD')" plus exclusions ["EUR"] . Retrieval does not need to know about the currency exclusion; generation does not need to re-extract the keywords.

→ Article 6A splits the question into two briefs, and Article 6B extracts the columns.

Lesson 4 – The expert dictionary that beats embeddings

The standard story sells embeddings as the way to handle synonyms: a user types “premium”, the model “knows” it relates to “monthly contribution”. In practice concept_keywords_df maps the user’s word to the document’s word before any search, for a fraction of the cost and none of the drift. The expert maintains the dictionary as a wiki; the embedding model has no opinion on which alias is canonical in your corpus.

Concrete contrast. User types “How much do I pay each month?”. Naive baseline embeds it, cosine returns generic “payment” pages. Series checks concept_keywords_df first: "pay each month" maps to ["premium", "monthly contribution", "monthly installment"] for this insurance corpus. Retrieval runs keyword search on those three terms; the actual line (“premium of $124 / month”) lights up immediately.

→ Article 6B: Five fields RAG should extract from any question explains the concept_keywords_df mechanism.

Lesson 5 – Four compound-question patterns, none silent

A two-part question (“amount and deadline”) is typically answered for one part and silently dropped for the other. The series names the four patterns (independent, sequential, unified, conditional) and forces the parser to mark which one applies. The pipeline then either decomposes (and runs in parallel), chains (and feeds part A into part B), or refuses to answer the half it could not cover. No silent partial answer.

Concrete contrast. User asks “What is the deductible if the claim exceeds the cap, and what is the cap?”, a a sequential* compound. Naive RAG sends both as one string; the LLM answers about the cap and forgets the deductible-conditional clause. Series sees decomposition = "sequential" , parses out part A (cap?) and part B (deductible if claim > cap?), runs them in order, and ships an answer for each with its own citation, or marks one as not-found if it really is.

→ Article 6B: Five fields RAG should extract from any question lays out the four compound patterns.

Lesson 6 – Deterministic dispatcher, not LLM-decides

The agentic reflex says: let the LLM pick which retrievers, schemas, and prompt fragments to activate per call. We catalogue three approaches: user-explicit (the form drives the activations), deterministic-dispatcher (rules in code map question features to activations), and LLM-decides (the model plans itself). The first two stay. We drop the third for enterprise, because a system that re-plans itself every call cannot be audited the same way twice.

Concrete contrast. The same compliance question runs twice through the system. With deterministic-dispatcher , the audit log shows the same dispatch path both times: decide.py line 47 fired, route = "factual_lookup" , retrieval methods ["keyword", "toc"] activated, generation schema AmountWithEvidence . With LLM-decides , the audit log shows two different reasoning traces, and you cannot guarantee the same routing tomorrow. The first is auditable. The second is not.

→ Article 6C: One parsed RAG question, four decisions covers the dispatcher pattern.

The six lessons share one move: take a step the mainstream playbook treats as inline string processing, and make it a typed brick instead. Once the question is a row with columns, the rest of the pipeline gets to filter, type-check, and dispatch in ways a flat string never could. The deep-dives (6A, 6B, 6C, 6bis) ship runnable code on real corpora; this piece is the catalogue that points at them.

A note on intent detection. Vol.1 stays minimal on intents: the dispatcher recognises a baseline set (factual lookup, listing, quick summary read from parsing_summary.summary , deep summary from TOC + first lines, cross-reference resolution, out-of-corpus refusal), enough to dispatch the most common enterprise PDF questions correctly. The full intent taxonomy lands in Volume 2 (translation, summarisation across documents, comparison, redaction, proofreading), where the intent × format matrix produces dozens of dispatch paths on top of the four-brick spine. Vol.1 keeps the spine clean; Vol.2 builds the matrix.

Across sectors and professions

The brick treats every domain the same way: extract typed columns from the question, derive the two briefs. The expert dictionary inside concept_keywords_df is sector-specific; the schema and the dispatch logic are universal. Five sectors below, one parsing pattern, the same five columns.

*The brick treats every domain the same way; only the expert dictionary changes – Image by author*

What changes from row to row is the expert dictionary. An insurance broker’s concept_keywords_df maps “pay each month” to ["premium", "monthly contribution", "monthly installment"]; the medical equivalent maps “blood thinner” to ["anticoagulant", "warfarin", "heparin", "DOAC"]; the financial equivalent maps “top line” to ["revenue", "net revenue", "GAAP revenue"] . The brick’s columns, dispatch, and audit trail stay identical.

Where these lessons land in the series

The numbered articles develop each lesson in code, with runnable notebooks:

Article 6A (question parsing: thesis) makes the case that a string is not a query and shows the relational shape.
Article 6B (question parsing: extraction) walks the five families of columns (keywords, scope, shape, decomposition, clarification) that fill question_df .
Article 6C (question parsing: dispatch) develops the dispatcher that turns a parsed question into routing decisions.
Article 6bis (clarification loop) handles the case where the question is too vague to route and the system asks one focused clarification.

Sources and further reading

The book/article literature on query understanding is consumer-search-shaped (Elastic, Google) and does not transfer cleanly to a small enterprise corpus where the expert vocabulary is the asset. The series’s stance is the relational-shape rebuild on top of the structured document side.

Parse the question before you search (Article 6A). The published thesis of question parsing.
Five fields RAG should extract from any question (Article 6B). The column-by-column extraction in code.
One parsed RAG question, four decisions (Article 6C). The dispatcher pattern that turns the parsed columns into routing decisions.
When RAG users ask vague questions (Article 6bis). The clarification loop that learns the default after one ask.

Earlier in the series: