From Regex to Vision Models: Which RAG Technique Fits Which Problem

Contents

1. Two axes: document complexity and question control 1.1 Document axis: from a fixed template to a vision model 1.2 Question axis: from a fixed prompt to a multi-turn chatbot 1.3 From case to technique zone 2. The techniques per case, and what isn’t a technique 2.1 Pick the simplest technique that works 2.2 Long context isn’t a way out 2.3 Fancy techniques are usually keyword work in disguise 2.4 Letting the LLM pick the case 3. Locate your case, in practice 3.1 Position the system around the expert who exists 3.2 The diagnostic questions 3.3 Common enterprise cases on the grid 4. Conclusion 5. Sources and further reading

Ms don’t deserve the classic playbook. Article 3 said there is no THE RAG technique. You still have to pick one. This article is the diagnostic that tells you which.

Most teams building RAG systems reach for the same playbook: parse the document into chunks, embed every chunk, drop them in a vector store, embed the question, retrieve the top-k by cosine similarity, hand the result to an LLM. Call it the classic RAG playbook. Every tutorial teaches it. Every demo runs on it.

The actual problems vary much more than the playbook suggests. A few real cases.

Three cases at three different extremes.

Templated, high-volume documents. Insurance certificates, KYC forms, regulatory filings, monthly brokerage statements. The same software writes the same layout on every document. A hundred lines of regex extract the fields in microseconds. The classic playbook runs here too but it pays an LLM to do what the layout gave you for free.

Same shape across industries: payroll stubs, bank statements, lab test reports, tax filings, compliance attestations, supplier invoices from one ERP. Wherever one piece of software writes every document, the layout is a contract.

Sarcasm in customer-service transcripts. “Find every sarcastic remark in this month’s call recordings.” Standard sentiment scoring (anger, frustration, joy) is largely solved by a sentiment lexicon: unacceptable, ridiculous, frustrated all flag clearly. Sarcasm is the canonical exception. “Oh, fantastic service, only had to wait 45 minutes” scores positive on every lexicon, and the embedding clusters it with the sincere version because the surface words are nearly the same. The only honest method is an LLM that reads each call in full and judges the gap between what is said and what is meant.

Same shape across functions: HR exit interviews looking for hidden frustration, internal-chat archives looking for cultural red flags before an M&A close, earnings-call transcripts looking for places the CFO hedged, sales-call recordings looking for promises the contract did not authorise. Tone and intent, no anchor in the text.

Engineering schematics (a different axis altogether). Drawings, slides where data lives in the chart, technical specs with embedded images. Pure-text RAG returns the caption and misses the schematic. Vision models fit here, and only here.

Same shape: architectural blueprints, scanned handwritten records, slide decks where data lives in the chart, lab notebook pages, medical imaging reports. Wherever the meaning lives in the pixels.

The classic playbook is overkill on templated documents (regex would do), dimensionally wrong on call transcripts (no anchor exists), and modality-blind on schematics (vision is required). It fits a middle band of problems and ships as if it covered everything. That middle band is real and Section 3.3 catalogues it; the cost of mismatch on the rest is what this article exists to prevent.

This article is the diagnostic. Three steps, in order.

Identify the two axes: RAG problems aren’t a single problem. They sit on a picture with two axes: how structured your documents are, and how controlled your questions are. Each combination calls for a different stack.
Identify the techniques per region: Each region of the picture has its own stack: regex, section retrieval, hybrid retrieval (lexical search + embedding similarity), vision, SQL aggregation. A third axis (the agentic dimension, section 2.4) sits on top of these and decides how much runtime control the LLM gets. The catalog later in the article maps each region to its technique zone.
Locate your own case: Where do your documents sit on the complexity axis? Where do your questions sit on the control axis? The intersection points to a region, and to the techniques that fit it.

You’re not here to build everything. You’re here to find where you sit, then read the parts of the series that match. Most readers will skip half of it.

A note before the article gets technical. Most enterprise RAG is in two shapes: extracting fields from templated documents (the regex case in the opener), or answering free-form questions on heterogeneous documents like contracts and reports (where the rest of the series spends most of its time). Conversational transcripts are a real third shape, common in customer service, HR, and compliance; sarcasm is the hardest question they raise. Pure vision content (schematics, slide decks) and corpus-scale questions (Part IV) come up less often. You may meet one or two of these. The grid below lets you locate your case on sight.

This diagnostic is one piece of a larger framing: Enterprise Document Intelligence Volume 1 builds enterprise RAG brick by brick, and the regions of the grid this article maps point to the articles in the series where each technique gets built.

1. Two axes: document complexity and question control

Every problem we’ll meet in this series sits somewhere on two axes:

Document complexity: How redundant is the structure across your documents? Can a parser address fields by position, by heading, or do you need a model that sees the page?
Question control: Who frames the question? An engineer writing a fixed prompt, or a user typing freely into a chat box, possibly with no idea what to ask?

These two axes are almost independent. The one coupling: a fixed-template document (Tier 1, below) usually forces engineer-templated questions (Tier A), since the user never types a question. Outside that corner, any document tier can pair with any question tier.

1.1 Document axis: from a fixed template to a vision model

Volume 1 stays inside the PDF scope. Multi-format documents (Word, Excel, PowerPoint, mail) are Volume 2’s territory; everything below describes one PDF at a time.

Documents vary in structural redundancy: how much of their layout is shared across the corpus. Five tiers cover most enterprise situations.

*five tiers of document complexity, with the technique that fits each – Image by author*

Tier 1: Fixed template: Every document has the same structure, the same fields in the same place, often produced by the same software: insurance certificates from a single broker, KYC forms, tax filings, internal compliance attestations. The structure is so predictable that you can address fields by their coordinates on the page. Technique: regex or coordinate-based extraction, no model.

Tier 2: Family of templates: Documents follow a recognizable pattern with variations (different vendor, different software, different year): invoices across suppliers, leases across landlords, employment contracts across companies in the same legal framework. Technique: a regex per template plus a few-shot LLM as fallback when the template drifts.

Tier 3: Heterogeneous structured: Each document has its own structure (sections, headings, tables of contents) but the structures don’t repeat across documents: custom legal contracts, technical manuals from different vendors, financial reports. Technique: parse the structure, retrieve via the document’s own table of contents.

Tier 4: Unstructured / OCR’d: Scanned PDFs, photos of paper, emails, free-form notes: the text is there but the layout is degraded or absent. Technique: OCR with confidence scoring, then hybrid retrieval (lexical + embeddings) over the noisy text.

Tier 5: Visually rich: Documents where the meaning lives in the visuals: schematics, dense data tables embedded as images, slide decks with charts, engineering drawings. A pure-text parse loses the answer. Technique: a vision-capable model on the page image, often combined with text-side RAG.

The further down this axis you sit, the more you pay per document. The right move is to push every problem as far up as honest analysis allows. A team that decides their corpus is “too complex for regex” without checking the structural redundancy is choosing the expensive answer by default.

1.2 Question axis: from a fixed prompt to a multi-turn chatbot

The question axis is the one most teams skip. Two questions can look identical syntactically yet require completely different stacks. The dimension that matters is who controls the question and how much.

*four tiers of question control, from a fixed engineer prompt to a free user query with clarification – Image by author*

Tier A: Engineer-templated: The question is a parameter of the system: “Extract the effective date.”, “What is the policy number?”. The engineer wrote the prompt, calibrated it, tested it on a thousand documents. The user, if any, doesn’t even type a question. Technique: field extraction, structured output, no question-parsing step needed.

Tier B: User fills slots: The question is a template with user-supplied values: “Show me the clause about {topic} in this contract.” The user picks the topic from a list, or types a tag. The shape of the query is fixed, only one slot varies. Technique: section retrieval, lookup against a known taxonomy.

Tier C: Free user query, one-shot: The user types whatever they want, the system answers in one go: “Why does this contract differ from last year’s?”. This is the classic chat-with-your-document setup, where the pipeline must parse the question, decide what to retrieve, and answer. Technique: single-document RAG with question parsing.

Tier D: Free query plus clarification. Same as C, but the system can ask the user back when the question is ambiguous: “Which page do you mean? Did you mean the sub-tenant or the main tenant?” This is what real chatbots do, and it dramatically widens the range of questions a system can serve. Technique: question parsing plus a clarification loop.

A small example to make the clarification idea concrete. Imagine a user asks: “What is the deductible?” on a single insurance contract that mentions deductibles in three sections (home, auto, travel coverage). A naive pipeline retrieves something plausible and returns a confident wrong answer. A system that can ask back (“Which coverage: home, auto, or travel?”) fixes the problem at the source.

This pushes a constraint upstream into parsing. To detect that the user mentioned “page 3” or “the second appendix”, your parser must have preserved page numbers, section indices, and heading text as metadata on every chunk. The page number sounds trivial when you look at any single document, but it is the simplest example of a parsing decision that the question side depends on. Article 5 covers this in detail.

Question scale is a separate question, not a tier on this axis. “How many PDFs are in your corpus, and are they homogeneous or heterogeneous?” is a data-side concern, picked up by section 3.2 of the diagnostic and developed in Part IV (Articles 14-17). Mixing it into the question axis blurs two different things, so it stays out.

1.3 From case to technique zone

Cross the two axes and every single-PDF RAG problem lands somewhere on the picture. Each region calls for a different stack. Most teams build for one or two regions and pretend the rest don’t exist. The grid below is a thinking tool, not a strict taxonomy: real problems often sit between two cases, and the boundaries between zones are fuzzy on purpose.

*each case (a document tier × question tier) maps to the simplest technique that fits – Image by author*

The top-left corner (rows 1-2, columns A-B) is deterministic territory. Fixed templates, controlled questions. No LLM is needed for the field extraction itself; the LLM appears at most as a fallback when the template drifts. This is where the insurance-broker mistake from the opening lives. Most enterprise document workflows fall here, and most of them are over-engineered. The broker case from the opening is the canonical example: an LLM stack at sixty thousand euros a year when a hundred-line regex would do.

The middle band (rows 2-4, columns C-D) is single-document RAG. The chat-with-your-PDF use case every vendor demo shows. It is real, it is hard, and the rest of the series spends most of its time here. Chunking (splitting the document into searchable units), retrieval (picking the right ones), reranking (a precision pass on the shortlist), and evaluation (knowing it works) all matter when the document is heterogeneous and the question is open.

The bottom row (row 5, all columns) is vision territory. Charts, schematics, dense tables. A text parser loses the answer regardless of how clever the retrieval is. Vision models fit here, and only here. Article 10 discusses when the vision step is worth its cost and when it isn’t.

Corpus-scale cases sit off the grid, since the grid is one PDF at a time. When the question targets many PDFs at once (“find every supplier contract with a liability cap below one million”), the diagnostic routes to Part IV (Articles 14-17): classification at ingestion, structured fields, SQL on the structured side, RAG on the residual unstructured questions.

The grid isn’t a recipe. It’s a sanity check. Locate your problem, look at the technique zone, and ask whether the system you’re building matches. If you’re building deeper than the case calls for, you’re paying for nothing. If you’re building shallower, you’ll discover the gap in production.

2. The techniques per case, and what isn’t a technique

Once you’ve placed your problem on the grid, you know roughly which family of techniques applies. The rest of the series develops each technique in detail.

*each card is one technique with its dedicated article; read the ones that match your case, skip the rest – Image by author*

The deterministic family (regex, section anchors that locate a heading by name, coordinate-based extraction that pulls a field from a fixed bounding box on the page) doesn’t have its own article. It’s the baseline every engineer should already know. Every engineer reading this series should already know how to write a regex. The point of including it on the map is to remind you that it’s an option. When the structure of your input is fixed, it’s the option.

The single-document RAG family is what Parts II and III of the series are about. Layout-aware parsing (Article 5), question parsing and calibration (Article 6), retrieval as scope selection (Article 7), generation as controlled execution (Article 8), hybrid retrieval and TOC routing (Article 9), adaptive parsing including vision (Article 10), cross-references (Article 11), listing and synthesis (Article 12), composite pipelines with feedback loops (Article 13). Each of these is a technique you’ll reach for in the central band of the grid.

The corpus-scale family is Part IV. The corpus problem (Article 14), preparing a queryable corpus from a folder of PDFs (Article 15), the corpus ontology (Article 16), querying with SQL filter first and retrieval second (Article 17). These come in when you go from one PDF to a corpus of PDFs.

If your problem is in the top-left corner of the grid, you can stop reading the series after Article 5 (parsing) and skip ahead to Article 15 (preparing a queryable corpus). If your problem sits in the middle band, you’ll need Parts II and III. If your problem is corpus-scale, you’ll need Part IV on top of the foundation. The map tells you which.

2.1 Pick the simplest technique that works

The instinct of every engineering team is to build the most powerful pipeline they can justify. That instinct is wrong here. The right instinct is to pick the least powerful technique that solves the actual problem. Three reasons:

Cost: At two million docs a year, a regex on a VM is a rounding error; an LLM per document is sixty thousand euros.
Latency: Microseconds vs seconds, the difference between “feels instant” and “feels like waiting”.
Reliability: A regex either matches or it doesn’t and the engineer can read the rule; an LLM produces answers that are sometimes subtly wrong with failure modes harder to detect, which disqualifies it for audit-grade extraction.

Most production document workflows land on a hybrid: a deterministic core handling the bulk cleanly, with an LLM fallback for the cases where the format breaks. That hybrid is almost always the right shape, and almost never what teams build first.

2.2 Long context isn’t a way out

Every few months someone announces that “RAG is dead” because context windows just got bigger. The argument: dump the whole document in the prompt and let the model figure it out.

This works for one document and one user. It doesn’t work in production for four reasons:

Wasteful: A typical question doesn’t need the whole document. The effective date of a contract sits on one page; sending the other thirty-nine pays for tokens that won’t be used.
Misses information: Transformers reliably read what’s at the start and end of a long context and routinely skip what’s in the middle, so the relevant page might never be read even when it’s in the prompt.
Doesn’t scale: Real use cases involve many documents. No context window will ever hold a corporate archive; at any meaningful scale you have to choose what to send, and that choice is retrieval.
No grounded answer: Without explicit retrieval and citation, you can’t tell which part of the document the answer came from, you can’t verify it, you can’t audit it. For any enterprise use case where the answer needs to be traceable, that’s disqualifying.

Long contexts are useful as a tool, especially for single-document deep analysis. They’re not a substitute for retrieval. Anyone telling you otherwise is selling something.

2.3 Fancy techniques are usually keyword work in disguise

Techniques sold as “advanced” often turn out to be keyword work in another form, and often the wrong form. HyDE (Hypothetical Document Embeddings, Gao et al., 2022) is the clearest example. The protocol asks an LLM to write the hypothetical document that would answer the query, then retrieves against the embedding of that hypothetical. The pitch is that the hypothetical carries the vocabulary a real answer would use, widening the cosine margin.

The companion notebook tests this on the Attention paper: ask why multi-head attention, let HyDE generate its passage, compare against the actual vocabulary of section 3.2.2. The two lists overlap on exactly one phrase, the section title. HyDE writes ML-textbook vocabulary (semantic relationships, contextual dependencies, parallel processing, attention patterns); the paper writes operational vocabulary (attention layers, encoder-decoder attention, different positions, linear transformations).

HyDE understood the question. It never read the document. In enterprise the keywords exist somewhere on the page and the domain expert who has read the page knows them. HyDE pays per query to invent vocabulary that often does not even land on the page. The expert dictionary (Article 6), a curated list of the corpus’s actual vocabulary built once with the domain expert, gets the same job done at a fraction of the cost, reused across every future question.

2.4 Letting the LLM pick the case

Each combination of document tier and question tier is an elementary case, with one matching technique. In Volume 1, the engineer picks the case at compile-time and ships the technique. The dispatcher (Article 13) encodes the team’s routing wisdom in Python; the LLM critiques outputs inside fixed loops; every brick is auditable. That is enough for the vast majority of enterprise RAG.

A natural extension has the LLM itself pick the case at runtime, looking at the question, classifying it into a case, and choosing the technique to apply. That is what 2026 industry calls agentic RAG. Volume 3 (Agentic Bricks) builds that runtime-pick layer on top of the bricks Volume 1 produces. The shift is about who decides when, not about the bricks themselves: agentic stacks still reach for the same parsing, retrieval, and generation primitives that Volume 1 audits and tests.

3. Locate your case, in practice

3.1 Position the system around the expert who exists

The diagnostic below needs one input most teams skip: who is the user of this system?

For almost all enterprise RAG, the answer is the expert who already knows the documents. Not an open-domain user typing any question. Not a curious browser exploring a public archive. The lawyer reading a contract. The underwriter checking a quote. The compliance officer auditing a clause. Someone who has read documents like these for years, and who knows the vocabulary, the cases where one term means two things, and the failure modes to watch for.

The job of the system is then clear: amplify that expert, not replace them. Codify their vocabulary, their disambiguations, their year-by-year heuristics. Let the pipeline handle the volume; let the expert stay the source of truth.

This matters before the grid, because it changes which cases are realistic. A team that says “anyone can ask anything across the whole archive” is choosing the bottom-right case by default: open question, mixed corpus, the hardest one. A team that says “our underwriter checks a known field on a known document type” is choosing the top-left, often regex territory.

The framing is rarely a property of the documents or the questions. It is a choice the team makes. Most teams inherit it from consumer chatbots without noticing. First, position the system around the expert who is already there. Then read the case on the grid the answer points to.

3.2 The diagnostic questions

Before writing any code, work through these questions. Out loud, in front of a whiteboard, with the domain experts in the room.

About the documents: How alike are they across the corpus? Native text or OCR? How many PDFs do you have, and are they homogeneous or heterogeneous? (this is where corpus-scale concerns enter the diagnostic — they route to Part IV). Static or daily ingestion? Where on the document axis do they sit?

About the questions: Who frames them? An engineer at design time, or a user at run time? Is the system one-shot or can it ask back for clarification? Is the answer always in one document, or distributed across several? What does no answer mean: acceptable, or unacceptable? Where on the question axis do they sit?

About the constraints: Does the answer need to be traceable to the source? How precise (best-effort, or audit-grade: every citation traceable to a source line, every answer replayable)? What’s the cost budget per document? Sometimes the difference between regex and LLM is the difference between profitable and not.

The answers point you to a case on the grid. The case points you to a technique zone. The technique zone points you to the articles in the rest of the series you’ll need.

3.3 Common enterprise cases on the grid

A handful of patterns show up repeatedly in real engagements. Most readers will recognize themselves in one of these.

Field extraction from a fixed-template form. Think insurance certificates from one broker, KYC forms from one bank, tax filings from one administration: the same software writes the same layout on every page. Case: doc tier 1, question A, top-left corner. Stack: regex on coordinate-addressable fields, with an LLM fallback for the rare drift. The classic playbook is overkill here, and that’s the most common mistake we meet in real projects.

Field extraction across template variants. Think invoices across hundreds of suppliers, leases across landlords, employment contracts across companies in the same legal framework: every document follows one of a handful of recognizable patterns. Case: doc tier 2, question A or B. Stack: a regex per recognized template, plus a few-shot LLM extraction when the document doesn’t match anything in the registry. Classification before extraction.

Q&A on a long custom contract: Each contract is structured differently, sections vary, ten-page glossaries don’t repeat. The user asks free-form questions about the contract in front of them. Case: doc tier 3, question C or D, middle band. Stack: full single-document RAG with TOC routing, hybrid retrieval, schema-driven generation. This is where the four bricks of the series each carry their own weight.

Reading a slide deck or a schematic: Think engineering drawings, financial decks where data lives in the chart, technical specs with embedded images: pure-text parsing loses the answer outright. Case: doc tier 5, any question column, bottom row. Stack: vision-capable model on the page image, combined with text-side RAG for the prose around the visuals.

Off the grid – corpus territory: “Find every supplier contract with a liability cap below one million” on hundreds or thousands of contracts. The single-PDF grid stops being the right frame; the question targets the corpus, not one document. Stack: field extraction at ingestion, structured fields stored in a database, SQL on the structured side, RAG only as a fallback for the residual unstructured questions. Articles 14-17 (Part IV) develop this.

Off the grid – no structure to anchor on: a novel, an intent classification, sarcasm detection. The document has no structure, the vocabulary has no characteristic terms, and the question requires understanding tone or intent rather than locating a passage. Stack: an LLM that scans the whole text paragraph by paragraph, deciding what to flag. Not a RAG problem in Volume 1’s sense; section 2.4 hints at where this kind of runtime decision-making belongs (Volume 3).

If your case doesn’t quite match any of these, walk the diagnostic in section 3.2 and the result will tell you which of the patterns above is closest.

4. Conclusion

Run the diagnostic on your own corpus before writing code, ideally with the domain experts in the room. The output is the list of articles in the rest of the series you need to read, and the list you can skip. Teams that get RAG to ship in production are the ones that located their problem on the grid first. Teams still tuning six months in are usually the ones that started building before they did.

The next article opens Part II with the first brick: document parsing. Everything lost there cannot be recovered later, no matter how clever the retrieval.

5. Sources and further reading

The two-axis grid is a map of where each approach fits across document complexity and question control on a single PDF. The long-context-doesn’t-replace-retrieval claim the grid leans on is grounded by Liu et al. (Lost in the Middle, TACL 2024) and Lee et al. (long-context benchmark, 2024). The vision row maps to Faysse et al. (ColPali, 2024). The HyDE demo uses the technique from Gao et al. (HyDE, 2022). The agentic extension hinted at in section 2.4 (the LLM picking the case at runtime) is the direction Volume 3 develops on top of the bricks built here.

Same direction as the article:

Liu et al., Lost in the Middle: How Language Models Use Long Contexts, TACL 2024 (arXiv:2307.03172). Models systematically miss information mid-input. Supports the claim that long context is not a way out.
Lee et al., Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?, 2024 (arXiv:2406.13121). Concrete data on where long-context replaces retrieval and where it breaks.
Faysse et al., ColPali: Efficient Document Retrieval with Vision Language Models, 2024 (arXiv:2407.01449). Vision-language retrieval on the page image itself. Anchors the visual row of the grid.
Gao et al., Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE), 2022 (arXiv:2212.10496). The hypothetical-document-embedding technique tested in section 2.3.

Different angle, different context:

Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models, ICLR 2023 (arXiv:2210.03629). Founding paper of the LLM-picks-tools-at-runtime line. Volume 3 develops this on top of the bricks Volume 1 builds.
Schick et al., Toolformer: Language Models Can Teach Themselves to Use Tools, NeurIPS 2023 (arXiv:2302.04761). Same direction as ReAct.
Gao et al., Retrieval-Augmented Generation for Large Language Models: A Survey, 2024 (arXiv:2312.10997). RAG survey; treats RAG as one paradigm with shared concerns (retriever quality, generator faithfulness).