Reconstructing the Table of Contents a PDF Forgot to Ship, So RAG Can Scope by Section

Last updated: 2026/06/21 at 4:46 PM

Editor AI News

22 Min Read

Contents

1. Two halves: read the entries, then find their real pages 2. Three cases, by ascending cost 3. Follow the links 4. Read the printed contents page, then find its real pages 4.1 Detecting and reading the contents page 4.2 The label is not the page 5. The LLM disposes, it does not detect 6. One uniform toc_df, whatever fired 7. How well does it work?Conclusion

document parsing companion in Enterprise Document Intelligence, the series that builds an enterprise RAG system from four bricks. It extends Article 5 (document parsing) on one table: toc_df, the document’s section structure, which Article 5 fills from the PDF’s native outline (PyMuPDF’s doc.get_toc) when there is one. This part is about the case where there isn’t, reconstructing that structure from what the document still shows on the page.

where this companion sits: it extends Article 5 (document parsing), inside Part II (the four bricks), reconstructing the table of contents when the PDF ships none – Image by author

Open NIST FIPS 202, the SHA-3 standard (a US Government work, public domain, see the NIST copyright statement), and turn to page seven. There is a clean table of contents: section titles on the left, page numbers on the right. Now open the same file in any PDF viewer and look at the bookmarks pane. Empty. The contents page is ink on a page, not structure the machine can use. The author wrote a perfectly good table of contents, and the file shipped without exposing it.

Article 5 (document parsing) and Article 5B (the relational data model) leaned on doc.get_toc(), the PDF’s native outline, to fill toc_df. It is exact when it exists. It often does not. Plenty of real documents, papers exported straight from LaTeX, contracts printed to PDF, government standards, carry a printed contents page but no outline. For those, toc_df comes back empty, even though the document is telling you its structure in plain sight on page seven.

That structure is not a nicety. Retrieval scopes by section (Article 7). The chunker cuts on heading boundaries (Article 5B). Summarization walks the document section by section. Every one of those steps reads toc_df. When it is empty, retrieval falls back to scanning every page, the chunker splits on blind page breaks, and the answer loses the document’s own structure. So the question this article answers is narrow and practical: when the file ships no outline but prints a contents page, how do you turn that page back into a toc_df?

One thing up front, because it is easy to conflate. This is about documents that have a contents page. A document with no contents page at all, a paper that just opens with “1. Introduction”, a five-page memo, an export that stripped every heading, is a different problem. Recovering a skeleton from the body of an unstructured document is summarization, a separate intent that builds the map from the chunks rather than reading one off a page. Here we only ever read a contents page the document already has.

1. Two halves: read the entries, then find their real pages

It helps to separate two things a contents page gives you. The first is a list of sections with titles and a hierarchy: what the document is about, in what order. The second is a map from each section to where it physically starts in the file. The native outline hands you both for free. Reading a printed contents page hands you the first directly, but the second only as printed labels, which are not physical pages. The two halves have different failure modes, so the rest of this article keeps them separate: first read the entries, then align them to physical pages.

In: a PDF whose doc.get_toc() returns nothing but that prints a contents page. Out: a toc_df with the same shape Article 5B defined (level, title, start_page, end_page, breadcrumb), so everything downstream keeps working unchanged.

The contents page comes in two flavours, and they cost different amounts to read.

2. Three cases, by ascending cost

*The cascade tries each case in turn and stops at the first that yields a usable TOC. Image by author.*

Each case has a detection step and an extraction step, and falls through to the next when it fails or returns too little.

Case 1, native outline. Handled in Article 5 by build_toc_df. Free, exact, hierarchical. When it works there is nothing to do. We recap it only to set the cost baseline.
Case 2, contents page with links. No outline, but an early page lists titles as hyperlinks pointing inside the file. The link target is the physical page, so this case skips the alignment problem entirely.
Case 3, contents page without links. A page that looks like a printed contents (titles, dot leaders, right-aligned page numbers) but carries no links. The page numbers it prints are labels in the document’s own numbering, not physical pages, so this case needs the alignment step.

All of this lives in a module of its own, separate from the native path so Article 5 stays readable. The entry point is reconstruct_toc_df.

3. Follow the links

Case 2 is the lucky one. Some documents have no outline but do ship a clickable contents page. The NIST Cybersecurity Framework is one: page two lists every section as a hyperlink that jumps into the document. PyMuPDF exposes those links per page, and each internal link carries its target page directly.

In: the PDF (links are not in line_df, so this reader opens the file). Out: entries with a title and the physical target page, already resolved.

The detection is a density check: a page with five or more internal links is a navigation page, not a body page with the odd footnote link. The extraction joins each link’s clickable rectangle back to the text under it, then strips the leaders and the trailing page label.

import fitz   # PyMuPDF

def extract_toc_from_links(pdf_path, min_links=5):
    """The contents page is the page carrying the most internal links."""
    doc = fitz.open(pdf_path)
    best = []
    for page in doc:
        entries = []
        for link in page.get_links():
            if link["kind"] != fitz.LINK_GOTO:        # internal jump only
                continue
            label = clean(text_under_rect(page, link["from"]))
            if label:
                entries.append({"title": label,
                                "start_page": link["page"] + 1,  # target page
                                "level": 1})
        if len(entries) >= min_links and len(entries) > len(best):
            best = entries                            # richest link page wins
    return best

Run it on the Framework and the recovered contents are clean:

*Every title resolved to a real page, no LLM, no guesswork. Image by author*

Put the detector’s output next to the page it read and you can check it by eye. The Framework’s contents page lists each section, then a List of Figures and a List of Tables; the detector recovers all three groups, titles and target pages matching line for line.

*Left, the document’s own contents page; right, what the detector returns. Image by author*

This is the case to hope for. It is deterministic, it is exact, and the page mapping is solved by the document itself. The catch is that most documents that lack a native outline also lack clickable links, which takes us to the harder case.

4. Read the printed contents page, then find its real pages

Case 3 is the common one: a printed table of contents with no links behind it, a page headed “Contents” or “Table of contents”, a column of titles, a column of page numbers, often joined by dot leaders. FIPS 202 has exactly this. A human reads it at a glance. Parsing it has two distinct steps, and the second is the one people skip.

4.1 Detecting and reading the contents page

First, find the contents page. The signal that actually separates a contents page from prose is dot-leader density: several lines of the shape Some title .......... 42. A keyword like “contents” raises confidence but is not required, and on its own is a weak signal (a sentence can say “table of contents”). The reader works on line_df alone, so it is engine-agnostic.

In: line_df. Out: entries with a title and a displayed_page, the page number as printed on the line.

import re
# "Introduction ......... 12"             "Introduction       12"
DOTTED   = re.compile(r"^(.*?\S)[.…](?:[.…\s]){2,}(\d{1,3})$")
TRAILING = re.compile(r"^(.{2,70}?\S)\s{2,}(\d{1,3})$")

def extract_toc_from_contents(line_df):
    entries = []
    for page in find_contents_pages(line_df):    # pages dense in dot leaders
        for line in lines_of(line_df, page):
            m = DOTTED.match(line) or TRAILING.match(line)
            if m:
                title, label = m.group(1).strip(), int(m.group(2))
                entries.append({"title": title,
                                "displayed_page": label,      # printed label
                                "level": infer_level(title)}) # "2.3.1" -> 3
    return entries

4.2 The label is not the page

Here is the subtlety. The contents page says Introduction .... 1. Page 1 of the file is the cover, not the introduction. A front matter of a cover, a foreword and the contents page itself sits in front, so the printed label and the physical page live in different numbering spaces. Open the file to the physical page that the label names and you land several pages early, every time.

So a printed page number is a label, and it goes into displayed_page. Mapping it to the physical start_page is a second step. The cheap version assumes one constant offset: physical = displayed + shift. To find the shift, sample a handful of titles and try every plausible offset, keeping the one under which the most titles actually appear on their shifted page.

def infer_page_shift(line_df, entries, max_shift=40):
    """Best constant offset: physical_page = displayed_label + shift."""
    page_text = {p: text_of(line_df, p) for p in pages(line_df)}
    sample = [(e["displayed_page"], norm(e["title"])) for e in entries][:20]
    best_shift, best_score = 0, -1
    for shift in range(-max_shift, max_shift + 1):
        hits = sum(1 for label, title in sample
                   if title in page_text.get(label + shift, ""))
        if hits > best_score:              # most titles land where predicted
            best_score, best_shift = hits, shift
    return best_shift

*Printed labels 1, 2, 4, 7 map to physical pages 4, 5, 7, 10 once the front-matter shift is found. Image by author*

The same thing happens on a real document. FIPS 202 prints its contents page on physical pages 7 and 8, and its body numbering starts well after the front matter. Run the detection and the alignment on it and the inferred shift comes out at +8: the introduction the contents page calls page 1 actually starts on physical page 9.

*Eight pages of front matter, so every printed label lands eight pages later in the file. Image by author*

Side by side with the page it read, the two columns are the whole point. The label column reproduces what the contents page prints; the page column is where each section actually begins in the file.

*Left, the document’s own contents page; right, what the detector returns, label and physical page. Image by author*

A constant shift covers the common case. When numbering restarts partway through (an appendix that resets to 1, inserted plates), the offset is not constant, and the fallback is content matching: locate each title’s real page by fuzzy-matching its text against the body, keeping the pages monotonically non-decreasing. align_toc_df runs the shift first and falls back to content matching, so Case 3 hands the same physical start_page downstream as Case 2.

When the printed contents page is too irregular for the patterns (a two-column layout, titles that wrap, leaders rendered as ragged whitespace), the LLM extractor takes over with a typed schema, reading the first pages and returning the same entry shape. That is a tool of last resort for this case, not the default, because a clean printed contents page is cheap to read and the LLM is not. The LLM here still only reads the contents page; it never invents a structure for a document that has none.

5. The LLM disposes, it does not detect

Both detection methods are heuristics, and heuristics make mistakes: a link rectangle that swept up two titles, a contents line the patterns split wrong, a numbering that looks off. The reflex with an LLM is to hand it the whole document and ask for a TOC. That is the expensive, least auditable option. The better division of labour is the inverse: the heuristic proposes a TOC, and the LLM only checks whether it holds together.

from pydantic import BaseModel

class TocCoherenceVerdict(BaseModel):       # typed structured output
    is_coherent: bool
    issues: list[str]

SYSTEM = ("A heuristic already proposed this TOC. Do NOT detect structure. "
          "Judge only: is the numbering consistent (no unexplained skips), "
          "are the page numbers non-decreasing, does the hierarchy form a "
          "sensible tree?")

def check_toc_coherence(toc_df):
    view = "\n".join(f"[{r.start_page}] {'  ' * (r.level - 1)}{r.title}"
                     for r in toc_df.itertuples())
    return llm_parse(input=[{"role": "system", "content": SYSTEM},
                            {"role": "user", "content": view}],
                     text_format=TocCoherenceVerdict, label="toc.coherence")

This is faster, cheaper, and more auditable than full-LLM extraction, and it degrades gracefully: if the LLM is unavailable, the heuristic TOC is still usable with a confidence penalty.

6. One uniform toc_df, whatever fired

The point of the cascade is that downstream code never learns which case ran. Whether the TOC came from links, a printed contents page or the LLM, it leaves through the same canonicaliser and arrives as the toc_df Article 5B defined, with two columns added: displayed_page (the printed label, for audit) and source (which method fired).

DETECTORS = {"links":         extract_toc_from_links,     # Case 2
             "contents_text": extract_toc_from_contents,  # Case 3
             "llm":           extract_toc_by_llm}         # hard layout

def reconstruct_toc_df(pdf_path):
    for method in ("links", "contents_text", "llm"):    # ascending cost
        entries = DETECTORS[method](pdf_path)
        if not entries:
            continue                                     # fall through
        toc_df = canonicalize(entries, source=method)   # one shape out
        if method == "contents_text":
            toc_df = align_to_physical_pages(toc_df)     # label -> page
        return toc_df
    return empty_toc_df()       # no contents page -> summarization's job

Calling it is one import and one line. The returned frame is the same toc_df Article 5B defined, plus a source column that records which case fired.

# NIST FIPS 202 prints a contents page but ships no native outline:
# Case 3 fires (contents_text), the label-to-page alignment runs, source="contents_text".

toc_df = reconstruct_toc_df("data/nist/NIST.FIPS.202.pdf")

toc_df.head()              # title, level, start_page, end_page, displayed_page, source
toc_df["source"].iloc[0]   # "links" | "contents_text" | "llm"  -- which case fired

Run it across the two worked examples and the cascade routes each to the cheapest method that works, while the caller sees one toc_df every time.

*Links for the linked contents page, text patterns for the printed one. Image by author*

7. How well does it work?

It is worth checking the reconstruction against ground truth. Take documents that do carry a native outline, hide it, run the contents-page methods, and score the result against the native TOC. scripts/eval_toc_vs_native.py does this: recall (native entries recovered), precision (reconstructed entries that are real), and the share of matched entries whose start page lands within one page of the native one.

*the link reader is near-exact (the link target is authoritative); the text-pattern reader is softer, reading a printed page and aligning labels is genuinely harder – Image by author*

The link case is near-exact because the link target is authoritative; the text case is softer because reading a printed page and aligning labels is genuinely harder. Notice the link reader’s recall swings with the document (86% on SP 800-30r1, 45% on SP 800-207, where many entries are not links), while its precision stays high: what it does recover, it places correctly. Neither method is magic, and the coherence check is there to catch the misses.

Conclusion

A PDF without a native outline is not a dead end as long as it prints its own contents page. Case 1 reads the outline the file ships. Case 2 follows clickable links and gets the physical page for free. Case 3 reads the printed contents page, then does the step most people skip, mapping the printed label to the real page. The cascade tries them cheapest first and stops at the first that works, the LLM checks coherence rather than doing the detection, and everything leaves as the same toc_df. A document that prints no contents page at all is a different problem, summarization, which builds the structure from the body. Article 7 (retrieval) picks that toc_df back up to scope answers by section.

Earlier in the series:

Document Intelligence: series intro. What the series builds, brick by brick, and in what order.
Baseline Enterprise RAG, from PDF to highlighted answer. The four-brick pipeline end to end: PDF in, highlighted answer out.
Embeddings Aren’t Magic: The Predictable Failure Modes of RAG Retrieval. Where embedding similarity wins (synonyms, typos, paraphrase), where it predictably breaks (unknown terms, negation, term-vs-answer relevance), and how to use it anyway.
Rerankers Aren’t Magic Either: When the Cross-Encoder Layer Is Worth the Cost. What a cross-encoder adds over bi-encoder embeddings, measured, and when it is worth the latency.
RAG is not machine learning, and the ML toolkit solves the wrong problem. Why chunk-size sweeps and finetuning optimize the wrong thing; route by question type instead.
From regex to vision models: which RAG technique fits which problem. Two axes, document complexity and question control, that pick the technique for each case.
10 common RAG mistakes we keep seeing in production. Ten production mistakes, organized brick by brick, with the fix for each.
Beyond extract_text: the two layers of a PDF that drive RAG quality. The first half of the parsing brick: the document’s nature, signals, and summary.
Stop returning flat text from a PDF: the relational shape RAG needs. The second half of the parsing brick: the relational tables every downstream brick reads.

Share this Article