the most important AI use cases of an enterprise today, document comparison ranks alongside conversational chatbots. Organizations spend a vast number of person-hours comparing contracts, policies, technical specifications, legal petitions, research papers and many more to identify differences, risks, revisions and semantic inconsistencies.
However, document comparison is far more complex than traditional text difference. For one, these tools are meant to be effective assistants to legal and commercial professionals, scientists and others who expect the analysis to be at the level of depth and language as can be expected from a junior professional in the domain.
An even harder problem is that meaning in enterprise documents usually isn’t contained in isolated chunks. It is embedded within sections, hierarchies, clause groupings and relationships. And these sections may be scattered across multiple pages of a document spanning over a 100 pages. For example, a credit agreement may define collateral limitations in one section, exceptions to those several pages later, and describe enforcement rights under a completely different article. If another agreement is being compared against this with the criteria such as “collateral structure, security interests, and lien requirements,” the system must identify, retrieve, and synthesize all of these structurally scattered sections together before any meaningful comparison can occur.
Proxy-Pointer architecture, with its structure-aware, yet low-cost retrieval pipeline that preserves document hierarchy during retrieval and comparison, is ideally suited for this task. Using a combination of hierarchical breadcrumb embeddings and lightweight LLM re-ranker, it is able to precisely extract semantically aligned regions across documents before comparative reasoning begins.
In this article, I am sharing the design and real-world results of a versatile document comparator capable of analyzing both highly complex financial Credit Agreements and academic research papers. As you will notice in the architecture described in the next section, the core comparison engine is separated from the upstream document processing and downstream report formatting and generation, enabling the system to be easily adapted to any new document domain (such as insurance policies, medical guidelines, or tax codes). All that is required is an upstream extraction pipeline to structure the input for hierarchical tree generation, and a downstream update to the LLM’s analytical persona and report formatter—leaving the core multi-stage retrieval and comparison pipeline entirely untouched.
Also, I am adding the full code, to my existing open-source Proxy-Pointer github repository, along with a 5 minute quickstart.
Document Comparator Architecture
Here is an overview of the logical architecture. The LLM used is gemini-3-flash along with gemini-embedding-001 (dimension: 1536) for vector embeddings.
Architectural Tiers
Upstream Extraction Layer
Converts any incoming raw document structure into a standardized, machine-readable hierarchy.
Programs Involved
-
extract_pdf_to_md.py: Handles upstream ingestion, converting PDFs into clean, hierarchically formatted Markdown. build_doc_index.py: Parses Markdown headers, filters administrative noise, and builds the hierarchical JSON structure map (_structure.json).
Core Comparison Engine
Coordinates semantic search over hierarchical document nodes.
Programs Involved
criteria_validator.py: Dynamically detects thedoc_type(e.g., Academic vs. Legal) and performs an initial feasibility check on the user’s comparison criteria, to ascertain if the criteria is relevant for the identified document type.section_selector.py: Implements Stage 1 Proxy-Pointer retrieval. It identifies and extracts the most relevant sections of Document 1 based on user criteria using FAISS semantic search and an LLM re-ranker.cross_retriever.py: Implements Stage 2 Proxy-Pointer retrieval. It performs a targeted semantic search within Document 2’s vector space using the context of the selected Document 1 sections (pairing the Doc 1 section content with the user’s criteria as the query). The Proxy-Pointer pipeline is extremely accurate in identifying the correct semantically analogous sections for comparison.section_comparator.py: Coordinates pairwise evaluations of matching sections, passing them to the LLM to analyze alignments and discrepancies.
Downstream Presentation Layer
Tailors the analytical output to the target audience and formats the final visualization.
Programs Involved
build_comparison_prompt(incriteria_validator.py): The prompt assigns the appropriate persona (e.g., Experienced Academic Researcher or Senior Legal Counsel) based on the detecteddoc_type.report_builder.py: Renders the final comparison report side-by-side using professional CSS colors and highly readable layout formatting. The report can also be downloaded as a markdown file.
Dataset Used
For the prototype, publicly available Credit Agreements, Emerson (136 pages) and Texas Roadhouse (190 pages) are used. These have been deliberately selected as they have different structures and belong to different industries. Emerson is a utility provider, and its agreement reads like a sovereign corporate treasury document based on credit agency ratings, while Texas Roadhouse’s agreement is highly customized, built specifically around restaurant leases, multi-entity subsidiary structures, and dynamic leverage ratios.
In addition, I added the feature to compare research papers for which I selected VectorFusion and VectorPainter, which were used in my article on Multimodal Answers RAG. They are both papers in the highly specialized field of text-to-vector graphics generation. While both share an identical technical foundation—using differentiable rendering (such as DiffVG) to optimize Scalable Vector Graphics (SVG) paths via diffusion models—they differ significantly in their methodological execution. This narrow, shared-domain relationship is a difficult test case for our comparison engine, of its ability to bypass surface-level similarities and instead evaluate subtle architectural variations, which we shall see in the next section.
Comparison of Credit Agreements
I ran several different queries with a diverse set of criteria; the detailed reports are fully included in the repository, and a snapshot is shared below. The Streamlit UI accepts two documents (either in .pdf or .md format) as input, with the comparison performed strictly from the perspective of Document 1. For example, if Document 1 is Emerson and Document 2 is Texas Roadhouse, the final comparison is framed around Emerson.
There are three steps to the process. First, it selects all sections from the Emerson agreement that are relevant to the user’s criteria. For each selected section, it finds up to three comparative sections in Texas Roadhouse, and then performs a side-by-side analysis. Along with the detailed analysis, the system provides a functional Role, a Discrepancy Rating, and a Risk Direction (or Methodological Tradeoff for academic papers)
In the following four cases, Document 1 is Emerson, Document 2 is Texas Roadhouse.
Criteria 1: collateral structure, security interests, guarantees, and lien requirements


Criteria 2: events of default, lender remedies, acceleration rights, and cure periods


Criteria 3: financial covenants, leverage ratio requirements, and borrower compliance obligations


Criteria 4a: representations and warranties, material adverse effect clauses, and disclosure obligations


For edge case testing, here is the above “warranties” criteria with the documents switched. In the following, Document 1 is Texas Roadhouse and Document 2 is Emerson.
Criteria 4b: representations and warranties, material adverse effect clauses, and disclosure obligations


Analysis of Credit Agreement comparison
What the above results show is that Proxy-Pointer is not just matching clauses by keywords or incomplete chunks, it is looking at them from the persona of a legal analyst, someone who understands how credit works, across these highly diverse industries. One being an investment-grade utility, and the other a midsize restaurant chain. For instance, it identifies the economic and legal consequences hidden beneath superficially similar language — like structural subordination risk inside a negative pledge, enterprise-value preservation inside disposition covenants or litigation exposure inside disclosure representations.
Another observation is that the analysis remained directionally consistent when the documents were flipped. It did not anchor itself to Emerson as Document 1, but instead re-evaluated the agreements from the Texas Roadhouse perspective. It correctly identified which agreement placed more restrictions on the borrower, which gave lenders greater control during defaults, which was more vulnerable to assets being moved out of reach, and which required the company to disclose more information. None of these are explicitly written in either agreements. They become evident to a legal analyst when multiple clauses, exceptions, thresholds, and definitions are read together. The result feels less like a simple clause comparison and more like understanding how risk and control are shared between the borrower and the lender.
Research Paper Comparison
For the VectorFusion and VectorPainter papers, I compared using the following criteria: Compare how each paper approaches style control and primitive initialization in vector graphics synthesis. Specifically, analyze how VectorFusion uses path reinitialization and raster sample initialization versus how VectorPainter extracts and rearranges vectorized strokes from a reference image using stroke imitation learning and style-preserving losses
Here is one comparison:


The analysis shows a deep domain-intensive comparison, a tool that a researcher can use to compare both papers without reading them in their entirety. Proxy-Pointer moves beyond surface-level architecture matching and identifies the deeper design philosophy behind both papers. In addition, it correctly recognizes that VectorFusion treats SVG generation as a dynamic optimization problem with continuous path reinitialization, while VectorPainter approaches it as a style-guided synthesis problem focused on artistic consistency and learned stroke history. What was also quite interesting was that it could connect ideas spread across completely different sections of the papers and balance the underlying limitations. This demonstrates a fine-grained analysis of two systems in the same narrow domain but that work differently.
Open-Source Repository
Proxy-Pointer is fully open-source (MIT License) and can be accessed at Proxy-Pointer Github repository. The Document Comparator is being added to the repo in addition to the existing Text-Only and Multimodal Answering bots.
A 5-minute quickstart will enable you to test quickly with available data.
DocComparator/
├── src/
│ ├── comparison/
│ │ ├── cross_retriever.py # Stage 2 PP Retrieval (Doc 2)
│ │ ├── section_comparator.py # Pairwise LLM evaluation engine
│ │ └── section_selector.py # Stage 1 PP Retrieval (Doc 1)
│ ├── extraction/
│ │ └── extract_pdf_to_md.py # LlamaParse PDF ingestion & formatting
│ ├── indexing/
│ │ └── build_doc_index.py # Skeleton tree & FAISS vector builder
│ ├── report/
│ │ └── report_builder.py # Markdown report generation logic
│ ├── validation/
│ │ └── criteria_validator.py # Persona injection & criteria feasibility
│ └── config.py # Core configurations and model definitions
├── data/ # Unified Data Hub
│ └── uploads/ # Raw PDFs and test documents
├── results/ # Artifact reports for the test cases tried
└── app.py # Streamlit Comparator UI
Conclusion
Document comparison using a Chunk-Embed-Match approach is not likely to give good results. In a complex enterprise document such as Contract Terms and Conditions, semantic meaning is encapsulated into sections and subsections containing dense text. Each of these sections could be pages in length and part of a very long document. For effective comparison and analysis – sections, definitions, exceptions, and structural relationships need to be extracted together to make sense when read together.
Proxy-Pointer with its accurate two-step retrieval pipeline is ideal for this task. As the results above show, even with a budget LLM such as gemini-flash, one can compare agreements or research papers such that it could preserve the underlying intent and trade-offs hidden across structurally disparate sections.
The 3-tier architecture of the Document Comparator can scale to other domains with no change to the comparison engine itself. This enables structure-aware retrieval to generalize better than a custom-built tool that works only for a specific type of document. Organizations can adapt this to their specific industries and use cases, with minimal incremental engineering effort.
Clone the repo. Try your own documents. Let me know your thoughts.
Connect with me and share your comments at www.linkedin.com/in/partha-sarkar-lets-talk-AI
All research papers used in this article are available at VectorFusion and VectorPainter with CC-BY license. The credit agreements are publicly available at SEC.gov. Code and benchmark results are open-source under the MIT License. Images used in this article are generated using Google Gemini.