In my article on Solving Entity and Relationship Sprawl in Knowledge Graphs, I discussed how Proxy-Pointer architecture can optimize searching for right entities and relations. That, however, is only the second part of a larger problem in graph ingestion. The bigger—and far more expensive—step is identifying those entities (NER) and relations in the first place.
Knowledge Graphs are built to answer complex aggregation and multi-hop queries across entities and relationships over similar documents — vendor contracts, compliance manuals, credit agreements, global terms and conditions, etc. These documents are routinely over 100 pages long with dense text exceeding 500k characters. Enterprises frequently ingest thousands of similar contracts from the same suppliers and customers.
To do that, each of these documents is passed through a powerful LLM for NER and relations extraction, burning millions of tokens even before the actual graph ingestion can happen. The process has to be repeated sometimes, since long-context extraction often suffers from reduced recall consistency and increased extraction variance.
However, the crucial fact is that legal documents such as contracts, have very similar structure across organizations, even across industries. And they are packed with dense boilerplate text, schedules, exhibit etc most of which are of little value for NER, yet still have to be seen by a LLM anyway.
But what if we could exploit this structural predictability? What if we could predict the value of a section before we ever send it to the LLM, drastically cutting ingestion costs by strategically ignoring the noise?
In this article, we will explore a novel approach to minimizing the content seen by the LLM. By leveraging the structural concepts of Proxy-Pointer RAG and introducing a predictive metric called Graphability Indexing, we can selectively bypass low-yield sections of dense documents. I am illustrating this using three massive, real-world corporate Credit Agreements—Emerson, AT&T, and Texas Roadhouse — to demonstrate how this methodology can slash extraction costs, as compared against full-document extraction pipelines, without sacrificing the integrity of the resulting Knowledge Graph.
Quick Recap: What is Proxy-Pointer?
Proxy-Pointer is an structure-aware RAG technique that delivers surgical precision over complex documents such as annual reports, credit agreements, etc. at the cost of standard Vector RAG. Standard vector RAG splits documents into blind chunks, embeds them, and retrieves the top-K by cosine similarity. Even with overlap and semantic chunking, this is not a reliable method for relationship extraction in enterprise KGs as chunks fragment the context of a document, making extraction prone to hallucination.
Instead, Proxy-Pointer treats a document as a tree of self-contained semantic blocks (sections). Context is encapsulated within each section and therefore these are good candidates for relations extraction. Also, a LLM is much more likely to accurately identify the entities and relations from a section in a single pass, rather than from a full 100 page document, making repeated scans unnecessary.
Technically, Proxy-Pointer leverages five zero-cost engineering techniques for RAG — a skeleton structure tree of the document, breadcrumb injection, structure-guided chunking, noise filtering, and pointer-based context. We will be leveraging some of these concepts along with a few new ones here. You can refer to the article here for more on Proxy-Pointer.
Existing methods for NER optimization
Before we look at the Proxy-Pointer approach, lets look at some of the existing optimization approaches adopted by organizations.
- Traditional NLP / Pre-Trained Models (e.g., spaCy): A common first approach is to use lightweight, traditional NLP pipelines like spaCy along with a LLM in a Funnel approach. These models are extremely fast and cheap, pre-trained to recognize standard entities (Persons, Organizations, Locations, Dates), and are used to scan a document for entity hotspot regions. The hotspots are then scanned using a LLM in a focused manner. However, entity density does not necessarily correlate to relations density. For instance, administrative boilerplate like ‘Notices’ or trailing ‘Exhibits’ might be packed with standard entities (names, addresses, dates) without containing any structural legal relationships.
- They also struggle with bespoke corporate entities (like Adjusted Term SOFR or Swing Line Loans) and are not suitable for extracting the complex, nested relationships required for a highly constrained legal Knowledge Graph. Also, continual fine-tuning of these models to achieve the necessary accuracy requires lot of manual annotation effort and compute costs.
- LLM Pre-Scanning (Smaller Router Models): Another approach is to use a smaller, cheaper LLM to quickly pre-scan chunks and decide if they contain valuable relationships, before sending only the high-value chunks to a large reasoning model for deep extraction. While cheaper per token, we are still forcing a model to read every word of a 500k character document. And this is also therefore, a wasteful double scan of large parts of the document.
Proxy-Pointer Approach
As mentioned earlier, Proxy-Pointer leverages the following properties of knowledge graphs:
- Graphs are built for a domain/functional area, and therefore store similar document content. A procurement graph will ingest multiple supplier contracts (and also many contracts of same supplier), a finance graph will have many lender and credit documents, compliance documents etc
- The documents share a similar baseline structure — sections, schedules, exhibits etc. And only a fraction of the content is enough for meaningful entities and relations extraction. The challenge is to identify that content.
We use this predictability for the following steps:
- Build and deploy a baseline Graphability index: Start with a baseline index for a document type (e.g. Credit Agreements). Sections are classified into very high, high, medium, low and very low graphability. The graphability rating is driven by Relational Density—the volume of actionable business connections (edges) relative to the size of the section—rather than raw entity counts (nodes). This avoids entity dense but generic sections like Notices or Exhibits being classified as high. Based on this methodology,
payment of obligationsis classified as very high graphability whereasDuties of AgentorGoverning laware classified as low yield sections. However, there is an important exception. While most sections are evaluated on relational density, ontological foundations like ‘Subsidiaries’ are anchored as ‘Very High’ because their few edges define the critical corporate hierarchy that the rest of the contract’s rules inherit. This preserves the index’s value as a business heatmap rather than a purely technical one based on entity or relations density. - Structure tree creation: We create a structure tree of a document which lists the hierarchy of sections as nodes, along with section title.
- Enrich and Adjust: We walk the tree, not the text. We use the first few documents to refine and harden the index. Extract each section content based on line numbers. Use the section title to find the predicted yield index. Next, the LLM scans all the sections of the document and based on the extracted relations and entities, makes an actual assessment of the yield index for every section. Where the predicted and actual ratings do not match, these are flagged for human review (e.g., actual classification says “Low” but the predicted rating from the index is “Medium”). Based on human SME input, the classifications in the index are adjusted.
- Route and Bypass: Following the above process, we would be able to derive an enriched graphability index after a few documents. From then on, high-yield sections (Very High, High, Medium) are sent to the LLM for deep NER extraction. Low and Very Low sections are safely bypassed.
- New Sections: Every document will have a few sections not found in the index which will be flagged as Coverage Gaps. These are mandatorily scanned for NER, to avoid missing relevant relations. Upon human review of these, the ones deemed generic, frequently occurring, can be added to the index, while bespoke ones such as
Benchmark Replacement Settingcan be ignored. - Achieve stabilization. After just a few iterations, we expect prediction mismatches to drop to near zero, and the volume of “New Sections” to stabilize at no more than 20-25% (representing highly bespoke or administrative clauses), allowing the system to confidently process massive document corpuses with the right balance of rigor and efficiency.
The graphability index should be maintained for each document type and could possibly even be specific to individual large suppliers and partners from whom we may be ingesting hundreds of similar documents in a year.
Lets see this in action with an experiment.
The Experimental Setup
To validate this hypothesis, I set up an experiment using three massive, publicly available corporate Credit Agreements that I have previously used in my article on efficient Contract Comparison using Proxy-Pointer. As you can see, they are all from different companies (and industries), so the documents do not share an identical structure and format.
- Emerson Electric Co. (~228,000 characters)
- AT&T Inc. (~214,000 characters)
- Texas Roadhouse, Inc. (TRoadhouse) (~434,000 characters)
Baseline Graphability Index
Our goal is to build and iteratively validate a predictive Graphability Index. We start with a foundational baseline index mapping common credit agreement sections to their expected relational density:
{
"document_type": "credit_agreement",
"very_high_graphability": [
"Litigation",
"Environmental Matters",
"Subsidiaries",
"Payment of Obligations",
"Maintenance of Property",
"Mergers and Sales of Assets",
"Commitment Schedule",
"Sanctions and Anti-Corruption",
"Designation of Subsidiary Borrowers",
"Definitions",
"Events of Default",
"Successors and Assigns"
],
"high_graphability": [
"Company Guarantee",
"The Facility",
"Facility Letters of Credit",
"Corporate Existence and Power",
"Corporate Authorization",
"Financial Information",
"Compliance with Laws",
"Use of Proceeds",
"Arranger and Syndication Agent",
"Eurocurrency Payment Offices",
"Defaulting Lenders"
],
"medium_graphability": [
"Swing Line Loans",
"Competitive Bid Advances",
"Credit Extensions",
"Designation of a Subsidiary Borrower",
"Successor Agent",
"Funding Indemnification",
"Acceleration and Collateral Accounts",
"Collateral"
],
"low_graphability": [
"Accounting Terms",
"Interest Rate Changes",
"Method of Payment",
"Telephonic Notices",
"Market Disruption",
"Judgment Currency",
"Change in Circumstances",
"Confidentiality"
],
"very_low_graphability": [
"No Waivers",
"Counterparts and Integration",
"Governing Law",
"Waiver of Jury Trial",
"No Fiduciary Duty",
"Service of Process",
"Miscellaneous",
"Electronic Communications",
"Exhibit",
"Table of Contents"
]
}
We would execute these in 3 phases. First, run the Emerson agreement to calculate the initial savings. Any generic uncovered sections (deltas) discovered in Emerson would be baked back into the index. We would then run the enriched index against AT&T, include any final edge cases to the index, if required, and use the fully refined index against the massive TRoadhouse agreement to measure the ultimate reduction. The goal is that by the time we scan the TRoadhouse agreement, we should see significantly fewer mismatches than the previous two as the index stabilizes.
Evaluation Criteria
For each section, we will measure the index predicted graphability with the actual rating assessed by the LLM based on relations and entities found. In our report, we will categorize the results into three buckets:
Perfect Alignment: The index accurately predicted the section’s graphability rating.
Minor Deviations: The index predicted a yield (e.g., Medium) that slightly differed from the manual assessment (e.g., Low).
Coverage Gaps / New Sections: The section was unique to the document and did not yet exist in our predictive index.
Results & Iterative Enrichment
Lets begin with Phase 1 — Emerson
Phase 1: Emerson Credit Agreement (Testing the Baseline)
We ran the 95 sections of this agreement with our baseline index. In this initial run, 66 out of 95 sections (70.0%) matched perfectly. The index accurately mapped standard provisions, such as “Mergers and Sales of Assets,” as highly graphable, while correctly identifying “Accounting Terms” and standard boilerplate Exhibits as low-yield. There were no mismatches between actual and predicted ratings from the index.
However, we find that 29 sections (~30%) were marked as New Section and were therefore identified as Coverage Gaps. Upon review, it was found that while many were highly bespoke administrative clauses (e.g., “Ratable advances”, “Notification of advances”) and were therefore, correctly left as gaps, several generic sections (like “Types of Advances”, “Compliance with ERISA”, and “Interest Payment Dates; Interest and Fee Basis”) should be added to the index. Based on their assessed actual yield I added these specific clauses to the “Medium” and “Low” tiers of the graphability index, and enriched the baseline for the next phase.
The most important outcome is that even with this raw baseline index, 36,880 characters of text, comprising “Low” and “Very Low” yield was successfully predicted as noise by the index. And therefore, could have resulted in 16.10% reduction in LLM processing payload if those were not routed to the LLM.
The match quality and yield prediction efficiency is summarized as following:
| Matched Ratings | Number of Sections | Total Characters | % of Total Document |
|---|---|---|---|
| Very High | 13 | 61,360 | 26.79% |
| High | 13 | 83,040 | 36.26% |
| Medium | 17 | 27,840 | 12.16% |
| Low | 15 | 12,800 | 5.59% |
| Very Low | 8 | 24,080 | 10.51% |
| Mismatched Rating | 0 | 0 | 0.00% |
| New Section | 29 | 19,920 | 8.70% |
| TOTAL | 95 | 229,040 | 100.00% |
Following are a few rows from the base table of section-wise comparison:
Node ID Section Header Approx. Chars Entities (Est.) Relations (Est.) Actual Rating Predicted Rating (Index Match) Match Quality
0002 Section 1.01 Definitions 44,400 252 402 Very High Very High (Definitions) 🟢
0003 Section 1.02 Accounting Terms and Determinations 320 4 4 Low Low (Accounting Terms) 🟢
0004 Section 1.03 Types of Advances 800 19 2 Low New Section ⚪
0006 Section 2.01 The Facility 2,320 27 21 High High (The Facility) 🟢
0007 Section 2.02 Ratable Advances 3,840 56 19 Very High New Section ⚪
Finally here are a few extraction examples:
- **Company Guarantee (Very High)**:
- *Entities*: Guarantor, Agent, Obligations
- *Relations*: [Guarantor]-(guarantees)->[Obligations], [Guarantor]-(indemnifies)->[Agent]
- **Mergers and Sales of Assets (Very High)**:
- *Entities*: Borrower, Assets, Buyer
- *Relations*: [Borrower]-(sells)->[Assets], [Borrower]-(merges_with)->[Buyer]
- **Ratable Advances (Very High)**:
- *Entities*: Advance, Lender, Borrower
- *Relations*: [Lender]-(makes)->[Advance], [Borrower]-(receives)->[Advance]
- **Method of Payment (Low)**:
- *Entities*: Agent, Accounts, Funds
- *Relations*: None (purely administrative procedural instructions with minimal active relational edges)
Phase 2: AT&T Credit Agreement (Refinement)
Next, we deployed the enriched index against the AT&T Credit Agreement. The document contained 77 sections spanning roughly 214,000 characters.
The results showed significant improvement. 55 out of 77 sections (71.4%) achieved Perfect Alignment which is nearly identical to Emerson’s. In addition, there were 4 mismatched sections, where the actual and predicted graphability ratings did not agree. This is only about 5% and therefore, not adjusted in the index to avoid overfitting based on each document. Only 18 sections (23.4%) resulted in Coverage Gaps, which was an improvement from Emerson’s 30%. And all were adjudged to be Bespoke / Procedural Noise from a KG point of view — computation of time periods, extension of termination date, subordination etc. These are low or very low yield sections from a NER perspective and should be added to the index to prevent the LLM scanning them for a new document. However, to check the robustness of the experiment, I did not add them to the index to see how the existing index performs against the TRoadhouse document.
The potential LLM savings compounded dramatically. Because the index confidently identified large regions of the document as low-yield (e.g; interest rate determination, increased costs etc besides Table of Contents and trailing Exhibits), the system flagged 72,763 characters as not worth scanning. By following this index in production, 33.94% reduction in processing load could be achieved, while still extracting every high-value relational edge in the document.
The match quality and yield prediction efficiency is summarized as following:
| Matched Ratings | Number of Sections | Total Characters | % of Total Document |
|---|---|---|---|
| Very High | 5 | 53,520 | 24.96% |
| High | 9 | 41,840 | 19.51% |
| Medium | 15 | 20,000 | 9.33% |
| Low | 12 | 10,960 | 5.11% |
| Very Low | 14 | 61,803 | 28.83% |
| Mismatched Rating | 4 | 4,880 | 2.28% |
| New Section | 18 | 21,397 | 9.98% |
| TOTAL | 77 | 214,400 | 100.00% |
A few of the rows from the section rating analysis table is as follows:
Node ID Section Header Approx. Chars Entities (Est.) Relations (Est.) Actual Rating Predicted Rating (Index Match) Match Quality
0017 SECTION 2.12. Payments and Computations 1,520 21 5 Low Low (Payments and Computations) 🟢
0018 SECTION 2.13. Taxes 3,360 14 10 Medium Medium (Taxes) 🟢
0019 SECTION 2.14. Sharing of Payments, Etc. 800 8 6 Low Low (Sharing of Payments) 🟢
0020 SECTION 2.15. Evidence of Debt 640 10 2 Low Low (Evidence of Debt) 🟢
0021 SECTION 2.16. Use of Proceeds 320 8 4 High High (Use of Proceeds) 🟢
0022 SECTION 2.17. Increase in the Aggregate Commitments 2,800 22 9 Medium New Section ⚪
0023 SECTION 2.18. Extension of Termination Date 3,120 20 25 Medium New Section ⚪
0024 SECTION 2.20. Replacement of Lenders 1,920 19 12 Medium Medium (Replacement of Lenders) 🟢
0025 SECTION 2.21. Benchmark Replacement Setting 12,560 61 31 High High (Benchmark Replacement Setting) 🟢
And here are a few extraction examples:
- **Certain Defined Terms (Very High)**:
- *Entities*: Base Rate, Margin, SOFR
- *Relations*: IS_A, PART_OF, CONTROLS, ROLE_OF, REFERENCES (Definitions form the ontology backbone, creating canonical entity normalization and robust semantic inheritance)
- **Conditions Precedent (Medium)**:
- *Entities*: Closing Date, Certificates, Approvals
- *Relations*: [Lender]-(requires)->[Certificates], [Agent]-(receives)->[Approvals]
- **Accounting Terms; Interpretive Provisions (Low)**:
- *Entities*: GAAP, Accounting Principles
- *Relations*: None (purely administrative and interpretive provisions with minimal active relational edges
Phase 3: TRoadhouse Credit Agreement (The Final Test)
Although we used just the first document to enrich the graphability index, let’s test the TRoadhouse credit agreement and see the outcome. Before we do that, it is pertinent to consider a few differences, not just between the documents, but the domain and industry. Emerson and AT&T are very large, bluechip utility and telecom providers whereas Texas Roadhouse is a midsize restaurant chain. The agreements of Emerson and AT&T read like a sovereign corporate treasury document based on credit agency ratings, while Texas Roadhouse’s agreement is highly customized, built specifically around restaurant leases. In terms of size, at 434,000 characters, this document is almost the size of the previous 2 combined, with over 100 sections in the structure tree. In other words, if the graphability index performs well here, the premise that document structure can be considered an accurate predictor of entity and relations yield will be proven beyond a doubt.
And here are the results. The index performed exceptionally well. 81 out of 102 sections (79.4%) matched the index perfectly. There were no sections where actual rating did not match the predicted. The model flawlessly categorized crucial sections like “Letters of Credit” and standard “Affirmative/Negative Covenants” as high yield, which should trigger full extraction. The remaining 21 sections (20.6%), classified as Coverage gaps, were a mix of low-yield administrative clauses (e.g., Rounding, Erroneous payments) and procedural noise (eg; Divisions, Commitments etc)
However, the true impact was in the payload efficiency. There were several low-yield sections such as accounting terms, rounding, administrative agent, miscellaneous etc. identified besides the Exhibits. The Schedules were analyzed based on their individual value. While a few schedules such as Liens and Investments matched the index rating of High, others such as Existing LCs were classified as gaps.
The overall Low + Very Low confirms a net saving of 38% by following the predictions and bypassing those sections entirely. This affirms the viability of the approach.
Here is the yield processing efficiency table:
| Matched Ratings | Number of Sections | Total Characters | % of Total Document |
|---|---|---|---|
| Very High | 11 | 128,840 | 29.64% |
| High | 12 | 30,320 | 6.98% |
| Medium | 20 | 25,000 | 5.75% |
| Low | 17 | 9,520 | 2.19% |
| Very Low | 21 | 155,000 | 35.66% |
| Mismatched Rating | 0 | 0 | 0.00% |
| New Section | 21 | 85,960 | 19.78% |
| TOTAL | 102 | 434,640 | 100.00% |
A few examples of section ratings are as follows:
Node ID Section Header Approx. Chars Entities (Est.) Relations (Est.) Actual Rating Predicted Rating (Index Match) Match Quality
0104 7.14 Financial Covenants 720 12 1 Very High Very High (Financial Covenant) 🟢
0105 8.01 Events of Default 3,200 30 21 Medium Medium (Events of Default) 🟢
0108 Article 9: ADMINISTRATIVE AGENT (Aggregated) 4,880 2 0 Low Low (Duties of Agent) 🟢
0119 Article 10: MISCELLANEOUS (Aggregated) 18,000 2 0 Very Low Very Low (Miscellaneous) 🟢
0144 Schedule 2.01A Commitments 4,000 2 0 Very High Very High (Commitment Schedule) 🟢
0145 Schedule 2.01B L/C Commitments 2,000 2 0 Very Low New Section ⚪
0146 Schedule 2.03 Existing L/Cs 3,000 3 0 Very Low New Section ⚪
0147 Schedule 5.01 Jurisdictions 6,000 2 0 Very Low New Section ⚪
0159 Schedule 5.06 Litigation 5,000 2 5 Very High Very High (Litigation) 🟢
0161 Schedule 5.09 Environmental 8,000 2 5 Very High Very High (Environmental Matters) 🟢
0163 Schedule 5.13 Subsidiaries 40,000 2 5 Very High Very High (Subsidiaries) 🟢
And finally a few examples of extraction:
- **Financial Covenants (Very High)**:
- *Entities*: Borrower, Leverage Ratio, Fixed Charge Coverage Ratio
- *Relations*: [Borrower]-(maintains)->[Leverage Ratio]
- **Investments & Liens (High)**:
- *Entities*: Borrower, Lien, Property, Permitted Investments
- *Relations*: [Borrower]-(grants)->[Lien], [Borrower]-(makes)->[Permitted Investments]
- **Defined Terms (Very High)**:
- *Entities*: Adjusted Term SOFR, Base Rate, Defaulting Lender
- *Relations*: IS_A, PART_OF, CONTROLS, ROLE_OF, REFERENCES (Definitions form the ontology backbone, creating canonical entity normalization and robust semantic inheritance)
Conclusion
Knowledge Graph pipelines today are fundamentally inefficient. We force expensive LLMs to scan entire enterprise corpuses even though only a fraction of those documents contain meaningful relational intelligence.
This article demonstrated that document structure itself can serve as a strong predictor of graph extraction yield.
By combining Proxy-Pointer’s structural understanding with Graphability Indexing, we can shift KG ingestion from brute-force semantic scanning to targeted structural routing. Instead of repeatedly processing entire 500k-character agreements, the system learns which regions of a document family consistently produce valuable entities and relationships — and which are largely boilerplate noise. We can simply ignore the noise altogether, without using workarounds such as a smaller LLM to reduce costs.
Across three large real-world credit agreements from different industries, the index stabilized rapidly after only a few iterations and consistently achieved major payload reductions while preserving high-value relational extraction.
More importantly, this points to re-aligning our view of the extraction architecture. Instead of treating documents as flat text streams, Proxy-Pointer treats them as structured semantic trees capable of predicting where meaningful knowledge is likely to exist before extraction even begins.
As enterprise GraphRAG systems scale across millions of contracts, filings, policies, and agreements, this type of structure-aware ingestion may help in making large-scale Knowledge Graph construction operationally sustainable.
Open-Source Repository
Proxy-Pointer is fully open-source (MIT License) and can be accessed at Proxy-Pointer Github repository. You can install it with a single pip command using the package installer.
Clone the repo. Try your own documents. Let me know your thoughts.
Connect with me and share your comments at www.linkedin.com/in/partha-sarkar-lets-talk-AI
The credit agreements used here are publicly available at SEC.gov. Code and benchmark results are open-source under the MIT License. Images used in this article are generated using Google Gemini.