Proxy-Pointer RAG: Eliminating Wasteful Entity & Relations Extraction in Knowledge Graphs

Contents

Quick Recap: What is Proxy-Pointer?Existing methods for NER optimization Proxy-Pointer Approach The Experimental Setup Baseline Graphability Index Evaluation Criteria Results & Iterative Enrichment Phase 1: Emerson Credit Agreement (Testing the Baseline)Phase 2: AT&T Credit Agreement (Refinement)Phase 3: TRoadhouse Credit Agreement (The Final Test)Conclusion Open-Source Repository

In my article on Solving Entity and Relationship Sprawl in Knowledge Graphs, I discussed how Proxy-Pointer architecture can optimize searching for right entities and relations. That, however, is only the second part of a larger problem in graph ingestion. The bigger—and far more expensive—step is identifying those entities (NER) and relations in the first place.

Knowledge Graphs are built to answer complex aggregation and multi-hop queries across entities and relationships over similar documents — vendor contracts, compliance manuals, credit agreements, global terms and conditions, etc. These documents are routinely over 100 pages long with dense text exceeding 500k characters. Enterprises frequently ingest thousands of similar contracts from the same suppliers and customers.

To do that, each of these documents is passed through a powerful LLM for NER and relations extraction, burning millions of tokens even before the actual graph ingestion can happen. The process has to be repeated sometimes, since long-context extraction often suffers from reduced recall consistency and increased extraction variance.

However, the crucial fact is that legal documents such as contracts, have very similar structure across organizations, even across industries. And they are packed with dense boilerplate text, schedules, exhibit etc most of which are of little value for NER, yet still have to be seen by a LLM anyway.

But what if we could exploit this structural predictability? What if we could predict the value of a section before we ever send it to the LLM, drastically cutting ingestion costs by strategically ignoring the noise?

In this article, we will explore a novel approach to minimizing the content seen by the LLM. By leveraging the structural concepts of Proxy-Pointer RAG and introducing a predictive metric called Graphability Indexing, we can selectively bypass low-yield sections of dense documents. I am illustrating this using three massive, real-world corporate Credit Agreements—Emerson, AT&T, and Texas Roadhouse — to demonstrate how this methodology can slash extraction costs, as compared against full-document extraction pipelines, without sacrificing the integrity of the resulting Knowledge Graph.

Quick Recap: What is Proxy-Pointer?

Proxy-Pointer is an structure-aware RAG technique that delivers surgical precision over complex documents such as annual reports, credit agreements, etc. at the cost of standard Vector RAG. Standard vector RAG splits documents into blind chunks, embeds them, and retrieves the top-K by cosine similarity. Even with overlap and semantic chunking, this is not a reliable method for relationship extraction in enterprise KGs as chunks fragment the context of a document, making extraction prone to hallucination.

Instead, Proxy-Pointer treats a document as a tree of self-contained semantic blocks (sections). Context is encapsulated within each section and therefore these are good candidates for relations extraction. Also, a LLM is much more likely to accurately identify the entities and relations from a section in a single pass, rather than from a full 100 page document, making repeated scans unnecessary.

Technically, Proxy-Pointer leverages five zero-cost engineering techniques for RAG — a skeleton structure tree of the document, breadcrumb injection, structure-guided chunking, noise filtering, and pointer-based context. We will be leveraging some of these concepts along with a few new ones here. You can refer to the article here for more on Proxy-Pointer.

Existing methods for NER optimization

Before we look at the Proxy-Pointer approach, lets look at some of the existing optimization approaches adopted by organizations.

Traditional NLP / Pre-Trained Models (e.g., spaCy): A common first approach is to use lightweight, traditional NLP pipelines like spaCy along with a LLM in a Funnel approach. These models are extremely fast and cheap, pre-trained to recognize standard entities (Persons, Organizations, Locations, Dates), and are used to scan a document for entity hotspot regions. The hotspots are then scanned using a LLM in a focused manner. However, entity density does not necessarily correlate to relations density. For instance, administrative boilerplate like ‘Notices’ or trailing ‘Exhibits’ might be packed with standard entities (names, addresses, dates) without containing any structural legal relationships.
They also struggle with bespoke corporate entities (like Adjusted Term SOFR or Swing Line Loans) and are not suitable for extracting the complex, nested relationships required for a highly constrained legal Knowledge Graph. Also, continual fine-tuning of these models to achieve the necessary accuracy requires lot of manual annotation effort and compute costs.
LLM Pre-Scanning (Smaller Router Models): Another approach is to use a smaller, cheaper LLM to quickly pre-scan chunks and decide if they contain valuable relationships, before sending only the high-value chunks to a large reasoning model for deep extraction. While cheaper per token, we are still forcing a model to read every word of a 500k character document. And this is also therefore, a wasteful double scan of large parts of the document.

Proxy-Pointer Approach

As mentioned earlier, Proxy-Pointer leverages the following properties of knowledge graphs:

Graphs are built for a domain/functional area, and therefore store similar document content. A procurement graph will ingest multiple supplier contracts (and also many contracts of same supplier), a finance graph will have many lender and credit documents, compliance documents etc
The documents share a similar baseline structure — sections, schedules, exhibits etc. And only a fraction of the content is enough for meaningful entities and relations extraction. The challenge is to identify that content.

We use this predictability for the following steps:

Build and deploy a baseline Graphability index: Start with a baseline index for a document type (e.g. Credit Agreements). Sections are classified into very high, high, medium, low and very low graphability. The graphability rating is driven by Relational Density—the volume of actionable business connections (edges) relative to the size of the section—rather than raw entity counts (nodes). This avoids entity dense but generic sections like Notices or Exhibits being classified as high. Based on this methodology, payment of obligations is classified as very high graphability whereas Duties of Agent or Governing law are classified as low yield sections. However, there is an important exception. While most sections are evaluated on relational density, ontological foundations like ‘Subsidiaries’ are anchored as ‘Very High’ because their few edges define the critical corporate hierarchy that the rest of the contract’s rules inherit. This preserves the index’s value as a business heatmap rather than a purely technical one based on entity or relations density.
Structure tree creation: We create a structure tree of a document which lists the hierarchy of sections as nodes, along with section title.
Enrich and Adjust: We walk the tree, not the text. We use the first few documents to refine and harden the index. Extract each section content based on line numbers. Use the section title to find the predicted yield index. Next, the LLM scans all the sections of the document and based on the extracted relations and entities, makes an actual assessment of the yield index for every section. Where the predicted and actual ratings do not match, these are flagged for human review (e.g., actual classification says “Low” but the predicted rating from the index is “Medium”). Based on human SME input, the classifications in the index are adjusted.
Route and Bypass: Following the above process, we would be able to derive an enriched graphability index after a few documents. From then on, high-yield sections (Very High, High, Medium) are sent to the LLM for deep NER extraction. Low and Very Low sections are safely bypassed.
New Sections: Every document will have a few sections not found in the index which will be flagged as Coverage Gaps. These are mandatorily scanned for NER, to avoid missing relevant relations. Upon human review of these, the ones deemed generic, frequently occurring, can be added to the index, while bespoke ones such as Benchmark Replacement Setting can be ignored.
Achieve stabilization. After just a few iterations, we expect prediction mismatches to drop to near zero, and the volume of “New Sections” to stabilize at no more than 20-25% (representing highly bespoke or administrative clauses), allowing the system to confidently process massive document corpuses with the right balance of rigor and efficiency.

The graphability index should be maintained for each document type and could possibly even be specific to individual large suppliers and partners from whom we may be ingesting hundreds of similar documents in a year.

Lets see this in action with an experiment.

The Experimental Setup

To validate this hypothesis, I set up an experiment using three massive, publicly available corporate Credit Agreements that I have previously used in my article on efficient Contract Comparison using Proxy-Pointer. As you can see, they are all from different companies (and industries), so the documents do not share an identical structure and format.

Emerson Electric Co. (~228,000 characters)
AT&T Inc. (~214,000 characters)
Texas Roadhouse, Inc. (TRoadhouse) (~434,000 characters)

Baseline Graphability Index

Our goal is to build and iteratively validate a predictive Graphability Index. We start with a foundational baseline index mapping common credit agreement sections to their expected relational density:

{
  "document_type": "credit_agreement",
  "very_high_graphability": [
    "Litigation",
    "Environmental Matters",
    "Subsidiaries",
    "Payment of Obligations",
    "Maintenance of Property",
    "Mergers and Sales of Assets",
    "Commitment Schedule",
    "Sanctions and Anti-Corruption",
    "Designation of Subsidiary Borrowers",
    "Definitions",
    "Events of Default",
    "Successors and Assigns"
  ],
  "high_graphability": [
    "Company Guarantee",
    "The Facility",
    "Facility Letters of Credit",
    "Corporate Existence and Power",
    "Corporate Authorization",
    "Financial Information",
    "Compliance with Laws",
    "Use of Proceeds",
    "Arranger and Syndication Agent",
    "Eurocurrency Payment Offices",
    "Defaulting Lenders"
  ],
  "medium_graphability": [
    "Swing Line Loans",
    "Competitive Bid Advances",
    "Credit Extensions",
    "Designation of a Subsidiary Borrower",
    "Successor Agent",
    "Funding Indemnification",
    "Acceleration and Collateral Accounts",
    "Collateral"
  ],
  "low_graphability": [
    "Accounting Terms",
    "Interest Rate Changes",
    "Method of Payment",
    "Telephonic Notices",
    "Market Disruption",
    "Judgment Currency",
    "Change in Circumstances",
    "Confidentiality"
  ],
  "very_low_graphability": [
    "No Waivers",
    "Counterparts and Integration",
    "Governing Law",
    "Waiver of Jury Trial",
    "No Fiduciary Duty",
    "Service of Process",
    "Miscellaneous",
    "Electronic Communications",
    "Exhibit",
    "Table of Contents"
  ]
}

We would execute these in 3 phases. First, run the Emerson agreement to calculate the initial savings. Any generic uncovered sections (deltas) discovered in Emerson would be baked back into the index. We would then run the enriched index against AT&T, include any final edge cases to the index, if required, and use the fully refined index against the massive TRoadhouse agreement to measure the ultimate reduction. The goal is that by the time we scan the TRoadhouse agreement, we should see significantly fewer mismatches than the previous two as the index stabilizes.

Evaluation Criteria

For each section, we will measure the index predicted graphability with the actual rating assessed by the LLM based on relations and entities found. In our report, we will categorize the results into three buckets:

Perfect Alignment: The index accurately predicted the section’s graphability rating.

Minor Deviations: The index predicted a yield (e.g., Medium) that slightly differed from the manual assessment (e.g., Low).

Coverage Gaps / New Sections: The section was unique to the document and did not yet exist in our predictive index.

Results & Iterative Enrichment

Lets begin with Phase 1 — Emerson

Phase 1: Emerson Credit Agreement (Testing the Baseline)

We ran the 95 sections of this agreement with our baseline index. In this initial run, 66 out of 95 sections (70.0%) matched perfectly. The index accurately mapped standard provisions, such as “Mergers and Sales of Assets,” as highly graphable, while correctly identifying “Accounting Terms” and standard boilerplate Exhibits as low-yield. There were no mismatches between actual and predicted ratings from the index.

However, we find that 29 sections (~30%) were marked as New Section and were therefore identified as Coverage Gaps. Upon review, it was found that while many were highly bespoke administrative clauses (e.g., “Ratable advances”, “Notification of advances”) and were therefore, correctly left as gaps, several generic sections (like “Types of Advances”, “Compliance with ERISA”, and “Interest Payment Dates; Interest and Fee Basis”) should be added to the index. Based on their assessed actual yield I added these specific clauses to the “Medium” and “Low” tiers of the graphability index, and enriched the baseline for the next phase.

The most important outcome is that even with this raw baseline index, 36,880 characters of text, comprising “Low” and “Very Low” yield was successfully predicted as noise by the index. And therefore, could have resulted in 16.10% reduction in LLM processing payload if those were not routed to the LLM.

The match quality and yield prediction efficiency is summarized as following:

Matched Ratings	Number of Sections	Total Characters	% of Total Document
Very High	13	61,360	26.79%
High	13	83,040	36.26%
Medium	17	27,840	12.16%
Low	15	12,800	5.59%
Very Low	8	24,080	10.51%
Mismatched Rating	0	0	0.00%
New Section	29	19,920	8.70%
TOTAL	95	229,040	100.00%

Following are a few rows from the base table of section-wise comparison:

Node ID	Section Header	Approx. Chars	Entities (Est.)	Relations (Est.)	Actual Rating	Predicted Rating (Index Match)	Match Quality
0002	Section 1.01 Definitions	44,400	252	402	Very High	Very High (Definitions)	🟢
0003	Section 1.02 Accounting Terms and Determinations	320	4	4	Low	Low (Accounting Terms)	🟢
0004	Section 1.03 Types of Advances	800	19	2	Low	New Section	⚪
0006	Section 2.01 The Facility	2,320	27	21	High	High (The Facility)	🟢
0007	Section 2.02 Ratable Advances	3,840	56	19	Very High	New Section	⚪

Finally here are a few extraction examples:

- **Company Guarantee (Very High)**:
  - *Entities*: Guarantor, Agent, Obligations
  - *Relations*: [Guarantor]-(guarantees)->[Obligations], [Guarantor]-(indemnifies)->[Agent]
- **Mergers and Sales of Assets (Very High)**:
  - *Entities*: Borrower, Assets, Buyer
  - *Relations*: [Borrower]-(sells)->[Assets], [Borrower]-(merges_with)->[Buyer]
- **Ratable Advances (Very High)**:
  - *Entities*: Advance, Lender, Borrower
  - *Relations*: [Lender]-(makes)->[Advance], [Borrower]-(receives)->[Advance]
- **Method of Payment (Low)**:
  - *Entities*: Agent, Accounts, Funds
  - *Relations*: None (purely administrative procedural instructions with minimal active relational edges)

Next, we deployed the enriched index against the AT&T Credit Agreement. The document contained 77 sections spanning roughly 214,000 characters.

The results showed significant improvement. 55 out of 77 sections (71.4%) achieved Perfect Alignment which is nearly identical to Emerson’s. In addition, there were 4 mismatched sections, where the actual and predicted graphability ratings did not agree. This is only about 5% and therefore, not adjusted in the index to avoid overfitting based on each document. Only 18 sections (23.4%) resulted in Coverage Gaps, which was an improvement from Emerson’s 30%. And all were adjudged to be Bespoke / Procedural Noise from a KG point of view — computation of time periods, extension of termination date, subordination etc. These are low or very low yield sections from a NER perspective and should be added to the index to prevent the LLM scanning them for a new document. However, to check the robustness of the experiment, I did not add them to the index to see how the existing index performs against the TRoadhouse document.

The potential LLM savings compounded dramatically. Because the index confidently identified large regions of the document as low-yield (e.g; interest rate determination, increased costs etc besides Table of Contents and trailing Exhibits), the system flagged 72,763 characters as not worth scanning. By following this index in production, 33.94% reduction in processing load could be achieved, while still extracting every high-value relational edge in the document.

The match quality and yield prediction efficiency is summarized as following:

Matched Ratings	Number of Sections	Total Characters	% of Total Document
Very High	5	53,520	24.96%
High	9	41,840	19.51%
Medium	15	20,000	9.33%
Low	12	10,960	5.11%
Very Low	14	61,803	28.83%
Mismatched Rating	4	4,880	2.28%
New Section	18	21,397	9.98%
TOTAL	77	214,400	100.00%

A few of the rows from the section rating analysis table is as follows:

Node ID	Section Header	Approx. Chars	Entities (Est.)	Relations (Est.)	Actual Rating	Predicted Rating (Index Match)	Match Quality
0017	SECTION 2.12. Payments and Computations	1,520	21	5	Low	Low (Payments and Computations)	🟢
0018	SECTION 2.13. Taxes	3,360	14	10	Medium	Medium (Taxes)	🟢
0019	SECTION 2.14. Sharing of Payments, Etc.	800	8	6	Low	Low (Sharing of Payments)	🟢
0020	SECTION 2.15. Evidence of Debt	640	10	2	Low	Low (Evidence of Debt)	🟢
0021	SECTION 2.16. Use of Proceeds	320	8	4	High	High (Use of Proceeds)	🟢
0022	SECTION 2.17. Increase in the Aggregate Commitments	2,800	22	9	Medium	New Section	⚪
0023	SECTION 2.18. Extension of Termination Date	3,120	20	25	Medium	New Section	⚪
0024	SECTION 2.20. Replacement of Lenders	1,920	19	12	Medium	Medium (Replacement of Lenders)	🟢
0025	SECTION 2.21. Benchmark Replacement Setting	12,560	61	31	High	High (Benchmark Replacement Setting)	🟢

And here are a few extraction examples:

- **Certain Defined Terms (Very High)**:
  - *Entities*: Base Rate, Margin, SOFR
  - *Relations*: IS_A, PART_OF, CONTROLS, ROLE_OF, REFERENCES (Definitions form the ontology backbone, creating canonical entity normalization and robust semantic inheritance)
- **Conditions Precedent (Medium)**:
  - *Entities*: Closing Date, Certificates, Approvals
  - *Relations*: [Lender]-(requires)->[Certificates], [Agent]-(receives)->[Approvals]
- **Accounting Terms; Interpretive Provisions (Low)**:
  - *Entities*: GAAP, Accounting Principles
  - *Relations*: None (purely administrative and interpretive provisions with minimal active relational edges

Phase 3: TRoadhouse Credit Agreement (The Final Test)

Although we used just the first document to enrich the graphability index, let’s test the TRoadhouse credit agreement and see the outcome. Before we do that, it is pertinent to consider a few differences, not just between the documents, but the domain and industry. Emerson and AT&T are very large, bluechip utility and telecom providers whereas Texas Roadhouse is a midsize restaurant chain. The agreements of Emerson and AT&T read like a sovereign corporate treasury document based on credit agency ratings, while Texas Roadhouse’s agreement is highly customized, built specifically around restaurant leases. In terms of size, at 434,000 characters, this document is almost the size of the previous 2 combined, with over 100 sections in the structure tree. In other words, if the graphability index performs well here, the premise that document structure can be considered an accurate predictor of entity and relations yield will be proven beyond a doubt.

And here are the results. The index performed exceptionally well. 81 out of 102 sections (79.4%) matched the index perfectly. There were no sections where actual rating did not match the predicted. The model flawlessly categorized crucial sections like “Letters of Credit” and standard “Affirmative/Negative Covenants” as high yield, which should trigger full extraction. The remaining 21 sections (20.6%), classified as Coverage gaps, were a mix of low-yield administrative clauses (e.g., Rounding, Erroneous payments) and procedural noise (eg; Divisions, Commitments etc)

However, the true impact was in the payload efficiency. There were several low-yield sections such as accounting terms, rounding, administrative agent, miscellaneous etc. identified besides the Exhibits. The Schedules were analyzed based on their individual value. While a few schedules such as Liens and Investments matched the index rating of High, others such as Existing LCs were classified as gaps.

The overall Low + Very Low confirms a net saving of 38% by following the predictions and bypassing those sections entirely. This affirms the viability of the approach.

Here is the yield processing efficiency table:

Matched Ratings	Number of Sections	Total Characters	% of Total Document
Very High	11	128,840	29.64%
High	12	30,320	6.98%
Medium	20	25,000	5.75%
Low	17	9,520	2.19%
Very Low	21	155,000	35.66%
Mismatched Rating	0	0	0.00%
New Section	21	85,960	19.78%
TOTAL	102	434,640	100.00%

A few examples of section ratings are as follows:

Node ID	Section Header	Approx. Chars	Entities (Est.)	Relations (Est.)	Actual Rating	Predicted Rating (Index Match)	Match Quality
0104	7.14 Financial Covenants	720	12	1	Very High	Very High (Financial Covenant)	🟢
0105	8.01 Events of Default	3,200	30	21	Medium	Medium (Events of Default)	🟢
0108	Article 9: ADMINISTRATIVE AGENT (Aggregated)	4,880	2	0	Low	Low (Duties of Agent)	🟢
0119	Article 10: MISCELLANEOUS (Aggregated)	18,000	2	0	Very Low	Very Low (Miscellaneous)	🟢
0144	Schedule 2.01A Commitments	4,000	2	0	Very High	Very High (Commitment Schedule)	🟢
0145	Schedule 2.01B L/C Commitments	2,000	2	0	Very Low	New Section	⚪
0146	Schedule 2.03 Existing L/Cs	3,000	3	0	Very Low	New Section	⚪
0147	Schedule 5.01 Jurisdictions	6,000	2	0	Very Low	New Section	⚪
0159	Schedule 5.06 Litigation	5,000	2	5	Very High	Very High (Litigation)	🟢
0161	Schedule 5.09 Environmental	8,000	2	5	Very High	Very High (Environmental Matters)	🟢
0163	Schedule 5.13 Subsidiaries	40,000	2	5	Very High	Very High (Subsidiaries)	🟢

And finally a few examples of extraction:

- **Financial Covenants (Very High)**:
  - *Entities*: Borrower, Leverage Ratio, Fixed Charge Coverage Ratio
  - *Relations*: [Borrower]-(maintains)->[Leverage Ratio]
- **Investments & Liens (High)**:
  - *Entities*: Borrower, Lien, Property, Permitted Investments
  - *Relations*: [Borrower]-(grants)->[Lien], [Borrower]-(makes)->[Permitted Investments]
- **Defined Terms (Very High)**:
  - *Entities*: Adjusted Term SOFR, Base Rate, Defaulting Lender
  - *Relations*: IS_A, PART_OF, CONTROLS, ROLE_OF, REFERENCES (Definitions form the ontology backbone, creating canonical entity normalization and robust semantic inheritance)

Conclusion

Knowledge Graph pipelines today are fundamentally inefficient. We force expensive LLMs to scan entire enterprise corpuses even though only a fraction of those documents contain meaningful relational intelligence.

This article demonstrated that document structure itself can serve as a strong predictor of graph extraction yield.

By combining Proxy-Pointer’s structural understanding with Graphability Indexing, we can shift KG ingestion from brute-force semantic scanning to targeted structural routing. Instead of repeatedly processing entire 500k-character agreements, the system learns which regions of a document family consistently produce valuable entities and relationships — and which are largely boilerplate noise. We can simply ignore the noise altogether, without using workarounds such as a smaller LLM to reduce costs.

Across three large real-world credit agreements from different industries, the index stabilized rapidly after only a few iterations and consistently achieved major payload reductions while preserving high-value relational extraction.

More importantly, this points to re-aligning our view of the extraction architecture. Instead of treating documents as flat text streams, Proxy-Pointer treats them as structured semantic trees capable of predicting where meaningful knowledge is likely to exist before extraction even begins.

As enterprise GraphRAG systems scale across millions of contracts, filings, policies, and agreements, this type of structure-aware ingestion may help in making large-scale Knowledge Graph construction operationally sustainable.

Open-Source Repository

Proxy-Pointer is fully open-source (MIT License) and can be accessed at Proxy-Pointer Github repository. You can install it with a single pip command using the package installer.

Clone the repo. Try your own documents. Let me know your thoughts.

Connect with me and share your comments at www.linkedin.com/in/partha-sarkar-lets-talk-AI

_{The credit agreements used here are publicly available at SEC.gov. Code and benchmark results are open-source under the MIT License}. _{Images used in this article are generated using Google Gemini.}