Proxy-Pointer RAG: Solving Entity and Relationship Sprawl in Large Knowledge Graphs

Contents

Quick Recap: What is Proxy-Pointer?How Knowledge Graphs Handle Reconciliation So, how does Proxy-Pointer solve this?1. Building the Entity Profile (The Query Builder)2. The Vector Hit is Just a “Pointer”3. LLM-Driven Strict Reconciliation What about the variations not in top-k?Test Results Bridging Aliases (The “Sony” and “Valve” Tests):Semantic Localization – 1 (The “Pensando” Test)Semantic Localization – 2 (The “Product Family” Test)Conclusion Further Reading

graphs have become the most widely used business semantic layer, providing a unified view of an organization’s suppliers, contracts, products, partners etc. As a result, they evolve organically over time to become very large, with millions of nodes (entities) and many times more edges (relations).

Even with governance controls and ontologies in place, adherence across different pipelines feeding data into the graph is often not consistent. New business rules emerge, naming conventions change and older regions of the graph are frequently left untouched due to the sheer complexity and computational cost of upgrading them.

All of this makes a large graph increasingly difficult to maintain. One of the biggest operational problems occur at the ingestion layer. For every new document that needs to be added, a few questions emerge recurrently that need answering. Questions such as the following:

Does Sony Corp already exist in the graph? And if so, by what name?
Is the “Sony Corp” listed in this new document the same entity as “Sony Interactive Entertainment” already present in the graph? Or do they hold different relationships to our organization, thereby requiring a distinct, new node?
What relationships do exist? Semantic ambiguities (supplies, provides, is contracted for) make reconciliation increasingly difficult at scale.

In the absence of an effective tool that can narrow the search space, ingestion pipelines are forced to execute expensive global graph searches to scan for variations, which degrade performance and incurs large computational costs.

What if there was a scalable, low-cost and fast way to scan thousands of historical documents that have already been ingested into the graph and determine the likely entities and relations before querying the knowledge graph. Even better would be to use the context thus gathered for semantic localization —
telling the pipeline exactly which specific region of the graph to update, rather than forcing it to traverse the whole thing?

The obvious choice for this pre-filtering step is a vector index.
However, traditional Retrieval-Augmented Generation (RAG) is entirely unsuitable for this task. Standard vector chunking fragments a document into isolated snippets, with no common structural narrative. While chunks may be able to find an entity name, they strip away the surrounding context needed to accurately extract the relationships between companies, products, persons, places etc.

That is where Proxy-Pointer architecture comes in.

In this article, I will demonstrate a novel approach to quickly and reliably extract entities and relationships from historical documents. By using vector matches as “pointers” to retrieve intact structural sections of a document, we can shift the burden of entity reconciliation away from the expensive Knowledge Graph, and onto a significantly faster, cheaper, and more accurate vector retrieval pipeline.

Quick Recap: What is Proxy-Pointer?

Standard vector RAG splits documents into blind chunks, embeds them, and retrieves the top-K by cosine similarity. The synthesizer LLM sees fragmented, context-less text — and frequently hallucinates or misses the answer entirely.

Proxy-Pointer fixes this with five zero-cost engineering techniques:

Skeleton Tree — Parse Markdown headings into a hierarchical tree (pure Python, no LLM needed)
Breadcrumb Injection — Prepend the full structural path (AMD > Financial Statements > Cash Flows) to every chunk before embedding
Structure-Guided Chunking — Split text within section boundaries, never across them
Noise Filtering — Remove distracting sections (TOC, glossary, executive summaries) from the index
Pointer-Based Context — Use retrieved chunks as pointers to load the full, unbroken document section for the synthesizer

The result: every chunk knows where it lives in the document, and the synthesizer sees complete sections — not fragments.

How Knowledge Graphs Handle Reconciliation

While it is clear why traditional vector databases are not suitable for reconciliation, it is worth examining how knowledge graphs tackle this problem. Almost all enterprise graph databases can perform semantic similarity matching over nodes and relationships. In addition, graph databases deploy a variety of tools — ontology matching, alias tables, fuzzy matching and GNN. But perhaps the most well known and widely used technique is embedding similarity.

In a modern graph, the nodes and edges of a graph carry vector embeddings. And node embeddings will include not only the node name (eg: Sony Corp) but also its metadata (tags such as industry) and its localized topology (neighborhood nodes and relations). In principle, this allows the system to identify nodes that are semantically close even when names differ. For example, a graph search for: Sony + gaming ecosystem + supplier may retrieve nodes such as PlayStation ecosystem, Sony Corp or Sony Interactive Entertainment.

However, this approach becomes increasingly difficult at enterprise scale. As the number of semantically similar entities proliferates—whether by design or due to messy historical data — it becomes increasingly difficult to predict which specific entity node is the correct target for the new relationship we are trying to ingest.

Consider this single sentence: “AMD partnered with Sony for PlayStation semi-custom SoCs” contains entity identity (AMD, Sony, Playstation) but also relationship semantics (partnered with), platform context (Playstation) and business role (semi-custom SoCs). Implicitly, this sentence maps to multiple distinct relationships: AMD is the chip designer/supplier, Sony is the platform owner/customer, and the interaction is hardware-oriented.

In a large knowledge graph, such diverse relationships are not stored close together — they are likely to be distributed across several nodes and complex edge paths. Yet in the source document, they are part of just one section where this sentence occurs.
This makes deciding which of the “Sony-related” nodes is the correct anchor for the new ingestion a massive, computationally expensive challenge.

So, how does Proxy-Pointer solve this?

The answer is that while traditional vector database is used as a store for random bag-of-words (chunks), Proxy-Pointer treats it as a structural index. Using a two step pipeline that bridges the gap between an exact match and a semantic search, we can address reconciliation well enough to control entity and relations sprawl.

For this test, I downloaded and embedded the publicly available 10-K filings of AMD for 2020 and 2021 using the Proxy-Pointer tree-based indexing. Each of these documents is over 120 pages long and result in a total of ~1000 chunks. This forms our historical dataset that we assume is already ingested and present in the knowledge graph. I then took the AMD 2022 10-K as the document to be ingested and created 4 test queries around entities detected there.

Here is how this works in practice:

1. Building the Entity Profile (The Query Builder)

When an ingestion pipeline processes a new document (e.g., a 2022 filing), an upstream LLM first extracts not only the entities, but would then create an “Entity Profile.” Instead of just extracting the name “Sony”, it collates the key facts and business context surrounding that entity in the new document.

For example: "Sony is referenced in AMD's 2022 filing as the owner of the PlayStation registered trademark. In the Gaming segment, AMD states that the Sony PlayStation 5 and Microsoft Xbox Series S and X game consoles feature AMD's RDNA graphics architecture. Under Semi-Custom Products, AMD states that it developed the semi-custom SoC products that power both the Sony PlayStation 5 as well as the Microsoft Xbox Series S and X game consoles. AMD's semi-custom SoC revenue is dependent on consumer market conditions, including the success of the Sony PlayStation 5 game console systems worldwide.."

Our Query Builder automatically converts this profile into a multi-track vector search strategy. It generates one query for the pure entity name (Sub-Query1: “Sony”) designed to find any section that explicitly mentions the word “Sony” or any other variants. Next, we parse the profile to generate several questions targeted to find if the same/similar relation exists between “Sony” and another entity. This creates the following two queries in this case:

Sub-Query 2: "Sony owns the PlayStation trademark. Does the same or a similar relation exist?"
Sub-Query 3: "Sony utilizes AMD's RDNA graphics architecture in the PlayStation 5. Does the same or a similar relation exist?"

By sending both the raw entity name and the decomposed relationship questions, Proxy-Pointer creates a customized “semantic net”— enhancing the probability that the Reconciler gets every relevant section of the document needed to verify the node from different perspectives before ingesting into the Knowledge Graph. However, finding the right chunks in the vector database is only half the battle. This is where traditional RAG fails, and where the Proxy-Pointer architecture makes the leap.

2. The Vector Hit is Just a “Pointer”

In traditional RAG, the vector database would return a fragmented 500-character chunk—perhaps just the sentence mentioning PlayStation. Proxy-Pointer ignores the text of the chunk entirely. Instead, it uses the chunk’s metadata as a “pointer” to retrieve the entire, structurally intact document section (from heading to next heading).

This is what lets the LLM Reconciler see the full semantic context to derive relations between entities eg: Sony is owner of Playstation

3. LLM-Driven Strict Reconciliation

We take the unique top-k sections returned by all the queries and create the context for the Reconciler LLM. The Reconciler is prompted to output all the different variations of the entity name as well as relations it can see with other entities. Because the LLM can read the entire section, it doesn’t have to guess relationships based on isolated sentences.

What about the variations not in top-k?

A valid architectural question at this point would be to understand what happens if the entity is mentioned across hundreds of historical documents? Since our vector search is limited to retrieving the `top-k` (e.g., 3 to 7) sections, won’t we miss vital historical context?

The answer is “we don’t need to”. The goal of the Proxy-Pointer filter pipeline is not an exhaustive analysis of the historical documents, it is “Semantic Localization” for graph ingestion. By retrieving just a few highly relevant, full-context sections, the Reconciler LLM captures a critical mass of entity aliases and business relationships which helps the GraphQL to pinpoint the region where it should look to find and merge this entity and relation.

As we will see in the next section, for our “Sony” example, we are able to find the canonical legal entity “Sony Interactive Entertainment, Inc.” In other case, it may point to the broader regions — gaming systems or AMD which can be used as anchors to localize our search in those graph neighborhoods.

Test Results

I deployed this architecture against several tricky ingestion scenarios, with remarkable results.

Bridging Aliases (The “Sony” and “Valve” Tests):

When queried with the “Sony” query (Sub-Query 1) mentioned in the previous sections, the Reconciler read the retrieved historical sections and accurately identified the formal legal entity already in the graph: “Sony Interactive Entertainment, Inc.”. Also, it successfully resolved the relationship owns trademark Playstation (Sub-Query 2) confirming it already exists in the historical disclosures and preventing duplicate edge creation.

More importantly, it also found indirect evidence for the Sub-Query 3 (Sony utilizes AMD's RDNA graphics architecture) in the graph_neighborhood. From historical 2020 and 2021 documents, it inferred that AMD leverages its core graphics IP to design Semi-Custom SoCs that power the PlayStation 5, which itself is built on the RDNA 2 architecture. And therefore, this complex multi-hop relationship is already accounted for — successfully preventing the creation of several redundant edges.

Here is the full response:

{
  "entity_name": "Sony",
  "candidate_matches": [
    {
      "name": "Sony Interactive Entertainment, Inc.",
      "confidence": "HIGH",
      "sources": [
        "AMD_2020_10K > Pending Acquisition > Additional Information",
        "AMD_2021_10K > Additional Information"
      ],
      "reasoning": "The evidence explicitly identifies 'PlayStation' as a registered trademark of 'Sony Interactive Entertainment, Inc.', which is the formal legal entity for the Sony gaming division referenced in the 2022 context."
    }
  ],
  "candidate_relationships": [
    {
      "relationship": "owns trademark",
      "target_entity": "PlayStation",
      "confidence": "HIGH",
      "evidence_snippet": "PlayStation is a registered trademark or trademark of Sony Interactive Entertainment, Inc."
    },
    {
      "relationship": "partner/customer",
      "target_entity": "AMD",
      "confidence": "HIGH",
      "evidence_snippet": "We developed the semi-custom SoC products that power both the Sony PlayStation®5 as well as the Microsoft®Xbox Series X™ and Microsoft® Xbox Series S™ game consoles."
    }
  ],
  "graph_neighborhood": [
    {
      "related_entity": "PlayStation 5",
      "relationship_to_target": "Product owned by the target entity",
      "evidence_snippet": "We have developed RDNA™ 2, a high performing and power efficient graphics architecture, which is the foundation for next-generation PC gaming graphics, the PlayStation 5 and Xbox Series S and X consoles."
    },
    {
      "related_entity": "Semi-Custom Products",
      "relationship_to_target": "Business segment where AMD and the target entity collaborate",
      "evidence_snippet": "We leverage our core IP, including our graphics and processing technologies to develop semi-custom solutions... We developed the semi-custom SoC products that power both the Sony PlayStation®5"
    }
  ],
  "summary": "The target entity 'Sony' is identified in the 2020-2021 evidence primarily through its subsidiary 'Sony Interactive Entertainment, Inc.', which holds the trademark for the PlayStation brand. The evidence confirms a direct business relationship where AMD provides semi-custom SoC products for the Sony PlayStation 5 console, consistent with the 2022 context.",
  "sources": [
    "AMD_2020_10K > Pending Acquisition > Additional Information",
    "AMD_2021_10K > Additional Information"
  ]
}

When queried with “Valve”, it successfully extracted two distinct variations—“Valve” and “Valve Corporation”—by analyzing both a product partnership (the Steam Deck) and a trademark disclosure in the same pass. Similar to the Sonly query, the candidate_relationships and graph_neighborhood indicate that the relationships in the Query are already represented in the graph.

Here are the full input query and the response:

Query: Valve is referenced in AMD's 2022 filing in the context of semi-custom gaming products. AMD states that it partnered with Valve to create a semi-custom APU optimized for handheld gaming to power the Steam Deck. Steam and the Steam logo are identified as trademarks and/or registered trademarks of Valve. The partnership falls within AMD's Gaming segment, which primarily includes discrete GPUs, semi-custom SoC products, and development services.

{
  "entity_name": "Valve",
  "candidate_matches": [
    {
      "name": "Valve",
      "confidence": "HIGH",
      "sources": [
        "AMD_2021_10K > The Enterprise, Embedded and Semi-Custom Markets"
      ],
      "reasoning": "The entity is explicitly named in the 2021 10-K as a partner for the Steam Deck."
    },
    {
      "name": "Valve Corporation",
      "confidence": "HIGH",
      "sources": [
        "AMD_2021_10K > Additional Information"
      ],
      "reasoning": "Full legal name provided in the trademark disclosures."
    }
  ],
  "candidate_relationships": [
    {
      "relationship": "partnered with",
      "target_entity": "AMD",
      "confidence": "HIGH",
      "evidence_snippet": "We also recently partnered with Valve to create a custom APU optimized for handheld gaming to power the Steam Deck™."
    },
    {
      "relationship": "owns trademark",
      "target_entity": "Steam",
      "confidence": "HIGH",
      "evidence_snippet": "Steam and the Steam logo are trademarks and/or registered trademarks of Valve Corporation in the United States and/or other countries."
    }
  ],
  "graph_neighborhood": [
    {
      "related_entity": "Steam Deck",
      "relationship_to_target": "Product developed by the target entity (Valve) in partnership with AMD.",
      "evidence_snippet": "We also recently partnered with Valve to create a custom APU optimized for handheld gaming to power the Steam Deck™."
    },
    {
      "related_entity": "Semi-Custom",
      "relationship_to_target": "Business segment under which the partnership between AMD and Valve is categorized.",
      "evidence_snippet": "We leverage our core IP, including our graphics and processing technologies to develop semi-custom solutions."
    }
  ],
  "summary": "The entity 'Valve' (and its full name 'Valve Corporation') is present in the 2021 evidence. It is identified as a partner of AMD in the 'Semi-Custom' business segment, specifically regarding the development of a custom APU for the 'Steam Deck' product. The evidence also confirms Valve's ownership of the 'Steam' trademark.",
  "sources": [
    "AMD_2021_10K > The Enterprise, Embedded and Semi-Custom Markets",
    "AMD_2021_10K > Additional Information"
  ]
}

Semantic Localization – 1 (The “Pensando” Test)

This is where the architecture truly shines. We asked the pipeline to reconcile “Pensando Systems” (a company AMD acquired in 2022) against the 2020-2021 corpus. As expected, it correctly identified that this is a new node not yet present in the graph. But more importantly, based on the Pensando entity profile, it identified entities from AMD documents — Data center, Networking and AMD to which Pensando is likely related. This localizes the graph neighborhood for search thereby streamlining the ingestion. Here are the query and response:

Query: Pensando Systems, Inc. was acquired by AMD in May 2022 for approximately $1.9 billion. With this acquisition, AMD offers high-performance DPUs and next-generation data center solutions. The AMD Pensando DPUs are P4 programmable and designed to help offload data center infrastructure services from the CPU. Coupled with a comprehensive software stack, they help enable cloud and enterprise customers to optimize performance for network, storage, and security services at cloud scale. The DPUs are designed for minimal latency, jitter, and power impact, and can help large Infrastructure as a Service (IaaS) cloud providers improve hosted virtualized and bare-metal workload performance. The DPUs power the Aruba CX 10000 top-of-rack network switch.

{
  "entity_name": "Pensando Systems",
  "candidate_matches": [],
  "candidate_relationships": [],
  "graph_neighborhood": [
    {
      "related_entity": "Data center",
      "relationship_to_target": "The target entity (Pensando Systems) provides DPUs designed to offload infrastructure services from the CPU in this environment.",
      "evidence_snippet": "Today’s data centers require new technologies and configuration models to meet the demand driven by the growing amount of data that needs to be stored, accessed, analyzed and managed."
    },
    {
      "related_entity": "Networking",
      "relationship_to_target": "The target entity specializes in network services; this is a key focus area for AMD's embedded products.",
      "evidence_snippet": "Embedded products address computing needs in enterprise-class telecommunications, networking, security, storage systems and thin clients"
    },
    {
      "related_entity": "AMD",
      "relationship_to_target": "AMD is the parent company that acquired the target entity in 2022.",
      "evidence_snippet": "Advanced Micro Devices, Inc. (AMD) was incorporated under the laws of Delaware on May 1, 1969"
    },
    {
      "related_entity": "Infrastructure as a Service (IaaS)",
      "relationship_to_target": "The target entity's DPUs are designed to improve performance for IaaS cloud providers.",
      "evidence_snippet": "Servers are also used in cloud computing, which is a computing model where data, applications and services are delivered over the internet or an intranet"
    }
  ],
  "summary": "Pensando Systems does not appear in the 2020-2021 evidence, which is consistent with the fact that the acquisition occurred in 2022. The graph neighborhood confirms that AMD was already heavily invested in the 'Enterprise, Embedded and Semi-Custom' markets, specifically focusing on networking, security, and data center infrastructure, which aligns with the strategic rationale for the later acquisition of Pensando Systems.",
  "sources": []
}

Semantic Localization – 2 (The “Product Family” Test)

When asked about “AMD EPYC 9004 Series”, the pipeline correctly inferred that although this is not present in the graph, it is part of the product series “AMD EPYC Series” and its predecessor “AMD EPYC 7003 Series”, powered by “Zen 3” architecture is likely to be present within the “Data Center” portfolio. This greatly simplifies finding at least one of these entities in the existing graph and adding the new node there. Here are the query and response:

Query: AMD EPYC 9004 Series processors are part of AMD's data center computing portfolio, listed under the AMD EPYC Series alongside the AMD EPYC 7003 Series and AMD EPYC 7002 Series. The 4th Gen AMD EPYC 9004 Series processors are built on the "Zen 4" core and are designed to deliver leadership performance and energy efficiency across multiple market segments and workloads. They appear in the Data Center segment, which primarily includes server CPUs, GPUs, DPUs, FPGAs, and Adaptive SoC products. The Data Center segment leverages technology to address computational and visual data processing needs, where different combinations of CPUs, GPUs, DPUs, FPGAs, and Adaptive SoCs enable the optimization of performance and power for a diverse set of workloads.

{
  "entity_name": "AMD EPYC 9004 Series",
  "candidate_matches": [],
  "candidate_relationships": [],
  "graph_neighborhood": [
    {
      "related_entity": "AMD EPYC Series",
      "relationship_to_target": "The target entity is a member of this product series.",
      "evidence_snippet": "Our microprocessors for server platforms currently include the AMD EPYC™ Series processors."
    },
    {
      "related_entity": "AMD EPYC 7003 Series",
      "relationship_to_target": "The target entity is the successor to this product line.",
      "evidence_snippet": "We launched our 3rd Gen AMD EPYC processors, the AMD EPYC 7003 Series CPUs, in March 2021."
    },
    {
      "related_entity": "Zen 3",
      "relationship_to_target": "The target entity uses Zen 4, while the evidence highlights the predecessor Zen 3 architecture.",
      "evidence_snippet": "Our new AMD EPYC 7003 Series processors are powered by our “Zen 3” core architecture"
    },
    {
      "related_entity": "Data Center",
      "relationship_to_target": "The target entity is part of the Data Center computing portfolio.",
      "evidence_snippet": "We leverage our technology to address the computational and visual data processing needs in the data center market where we design CPUs, GPUs, and software for HPC, cloud gaming, and cloud and enterprise customers."
    }
  ],
  "summary": "The 'AMD EPYC 9004 Series' does not appear in the 2020-2021 evidence, which is expected as the series was released after this period. The evidence corpus contains the 'AMD EPYC Series' and the 'AMD EPYC 7003 Series' (3rd Gen), which serve as the immediate predecessors and the relevant product family context for the target entity.",
  "sources": []
}

Conclusion

As enterprise knowledge graphs grow to hundreds of millions of nodes, reconciliation increasingly becomes the dominant ingestion bottleneck. Determining whether an entity already exists, under what name, and in which operational context often requires expensive global graph exploration across fragmented relationships and inconsistent semantics.

Proxy-Pointer approaches this problem differently.

Instead of forcing the graph to perform large-scale semantic reconstruction from isolated nodes and edges, it retrieves intact contextual evidence directly from historical documents — where entities, relationships, business roles, and ecosystem context naturally exist together.

In effect, Proxy-Pointer shifts much of the reconciliation burden away from the Knowledge Graph and onto a significantly faster and cheaper vector retrieval pipeline. The graph no longer has to globally infer meaning first. It can focus on what it does best: localized structural traversal, persistence, and governance.

Proxy-Pointer does not replace the Knowledge Graph. It helps the graph understand where to look.