Building Knowledge Graphs with LLM Graph Transformer | by Tomaz Bratanic

Contents

Tool-based extraction Prompt-based extraction

The LLM Graph Transformer operates in two distinct modes, each designed to generate graphs from documents using an LLM in different scenarios.

Tool-Based Mode (Default): When the LLM supports structured output or function calling, this mode leverages the LLM’s built-in with_structured_outputto use tools. The tool specification defines the output format, ensuring that entities and relationships are extracted in a structured, predefined manner. This is depicted on the left side of the image, where code for the Node and Relationship classes is shown.
Prompt-Based Mode (Fallback): In situations where the LLM doesn’t support tools or function calls, the LLM Graph Transformer falls back to a purely prompt-driven approach. This mode uses few-shot prompting to define the output format, guiding the LLM to extract entities and relationships in a text-based manner. The results are then parsed through a custom function, which converts the LLM’s output into a JSON format. This JSON is used to populate nodes and relationships, just as in the tool-based mode, but here the LLM is guided entirely by prompting rather than structured tools. This is shown on the right side of the image, where an example prompt and resulting JSON output are provided.

These two modes ensure that the LLM Graph Transformer is adaptable to different LLMs, allowing it to build graphs either directly using tools or by parsing output from a text-based prompt.

Note that you can use prompt-based extraction even with models that support tools/functions by setting the attribute ignore_tools_usage=True.

Tool-based extraction

We initially chose a tool-based approach for extraction since it minimized the need for extensive prompt engineering and custom parsing functions. In LangChain, the with_structured_output method allows you to extract information using tools or functions, with output defined either through a JSON structure or a Pydantic object. Personally, I find Pydantic objects clearer, so we opted for that.

We start by defining a Node class.

class Node(BaseNode):
id: str = Field(..., description="Name or human-readable unique identifier")
label: str = Field(..., description=f"Available options are {enum_values}")
properties: Optional[List[Property]]

Each node has an id, a label, and optional properties. For brevity, I haven’t included full descriptions here. Describing ids as human-readable unique identifier is important since some LLMs tend to understand ID properties in more traditional way like random strings or incremental integers. Instead we want the name of entities to be used as id property. We also limit the available label types by simply listing them in the labeldescription. Additionally, LLMs like OpenAI’s, support an enum parameter, which we also use.

Next, we take a look at the Relationship class

class Relationship(BaseRelationship):
source_node_id: str
source_node_label: str = Field(..., description=f"Available options are {enum_values}")
target_node_id: str
target_node_label: str = Field(..., description=f"Available options are {enum_values}")
type: str = Field(..., description=f"Available options are {enum_values}")
properties: Optional[List[Property]]

This is the second iteration of the Relationship class. Initially, we used a nested Node object for the source and target nodes, but we quickly found that nested objects reduced the accuracy and quality of the extraction process. So, we decided to flatten the source and target nodes into separate fields—for example, source_node_id and source_node_label, along with target_node_id and target_node_label. Additionally, we define the allowed values in the descriptions for node labels and relationship types to ensure the LLMs adhere to the specified graph schema.

The tool-based extraction approach enables us to define properties for both nodes and relationships. Below is the class we used to define them.

class Property(BaseModel):
"""A single property consisting of key and value"""
key: str = Field(..., description=f"Available options are {enum_values}")
value: str

Each Property is defined as a key-value pair. While this approach is flexible, it has its limitations. For instance, we can’t provide a unique description for each property, nor can we specify certain properties as mandatory while others optional, so all properties are defined as optional. Additionally, properties aren’t defined individually for each node or relationship type but are instead shared across all of them.

We’ve also implemented a detailed system prompt to help guide the extraction. In my experience, though, the function and argument descriptions tend to have a greater impact than the system message.

Unfortunately, at the moment, there is no simple way to customize function or argument descriptions in LLM Graph Transformer.

Prompt-based extraction

Since only a few commercial LLMs and LLaMA 3 support native tools, we implemented a fallback for models without tool support. You can also set ignore_tool_usage=True to switch to a prompt-based approach even when using a model that supports tools.

Most of the prompt engineering and examples for the prompt-based approach were contributed by Geraldus Wilsen.

With the prompt-based approach, we have to define the output structure directly in the prompt. You can find the whole prompt here. In this blog post, we’ll just do a high-level overview. We start by defining the system prompt.

You are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph. Your task is to identify the entities and relations specified in the user prompt from a given text and produce the output in JSON format. This output should be a list of JSON objects, with each object containing the following keys:- **"head"**: The text of the extracted entity, which must match one of the types specified in the user prompt.
- **"head_type"**: The type of the extracted head entity, selected from the specified list of types.
- **"relation"**: The type of relation between the "head" and the "tail," chosen from the list of allowed relations.
- **"tail"**: The text of the entity representing the tail of the relation.
- **"tail_type"**: The type of the tail entity, also selected from the provided list of types.
Extract as many entities and relationships as possible. 
**Entity Consistency**: Ensure consistency in entity representation. If an entity, like "John Doe," appears multiple times in the text under different names or pronouns (e.g., "Joe," "he"), use the most complete identifier consistently. This consistency is essential for creating a coherent and easily understandable knowledge graph.
**Important Notes**:
- Do not add any extra explanations or text.

In the prompt-based approach, a key difference is that we ask the LLM to extract only relationships, not individual nodes. This means we won’t have any isolated nodes, unlike with the tool-based approach. Additionally, because models lacking native tool support typically perform worse, we do not allow extraction any properties — whether for nodes or relationships, to keep the extraction output simpler.

Next, we add a couple of few-shot examples to the model.

examples = [
{
"text": (
"Adam is a software engineer in Microsoft since 2009, "
"and last year he got an award as the Best Talent"
),
"head": "Adam",
"head_type": "Person",
"relation": "WORKS_FOR",
"tail": "Microsoft",
"tail_type": "Company",
},
{
"text": (
"Adam is a software engineer in Microsoft since 2009, "
"and last year he got an award as the Best Talent"
),
"head": "Adam",
"head_type": "Person",
"relation": "HAS_AWARD",
"tail": "Best Talent",
"tail_type": "Award",
},
...
]

In this approach, there’s currently no support for adding custom few-shot examples or extra instructions. The only way to customize is by modifying the entire prompt through the promptattribute. Expanding customization options is something we’re actively considering.

Next, we’ll take a look at defining the graph schema.

When using the LLM Graph Transformer for information extraction, defining a graph schema is essential for guiding the model to build meaningful and structured knowledge representations. A well-defined graph schema specifies the types of nodes and relationships to be extracted, along with any attributes associated with each. This schema serves as a blueprint, ensuring that the LLM consistently extracts relevant information in a way that aligns with the desired knowledge graph structure.

In this blog post, we’ll use the opening paragraph of Marie Curie’s Wikipedia page for testing with an added sentence at the end about Robin Williams.

from langchain_core.documents import Documenttext = """
Marie Curie, 7 November 1867 – 4 July 1934, was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity.
She was the first woman to win a Nobel Prize, the first person to win a Nobel Prize twice, and the only person to win a Nobel Prize in two scientific fields.
Her husband, Pierre Curie, was a co-winner of her first Nobel Prize, making them the first-ever married couple to win the Nobel Prize and launching the Curie family legacy of five Nobel Prizes.
She was, in 1906, the first woman to become a professor at the University of Paris.
Also, Robin Williams.
"""
documents = [Document(page_content=text)]

We’ll also be using GPT-4o in all examples.

from langchain_openai import ChatOpenAI
import getpass
import osos.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI api key")
llm = ChatOpenAI(model='gpt-4o')

To start, let’s examine how the extraction process works without defining any graph schema.

from langchain_experimental.graph_transformers import LLMGraphTransformerno_schema = LLMGraphTransformer(llm=llm)

Now we can process the documents using the aconvert_to_graph_documents function, which is asynchronous. Using async with LLM extraction is recommended, as it allows for parallel processing of multiple documents. This approach can significantly reduce wait times and improve throughput, especially when dealing with multiple documents.

data = await no_schema.aconvert_to_graph_documents(documents)

The response from the LLM Graph Transformer will be a graph document, which has the following structure:

[
GraphDocument(
nodes=[
Node(id="Marie Curie", type="Person", properties={}),
Node(id="Pierre Curie", type="Person", properties={}),
Node(id="Nobel Prize", type="Award", properties={}),
Node(id="University Of Paris", type="Organization", properties={}),
Node(id="Robin Williams", type="Person", properties={}),
],
relationships=[
Relationship(
source=Node(id="Marie Curie", type="Person", properties={}),
target=Node(id="Nobel Prize", type="Award", properties={}),
type="WON",
properties={},
),
Relationship(
source=Node(id="Marie Curie", type="Person", properties={}),
target=Node(id="Nobel Prize", type="Award", properties={}),
type="WON",
properties={},
),
Relationship(
source=Node(id="Marie Curie", type="Person", properties={}),
target=Node(
id="University Of Paris", type="Organization", properties={}
),
type="PROFESSOR",
properties={},
),
Relationship(
source=Node(id="Pierre Curie", type="Person", properties={}),
target=Node(id="Nobel Prize", type="Award", properties={}),
type="WON",
properties={},
),
],
source=Document(
metadata={"id": "de3c93515e135ac0e47ca82a4f9b82d8"},
page_content="\nMarie Curie, 7 November 1867 – 4 July 1934, was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity.\nShe was the first woman to win a Nobel Prize, the first person to win a Nobel Prize twice, and the only person to win a Nobel Prize in two scientific fields.\nHer husband, Pierre Curie, was a co-winner of her first Nobel Prize, making them the first-ever married couple to win the Nobel Prize and launching the Curie family legacy of five Nobel Prizes.\nShe was, in 1906, the first woman to become a professor at the University of Paris.\nAlso, Robin Williams!\n",
),
)
]

The graph document describes extracted nodes and relationships . Additionally, the source document of the extraction is added under the source key.

We can use the Neo4j Browser to visualize the outputs, providing a clearer and more intuitive understanding of the data.