Entity Type Prediction with Relational Graph Convolutional Network (PyTorch)

Contents

Graph triple storage Storing nodes, relations and node labels Creating edge_index and edge_type Create training data RGCNConv Train the R-GCN model Start training

In this chapter a python setup is discussed. The entire code, including the setup and run commands can be found on GitHub.

Graph triple storage

Let’s first dive into the data structure of a graph and its triple stores. A common file type to store graph data is the ‘N-triple’ format with the file extension ‘.nt’. Figure 1 displays an example graph file (example.nt) and Figure 2 is the visualization of the graph data.

For the sake of clarity in the visualization of example.nt, it was decided to indicate the rdf:type relation with a dotten line. In Figure 2 we see that Tarantino has two type labels and Kill Bill and Pulp Fiction have only one. We will see that this is important to decide for an activation and loss function later on.

Storing nodes, relations and node labels

To create and store important graph information we created the Graph class in graph.py.

import torchfrom collections import defaultdict
from torch import Tensor
class Graph:
RDF_TYPE = '<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>'
def __init__(self) -> None:
self.graph_triples: list = None
self.node_types: dict = defaultdict(set)
self.enum_nodes: dict = None
self.enum_relations: dict = None
self.enum_classes: list = None
self.edge_index: Tensor = None
self.edge_type: Tensor = None

The rdf:type relation is hard coded to later remove it from the relation set. Furthermore, variables are created to store important graph information. To use the graph data, we need to parse the ‘.nt’ file and store its contents. There are libraries, like ‘RDFLib’ that can help with this and offer other graph functionalities. However, I found that ‘RDFLib’ does not scale well with larger graphs. Therefore, new code was created to parse the file. To read and store the RDF-triples from a ‘.nt’ file, the function below in the Graph class was created.

def get_graph_triples(self, file_path: str) -> None:
with open(file_path, 'r') as file:
self.graph_triples = file.read().splitlines()

The above function stores a list of strings in self.graph_triples: [ ‘<entity> <predicate> <entity> .’,…,‘<entity> <predicate> <entity> .’]. The next step is to store all distinct graph nodes and predicates and to store the node labels.

def init_graph(self, file_path: str) -> None:
'''intialize graph object by creating and storing important graph variables'''# give the command to store the graph triples
self.get_graph_triples(file_path)
# variables to store all entities and predicates
subjects = set()
predicates = set()
objects = set()
# object that can later be printed to get insignt in class (im)balance
class_count = defaultdict(int)
# loop over each graph triple and split 2 times on space:' '
for triple in self.graph_triples:
triple_list = triple[:-2].split(' ', maxsplit=2)
# skip triple if there is a blank lines in .nt files
if triple_list != ['']:
s, p, o = triple_list[0].lower(), triple_list[1].lower(), triple_list[2].lower()
# add nodes and predicates
subjects.add(s)
predicates.add(p)
objects.add(o)
# check if subject is a valid entity and check if predicate is rdf:type
if str(s).split('#')[0] != 'http://swrc.ontoware.org/ontology' \
and str(p) == self.RDF_TYPE.lower():
class_count[str(o)] += 1
self.node_types[s].add(o)
# create a list with all nodes and then enumerate the nodes
nodes = list(subjects.union(objects))
self.enum_nodes = node: i for i, node in enumerate(sorted(nodes))
# remove the rdf:type relations since we would like to predict the types
# and enumerate the relations and save as dict
predicates.remove(self.RDF_TYPE)
self.enum_relations = rel: i for i, rel in enumerate(sorted(predicates))
# enumereate classes
self.enum_classes = lab: i for i, lab in enumerate(class_count.keys())
# if you want to: print class occurence dict to get insight in class (im)balance
# print(class_count)

In self.node_types the label(s) for each node are stored. The value for each node is the set of labels. Later this dictionary is used to vectorize node labels. Now, let’s look at the loop over self.graph_triples. We create a triple_list with triple[:-2].split(‘ ‘, maxsplit=2). In triple_list we now have: [‘<entity>’, ‘<predicate>’, ‘<entity>’]. The subject, predicate and object are stored in the designated subjects, predicates and objects sets. Then, if the subject was a valid entity with an rdf:type predicate and type label, the node and its label are added with self.node_types[s].add(o).

From the subjects, predicates and objects sets, the dictionaries self.enum_nodes and self.enum_relations are created, which store nodes and predicates as keys respectively. In these dictionaries the keys are enumerated with integers and stored as the value for each key. The rdf:type relation is removed from the predicates set before storing the numbered relations in self.enum_relations. This is done because we do not want our model to train for the rdf:type relation. Otherwise, through the rdf:type relation the node embedding will be influence and taken into account for each node update. This is prohibited as it would result in information leakage for the prediction task.

Creating edge_index and edge_type

With the stored graph nodes and relations we can create the edge_index and edge_type tensors. The edge_index is a tensor that indicates which nodes are connected. The edge_type tensor stores by which relation the nodes are connected. Importantly to note, to allow the model to pass messages in two directions, the edge_index and edge_type also include the inverse of each edge [4][5]. This enables to update each node representation by incoming and outgoing edges. The code to create the edge_index and edge_type is displayed below.

def create_edge_data(self):
'''create edge_index and edge_type'''edge_list: list = []
for triple in self.graph_triples:
triple_list = triple[:-2].split(" ", maxsplit=2)
if triple_list != ['']:
s, p, o = triple_list[0].lower(), triple_list[1].lower(), triple_list[2].lower()
# if p is RDF_TYPE, it is not stored
if self.enum_relations.get(p) != None:
# create edge list and also add inverse edge of each edge
src, dst, rel = self.enum_nodes[s], self.enum_nodes[o], self.enum_relations[p]
edge_list.append([src, dst, 2 * rel])
edge_list.append([dst, src, 2 * rel + 1])
edges = torch.tensor(edge_list, dtype=torch.long).t() # shape(3, (2*number_of_edges - #RDFtype_edges))
self.edge_index = edges[:2] 
self.edge_type = edges[2]

In the code above, we start with looping over the graph triples like before. Then we check if the predicate p can be found. If not, the predicate is the rdf:type predicate and this predicate is not stored. Therefore, the triple is not included in the edge data. If the predicate is stored in self.enum_relations the corresponding integers for the subject, predicate and object are assigned to src, dst and rel respectively. The edges and inverse edges are added to edge_list . Distinctive integers for each non-inverse relation are created with 2*rel. For the inverse edge the distinctive integer for the inverse relation is created with 2*rel+1 .

Create training data

Below the class TrainingData of trainingdata.py is displayed. This class creates and stores training, validation and test data for the entity type prediction task.

import torchfrom dataclasses import dataclass
from sklearn.model_selection import train_test_split
from graph import Graph
@dataclass
class TrainingData:
'''class to create and store training data'''
x_train = None
y_train = None
x_val = None
y_val = None
x_test = None
y_test = None
def create_training_data(self, graph: Graph) -> None:
train_indices: list = []
train_labels:list = []
for node, types in graph.node_types.items():
# create list with zeros
labels = [0 for _ in range(len(graph.enum_classes.keys()))]
for t in types:
# Assing 1.0 to correct index with class number
labels[graph.enum_classes[t]] = 1.0
train_indices.append(graph.enum_nodes[node])
train_labels.append(labels)
# create the train, val en test splits
x_train, x_test, y_train, y_test = train_test_split(train_indices,
train_labels,
test_size=0.2,
random_state=1,
shuffle=True) 
x_train, x_val, y_train, y_val = train_test_split(x_train,
y_train,
test_size=0.25,
random_state=1,
shuffle=True)
self.x_train = torch.tensor(x_train)
self.x_test = torch.tensor(x_test) 
self.x_val = torch.tensor(x_val)
self.y_val = torch.tensor(y_val)
self.y_train = torch.tensor(y_train)
self.y_test = torch.tensor(y_test)

To create the training data train_test_split from sklearn.model_selection is used. Importantly to note, is that in the training data, only node indices are include which have an entity type denoted. This is important for interpreting the overall performance of the model.

RGCNConv

In model.py a model setup is proposed with layers from PyTorch. Below, a copy of the code is included:

import torchfrom torch import nn
from torch import Tensor, LongTensor
from torch_geometric.nn import RGCNConv
class RGCNModel(nn.Module):
def __init__ (self, num_nodes: int,
emb_dim: int,
hidden_l: int,
num_rels: int,
num_classes: int) -> None:
super(RGCNModel, self).__init__()
self.embedding = nn.Embedding(num_nodes, emb_dim)
self.rgcn1 = RGCNConv(in_channels=emb_dim,
out_channels=hidden_l,
num_relations=num_rels,
num_bases=None)
self.rgcn2 = RGCNConv(in_channels=hidden_l,
out_channels=num_classes,
num_relations=num_rels,
num_bases=None)
# intialize weights
nn.init.kaiming_uniform_(self.rgcn1.weight, mode='fan_out', nonlinearity='relu')
nn.init.kaiming_uniform_(self.rgcn2.weight, mode='fan_out', nonlinearity='relu')
def forward(self, edge_index: LongTensor, edge_type: LongTensor) -> Tensor:
x = self.rgcn1(self.embedding.weight, edge_index, edge_type)
x = torch.relu(x)
x = self.rgcn2(x, edge_index, edge_type)
x = torch.sigmoid(x)
return x

Besides the RGCNConv layers of PyTorch, the nn.Embedding layer is utilized. This layer creates an embedding tensor with a gradient. As the embedding tensor contains a gradient, it will be updated in backpropagation.

Two layers of R-GCN with a ReLU activation in between are used. This setup is proposed in literature[4][5]. As explained earlier, stacking two layers allows for node updates that take the node representations over two hops into account. The output of the first R-GCN layer contains updated node representations for each adjacent node. Through passing the update of the first layer, the node update of the second layers includes the updated representations of the first layers. Therefore, each node is updated with information over two hops.

In the forward pass, the Sigmoid activation is used over the output of the second R-GCN layer, because entities can have multiple type labels (multi-label classification). Each type class should be predicted for separately. In the case that multiple labels can be predicted, the Sigmoid activation is desired as we want to make a prediction for each label independently. We do not only predict the most likely label, the Softmax would be a better option.

Train the R-GCN model

To train the R-GCN model, the ModelTrainer class was created in train.py. __init__ stores the model and training parameters. Furthermore, the functions train_model() and compute_f1() are part of the class:

import torchfrom sklearn.metrics import f1_score
from torch import nn, Tensor
from typing import List, Tuple
from graph import Graph
from trainingdata import TrainingData
from model import RGCNModel
from plot import plot_results
class ModelTrainer:
def __init__(self,
model: nn.Module,
epochs: int,
lr: float,
weight_d: float) -> None:
self.model = model
self.epochs = epochs
self.lr = lr
self.weight_d = weight_d
def compute_f1(self, graph: Graph, x: Tensor, y_true: Tensor) -> float:
'''evaluate the model with the F1 samples metric'''
pred = self.model(graph.edge_index, graph.edge_type)
pred = torch.round(pred)
y_pred = pred[x]
# f1_score function does not accept torch tensor with gradient
y_pred = y_pred.detach().numpy()
f1_s = f1_score(y_true, y_pred, average='samples', zero_division=0)
return f1_s
def train_model(self, graph: Graph, training_data: TrainingData) -> Tuple[List[float]]:
'''loop to train pytorch R-GCN model'''
optimizer = torch.optim.Adam(self.model.parameters(), lr=self.lr, weight_decay=self.weight_d)
loss_f = nn.BCELoss()
f1_ss: list = []
losses: list = []
for epoch in range(self.epochs):
# evaluate the model
self.model.eval()
f1_s = self.compute_f1(graph, training_data.x_val, training_data.y_val)
f1_ss.append(f1_s)
# train the model
self.model.train()
optimizer.zero_grad()
out = self.model(graph.edge_index, graph.edge_type)
output = loss_f(out[training_data.x_train], training_data.y_train)
output.backward()
optimizer.step()
l = output.item()
losses.append(l)
# every tenth epoch print loss and F1 score
if epoch%10==0:
l = output.item()
print(f'Epoch: epoch, Loss: l:.4f\n',
f'F1 score on validation set:f1_s:.2f')
return losses, f1_ss,

Let’s discuss some important aspects of train_model() . For calculating the loss, the Binary Cross Entropy Loss (BCELoss) calculation is used. BCELoss is a suitable loss calculation for multi-label classification combined with a Sigmoid activation on the output layer as it calculates the loss over each predicted label and the true label separately. Therefore, it treats each output unit of our model independently. This is desired as a node could have multiple entity types (Figure 2: Tarantino is a person and a director). However, if the graph only contained nodes with one entity type, the Softmax with a Categorical Cross Entropy Loss would be a better choice.

Another important aspect, is the evaluation of the prediction performance. The F1-score is a suitable metric as there are multiple classes to predict, which may occur in an imbalanced fashion. Imbalanced data means that some classes are represented more than others. The imbalanced data could result in a skewed performance of the model as only a few type classes may be predicted well. Therefore, it is desired to include the precision and recall in the performance evaluation which the F1-score does. The f1_score() of sklearn.metrics is used. To account for the imbalanced data distribution the method weighted-F1-score is used. The F1 score is calculated for each label separately. Then the F1 scores are averaged considering the proportion for each label in the dataset, resulting in the weighted-F1-score.

Start training

In the data folder on Github, are an example graph (example.nt) and a larger graph, called AIFB[7] (AIFB.nt). This dataset, amongst others, is used more often in research[5][6] on R-GCNs. To start training of the model, the following code is included in train.py:

if __name__=='__main__':
file_path = './data/AIFB.nt' # adjust to use another dataset
graph = Graph()
graph.init_graph(file_path)
graph.create_edge_data()
graph.print_graph_statistics()training_data = TrainingData()
training_data.create_training_data(graph)
# training parameters
emb_dim = 50
hidden_l = 16
epochs = 51
lr = 0.01
weight_d = 0.00005
model = RGCNModel(len(graph.enum_nodes.keys()),
emb_dim,
hidden_l,
2*len(graph.enum_relations.keys())+1, # remember the inverse relations in the edge data
len(graph.enum_classes.keys()))
trainer = ModelTrainer(model, epochs, lr, weight_d)
losses, f1_ss = trainer.train_model(graph, training_data)
plot_results(epochs, losses, title='BCELoss on trainig set during epochs', y_label='Loss')
plot_results(epochs, f1_ss, title='F1 score on validation set during epochs', y_label='F1 samples')
# evaluate model on test set and print result
f1_s_test = trainer.compute_f1(graph, training_data.x_test, training_data.y_test)
print(f'F1 score on test set = f1_s_test')

To setup an environment and run the code I refer to the readme in the repository on GitHub. Running the code will yield two plots: one with the BCELoss on the training set and one with the F1 score on the validation set.

Figure 3: BCELoss during training epochs

Figure 4: ‘weighted-F1-score’ on the validation set during training epochs

If you have any comments or questions, please get in touch!