4.1 Experimental Setup
The setup for both methods is described below:
Graph R-CNN: The use of faster R-CNN with VGG16 backbone to ensure object detection is implemented via PyTorch. For the RePN implementation, a multi-layer perceptron structure is used to analyze the relatedness score using two projection functions, each for subject and object relation. Two aGCN layers are used, one for the feature level, the result of which is sent to the other one at the semantic level. The training is done in two stages, first only the object detector is trained, and then the whole model is trained jointly.
MotifNet: The images that are fed into the bounding box detector are made to be of size 592×592, by using the zero-padding method. All the LSTM layers undergo highway connections. Two and four alternating highway LSTM layers are used for object and edge context respectively. The ordering of the bounding box regions can be done in several ways using central x-coordinate, maximum non-background prediction, size of the bounding box, or just random shuffling.
The main challenge is to analyze the model with a common dataset framework, as different approaches use different data preprocessing, split, and evaluation. However, the discussed approaches, Graph R-CNN and MotifNet uses the publicly available data processing scheme and split from [7]. There are 150 object classes and 50 classes for relations in this Visual Genome dataset [4].
Visual Genome Dataset [4] in a nutshell:
Human Annotated Images
More than 100,000 images
150 Object Classes
50 Relation Classes
Each image has around 11.5 objects and 6.2 relationships in scene graph
4.2 Experimental Results
Quantitative Comparison: Both methods evaluated their model using the recall metric. Table 1 shows the comparison of both methods via different quantitative indicators. (1) Predicate Classification (PredCls) denotes the performance to recognize the relation between objects, (2) Phrase Classification (PhrCls) or scene graph classification in [9] depicts the ability to observe the categories of both objects and relations, (3) Scene Graph Generation (SGGen) or scene graph detection in [9] represents the performance to combine the objects with detected relations among them. In [8], they enhance the latter metric with a comprehensive SGGen (SGGen+) that includes the possibility of having a certain scenario like detecting a man as boy, technically it is a failed detection, but qualitatively if all the relations to this object is detected successfully then it should be considered as a successful result, hence increasing the SGGen metric value.
According to table 1, MotifNet [9] performs comparatively better when analyzing objects, edges, and relation labels separately. However, the generation of the entire graph of a given image is more accurate using the second approach, Graph R-CNN [8]. It also shows that having the comprehensive output metric shows a better analysis of the scene graph model.
Qualitative Comparison: In neural motifs structure [9], they consider the qualitative results separately. For instance, the detection of relation edge wearing as wears falls under the category of failed detection. It shows that the model [9] performs better than what the output metric number shows. On the other hand, [8] includes this understanding of result in their comprehensive SGGen (SGGen+) metric which already takes possible not-so-failed detections into consideration.