Graph & Geometric ML in 2024: Where We Are and What’s Next (Part I — Theory & Architectures) | by Michael Galkin

Contents

Message Passing Neural Networks & Graph Transformers Graph components, biconnectivity, and planarity Aggregation functions: A uniform expressiveness study Convergence and zero-one laws of GNNs on random graphs The descriptive complexity of GNNs A fine-grained expressivity study of GNNs Expressiveness results for Subgraph GNNs Expressive power of architectures for link prediction on KGs Over-squashing and expressivity Generalization and extrapolation capabilities of GNNs 🔮 Predictions time!Predictions from the 2023 post

Michael Bronstein (Oxford), Francesco Di Giovanni (Oxford), İsmail İlkan Ceylan (Oxford), Chris Morris (RWTH Aachen)

Message Passing Neural Networks & Graph Transformers

Graph Transformers are a relatively recent trend in graph ML, trying to extend the successes of Transformers from sequences to graphs. As far as traditional expressivity results go, these architectures do not offer any particular advantages. In fact, it is arguable that most of their benefits in terms of expressivity (see e.g. Kreuzer et al.) come from powerful structural encodings rather than the architecture itself and such encodings can in principle be used with MPNNs.

In a recent paper, Cai et al. investigate the connection between MPNNs and (graph) Transformers showing that an MPNN with a virtual node — an auxiliary node that is connected to all other nodes in a specific way — can simulate a (graph) Transformer. This architecture is non-uniform, i.e., the size and structure of the neural networks may depend on the size of the input graphs. Interestingly, once we restrict our attention to linear Transformers (e.g., Performer) then there is a uniform result: there exists a single MPNN using a virtual node that can approximate a linear transformer such as Performer on any input of any size.

Figure from Cai et al.: (a) MPNN with a virtual node, (b) a Transformer.

This is related to the discussions on whether graph transformer architectures present advantages for capturing long-range dependencies when compared to MPNNs. Graph transformers are compared to MPNNs that include a global computation component through the use of virtual nodes, which is a common practice. Cai et al. empirically show that MPNNs with virtual nodes can surpass the performance of graph transformers on the Long-Range Graph Benchmark (LRGB, Dwivedi et al.) Moreover, Tönshoff et al. re-evaluate MPNN baselines on the LRGB benchmark to find out that the earlier reported performance gap in favor of graph transformers was overestimated due to suboptimal hyperparameter choices, essentially closing the gap between MPNNs and graph Transformers.

Figure from Lim et al.: SignNet pipeline.

It is also well-known that common Laplacian positional encodings (e.g., LapPE), are not invariant to the changes of signs and basis of eigenvectors. The lack of invariance makes it easier to obtain (non-uniform) universality results, but these models do not compute graph invariants as a consequence. This has motivated a body of work this year, including the study of sign and basis invariant networks (Lim et al., 2023a) and sign equivariant networks (Lim et al., 2023b). These findings suggest that more research is necessary to theoretically ground the claims commonly found in the literature regarding the comparisons of MPNNs and graph transformers.

Graph components, biconnectivity, and planarity

Figure originally by Zyqqh at Wikipedia.

Zhang et al. (2023a) brings the study of graph biconnectivity to the attention of graph ML community. There are many results presented by Zhang et al. (2023a) relative to different biconnectivity metrics. It has been shown that standard MPNNs cannot detect graph biconnectivity unlike many existing higher-order models (i.e., those that can match the power of 2-FWL). On the other hand, Graphormers with certain distance encodings and subgraph GNNs such as ESAN can detect graph biconnectivity.

Figure from Dimitrov et al. (2023): LHS shows the graph decompositions (A-C) and RHS shows the associated encoders (D-F) and the update equation (G).

Dimitrov et al. (2023) rely on graph decompositions to develop dedicated architectures for learning with planar graphs. The idea is to align with a variation of the classical Hopcroft & Tarjan algorithm for planar isomorphism testing. Dimitrov et al. (2023) first decompose the graph into its biconnected and triconnected components, and afterwards learn representations for nodes, cut nodes, biconnected components, and triconnected components. This is achieved using the classical structures of Block-Cut Trees and SPQR Trees which can be computed in linear time. The resulting framework is called PlanE and contains architectures such as BasePlanE. BasePlanE computes isomorphism-complete graph invariants and hence it can distinguish any pair of planar graphs. The key contribution of this work is to design architectures for efficiently learning complete invariants of planar graphs while remaining practically scalable. It is worth noting that 3-FWL is known to be complete on planar graphs (Kiefer et al., 2019), but this algorithm is not scalable.

Aggregation functions: A uniform expressiveness study

It was broadly argued that different aggregation functions have their place, but this had not been rigorously proven. In fact, in the non-uniform setup, sum aggregation with MLPs yields an injective mapping and as a result subsumes other aggregation functions (Xu et al., 2020), which builds on earlier results (Zaheer et al., 2017). The situation is different in the uniform setup, where one fixed model is required to work on all graphs. Rosenbluth et al. (2023) show that sum aggregation does not always subsume other aggregations in the uniform setup. If, for example, we consider an unbounded feature domain, sum aggregation networks cannot even approximate mean aggregation networks. Interestingly, even for the positive results, where sum aggregation is shown to approximate other aggregations, the presented constructions generally require a large number of layers (growing with the inverse of the approximation error).

Convergence and zero-one laws of GNNs on random graphs

GNNs can in principle be applied to graphs of any size following training. This makes an asymptotic analysis in the size of the input graphs very appealing. Previous studies of the asymptotic behaviour of GNNs have focused on convergence to theoretical limit networks (Keriven et al., 2020) and their stability under the perturbation of large graphs (Levie et al., 2021).

In a recent study, Adam-Day et al. (2023) proved a zero-one law for binary GNN classifiers. The question being tackled is the following: How do binary GNN classifiers behave as we draw Erdos-Rényi graphs of increasing size with random node features? The main finding is that the probability that such graphs are mapped to a particular output by a class of GNN classifiers tends to either zero or to one. That is, the model eventually maps either all graphs to zero or all graphs to one. This result applies to GCNs as well as to GNNs with sum and mean aggregation.

The principal import of this result is that it establishes a novel uniform upper bound on the expressive power of GNNs: any property of graphs which can be uniformly expressed by these GNN architectures must obey a zero-one law. An example of a simple property which does not asymptotically tend to zero or one is that of having an even number of nodes.

The descriptive complexity of GNNs

Grohe (2023) recently analysed the descriptive complexity of GNNs in terms of Boolean circuit complexity. The specific circuit complexity class of interest is TC0. This class contains all languages which are decided by Boolean circuits with constant depth and polynomial size, using only AND, OR, NOT, and threshold (or, majority) gates. Grohe (2023) proves that the graph functions that can be computed by a class of polynomial-size bounded-depth family of GNNs lie in the circuit complexity class TC0. Furthermore, if the class of GNNs are allowed to use random node initialization and global readout as in Abboud el al. (2020) then there is a matching lower bound in that they can compute exactly the same functions that can be expressed in TC0. This establishes an upper bound on the power of GNNs with random node features, by requiring the class of models to be of bounded depth (fixed #layers) and of size polynomial. While this result is still non-uniform, it improves the result of Abboud el al. (2020) where the construction can be worst-case exponential.

A fine-grained expressivity study of GNNs

Numerous recent works have analyzed the expressive power of MPNNs, primarily utilizing combinatorial techniques such as the 1-WL for the graph isomorphism problem. However, the graph isomorphism objective is inherently binary, not giving insights into the degree of similarity between two given graphs. Böker et al. (2023) resolve this issue by deriving continuous extensions of both 1-WL and MPNNs to graphons. Concretely, they show that the continuous variant of 1-WL delivers an accurate topological characterization of the expressive power of MPNNs on graphons, revealing which graphs these networks can distinguish and the difficulty level in separating them. They provide a theoretical framework for graph and graphon similarity, combining various topological variants of classical characterizations of the 1-WL. In particular, they characterize the expressive power of MPNNs in terms of the tree distance, which is a graph distance based on the concept of fractional isomorphisms, and substructure counts via tree homomorphisms, showing that these concepts have the same expressive power as the 1-WL and MPNNs on graphons. Interestingly, they also validated their theoretical findings by showing that randomly initialized MPNNs, without training, show competitive performance compared to their trained counterparts.

Expressiveness results for Subgraph GNNs

Subgraph-based GNNs were already a big trend in 2022 (Bevilacqua et al., 2022, Qian et al., 2022). This year, Zhang et al. (2023b) established more fine-grained expressivity results for such architectures. The paper investigates subgraph GNNs via the so-called Subgraph Weisfeiler-Leman Tests (SWL). Through this, they show a complete hierarchy of SWL with strictly growing expressivity. Concretely, they define equivalence classes for SWL-type algorithms and show that almost all existing subgraph GNNs fall in one of them. Moreover, the so-called SSWL achieves the maximal expressive power. Interestingly, they also relate SWL to several existing expressive GNNs architectures. For example, they show that SWL has the same expressivity as the local versions of 2-WL (Morris et al., 2020). In addition to theory, they also show that SWL-type architectures achieve good empirical results.

Expressive power of architectures for link prediction on KGs

The expressive power of architectures such as RGCN and CompGCN for link prediction on knowledge graphs has been studied by Barceló et al. (2022). This year, Huang et al. (2023) generalized these results to characterize the expressive power of various other model architectures.

Figure from Huang et al. (2023): The figure compares the respective mode of operations in R-MPNNs and C-MPNNs.

Huang et al. (2023) introduced the framework of conditional message passing networks (C-MPNNs) which includes architectures such as NBFNets. Classical relational message passing networks (R-MPNNs) are unary encoders (i.e., encoding graph nodes) and rely on a binary decoder for the task of link prediction (Zhang, 2021). On the other hand, C-MPNNs serve as binary encoders (i.e., encoding pairs of graph nodes) and as a result, are more suitable for the binary task of link prediction. C-MPNNs are shown to align with a relational Weisfeiler-Leman algorithm that can be seen as a local approximation of 2WL. These findings explain the superior performance of NBFNets and alike over, e.g., RGCNs. Huang et al. (2023) also present uniform expressiveness results in terms of precise logical characterizations for the class of binary functions captured by C-MPNNs.

Over-squashing and expressivity

Over-squashing is a phenomenon originally described by Alon & Yahav in 2021 as the compression of exponentially-growing receptive fields into fixed-size vectors. Subsequent research (Topping et al., 2022, Di Giovanni et al., 2023, Black et al., 2023, Nguyen et al., 2023) has characterised over-squashing through sensitivity analysis, proving that the dependence of the output features on hidden representations from earlier layers, is impaired by topological properties such as negative curvature or large commute time. Since the graph topology plays a crucial role in the formation of bottlenecks, graph rewiring, a paradigm shift elevating the graph connectivity to design factor in GNNs, has been proposed as a key strategy for alleviating over-squashing (if you are interested, see the Section on Exotic Message Passing below).

For the given graph, the MPNN learns stronger mixing (tight springs) for nodes (v, u) and (u, w) since their commute time is small, while nodes (u, q) and (u, z), with high commute-time, have weak mixing (loose springs). Source: Di Giovanni et al., 2023

Over-squashing is an obstruction to the expressive power, for it causes GNNs to falter in tasks with long-range interactions. To formally study this, Di Giovanni et al., 2023 introduce a new metric of expressivity, referred to as “mixing”, which encodes the joint and nonlinear dependence of a graph function on pairs of nodes’ features: for a GNN to approximate a function with large mixing, a necessary condition is allowing “strong” message exchange between the relevant nodes. Hence, they postulate to measure over-squashing through the mixing of a GNN prediction, and prove that the depth required by a GNN to induce enough mixing, as required by the task, grows with the commute time — typically much worse than the shortest-path distance. The results show how over-squashing hinders the expressivity of GNNs with “practical” size, and validate that it arises from the misalignment between the task (requiring strong mixing between nodes i and j) and the topology (inducing large commute time between i and j).

The “mixing” of a function pertains to the exchange of information between nodes, whatever this information is, and not to its capacity to separate node representations. In fact, these results also hold for GNNs more powerful than the 1-WL test. The analysis in Di Giovanni et al., (2023) offers an alternative approach for studying the expressivity of GNNs, which easily extends to equivariant GNNs in 3D space and their ability to model interactions between nodes.

Generalization and extrapolation capabilities of GNNs

The expressive power of MPNNs has achieved a lot of attention in recent years through its connection to the WL test. While this connection has led to significant advances in understanding and enhancing MPNNs’ expressive power (Morris et al, 2023a), it does not provide insights into their generalization performance, i.e., their ability to make meaningful predictions beyond the training set. Surprisingly, only a few notable contributions study MPNNs’ generalization behaviors, e.g., Garg et al. (2020), Kriege et al. (2018), Liao et al. (2021), Maskey et al. (2022), Scarselli et al. (2018). However, these approaches express MPNNs’ generalization ability using only classical graph parameters, e.g., maximum degree, number of vertices, or edges, which cannot fully capture the complex structure of real-world graphs. Further, most approaches study generalization in the non-uniform regime, i.e., assuming that the MPNNs operate on graphs of a pre-specified order.

Figure from Morris et al. (2023b): Overview of the generalization capabilities of MPNNs and their link to the 1-WL.

Hence, Morris et al. (2023b) showed a tight connection between the expressive power of the 1-WL and generalization performance. They investigate the influence of graph structure and the parameters’ encoding lengths on MPNNs’ generalization by tightly connecting 1-WL’s expressivity and MPNNs’ Vapnik–Chervonenkis (VC) dimension. To that, they show several results.

1️⃣ First, in the non-uniform regime, they show that MPNNs’ VC dimension depends tightly on the number of equivalence classes computed by the 1-WL over a set of graphs. In addition, their results easily extend to the k-WL and many recent expressive MPNN extensions.

2️⃣ In the uniform regime, i.e., when graphs can have arbitrary order, they show that MPNNs’ VC dimension is lower and upper bounded by the largest bitlength of its weights. In both the uniform and non-uniform regimes, MPNNs’ VC dimension depends logarithmically on the number of colors computed by the 1-WL and polynomially on the number of parameters. Moreover, they also empirically show that their theoretical findings hold in practice to some extent.

🔮 Predictions time!

Christopher Morris (RWTH Aachen)

“I believe that there is a pressing need for a better and more practical theory of generalization of GNNs. ” — Christopher Morris (RWTH Aachen)

➡️ For example, we need to understand how graph structure and various architectural parameters influence generalization. Moreover, the dynamics of SGD for training GNNs are currently understudied and not well understood, and more works will study this.

İsmail İlkan Ceylan (Oxford)

“I hope to see more expressivity results in the uniform setting, where we fix the parameters of a neural network and examine its capabilities.” — İsmail İlkan Ceylan (Oxford)

➡️ In this case, we can identify a better connection to generalization, because if a property cannot be expressed uniformly then the model cannot generalise to larger graph sizes.

➡️ This year, we may also see expressiveness studies that target graph regression or graph generation, which remain under-explored. There are good reasons to hope for learning algorithms which are isomorphism-complete on larger graph classes, strictly generalizing the results for planar graphs.

➡️ It is also time to develop a theory for learning with fully relational data (i.e., knowledge hypergraphs), which will unlock applications in relational databases!

Francesco Di Giovanni (Oxford)

In terms of future theoretical developments of GNNs, I can see two directions that deserve attention.

“There is very little understanding of the dynamics of the weights of a GNN under gradient flow (or SGD); assessing the impact of the graph topology on the evolution of the weights is key to addressing questions about generalisation and hardness of a task.” — Francesco Di Giovanni (Oxford)

➡️ Second, I believe it would be valuable to develop alternative paradigms of expressivity, which more directly focus on approximation power (of graph functions and their derivatives) and identify precisely the tasks which are hard to learn. The latter direction could also be particularly meaningful for characterising the power of equivariant GNNs in 3D space, where measurements of expressivity might need to be decoupled from the 2D case in order to be better aligned with tasks coming from the scientific domain.

At the end: a fun fact about where WL went in 2023

Predictions from the 2023 post

(1) More efforts on creating time- and memory-efficient subgraph GNNs.
❌ not really

(2) Better understanding of generalization of GNNs
✅ yes, see the subsections on oversquashing and generalization

(3) Weisfeiler and Leman visit 10 new places!
❌ (4 so far) Grammatical, indifferent, measurement modeling, paths