Static and Dynamic Attention: Implications for Graph Neural Networks | by Hunjae Timothy Lee

Contents

Graph Attention Network (GAT)Static Attention

Graph Attention Network (GAT)

Graph Attention Network (GAT), as introduced in [1], closely follows the work from [3] in its attention setup. GAT formulation also holds many similarities to the now infamous transformer paper [4], with both papers having been published months away from each other.

Attention in graphs is used to rank or weigh the relative importance of every neighboring node (keys) with respect to each source node (query). These attention scores are calculated for every node feature in the graph and its respective neighbors. Node features, denoted by

go through a linear transformation with a weight matrix denoted

before attention mechanism is applied. With linearly transformed node features, raw attention score is calculated as shown in Equation (1). To calculate normalized attention scores, softmax function is used as shown in Equation (2) similar to attention calculation in [4].

In the paper (GAT), the attention mechanism Attention( ) used is a single-layer feedforward neural network parameterized by a followed by a LeakyReLU non-linearity, as shown in Equation (3). The || symbol denotes concatenation along the feature dimension. Note: multi-head attention formulation is intentionally skipped in this article as it holds no relevance to attention formulation itself. Both GAT and GATv2 leverage multi-headed attention in their implementations.

As it can be seen, the learnable attention parameter a is introduced as a linear combination to the transformed node features Wh. As elaborated in the upcoming sections, this setup is known as static attention and is the main limiting factor of GAT, though for reasons that are not immediately obvious.

Static Attention

Consider the following graph below where node h1 is the query node with the following neighbors (keys) {h2, h3, h4, h5}.

Calculating the raw attention score between the query node and h2 following the GAT formulation is shown in Equation (4).

As mentioned earlier, learnable attention parameter a is combined linearly with the concatenated query and key nodes. This means that the contributions of a with respect to Wh1 and Wh2 are linearly separable as a = [a1 || a2]. Using a1 and a2, Equation (4) can be restructured as the following (Equation (5)).

Calculating the raw attention scores for the rest of the neighborhood with respect to the query node h1, a pattern begins to emerge.

From Equation (6), it can be seen that the query term

is repeated each time in the calculation of attention scores e. This means that while the query term is technically included in the attention calculation, it essentially affects all neighbors equally and does not affect their relative ordering. Only the key terms

determine the relative order of attention scores with respect to each other.

This type of attention is called static attention by [2]. This design means that the ranking of neighbors’ importance

is determined globally across all nodes independent of the specific query nodes. This limitation prevents GAT from capturing locally nuanced relationships where different nodes might prioritize different subsets of neighbors. As stated in [2], “[static attention] cannot model situations where different keys have different relevance to different queries”.