Glitches in the Attention Matrix

Contents

1. Discovery of the Artifacts in ViTs with DINOv2 The Impact of High-Norm Artifacts The Cause: Two Hypotheses 2. The Register Solution: Vision Transformers Need Registers (2024)3. The Denoising Solution: Denoising Vision Transformers (2024)4. The Distillation Solution: Self-Distilled Registers (2025)5. The Mechanistic Solution: Test-Time Registers (2025)Relationship between ViT High-Norm Artifacts and LLM Attention Sinks 7. Removing the Artifacts with Sigmoidal Gating: Gated Attention (2025)8. Conclusion Comparison of Mitigation Strategies Bibliography

the groundwork for foundation models, which allow us to take pretrained models off the shelf and apply them to a variety of tasks. However, there is a common artifact found in transformer models that can have detrimental impacts in specific tasks and scenarios. Not understanding these downfalls could cause your project to substantially underperform or fail. For example, the DINOv2’s GitHub page has models pretrained with and without registers. A table with metrics suggests that registers, which were introduced to fix this artifact, do not help the model in a meaningful way. And why add complexity if there isn’t an increase in accuracy?

However, the metrics shown on the DINOv2’s page are only for ImageNet classification, which is known to not be impacted by these artifacts. If you use the DINOv2 ViT model without registers for object detection (like with LOST), your performance would likely be substantially worse.

Using Pretrained ViT Models without understanding when high-norm artifacts could impact your project could result in your project failing.

Since these artifacts were identified, the research community has developed several methods to address them. The latest solutions require little to no retraining and introduce zero additional test-time latency. These phenomena are not unique to ViTs, but also occur in LLMs. In fact, one of the NeurIPS 2025 papers reviewed here proposes a general solution to these “attention sink” artifacts — which modifies the self-attention transformer architecture. This modified architecture is shown to be beneficial in a multitude of ways and is already being incorporated into the latest Qwen model, Qwen3-Next.

This article provides a comprehensive guide to:

Transformer registers.
The high-norm artifacts (or attention sinks) they address.
The latest research-driven solutions for mitigating these artifacts.

1. Discovery of the Artifacts in ViTs with DINOv2

While ViTs have been pivotal in ushering in the era of foundation models for computer vision, they suffer from a persistent anomaly: the emergence of high-norm spikes¹. These artifacts appear across both supervised and self-supervised training regimes, with the original DINO being a notable exception. In Figure 1, this is demonstrated on ViT Base models trained with different algorithms, spanning self-supervised (DINO/DINOv2, MAE), weakly supervised (CLIP), to supervised (DeiT-III).

Figure 1. Visualization of the last layer of multiple ViT-B models. The original DINO does not show artifacts; adding registers to DINOv2 prevents artifacts from appearing in patch tokens. Figure by author; input images generated via NanoBanana.

These artifacts exhibit four key characteristics:

High Norm: The L2 norm of artifact tokens can be 2–10 times larger than the average token norm, depending on the training method.
Sparsity: They constitute a small fraction of total tokens (approx. 2%) and form a distinct mode in the distribution (e.g. Fig 3 and 4 in Darcet et al 2024¹).
Patch Localization: They predominantly appear in low-information background areas or image corners.
Layer Localization: They appear primarily in the middle-to-late layers of ViTs.

The Impact of High-Norm Artifacts

The impact on accuracy varies by task. We measure this impact by observing how much performance improves after applying the fixes discussed in later sections. A summary of results from Jiang et al. (2025)² is provided below:

Impact	Task	Mitigation Result
😐	ImageNet Classification	No significant impact
😃	Unsupervised Object Discovery (LOST)	Substantial improvement (20%) on DINOv2 ViT-L/14
😊	Zero-shot Segmentation	+5 mIOU for OpenCLIP ViT-B/14, but not DINOv2
😊	Depth Estimation	Marginal improvement with test-time registers (lower RMSE)

The Cause: Two Hypotheses

Why do these models generate high-norm artifacts? Two primary, non-contradictory hypotheses exist:

Global Processing: Large models learn to identify redundant tokens and repurpose them as “storage slots” to process and retrieve global information.
The Mechanistic Hypothesis: The artifacts are a byproduct of the Softmax function, which forces attention weights to sum to 1.

In SoftMax-based attention, the weights for a given query must sum to 1:

$$\sum_{j} \text{Attention}(Q, K_j) = 1$$

Even when a query token $ i $ has no meaningful relationship with any key token $ j $ the SoftMax operation forces it to distribute its “attention mass”. This mass often gets dumped into specific low-information background tokens that then become high-norm sinks.

They are calculated separately for each attention head. To really understand the attention sink issue, we will be stepping through the attention code. The self attention diagrams are also reproduced in Figure 2 for reference.

**Figure 2.** Refresher of transformer attention. The left side zooms into the Scaled Dot-Product Attention (SDPA), while the right side shows how SDPA fits into the network in a multi-headed configuration. The orange box on the left highlights the SoftMax layer, which is normalized so that sum along the last dimension sums to 1. The right illustrates how heads remain separate until after attention is applied. Figure by author, based on Figure 2 from Vaswani et al. (2017)³.

You can see an example of the code at Facebook Research’s DeiT Github Repo:

class Attention(nn.Module):
    # ...
    def forward(self, x):
		# B: batch size
		# N: sequence length (# tokens)
		# C: embedding size * num_heads
        B, N, C = x.shape
        # self.qkv is a Linear Layer with bias that triples the size of
        # the tensor - calculating Q=XW_Q, K=XW_K, V=XW_V in one equation
        qkv = self.qkv(x).reshape(
            B, N,
            3, # includes Q, K, and V - this dimension gets permuted to
               # 0 index
            self.num_heads,
            C // self.num_heads).permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]
        
        q = q * self.scale # for numeric stability

        attn = (q @ k.transpose(-2, -1)) # attn: [B x N x N]
        attn = attn.softmax(dim=-1) # Creation of artifact
        attn = self.attn_drop(attn) # Optional dropout training augmentation

		# Next line does matrix multiply AND concatenation between heads
        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        x = self.proj(x) # another linear layer
        x = self.proj_drop(x) # Optional dropout training augmentation
        return x

In ViTs, which lack explicit “global” tokens (other than the [CLS] token), the model repurposes background patches as “attention sinks” or “trash cans”. These tokens aggregate global information, their norm magnitude swells, and their original local semantic meaning is lost.

2. The Register Solution: Vision Transformers Need Registers (2024)

**Figure 3.** Diagram of ViT with registers. Register output tokens are not used for training or predictions but provide a dedicated space for global information. Figure by author; image of puppies created with NanoBanana.

The team behind DINOv2 discovered these high-norm artifacts and proposed adding “register” tokens (Darcet et al. 2024¹). These tokens are learned tokens like the [cls] token without positional embeddings, but the corresponding output tokens are never used. That’s all they really are, just additional tokens that aren’t directly used for training. These register tokens are learned just like the [CLS] token and don’t have positional embeddings. The major downside of this method is that they require retraining the model. This limitation spurred the search for post-hoc solutions that could fix existing models.

3. The Denoising Solution: Denoising Vision Transformers (2024)

Yang et al. (2024)⁴ proposed Denoising Vision Transformers (DVT) to clean output tokens post-hoc. While DVT is synergistic with registers, it introduces a significant bottleneck, adding approximately 100 seconds of latency per 518×518 image—making it impractical for real-time applications.

Contributions:

DVTs improve the performance on a variety of tasks and the authors showed that DVT was synergistic with adding registers.
Paper adds to our understanding the contributions of positional embeddings are an underlying cause to the high-norm artifacts.

However:

Adds a large latency per image (around 100 seconds for 518×518 images)

4. The Distillation Solution: Self-Distilled Registers (2025)

The approach by Chen et al. 2025⁵ uses a teacher-student paradigm to train a small subset of weights and the register tokens. The high-norm artifacts are removed from the teacher signal by applying data augmentation of random offsets and flips to the images, allowing the artifacts to be averaged out. The teacher model is kept frozen as the original ViT. The student model is also initialized from the same ViT, however, additional learnable register tokens are added and a small subset of the weights are finetuned.

Contributions:

Orders of magnitude less compute than training with registers from scratch.
No additional test-time latency.

5. The Mechanistic Solution: Test-Time Registers (2025)

Jiang et al. (2025)² introduce a method to perform “surgery” on trained models to add registers without retraining. They discovered that artifacts are generated by a sparse set of specific “Register Neurons” within the MLP layers (roughly 0.02% of all neurons). By rerouting the values from these internal MLP neurons to new register tokens, they matched the performance of fully trained register models at zero retraining cost.

They find the following properties of the artifact-causing neurons (or “Register Neurons”):

Sparsity: Roughly 0.02% of neurons are responsible for the vast majority of artifact energy.
Causality: the position of the outliers can be moved by modifying the activation pattern of the register neurons.

They show that these register neurons aggregate global information using linear probes: ie. they see if they can use the register neurons for classification on ImageNet and CIFAR-10/100. The last output of the registers are ignored, but there are register tokens within the network where the network can use that global information. The authors perform experiments to show that setting the register neurons to zero substantially reduces the networks performance from 70.2% to 55.6%, suggesting that the networks are using the artifacts to store information and are not just an artifact of SoftMax.

Relationship between ViT High-Norm Artifacts and LLM Attention Sinks

A phenomenon similar to the ViT high-norm artifacts — attention sinks — were found in LLMs in the StreamingLLM paper (Xiao et al., ICLR 2024⁶). While extending LLMs for use on streaming, infinite-length sequences, they noticed that the accuracy significantly dropped when the starting token no longer fit into a sliding window. These initial tokens, they’ve discovered, tend to accumulate over half of the attention score. The drop in accuracy was recovered if they kept the $ K $ and $ V $ values from the initial 1-4 tokens around, while sliding the window over the remaining tokens. They propose that the initial tokens are used as attention sinks because of the sequential nature of autoregressive language modeling: they are visible to all tokens, while later tokens are only visible to subsequent tokens. This is in contrast with ViTs where each patch token is visible to every other patch token. With LLMs, attention sinks tended to not be seen as a problem, unlike in ViTs.

The attentional sinks in LLMs were thought to serve as anchors without aggregating global information — unlike in ViTs; however, even more recent research from Queipo-de-Llano and colleagues (Queipo-de-Llano et al 2025⁷), “Attentional Sinks and Compression Valleys” finds that these attentional sinks do indeed contain global information. This suggests that the general solution discussed in the next solution might also apply to ViTs, even though they were not tested on them at the time of this writing.

7. Removing the Artifacts with Sigmoidal Gating: Gated Attention (2025)

**Figure 4.** Gu et al.⁸ showed that replacing SoftMax with Sigmoid avoids creating the high-norm artifacts. This did not involve any gating outside of the attention calculation.

One way to address the symptoms of SoftMax might be to replace it with a sigmoid. Gu et al. ⁸ showed in 2025 that indeed replacing SoftMax with (unnormalized) sigmoid can eliminate the Attention Sink at the first token, as shown in Figure 4. While the preliminary results show some potential improvement to validation loss, it remains unclear what the downstream impacts this will have on LLM performance and it lacks the robust experiments of our next paper.

**Figure 5.** Qiu et al.⁹ left the Scaled Dot-Product Attention (SDPA) untouched and added the sigmoid after concatenating the heads. This means that the Softmax would likely create the high-norm spikes in the SDPA, but then be removed during the gating step.

Qiu et al. did something different in their Gated Attention NeurIPS 2025 paper⁹: they left the SoftMax attention untouched, but then added gating after the tokens from all the heads were concatenated, shown in Figure 5. They find that adding gating does remove the high-norm artifacts, even though the SoftMax attention would still create such artifacts prior to the gating inside the standard scaled-dot product attention (SDPA). The benefits of the Gated Attention go beyond fixing the attention sink artifact, offering:

Improved training stability
Elimination of training loss spikes
Support for larger learning rates and batch sizes

They use this Gated Attention in their new Qwen3-Next model, although they also replace some of the self-attention with Gated DeltaNet. This could be a sign that we are moving away from single elegant solutions, like repeated self-attention modules, and more towards a collection of hacks or heuristics that gets the best performance. In a lot of ways, this could be similar to the brain, with its wide variety of types of neurons, neurotransmitters, and neuroreceptors. Larger architecture changes could puncture the equilibrium of progress and require a lot of the process of tweaking the collection of the heuristics again.

8. Conclusion

Since the distant past of 2024, when high-norm artifacts of ViTs and attention sinks of LLMs were discovered, the research community has discovered many solutions and made a lot more progress in understanding these artifacts. The artifacts are more similar than initially thought. In both cases, the SoftMax causes the attention to increase substantially for some tokens, which are used (implicitly or explicitly) as registers that store global information. Removing these registers can hurt performance once they are learned. Test-time registers moves the high-norm artifacts (or implicit registers) to explicit registers, allowing the patch tokens to be cleansed from the artifacts. You can also prevent the registers from forming in the first place by either replacing SoftMax with a sigmoid or using a sigmoid as a gating function after the SoftMax (although the latter allows high-norm artifacts within the SDPA, but they are removed before they form “tokens”)

In many cases, these artifacts don’t cause any issues, such as with global tasks like classification for ViTs and most LLM tasks. They do negatively impact dense ViT tasks, especially when a single or a few tokens can have an outsized effect, like object detection. The fixes at least don’t make the performance worse, although the fixes for LLMs, such as the sigmoid attention and gated attention haven’t been used as widely and — sigmoid attention in particular — might be more difficult to train. Embracing the artifact — copying the KV values of the initial tokens — seems to be the current best mature solution for streaming LLMs⁶.

Comparison of Mitigation Strategies

The best mitigation strategy depends if you already have a trained model or if you plan on training from scratch.

Method	Training Cost	Mechanism	Latency	Applied To
Trained Registers¹	High (Full)	Add Learned Tokens	None	ViTs
Denoising ViTs⁴	Medium	Signal Decomposition	Very High	ViTs
Self-Distilled⁵	Low (Fine-tune)	Distillation	None	ViTs
Test-Time Registers²	Zero	Neuron Shifting	None	ViTs
Streaming LLM⁶	Zero	KV Cache Preservation	None	LLMs
Sigmoid or Elu+1 Attention⁸	High (Full)	Replace SoftMax	None	LLMs
Gated Attention⁹	High (Full)	Add Sigmoid Gating	Minimal	LLMs

Bibliography

Darcet, T., et al. “Vision Transformers Need Registers.” (2024).
Jiang, N., et al. “Vision Transformers Don’t Need Trained Registers.” (2025).
Vaswani, A., et al. “Attention Is All You Need.” (2017).
Yang, et al. “Denoising Vision Transformers.” (2024).
Chen, Y., et al. “Vision Transformers with Self-Distilled Registers.” NeurIPS (2025).
Xiao, et al. “Efficient Streaming Language Models with Attention Sinks.” ICLR (2024).
Queipo-de-Llano, et al. “Attentional Sinks and Compression Valleys.” (2025).
Gu, et al. “When Attention Sink Emerges in Language Models: An Empirical View.” ICLR (2025).
Qiu, Z., et al. “Gated Attention for Large Language Models.” NeurIPS (2025).