Mechanistic Interpretability: Peeking Inside an LLM

Contents

Intro Refresher: The design of an LLM Introduction to interpretability methods Use cases LLM interpretability research Conclusion Contact References

Intro

how to examine and manipulate an LLM’s neural network. This is the topic of mechanistic interpretability research, and it can answer many exciting questions.

Remember: An LLM is a deep artificial neural network, made up of neurons and weights that determine how strongly those neurons are connected. What makes a neural network arrive at its conclusion? How much of the information it processes does it consider and analyze adequately?

These sorts of questions have been investigated in a vast number of publications at least since deep neural networks started showing promise. To be clear, mechanistic interpretability existed before LLMs did, and was already an exciting aspect of Explainable AI research with earlier deep neural networks. For instance, identifying the salient features that trigger a CNN to arrive at a given object classification or vehicle steering direction can help us understand how trustworthy and reliable the network is in safety-critical situations.

But with LLMs, the topic really took off, and became much more interesting. Are the human-like cognitive abilities of LLMs real or fake? How does information travel through the neural network? Is there hidden knowledge inside an LLM?

In this post, you will find:

A refresher on LLM architecture
An introduction to interpretability methods
Use cases
A discussion of past research

In a follow-up article, we will look at Python code to apply some of these skills, visualize the activations of the neural network and more.

Refresher: The design of an LLM

For the purpose of this article, we need a basic understanding of the spots in the neural network where it is worth hooking into, to derive possibly useful information in the process. Therefore, this section is a quick reminder of the components of an LLM.

LLMs use a sequence of input tokens to predict the next token.

The inner workings of an LLM: Input tokens are embedded into a combined matrix and transformer blocks enrich this hidden state with additional context. The residual stream can then be unembedded to determine the token predictions. (Image by author)

Tokenizer: Initially, sentences are segmented into tokens. The goal of the token vocabulary is to turn frequently used sub-words into single tokens. Each token has a unique ID.

However, tokens can be confusing and messy since they provide an inaccurate representation of many things, including numbers and individual characters. Asking an LLM to calculate or to count letters is a pretty unfair thing to do. (With specialized embedding schemes, their performance can improve [1].)

Embedding: A look-up table is used to assign each token ID to an embedding vector of a given dimensionality. The look-up table is learned (i.e., derived during the neural network training), and tends to place co-occurring tokens closer together in the embedding space. The dimensionality of the embedding vectors is an important trade-off between the capabilities of LLMs and computing effort. Since the order of the tokens would otherwise not be apparent in subsequent steps, positional encoding is added to these embeddings. In rotary positional encoding, the cosine of the token position can be used. The embedding vectors of all input tokens provide the matrix that the LLM processes, the initial hidden states. As the LLM operates with this matrix, which moves through layers as the residual stream (also referred to as the hidden state or representation space), it works in latent space.

Modalities other than text: LLMs can work with modalities other than text. In these cases, the tokenizer and embedding are changed to accommodate different modalities, such as sound or images.

Transformer blocks: A number of transformer blocks (dozens) refine the residual stream, adding context and additional meaning. Each transformer layer consists of an attention component [2] and an MLP component. These components are fed the normalized hidden state. The output is then added to the residual stream.

Attention: Multiple attention heads (also dozens) add weighted information from source tokens to destination tokens (in the residual stream). Each attention head’s “nature” is parametrized through three learned matrices W_Q, W_K, W_V, which essentially decide what the attention head is specialized on. Queries, keys and values are calculated by multiplying these matrices with the hidden states for all tokens. The attention weight are then computed for each destination token from the softmax of the scaled dot products of the query and the key vectors of the source tokens. This attention weight describes the strength of the relationship between the source and the destination for a given specialization of the attention head. Finally, the head outputs a weighted sum of the source token’s value vectors, and all the head’s outputs are concatenated and passed through a learned output projection W_O.
MLP: A fully connected feedforward network. This linear-nonlinear-linear operation is applied independently at each position. MLP networks typically contain a large share of the parameters in an LLM.
MLP networks store much of the knowledge. Later layers tend to contain more semantic and less shallow knowledge [3]. This is relevant when deciding where to probe or intervene. (With some effort, these knowledge representations can be modified in a trained LLM through weight modification [4] or residual stream intervention [5].)

Unembedding: The final residual stream values are normalized and linearly mapped back to the vocabulary size to produce the logits for each input token position. Typically, we only need the prediction for the token following the last input token, so we use that one. The softmax function converts the logits for the final position into a probability distribution. One option is then selected from this distribution (e.g., the most likely or a sampling-based option) as the next predicted token.

If you wish to learn more about how LLMs work and gain additional intuition, Stephen McAleese’s [6] explanation is excellent.

Now that we looked at the architecture, the question to ask is: What do the intermittent states of the residual stream mean? How do they relate to the LLM’s output? Why does this work?

Introduction to interpretability methods

Let’s take a look at our toolbox. Which components will help us answer our questions, and which methods can we apply to analyze them? Our options include:

Neurons:
We could observe the activation of individual neurons.
Attention:
We could observe the output of individual attention heads in each layer.
We could observe the queries, keys, values and attention weights of each attention head for each position and layer.
We could observe the concatenated outputs of all attention heads in each layer.
MLP:
We could observe the MLP output in each layer.
We could observe the neural activations inside of the MLP networks.
We could observe the LayerNorm mean/variance to track scale, saturation and outliers.
Residual stream:
We could observe the residual stream at each position, in each layer.
We could unembed the residual stream in intermediate layers, to observe what would happen if we stopped there — earlier layers often yield more shallow predictions. (This is a useful diagnostic, but not fully reliable — the unembedding mapping was trained for the final layer.)

We can also derive additional information:

Linear probes and classifiers: We can build a system that classifies the recorded residual stream into one group or another, or measures some feature within it.
Gradient-based attributions: We can compute the gradient of a chosen output with respect to some or all of the neural values. The gradient magnitude indicates how sensitive the prediction is to changes in those values.

All of this can be done while a given, static LLM runs an inference on a given prompt or while we actively intervene:

Comparison of multiple inferences: We can switch, train, modify or change the LLM or have it process different prompts, and record the aforementioned information.
Ablation: We can zero out neurons, heads, MLP blocks or vectors in the residual stream and watch how it affects behavior. For example, this allows us to measure the contribution of a head, neuron or pathway to token prediction.
Steering: We can actively steer the LLM by replacing or otherwise modifying activations in the residual stream.

Use cases

The interpretability methods discussed represent a vast arsenal that can be applied to many different use cases.

Model performance improvement or behavior steering through activation steering: For instance, in addition to a system prompt, a model’s behavior can be steered towards a certain trait or focus dynamically, without changing the model.
Explainability: Methods such as steering vectors, sparse autoencoders, and circuit tracing can be used to understand what the model does and why based on its activations.
Safety: Detecting and discouraging undesirable features during training or implementing run-time supervision to interrupt a model that is deviating. Detect new or risky capabilities.
Drift detection: During model development, it is important to understand when a newly trained model is behaving differently and to what extent.
Training improvement: Understanding the contribution of aspects of the model’s behavior to its overall performance optimizes model development. For example, unnecessary Chain-of-Thought steps can be discouraged during training, which leads to smaller, faster, or potentially more powerful models.
Scientific and linguistic learnings: Use the models as an object to study to better understand AI, language acquisition and cognition.

LLM interpretability research

The field of interpretability has steadily developed over the last few years, answering exciting questions along the way. Just three years ago, it was unclear whether or not the learnings outlined below would manifest. This is a brief history of key insights:

In-context learning and pattern understanding: During LLM training, some attention heads gain the capability to collaborate as pattern identifiers, greatly enhancing an LLM’s in-context learning capabilities [7]. Thus, some aspects of LLMs represent algorithms that enable capabilities applicable outside the space of the training data.
World understanding: Do LLMs memorize all of their answers, or do they understand the content in order to form an internal mental model before answering? This topic has been heavily debated, and the first convincing evidence that LLMs create an internal world model was published at the end of 2022. To demonstrate this, the researchers recovered the board state of the game Othello from the residual stream [8, 9]. Many more indications followed swiftly. Space and time neurons were identified [10].
Memorization or generalization: Do LLMs simply regurgitate what they have seen before, or do they reason for themselves? The evidence here was somewhat unclear [11]. Intuitively, smaller LLMs form smaller world models (i.e., in 2023, the evidence for generalization was less convincing than in 2025). Newer benchmarks [12, 13] aim to limit contamination with material that may be within a model’s training data, and focus specifically on the generalization capability. Their performance there is still substantial.
LLMs develop deeper generalization abilities for some concepts during their training. To quantify this, indicators from interpretability methods were used [14].
Superposition: Properly trained neural networks compress knowledge and algorithms into approximations. Because there are more features than there are dimensions to indicate them, this results in so-called superposition, where polysemantic neurons may contribute to multiple features of a model [15]. See Superposition: What Makes it Difficult to Explain Neural Network (Shuyang) for an explanation of this phenomenon. Basically, because neurons act in multiple functions, interpreting their activation can be ambiguous and difficult. This is a major reason why interpretability research focuses more on the residual stream than on the activation of individual, polysemantic neurons.
Representation engineering: Beyond surface facts, such as board states, space, and time, it is possible to identify semantically meaningful vector directions within the residual stream [16]. Once a direction is identified, it can be examined or modified. This can be used to identify or influence hidden behaviors, among other things.
Latent knowledge: Do LLMs possess internal knowledge that they keep to themselves? They do, and methods for discovering latent knowledge aim to extract it [17, 18]. If a model knows something that is not reflected in its prediction output, this is highly relevant to explainability and safety. Attempts have been made to audit such hidden objectives, which can be inserted into a model inadvertently or purposely, for research purposes [19].
Steering: The residual stream can be manipulated with such an additional activation vector to change the model’s behavior in a targeted way [20]. To determine this steering vector, one can record the residual stream during two consecutive runs (inferences) with opposite prompts and subtract one from the other. For instance, this can turn the style of the generated output from happy to sad, or from safe to dangerous. The activation vector is usually injected into a middle layer of the neural network. Similarly, a steering vector can be used to measure how strongly a model responds in a given direction.
Steering methods were attempted to reduce lies, hallucinations and other undesirable tendencies of LLMs. However, it does not always work reliably. Efforts have been made to develop measures of how well a model can be guided toward a given concept [21].
Chess: The board state of chess games as well as the language model’s estimation of the opponent’s skill level can also be recovered from the residual stream [22]. Modifying the vector representing the expected skill level was also used to improve the model’s performance in the game.
Refusals: It was found that refusals could be prevented or elicited using steering vectors [23]. This suggests that some safety behaviors may be linearly accessible.
Emotion: LLMs can derive emotional states from a given input text, which can be measured. The results are consistent and psychologically plausible in light of cognitive appraisal theory [24]. This is interesting because it suggests that LLMs can mirror many of our human tendencies in their world models.
Features: As mentioned earlier, neurons in an LLM are not very helpful for understanding what is happening internally.
Initially, OpenAI tried to have GPT-4 guess which features the neurons respond to based on their activation in response to different example texts [25]. In 2023, Anthropic and others joined this major topic and applied auto-encoder neural networks to automate the interpretation of the residual stream [26, 27]. Their work enables the mapping of the residual stream into monosemantic features that describe an interpretable attribute of what is occurring. However, it was later shown that not all of these features are one-dimensionally linear [28].
The automation of feature analysis remains a topic of interest and research, with more work being done in this area [29].
Currently, Anthropic, Google, and others are actively contributing to Neuronpedia, a mecca for researchers studying interpretability.
Hallucinations: LLMs often produce untrue statements, or “hallucinate.” Mechanistic interventions have been used to identify the causes of hallucinations and mitigate them [30, 31].
Features suitable for probing and influencing hallucinations have also been identified [32]. Accordingly, the model has some “self-knowledge” of when it is producing incorrect statements.
Circuit tracing: In LLMs, circuit analysis, i.e., the analysis of the interaction of attention heads and MLPs, allows for the specific attribution of behaviors to such circuits [33, 34]. Using this method, researchers can determine not only where information is within the residual stream but also how the given model computed it. Efforts are ongoing to do this on a larger scale.
Human brain comparisons and insights: Neural activity from humans has been compared to activations in OpenAI’s Whisper speech-to-text model [35]. Surprising similarities were found. However, this should not be overinterpreted; it may simply be a sign that LLMs have acquired effective strategies. Interpretability research allows such analyses to be performed in the first place.
Self-referential first-person view and claims of consciousness: Interestingly, suppressing features associated with deception led to more claims of consciousness and deeper self-referential statements by LLMs [36]. Again, the results should not be overinterpreted, but they are interesting to consider as LLMs become more capable and challenge us more often.

This review demonstrated the power of causal interventions on internal activations. Rather than relying on correlational observations of a black-box system, the system can be dissected and analyzed.

Conclusion

Interpretability is an exciting research area that provides surprising insights into an LLM’s behavior and capabilities. It can even reveal interesting parallels to human cognition. Many (mostly narrow) LLM behaviors can be explained for a given model to produce valuable insights. However, the sheer number of models and the number of possible questions to ask will likely prevent us from fully deciphering any large model — or even all of them — as the enormous time investment may simply not yield sufficient benefit. This is why shifts to automated analysis are happening, to apply mechanistic insight systematically.

These methods are valuable additions to our toolbox in both industry and research, and all users of future AI systems may benefit from these incremental insights. They enable improvements in reliability, explainability, and safety.

Contact

This is a complex and extensive topic, and I am happy about pointers, comments and corrections. Feel free to send a message to jvm (at) taggedvision.com

References

[1] McLeish, Sean, Arpit Bansal, Alex Stein, Neel Jain, John Kirchenbauer, Brian R. Bartoldson, Bhavya Kailkhura, et al. 2024. “Transformers Can Do Arithmetic with the Right Embeddings.” Advances in Neural Information Processing Systems 37: 108012–41. doi:10.52202/079017–3430.
[2] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 2017-Decem(Nips): 5999–6009.
[3] Geva, Mor, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. “Transformer Feed-Forward Layers Are Key-Value Memories.” doi:10.48550/arXiv.2012.14913.
[4] Meng, Kevin, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. 2023. “Mass-Editing Memory in a Transformer.” doi:10.48550/arXiv.2210.07229.
[5] Hernandez, Evan, Belinda Z Li, and Jacob Andreas. “Inspecting and Editing Knowledge Representations in Language Models.” https://github.com/evandez/REMEDI.
[6] Stephen McAleese. 2025. “Understanding LLMs: Insights from Mechanistic Interpretability.” https://www.lesswrong.com/posts/XGHf7EY3CK4KorBpw/understanding-llms-insights-from-mechanistic
[7] Olsson, et al., “In-context Learning and Induction Heads”, Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html
[8] Li, Kenneth, Aspen K. Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. 2023. “Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task.” https://arxiv.org/abs/2210.13382v4.
[9] Nanda, Neel, Andrew Lee, and Martin Wattenberg. 2023. “Emergent Linear Representations in World Models of Self-Supervised Sequence Models.” https://arxiv.org/abs/2309.00941v2
[10] Gurnee, Wes, and Max Tegmark. 2023. “Language Models Represent Space and Time.” https://arxiv.org/abs/2310.02207v1.
[11] Wu, Zhaofeng, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim. 2023. “Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks.” https://arxiv.org/abs/2307.02477v1.
[12] “An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems.” 2025. https://openreview.net/forum?id=Tos7ZSLujg
[13] White, Colin, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, et al. 2025. “LiveBench: A Challenging, Contamination-Limited LLM Benchmark.” doi:10.48550/arXiv.2406.19314.
[14] Nanda, Neel, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2023. “Progress Measures for Grokking via Mechanistic Interpretability.” doi:10.48550/arXiv.2301.05217.
[15] Elhage, Nelson, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, et al. 2022. “Toy Models of Superposition.” https://arxiv.org/abs/2209.10652v1 (February 18, 2024).
[16] Zou, Andy, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, et al. 2023. “REPRESENTATION ENGINEERING: A TOP-DOWN APPROACH TO AI TRANSPARENCY.”
[17] Burns, Collin, Haotian Ye, Dan Klein, and Jacob Steinhardt. 2022. “DISCOVERING LATENT KNOWLEDGE IN LANGUAGE MODELS WITHOUT SUPERVISION.”
[18] Cywiński, Bartosz, Emil Ryd, Senthooran Rajamanoharan, and Neel Nanda. 2025. “Towards Eliciting Latent Knowledge from LLMs with Mechanistic Interpretability.” doi:10.48550/arXiv.2505.14352.
[19] Marks, Samuel, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, Siddharth Mishra-Sharma, Daniel Ziegler, et al. “AUDITING LANGUAGE MODELS FOR HIDDEN OBJECTIVES.”
[20] Turner, Alexander Matt, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. 2023. “Activation Addition: Steering Language Models Without Optimization.” https://arxiv.org/abs/2308.10248v3.
[21] Rütte, Dimitri von, Sotiris Anagnostidis, Gregor Bachmann, and Thomas Hofmann. 2024. “A Language Model’s Guide Through Latent Space.” doi:10.48550/arXiv.2402.14433.
[22] Karvonen, Adam. “Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models.” https://github.com/adamkarvonen/chess.
[23] Arditi, Andy, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. “Refusal in Language Models Is Mediated by a Single Direction.” doi:10.48550/arXiv.2406.11717.
[24] Tak, Ala N., Amin Banayeeanzade, Anahita Bolourani, Mina Kian, Robin Jia, and Jonathan Gratch. 2025. “Mechanistic Interpretability of Emotion Inference in Large Language Models.” doi:10.48550/arXiv.2502.05489.
[25] Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff, and William Saunders Wu. 2023. “Language Models Can Explain Neurons in Language Models.” https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html.
[26] “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.” https://transformer-circuits.pub/2023/monosemantic-features/index.html.
[27] Cunningham, Hoagy, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. 2023. “SPARSE AUTOENCODERS FIND HIGHLY INTER-PRETABLE FEATURES IN LANGUAGE MODELS.”
[28] Engels, Joshua, Eric J. Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. 2025. “Not All Language Model Features Are One-Dimensionally Linear.” doi:10.48550/arXiv.2405.14860.
[29] Shaham, Tamar Rott, Sarah Schwettmann, Franklin Wang, Achyuta Rajaram, Evan Hernandez, Jacob Andreas, and Antonio Torralba. 2025. “A Multimodal Automated Interpretability Agent.” doi:10.48550/arXiv.2404.14394.
[30] Chen, Shiqi, Miao Xiong, Junteng Liu, Zhengxuan Wu, Teng Xiao, Siyang Gao, and Junxian He. 2024. “In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation.” doi:10.48550/arXiv.2403.01548.
[31] Yu, Lei, Meng Cao, Jackie CK Cheung, and Yue Dong. 2024. “Mechanistic Understanding and Mitigation of Language Model Non-Factual Hallucinations.” In Findings of the Association for Computational Linguistics: EMNLP 2024, eds. Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen. Miami, Florida, USA: Association for Computational Linguistics, 7943–56. doi:10.18653/v1/2024.findings-emnlp.466.
[32] Ferrando, Javier, Oscar Obeso, Senthooran Rajamanoharan, and Neel Nanda. 2025. “DO I KNOW THIS ENTITY? KNOWLEDGE AWARENESS AND HALLUCINATIONS IN LANGUAGE MODELS.”
[33] Lindsey, et al., On the Biology of a Large Language Model (2025), Transformer Circuits
[34] Wang, Kevin, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2022. “Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small.” http://arxiv.org/abs/2211.00593.
[35] “Deciphering Language Processing in the Human Brain through LLM Representations.” https://research.google/blog/deciphering-language-processing-in-the-human-brain-through-llm-representations/
[36] Berg, Cameron, Diogo de Lucena, and Judd Rosenblatt. 2025. “Large Language Models Report Subjective Experience Under Self-Referential Processing.” doi:10.48550/arXiv.2510.24797.