1. Introduction
the last decade, the entire AI industry has always believed in one unsaid convention: that intelligence can only emerge at scale. We convinced ourselves that for the models to truly mimic human reasoning, we needed bigger and deeper networks. Unsurprisingly, this led to stacking more transformer blocks on top of each other (Vaswani et al., 2017)5, adding billions of parameters, and training it across data centers, which require megawatts of power.
But is this race for making bigger and bigger models blind us to a far more efficient path? What if actual intelligence isn’t related to the size of the model, but instead, how long you let it reason? Can a tiny network, given the freedom to reiterate on its own solution, outsmart a model thousands of times bigger than itself?
2. The Fragility of the Giants
To understand why we need a new approach, we must first look at why our current reasoning models like GPT-4, Claude, and DeepSeek still struggle with complex logic.
These models are primarily trained on the Next-Token-Prediction (NTP) objective. They process the prompt through their billion-parameter layers to predict the next token in a sequence. Even when they use “Chain-of-Thought” (CoT) (Wei et al., 2022)4 to “reason” about a problem, they are again just predicting a word, which, unfortunately, is not thinking.
This approach has two flaws.
First is that it is brittle. Because the model generates its answers token-by-token, a single mistake in the early stages of reasoning can snowball into a completely different, and often wrong, answer. The model lacks the ability to stop, backtrack, and correct its internal logic before answering. It has to fully commit to the path it started with, often hallucinating confidently just to finish the sentence.
The second problem is that modern reasoning models rely on memorization over logical deduction. They perform well on unseen tasks because they likely have seen a similar problem in their enormous training data. But when faced with a novel problem—something that the models have never seen before (like the ARC-AGI benchmark)—their massive parameter counts become useless. This shows that the existing models can adapt a known solution, instead of formulating one from scratch.
3. Tiny Recursive Models: Trading Space for Time
The Tiny Recursion Model (TRM) (Jolicoeur-Martineau, 2025)1 breaks down the process of reasoning into a compact and cyclic process. Traditional transformer networks (a.k.a. our LLM models) are feed-forward architectures, where they have to process the input to an output in a single pass. TRM, on the other hand, works like a recurrent machine of a small and single MLP module, which can improve its output iteratively. This enables it to beat the best current mainstream reasoning models, all while being less than 7M parameters in size.
To understand how this network solves problems this efficiently, let’s walk through the architecture from input to solution.
Visual illustration of the entire TRM training/inference
3.1. The Setup: The “Trinity” of State
In standard LLMs, the only “state” is the KV cache of the conversation history. Meanwhile, TRM maintains three distinct information vectors that feed information into each other:
- The Immutable Question (x): The original problem (e.g., a Maze or a Sudoku grid), embedded into a vector space. Throughout the training/inference, this is never updated.
- The Current Hypothesis (yt): The model’s current “best guess” at the answer. At step
t=0, this is initialized randomly as a learnable parameter which gets updated alongside the model itself. - The Latent Reasoning (zn): This vector contains the abstract “thoughts” or intermediate logic that the model uses to derive its answer. Similar to yt, this is also initialized as a random parameter at the start.
3.2. The Core Engine: The Single-Network Loop
At the heart of TRM is a single, tiny neural network, which is often just two layers deep. This network is not a “model-layer” in the traditional sense, but is more like a function that is called repeatedly.
The reasoning process consists of a nested loop comprising of two distinct stages: Latent Reasoning and Answer Refinement.
Step A: Latent Reasoning (Updating zn)
First, the model is tasked to only think. It takes the current state (the three vectors which were described above) and runs a recursive loop to update its own internal understanding of the problem.
For a set number of sub-steps (n), the network updates its latent thought vector zn by:

The model takes in all three inputs and runs them through the model to update its thought vector (goes on for n steps).
Here, the network looks at the problem (x), its current best guess (yt), and its previous thought (zn). With this, the model can identify contradictions or logical leaps in its understanding, which it can then use to update zn. Note that the answer yt is not updated yet. The model is purely thinking/reasoning about the problem.
Step B: Answer Refinement (Updating yt)
Once the latent reasoning loop is complete up to n steps, the model then attempts to project these insights into its answer state. It uses the same network to do this projection:

To refine its answer state, the model only ingests the thought vector and the current answer state.
The model translates its reasoning process (zn) into a tangible prediction (yt). This new answer then becomes the input for the next cycle of reasoning, which in turn, goes on for T total steps.
Step C: The Cycle Continues
After every n steps of thought-refinement, one answer-refinement step runs (which in turn has to be invoked T times). This creates a powerful feedback loop where the model gets to refine its own output over multiple iterations. The new answer (yt+1) might reveal some new information which was missed by all preceding steps (e.g., “filling this Sudoku cell reveals that the 5 must go here”). The model takes this new answer, feeds it back into Step A, and continues refining its thoughts until it has filled in the entire sudoku grid.
3.3. The “Exit” Button: Simplified Adaptive Computation Time
Another major innovation of the TRM approach is in how it handles the entire reasoning process with efficiency. A simple problem might be solved in just two loops, while a hard one might require 50 or more, which means that hard-coding a fixed number of loops is restrictive and, hence, not ideal. The model should be able to decide if it has solved the problem already or if it still needs more iterations to think.
TRM employs Adaptive Computation Time (ACT) to dynamically decide when to stop, based on the difficulty of the input problem.
TRM treats stopping as a simple binary classification problem, which is based on how confident the model is about its own current answer.
The Halting Probability (h):
At the end of every T answer-refinement steps, the model projects its internal answer state into a single scalar value between 0 and 1, which is meant to represent the model’s confidence:

ht: Halting probability.
σ: Sigmoid activation to bound the output between 0 and 1.
Linear: Linear transformation performed on the answer vector.
The Training Objective:
The model is trained with a Binary Cross-Entropy (BCE) loss. It learns to output 1 (stop) when its current answer yt matches the ground truth, and 0 (continue) if it doesn’t.

Losshalt: Loss value, which is used to teach the model when to stop.
I(•): Conditional Function that outputs 1 if the statement inside checks out to be true, else 0.
ytrue: Ground truth for whether the model should stop or not.
Inference:
When the model runs on a new problem, it checks this probability ht after every loop (i.e. n <strong>×</strong>T steps).
- If
<em>h<sub>t</sub></em> > threshold: The model is confident enough. It hits the “Exit Button” and returns the current answer yt as the final answer. - If
<em>h<sub>t</sub></em> < threshold: The model is still unsure. It feeds yt and zn back into the TRM loop for deliberation and refinement.
This mechanism allows TRM to be computationally efficient. It achieves high accuracy not by being big, but by being persistent—allocating its compute budget exactly where it is needed.
4. The Results
To truly test the limits of TRM, it was benchmarked on some of the hardest logical datasets available, like the Sudoku and ARC-AGI (Chollet, 2019)3 challenge.
1. The Sudoku-Extreme Benchmark
The first test was on the Sudoku-Extreme benchmark, which is a dataset of specially curated hard Sudoku puzzles that require deep logical deduction and the ability to backtrack on steps that the model later realizes were wrong.
The results are quite contrary to the convention. TRM, with a mere 5 million parameters, achieved an accuracy of 87.4% on the dataset.
To put this in perspective:
- Today’s standard reasoning LLMs like Claude 3.7, GPT o3-mini, and DeepSeek R1 could not complete any Sudoku problem from the entire dataset, resulting in a 0% accuracy across the board (Wang et al., 2025)2.
- The previous state-of-the-art recursive model (HRM) used 27 million parameters (over 5x larger) and achieved 55.0% accuracy.
- By simply removing the complex hierarchy-based architecture of HRMs and focusing on a single recursive loop, TRM improved accuracy by over 30 percentage points while also reducing the parameter count.

T & n: Number of cycles of answer and thought refinement, respectively.
w / ACT: With the Adaptive Computation Time Module, the model performs slightly worse.
w / separate fH, fL: Separate networks used for thought and answer refinement.
w / 4-layers, n=3: Doubled the architectural depth of the recursive module, but halved the number of recursions.
w / self-attention: Recursive module based on attention blocks instead of MLP.
2. The “Capacity Trap”: Why Deeper Was Worse
Perhaps the most counterintuitive insight that the authors found in their approach was what happened when they tried to make TRM “better” by doubling its parameter count.
When they increased the network depth from 2 layers to 4 layers, performance didn’t go up; instead, it crashed.
- 2-Layer TRM: 87.4% Accuracy on Sudoku.
- 4-Layer TRM: 79.5% Accuracy on Sudoku.
In the world of LLMs, adding more layers and making the model deeper has been the default way to increase intelligence. But for recursive reasoning on small datasets (TRM was trained on only ~1,000 examples), extra layers can become a liability as they allow the model excess capacity to memorize patterns instead of deducing them, leading to overfitting.
This validates the paper’s core hypothesis: that depth in time beats depth in space. It can be far more effective to have a smaller model think for a long time than to have a larger model think for a short amount of time. The model doesn’t need more capacity to memorize; it just needs more time and an efficient medium to reason in.
3. The ARC-AGI Challenge: Humiliating the Giants
The Abstraction and Reasoning Corpus (ARC-AGI) is widely considered to be one of the hardest benchmarks to test pattern recognition and logical reasoning in AI models. It essentially tests fluid intelligence, which is the ability to learn new abstract rules of a system from just a few examples. This is where most modern-day LLMs typically fail.
The results here are even more shocking. TRM, trained with only 7 million parameters, achieved 44.6% accuracy on ARC-AGI-1.
Compare this to the giants of the industry:
- DeepSeek R1 (671 Billion Parameters): 15.8% accuracy.
- Claude 3.7 (Unknown, likely hundreds of billions): 28.6% accuracy.
- Gemini 2.5 Pro: 37.0% accuracy.
A model that is 0.001% the size of DeepSeek R1 outperformed it by nearly 3x. This is arguably the single most efficient performance ever recorded on this benchmark. It is only Grok-4’s 1.7T parameter count that we see some performance that beats the recursive reasoning approaches of HRM and TRMs.

5. Conclusion
For years, we have gauged AI progress with the number of zeros behind the parameter count. The Tiny Recursion Model brings an alternative to this convention. It proves that a model does not need to be massive to be smart; it just needs the time to think effectively.
As we look toward AGI, the answer might not lie in building bigger data centers to incorporate trillion-parameter models. Instead, it might lie in building tiny, efficient models of logic that can ponder a problem for as long as they need—mimicking the very human act of stopping, thinking, and solving.
👉If you liked this piece, I share shorter up-to-date writeups on Substack.
👉And if you want to support independent research writing, BuyMeACoffee helps keep it going.
References
- Jolicoeur-Martineau, A., Less is More: Recursive Reasoning with Tiny Networks. arXiv.org (2025).
- Wang, G., Li, J., Sun, Y., Chen, X., Liu, C., Wu, Y., Lu, M., Song, S., & Yadkori, Y. A. (2025, June 26). Hierarchical reasoning model. arXiv.org.
- Chollet, F. (2019). On the Measure of Intelligence. ArXiv.
- Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022, January 28). Chain-of-Thought prompting elicits reasoning in large language models. arXiv.org.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017, June 12). Attention is all you need. arXiv.org.