AlpamayoR1: Large Causal Reasoning Models for Autonomous Driving

Contents

The Current State of Autonomous Driving The AR1 Architecture Vision Encoder Reasoning Backbone Chain of Causation Trajectory Decoder Supervised Fine-Tuning and RL Post-Training Evaluation Conclusion Thank you for reading this far!Sources

took the world of autonomous driving by storm with their new AlpamayoR1 architecture integrating a large Vision-Language Model as a causally-grounded reasoning backbone. This release, accompanied by a new large-scale dataset and a photo-realistic driving simulator, already positions the company as one of the main players in the field in 2026.

In this article, we’ll break down the AlpamayoR1 architecture, chain of causation reasoning, as well as the elaborate training procedure used to train the model.

The Current State of Autonomous Driving

The release of AlpamayoR1 (AR1) finds context in the current paradigm of End-to-End (E2E) architectures. E2E models aim to map raw sensory inputs (cameras, LiDAR, radar, …) to trajectories in a fully differentiable architecture optimising a unified objective.

An emerging trend in E2E involves leveraging the extensive world knowledge of large Vision-Language Models (VLMs) to tackle complex driving situations. This generally involves using VLMs as reasoning backbones to inform future trajectories or as expert teachers to provide supervisory signal to smaller student models.

The AR1 Architecture

AR1 is a prime example of the reasoning-VLM-as-a-backbone approach. Despite its massive size, the architecture is optimised for real-world deployment and runs a latency of 99ms or 10Hz on a single BlackWell GPU, which is considered to be a general target for safety reasons. In this section, we’ll break down the architecture and its numerous innovations.

High-level overview of the AR1 architecture, source: [1]

Vision Encoder

AR1 uses both visual and textual inputs in the form of tokenised camera feeds and natural language instructions. For performance, it is crucial for the vision encoder to produce as few tokens as possible.

To this end, the authors used a Vision Transformer (ViT)[2] for single-image tokenisation. ViTs partition images in a sequence of tokens encoded by a regular transformer. Note that the integration of more efficient algorithms like Flex [3] for multi-video tokenisation is left for future work.

Vision Transformer architecture, source: [2]

Reasoning Backbone

The AR1 architecture is built around Cosmos-Reason, one of Nvidia’s VLMs trained specifically for embodied reasoning in Physical AI use cases. Its usual training set includes 3.7M general Visual Question-Answering (VQA) samples to improve the model’s physical common set as well, complemented by 24.7K driving samples. These include video VQA annotated with DeepSeek-R1 reasoning traces to predict the next action.

Cosmos-Reason processes visual and text tokens along with the recent ego-history (past x-y positions and angle of the ego-vehicle) to output chain of causation reasoning traces to inform future trajectories.

Chain of Causation

A crucial limitation of language models lies in the inherent ambiguity of text labels in visual datasets. This includes vague descriptions lacking a causal structure. Models trained on such data exhibit a low correlation between their reasoning traces and predicted actions as well as causal confusion.

Driving datasets tend to include vague annotations with weak causal grounding, source: [1]

For an embodied agent like an autonomous car, strong causal reasoning abilities are essential. To circumvent those problems, the Nvidia team deployed significant efforts to create a driving dataset with causally consistent annotations.

Specifically, the dataset contains 20-second clips extracted from real-world driving recordings in various environments and countries. Each clip contains 2 seconds of context leading to a driving decision (e.g. overtaking, yielding, passing an intersection, …) and its consequences. The causal structure of these scenarios is exposed by consistent textual annotations following a strict template.

Annotation pipeline for the Chain of Causation dataset, source: [1]

The first 10% of the dataset are annotated by humans, while the remainder are annotated by state-of-the-art VLMs like GPT5 to scale the labeling process. Once again, significant efforts are deployed to ensure the consistency, quality and correctness of these human and AI annotations.

Examples of chain of causation reasoning produced by AR1, source: [1]

Trajectory Decoder

The last step of the forward pass consists in decoding the reasoning traces into a 64 point trajectory. While trajectories are usually decoded as a sequence of waypoints (x-y coordinates), the Nvidia team found that using unicycle dynamics (i.e. generating a sequence of acceleration values and steering angles) produced more consistent results. In particular, it facilitates the learning task by preventing the model from predicting physically impossible trajectories (e.g. point t being too far from point t+1).

Interestingly, the authors adopt a dual representation of the trajectory where the model auto-regressively generates discrete tokens during training and uses flow-matching to generate a continuous trajectory at inference time. The main reasons behind this design are as follows:

Joint Action-Reasoning Token Space: Using discrete action tokens allows for a tighter coupling between reasoning traces and actions. When the model generates a reasoning trace, the next tokens in the sequence are (acceleration and curvatures) are mathematically linked to that explanation, preventing hallucinations.
Facilitating RL Optimisation: Restricting the set of possible action tokens to a discrete set makes RL optimisation significantly easier. Indeed, sampling the correct token from a discrete vocabulary (e.g. ACCEL_NEG_2) is significantly easier than providing a gradient for a continuous value like -2.145 m/s^2. As we’ll see in the next section, this enables RL post-training, which is crucial to improve the model’s safety and consistency.
Stronger Supervisory Signal: Using a cross-entropy loss on discrete tokens acts like a classification task and better captures the multi-modality (e.g. the distinct probability of turning left or right) than an MSE loss on coordinates.
Flow Matching for Inference: While discrete tokens are great for learning, they typically result in jerky trajectories. Moreover, generating a sequence of 128 tokens auto-regressively is too slow for real-time inference. To address those limitations, the authors introduce an action expert: a smaller variant of the main architecture using the KV cache (which contains visual tokens, historical motions and reasoning traces) to decode a continuous trajectory in one pass using flow-matching diffusion. This is one of the main reasons why AR1 can run at such low latency.

Latency benchmark for several AR1 variants, generating trajectories via flow-matching saves close to 200ms at inference time. Source: [1]

Supervised Fine-Tuning and RL Post-Training

Multi-stage training pipeline for the Cosmos-Reason backbone and the AR1 architecture, source: [1]

In order to transform the VLM backbone into a performant driving policy, it undergoes supervised fine-tuning (SFT) on the chain of causation dataset. Specifically, it learns to reproduce the reasoning traces and associated ground-truth actions by maximising the log-likelihood of the action-reasoning sequence:

Supervised Fine-Tuning loss, made by the author

However, SFT on its own is not enough. VLMs are notoriously suffering from discrepancies between their reasoning and predicted actions. The static nature of open-loop datasets allows the model to mimic reasoning traces, but the lack of environmental feedback prevents them from truly internalising causal reactions.

Fortunately, RL post-training helps alleviate those limitations by providing inference feedback on the model’s rollouts. In this paper, the authors use RL for three main purposes:

Improving reasoning quality: a large reasoning model (e.g. DeepSeek-R1) evaluates AR1’s reasoning traces to ensure there are no inconsistencies or hallucinations and assigns a discrete reward on a scale of 0 to 5 accordingly. While DeepSeek is not expected to be able to generate high-quality reasoning traces for driving, it is significantly easier to evaluate AR1’s reasoning, this is known as the generation-verification gap.
Enforcing reasoning-action consistency: the authors extract meta-actions (accelerate, steer, go straight, …) from the CoC dataset using rule-based systems. If those meta-actions correspond to those mentioned in the reasoning traces, the model receives an additional reward of 1, otherwise 0.
Trajectory Quality: a trajectory reward measures the L2 distance between the predicted and expert trajectory, penalises trajectories leading to collisions and high-magnitude jerks.

During post-training, AR1 generates multiple parallel rollouts and collects rewards r_i based on the three reward signals above. These rewards are then used to compute the GRPO loss [4]. GRPO computes the advantage of each rollout relative to the group average. This baseline-free approach (as opposed to other RL algorithms like PPO), stabilises training by rewarding reasoning paths that outperform their counterparts for the same input, rather than relying on an arbitrary absolute score.

GRPO loss, made by the author

All you need to understand about this objective is that it aims to maximise the probability of trajectories (the log term) with a high advantage (the softmax term) relative to others. To avoid losing vision-language priors from the VLM and the driving knowledge obtained during SFT, the objective is regularised by a KL divergence between the current policy and the reference (the policy obtained at the end of SFT).

Evaluation

The evaluation protocol includes 4 sections: Open-loop trajectory prediction, closed-loop simulation, ablation studies and on-vehicle road tests. While the fact that AR1 was deployed in real-world scenarios is impressive, the open and closed-loop results are somewhat opaque in my opinion; the main reason being that they were obtained on Nvidia datasets (closed loop: PhysicalAI-AV dataset, closed-loop: AlpaSim) released at the same time as the model. This implies a lack of baselines to contextualise AR1’s performances.

For instance, the closed-loop results only feature AR1 and a non-reasoning baseline on 75 scenarios. While AR1 outperforms the baseline on all measured metrics, it often does so by a single percent on average and with a much larger variance than the baseline.

Closed-loop results for AR1 and a non-reasoning baseline, source: [1]

For this reason, I would advise taking these results with a grain of salt before other frontier architectures are evaluated in AlpaSim.

Conclusion

Despite the lack of contextualised results, AR1 and the accompanying datasets remain an impressive engineering achievement and a good indication of where autonomous driving is headed: end-to-end models inheriting world knowledge from massive VLMs trained on embodied tasks.

However, the collection of causally-grounded datasets required to enable chain of causation require significant investments and labeling efforts which limits reproducibility until these datasets are made public. In my next article, I’ll contrast the AR1 approach with another state-of-the-art model which entirely disposes textual labels and instead trains VLMs to act and reason in a latent space.

Thank you for reading this far!

If you found this article useful, please consider sharing it; it genuinely helps support the time and effort that goes into producing this work. As always, feel free to contact me if you have questions, thoughts, or ideas for follow-ups. If you’d like to support my independent research and writing, feel free to buy me a coffee 😉

Until next time! 👋