Introduction
of Artificial Intelligence up until now has been defined by a simple, albeit expensive, rule: bigger is always better. As Large Language Models (LLMs) scale into the trillions of parameters, they show reasoning capabilities that were unimaginable just a few years ago, and they just keep getting better.
However, this growth has been hit with a physical reality. The energy and hardware required to run these models are becoming unsustainable, to the point where companies like Google and Meta are exploring nuclear power solutions, just to meet their future energy demands (The Guardian)2.
Bigger is NOT Always Better
To combat this issue, the industry has relied on compression techniques and quantization. In simple terms, this involves taking a model trained in high precision (16-bit) and rounding its weights down to lower precision (like 8-bit or 4-bit) for inference (Frantar et al., 2022)3. Even though this method works, it is still a makeshift solution to the larger problem, as the model was never designed to be small in the first place.
But what if high precision isn’t actually necessary for high performance?
In a recent paper titled “The Era of 1-bit LLMs” (Ma et al., 2024)1, researchers from Microsoft propose a completely different perspective on how LLMs are constructed. They introduce BitNet b1.58, which is an architecture that, instead of just compressing a model, restricts the model to be trained in the extremely aggressive low-precision mode from the get-go. It forces the model to operate using only three possible values: {−1,0,1}. This article explores how such a severe restriction is possible, the mathematical innovations behind the approach, and if this method could be a viable alternative to the expensive floating-point operations that are de facto in modern AI.
The Architecture: Designing a 1-Bit Brain
To understand the innovation of BitNet b1.58, we must look at the basic operation of a layer in a standard neural network. In modern LLMs, the nn.Linear layer stores information in a weight matrix of high-precision floating-point numbers (e.g., FP16/FP32). BitNet replaces this with a specialized BitLinear layer, which uses just three integers to store the same amount of information as any normal NN layer.
1. Achieving Ternary Weights
The core constraint of BitNet b1.58 is that every single parameter in the weight matrix of the network must resolve to one of three integers: {−1,0,1}. Unlike Post-Training Quantization, which compresses a model after it has been trained, BitNet enforces this constraint during the training process itself.
The authors utilize an Absmean Quantization function to map continuous values to this ternary set. The process involves the following two steps: scaling and rounding.
- Scaling: The weight matrix is first normalized by its average absolute value (
<em>γ</em>). This ensures that the distribution of weights remains centered and consistent. The scaling factor can be calculated as below:
n,m: Number of rows and columns in matrix respectively.
Wij: Parameter in the matrix at ith row and jth column.
- Rounding: The scaled values are then rounded to the nearest integer and clipped to ensure they fall strictly within the range of
[−1,1].

W: Original weight matrix.
ϵ: Small value added to prevent zero-division errors.
3. The Training Paradox: How to Differentiate Integers
The most significant challenge that the authors faced in designing the one-bit architecture was the training process. Standard optimization algorithms, such as Stochastic Gradient Descent (SGD) or Adam, rely on the concept of a continuous and differentiable landscape. They calculate the gradient of the loss function and adjust the weights by a tiny amount (e.g., 0.001) in the opposite direction.
This creates a paradox:
How do you “nudge” an integer to incorporate the changes suggested by the gradients?
For example: If a weight is 1 and the gradient suggests moving it by −0.001, the result is 0.999. If we enforce integer states only, this value snaps right back to 1, the model never updates, and hence, it never learns.
BitNet solves this using a Latent Weight architecture (Bengio et al., 2013)5.
3.1 The Latent Weight Mechanism

Flowchart depicting how the authors decouple ternary and master weights to enable model training.
The model maintains two versions of all of its parameters during training:
- Master Weights (High-Precision): These are standard FP16/FP32 numbers that can capture small updates.
- Quantized Weights (Ternary): These are the discrete
{−1,0,1}values derived from the Master Weights, used for actual inference/forward-pass.
3.2 The Forward Pass
During the forward pass, the master weights are first converted to ternary weights by the above-described operations (scaling and rounding). The model then uses these ternary weights to generate the output. This ensures that the model’s predictions are always representative of the constrained set of weights it has, instead of the full-precision master weights.
3.3 The Backward Pass and Update
During backpropagation, the gradients flow backward, from the loss function. These gradients are then applied to the Master Weights, not the Ternary Weights.
This allows the Master Weights to accumulate small changes over many training steps. For example, consider a Master Weight whose value is 0.4 (which corresponds to a 0 in the ternary set). After several updates, it might shift to 0.45, then 0.49. It still rounds to 0, so the model’s behavior doesn’t change yet. However, once it crosses the rounding threshold (e.g., reaching 0.51), it will then round to 1.
This mechanism allows the model to learn via standard gradient descent while still ensuring that the final trained model consists exclusively of the efficient ternary weights.
2. Elimination of Matrix Multiplication
The most significant and immediate benefit of forcing weights into {−1,0,1} is the elimination of floating-point multiplication, which is the most expensive operation in modern deep learning hardware.

Eliminating floating point numbers from weight matrices eliminates the need for floating point multiplications, which is the most expensive and unabating operation for the GPUs.
In a standard Transformer (Vaswani et al., 2017)4, the GPU must perform billions of Multiply-Accumulate (MAC) operations, where a floating-point number is multiplied by another floating-point number. However, when one of the two inputs is restricted to the ternary set, multiplication ceases to exist:
- Multiplication by
1is simply an addition (x). - Multiplication by
−1is simply a subtraction (−x). - Multiplication by
0avoids computation entirely.
This architectural shift transforms all computation from complex floating-point multiplication operations into simple addition. This drastically reduces the energy footprint of the model, as integer addition is orders of magnitude cheaper to perform than floating-point multiplication.
Results: The Pareto Improvement
The primary objective of the BitNet b1.58 research was not just to create a model that’s smaller in size, but also to prove that extreme quantization does not have to come at an expense of intelligence. The authors compared their architecture against FP16 LLaMA models (Touvron et al., 2023)6 on various downstream tasks, and observed some interesting findings:
1. Performance Parity with Full-Precision Models
Perhaps the most crucial finding is that the BitNet b1.58 model can perform on par with the standard FP16 models. When evaluated on zero-shot accuracy on benchmarks like ARC-Challenge, Hellaswag, and Winogrande, the b1.58 model demonstrated performance that is similar to that of FP16 LLaMA models.
As evident from the table below, this parity begins to manifest strongly at the 3 billion parameter mark. While smaller models did struggle slightly against the LLaMA baselines, BitNet b1.58 3B outperforms it on the average zero-shot accuracy. This lends credibility to the author’s hypothesis that the ternary representation of weight matrices is enough to capture the nuances and intricacies of language modeling without the need for high-precision floating-point weights.

For the smaller models (700M and 1.3B), BitNet still lagged behind the standard LLaMA models, but for the 3B variant, BitNet’s performance is virtually identical, if not superior in some benchmarks.
2. Redefining Latency and Memory Footprint
By reducing the weight precision from 16 bits down to 1.58 bits, the memory footprint of the model training and inference has expectedly, yet drastically, lowered. As shown below, BitNet b1.58b requires 3.55x less GPU memory than its LLaMA counterpart at 3B parameter size. This reduction also alleviates the bandwidth bottleneck, which is a primary constraint during LLM inference.
A smaller memory footprint directly translates to latency as well. The authors observed a 2.71x reduction in inference latency for the 3B model size. Furthermore, this gap in latency, between FP16 LLaMA and BitNet b1.58b, increases as we scale the model upwards. When both models are scaled to 70B parameters, the latency gap increases to 4.10x. This indicates a very promising scaling law, where the larger the model, the more it can benefit from the BitNet architecture.

Latency and Memory, plotted against Model size. The gap between standard LLaMA and BitNet widens as we increase model size, which is a sign of a good scaling law.
3. Energy Consumption and Arithmetic Efficiency
Apart from the efficiency gains from reducing precision, we also get profound energy savings because of the elimination of floating-point multiplications. By using ternary weights, BitNet relies on INT8 operations instead of FP16, which reduces arithmetic energy costs.
The authors applied an energy model to estimate the cost of operations on 7nm chips. They observed that as the model size scales up, BitNet becomes increasingly efficient. Since the nn.Linear layers (where the majority of the savings occur) constitute a larger percentage of the total computation in bigger models, the energy gap between standard LLaMA and BitNet grows with scale. For a 70B model, the end-to-end energy cost is more than 41x lower, addressing one of the most prominent environmental concerns about the deployment of large-scale AI models.

Plot of Energy vs Model Size. The combined effects of both: elimination of floating-point operations and aggressive quantization, yield enormous energy savings.
4. Throughput Maximization
In real-world production environments, throughput (tokens generated per second) is often a more important metric than single-stream latency. Due to BitNet’s smaller memory overhead, it allows us to work with much larger batch sizes while using the same GPUs.
On two 80GB A100 GPUs, the authors found that they could run a BitNet b1.58 70B model with a batch size 11 times larger than what was possible with FP16 LLaMA 70B. This resulted in an 8.9x increase in overall throughput. This finding is important for production environments with serving infrastructure, implying that 1-bit LLMs could serve nearly nine times as many users as the current models using the same hardware could do. This has a vast number of use cases, such as in real-time translation, autonomous driving cars, instant code generation, and many more.

BitNet b1.58b speeds up training by allowing 11X the original batch size, and speeds up token generation speed by nearly 9X.
👉If you liked this piece, I share shorter up-to-date writeups on Substack.
👉And if you want to support independent research writing, BuyMeACoffee helps keep it going.
Conclusion
As impressive as these results are, they still represent the least of the 1-bit architectures, not the best. It is important to note that the benchmarks and performance gains discussed above were run on hardware (NVIDIA A100s) that was designed for floating-point multiplication. This means that we are currently running BitNet b1.58 on chips that are not optimized to run INT8 additions, on top of which the entire architecture stands.
This implies that there still exist some efficiency gains left unexplored. If BitNet can achieve an 8-9x speedup on hardware that is suboptimal, then the potential gains on hardware that is specifically designed for integer addition—such as Groq’s LPUs—could be even more substantial.
This architecture also offers us a realistic pathway towards deploying large 70B+ parameter models, directly on local edge devices like mobile phones and laptops, without compromising intelligence.
References
[1] Ma, Shuming, et al. “The Era of 1-bit LLMs: All Large Language Models Are in 1.58 Bits.” arXiv.org, 27 Feb. 2024, arxiv.org/abs/2402.17764.
[2] The Guardian. “Meta Signs Deal With Nuclear Plant to Power AI and Datacenters for 20 Years,” 4 June 2025, www.theguardian.com/technology/2025/jun/03/meta-nuclear-power-ai.
[3] Frantar, Elias, et al. “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.” arXiv.org, 31 Oct. 2022, arxiv.org/abs/2210.17323.
[4] Vaswani, Ashish, et al. “Attention Is All You Need.” arXiv.org, 12 June 2017, arxiv.org/abs/1706.03762.
[5] Bengio, Yoshua, et al. “Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation.” arXiv.org, 15 Aug. 2013, arxiv.org/abs/1308.3432.
[6] Touvron, Hugo, et al. “Llama 2: Open Foundation and Fine-Tuned Chat Models.” arXiv.org, 18 July 2023, arxiv.org/abs/2307.09288.