2-bit VPTQ: 6.5x Smaller LLMs while Preserving 95% Accuracy

Editor
1 Min Read


Very accurate 2-bit quantization for running 70B LLMs on a 24 GB GPU

Generated with ChatGPT

Recent developments in low-bit quantization for LLMs, like AQLM and AutoRound, are now showing acceptable levels of degradation in downstream tasks, especially for large models. That said, 2-bit quantization still introduces noticeable accuracy loss in most cases.

One promising algorithm for low-bit quantization is VPTQ (MIT license), proposed by Microsoft. It was introduced in October 2024 and has since shown excellent performance and efficiency in quantizing large models.

In this article, we will:

  1. Review the VPTQ quantization algorithm.
  2. Demonstrate how to use VPTQ models, many of which are already available. For instance, we can easily find low-bit variants of Llama 3.3 70B, Llama 3.1 405B, and Qwen2.5 72B.
  3. Evaluate these models and discuss the results to understand when VPTQ models can be a good choice for LLMs in production.

Remarkably, 2-bit quantization with VPTQ almost achieves performance comparable to the original 16-bit model on tasks such as MMLU. Moreover, it enables running Llama 3.1 405B on a single GPU, while using less memory than a 70B model!

Share this Article
Please enter CoinGecko Free Api Key to get this plugin works.