Introducing ft-Q: Improving Vector Compression with Feature-Level Quantization | by Michelangiolo Mazzeschi

Contents

Quantization Pushing quantization to its limits by performing it at the feature level with ft-Quantization (ft-Q)Is there a catch to quantization?

Quantization

Pushing quantization to its limits by performing it at the feature level with ft-Quantization (ft-Q)

***To understand this article, knowledge of embeddings and basic quantization is required. The implementation of this algorithm has been released on GitHub and is fully open-source.

Since the dawn of LLMs, quantization has become one of the most popular memory-saving techniques for production-ready applications. Not long after, it has been popularized across vector databases, which have started using the same technology for compressing not only models but also vectors for retrieval purposes.

In this article, I will showcase the limitations of the current quantization algorithms and propose a new quantization approach (ft-Q) to address them.

Quantization is a memory-saving algorithm that lets us store numbers (both in-memory and in-disk) using a lower amount of bits. By default, when we store any number in memory, we use float32: this means that this number is stored using a combination of 32-bits (binary elements).

For example, the integer 40 is stored as follows in a 32-bit object:

storing the number 40 in a 32-bit object, **image by Author**

However, we could decide to store the same number using fewer bits (cutting by half the memory usage), with a 16-bit object:

storing the number 40 in a 16-bit object, **image by Author**

By quantization, we mean to store data using a lower number of bits (ex. 32 -> 16, or 32 -> 4), this is also known as casting. If we were to store 1GB of numbers (by default stored as 32-bit objects), if we decided to store them using 16-bit objects (hence, applying a quantization), the size of our data would be halved, resulting in 0.5GB.

Is there a catch to quantization?

Saving this amount of storage looks incredible (as you understood, we could keep cutting until we reach the minimum amount of bits: 1-bit, also known as binary quantization. Our database size will be reduced by 32 times, from 1GB to 31.25MB!), but as you might have already understood, there is a catch.

Any number can be stored up to the limits allowed by all the possible combinations of bits. With a 32-bit quantization, you can store a maximum of 2³² numbers. There are so many possible combinations that we decided to include decimals when using 32-bits. For example, if we were to add a decimal to our initial number and store 40.12 in 32-bits, it would be using this combination of 1 and 0:

01000010 00100000 01111010 11100001

We have understood that with a 32-bit storage (given its large combination of possible values) we can pretty much encode each number, including its decimal points (to clarify, if you are new to quantization, the real number and decimal are not separated, 40.12 is converted as a whole into a combination of 32 binary numbers).

If we keep diminishing the number of bits, all the possible combinations diminish exponentially. For example, 4-bit storage has a limit of 2⁴ combinations: we can only store 16 numbers (this does not leave much room to store decimals). With 1-bit storage, we can only store a single number, either a 1 or a 0.

To put this into context, storing our initials 32-bit numbers into binary code would force us to convert all our numbers, such as 40.12 into either 0 or 1. In this scenario, this compression does not look very good.

We have seen how quantization results in an information loss. So, how can we make use of it, after all? When you look at the quantization of a single number (40.12 converted into 1), it seems there is no value that can derive from such an extreme level of quantization, there is simply too much loss.

However, when we apply this technique to a set of data such as vectors, the information loss is not as drastic as when applied to a single number. Vector search is a perfect example of where to apply quantization in a useful manner.

When we use an encoder, such as all-MiniLM-L6-v2, we store each sample (which was originally in the form of raw text) as a vector: a sequence of 384 numbers. The storage of millions of similar sequences, as you might have understood, is prohibitive, and we can use quantization to substantially diminish the size of the original vectors by a huge margin.

Perhaps, quantizing our vectors from 32-bit to 16-bit is not this big of a loss. But how about 4-bit or even binary quantization? Because our sets are relatively large (384 numbers each), this considerable complexity lets us reach a higher level of compression without resulting in excessive retrieval loss.

4-bit quantization

The way we execute quantization is by looking at the data distribution of our flattened vector and choosing to map an equivalent interval with a lower number of bits. My favorite example is 4-bit quantization. With this degree of complexity, we can store 2⁴ = 16 numbers. But, as explained, all the numbers in our vectors are complex, each with several decimal points:

array([ 2.43655406e-02, -4.33481708e-02, -1.89688837e-03, -3.76498550e-02,
-8.96364748e-02,  2.96154656e-02, -5.79943173e-02,  1.87652372e-02,
1.87771711e-02,  6.30387887e-02, -3.23972516e-02, -1.46128759e-02,
-3.39277312e-02, -7.04369228e-03,  3.87261175e-02, -5.02494797e-02,
...
-1.03239892e-02,  1.83096472e-02, -1.86534156e-03,  1.44851031e-02,
-6.21072948e-02, -4.46912572e-02, -1.57684386e-02,  8.28376040e-02,
-4.58770394e-02,  1.04658678e-01,  5.53084277e-02, -2.51113791e-02,
4.72703762e-02, -2.41811387e-03, -9.09169838e-02,  1.15215247e-02],
dtype=float32)

What we can do is map each of our numbers in the distribution into an interval that spans between [-8, 7] (16 possible numbers). To define the extreme of the interval, we can use the minimum and maximum values of the distribution we are quantizing.

4-bit quantization: the grey area is an interval of integers between [-8, 7], don’t confuse it with bits. Any number of this interval will be later converted into a 4-bit object, **image by Author**

For example, the minimum/maximum of the distribution is [-0.2, 0.2]. This means that -0.2 will be converted to -8, and 0.2 to 7. Each number in the distribution will have a quantized equivalent in the interval (ex. the first number 0.02436554 will be quantized to -1).

array([[-1, -3, -1, ...,  1, -2, -2],
[-6, -1, -2, ..., -2, -2, -3],
[ 0, -2, -4, ..., -1,  1, -2],
...,
[ 3,  0, -5, ..., -5,  7,  0],
[-4, -5,  3, ..., -2, -2, -2],
[-1,  0, -2, ..., -1,  1, -3]], dtype=int4)

1-bit quantization

The same principle applies to binary quantization but is much simpler. The rule is the following: each number of the distribution < 0 becomes 0, and each number > 0 becomes 1.

The principal issue with current quantization techniques is that they live on the assumption that all our values are based on a single distribution. That is why, when we use thresholds to define intervals (ex. minimum and maximum), we only use a single set derived from the totality of our data (which is modeled on a single distribution).