Qdrant TurboQuant Explained: Is TurboQuant the Silver Bullet?

Editor
25 Min Read


as a tradeoff between memory and recall. The standard is Float32 with high fidelity and high memory cost. The basic solution is scalar quantization, which reduces each value to fewer bits (around 4× compression) with a slight recall loss. Although binary quantization pushes much harder, often reaching 32× compression, the retrieval result might become inconsistent due to information loss. On the other hand, product quantization may be more efficient, but it is harder to tune and operate in real production.

In early May of 2026, Qdrant released TurboQuant, a new quantization method. And they claimed that “TurboQuant can reduce memory use without making retrieval quality too unstable“. TurboQuant sounds like the kind of feature vector search teams want.

However, I wondered whether TurboQuant still holds up when we test it across different dataset sizes. Does it give a real improvement over common quantization methods, or does its advantage depend on the data?

I ran experiments to compare it with more familiar quantization methods such as scalar and binary quantization. The goal was to understand where TurboQuant is useful, where it is risky, and whether it can be treated as a serious default option for vector search.

I believe that this will help engineers, ML practitioners, and vector database users understand where TurboQuant fits compared with more common quantization methods, especially when moving from experiments to production.

1. What is Quantization?

Every float32 number in a vector uses 4 bytes. As a result, a 1536-dimension embedding takes 6 KB per vector; at a million vectors, the database takes up to 6 GB just for the index.

This is when we need Quantization. Quantization shrinks each number in a vector to a smaller byte number. The standard approach is Scalar quantization. It starts with finding the min and max across each dimension. Then, that range is divided into 255 equal bins. Every value in the vector is rounded to the nearest bin, and the bin number is stored as a single byte instead of four.

The original Float32 embedding now becomes a uint8 embedding at 4x compression, meaning 4 times smaller in storage size.

Figure 1 below is a simple demonstration of this process on a 6D vector.

Figure 1: Scalar quantization process and comparison. The tiny error (quantization error) accumulates across all dimensions during dot product computation. Image by author.

The tiny error in the last row is called quantization error, and it accumulates across 6 dimensions of the vector during dot product computation. This is what makes similarity scores slightly wrong. 

However, there are more aggressive compressions such as 8x (4-bit), 16x (2-bit), or 32x (1-bit). The more the compression, the smaller the vector size, and the bigger the error from the original one. You can see it in Figure 2 below, which demonstrates the error after transforming a Float32 number to different quantization spaces.

Figure 2: Difference compression methods vs original. Image by author.

The tradeoff between compression and recall (or memory and recall) is obvious. More compression results in lower recall.


2. The Real Question is Not Compression Ratio

The real question is: what vector geometry remains after compression?

Traditional quantizers, in most cases, directly compress the vector. Scalar quantization applies the same fixed grid to every dimension, whether that dimension contains a useful signal or noise. Binary quantization keeps only the sign bit. Therefore, neither method first checks whether some dimensions carry more signal than others.

Qdrant 1.18 changes this pattern with the new TurboQuant integrated. Based on a Google Research algorithm presented at ICLR 2026, TurboQuant rotates the vector before compression. This random rotation spreads variance more evenly across dimensions, so each bit can preserve more useful information.

TurboQuant is not better because it uses fewer bits. It is better because it makes the vector easier to compress before spending those bits.

The key differences between TurboQuant and others are shown in Figure 3 below.

  • Scalar Quant forces one grid on all dimensions, like the same pair of shoes for everyone, regardless of their foot length. 
  • Binary Quant transforms values to 0 or 1 with the rules: Values ≥ 0 become 1; Values < 0 become 0. This is like cutting every shoe to only one choice: left or right, big or small, yes or no. It is extremely cheap, but it throws away almost all shape information, so the “fit” becomes very crude.
  • Product Quant learns per-subspace codebooks; it fits each of the pair of shoes to each foot. It’s a great fit for everyone, but extremely costly.

TurboQuant makes all dimensions look alike first, then uses one well-designed codebook. This is the same as changing all the feet to the same size and having one pair of shoes for all.

Figure 3: Comparison of four quantization types — Scalar, Binary, Product, and TurboQuant. Image by author with help of ChatGPT.

3. TurboQuant in Short: Rotate First, Compress Second

Every vector in an embedding model has structure. 

A 1536-dimensional embedding might carry most of its useful signal in only a small subset of coordinates. The remaining dimensions often contribute much less, but they still appear in every vector, which adds noise  and makes distance comparisons less reliable

3.1 The TurboQuant Pipeline

The idea is simple. Before compressing, spin the vector through a random orthogonal rotation. That rotation does not change distances - it just redistributes energy so every dimension carries roughly the same amount of information. Then, a single precomputed codebook is applied to the rotated vectors, and it can handle all dimensions equally well. No per-dimension tuning needed. No training on your data. 

Check Figure 4 below for a summary of the process.

Figure 4: TurboQuant’s pipeline — rotation makes the coordinates predictable before any bits are spent. Image by author with help of ChatGPT.

3.2 What Does Rotation Do to the Coordinates?

Figure 5: Before and after rotation with TurboQuant — energy is redistributed evenly across dimensions, distances unchanged. Image by author.

In Figure 5, before rotation, a few dimensions carry most of the energy. The rest carry much less signal and often more noise.

After rotation, every dimension carries roughly equal energy and an equal amount of information.

However, does this indeed mean that energy transformation preserves important information and maintains distance relative to another vector, as with the original one?

I made a simple computation between 2 4D vectors, with Vector A transformed using TurboQuant, and then, at inference time, rotated Vector B with the same matrix and measured the cosine similarity in the same rotated space. This cosine similarity is compared to the original vector A vs original vector B cosine similarity.

3.3 Standard TurboQuant process

Figure 6: TurboQuant visualization. Image by author

In Figure 6, after applying TurboQuant to the original vector A, the distance between the new vector A and Vector B barely changes compared to the original vector A and Vector B, proving that the important geometry between vectors is still preserved, and recall is highly maintained.

3.4 How exactly does Qdrant apply Turboquant in the Database?

There are 2 processes separately on Qdrant:

3.4.1. Indexing process:

Figure 7: How to index a vector using TurboQuant on Qdrant. Image by author with help of ChatGPT.

The overview of Indexing Flow is visualized in Figure 7. Basically, the vector is processed as follows:

original vector → normalize/prepare depending on metric → pad if needed → Hadamard rotation → optional per-coordinate calibration: x → (x + shift) · scale → Lloyd-Max centroid assignment → packed TurboQuant codes

For TurboQuant specifically, Qdrant stores the information below as written in Table 1:

Table 1: What Qdrant stores for TurboQuant. Source: author

An important factor introduced by Qdrant is the Length Renormalization, aka Scaling factor. It happens after quantization, when Qdrant measures how much shorter the quantized reconstruction became vs the original length, stores that ratio as a per-vector scaling factor, and then applies it during scoring at query time.

The scaling factor = original_length / centroid_reconstruction_length

Why do we need Length Renormalization?

There is an observation after quantization

The quantized vector points in the right direction but is too short

Which means when quantizing a vector, there is always a quantization error, and it systematically shrinks the length of every vector. In query time, when you compute a dot product between a quantized vector and a rotated & encoded query, you’re computing the dot product of a slightly-too-short vector, which gives a score that is consistently too low. Qdrant calls this the “recall-degrading bias”.

To fix this, we need a factor to multiply it back in during the scoring phase instead of fixing the vectors. This tactic is simple and effective.

3.4.2. Query Time Process

Figure 8: How is query compared to Turbo quantized vectors on Qdrant? Image by author with help of ChatGPT.

Figure 8 shows the process of querying with the TurboQuant vector database.

The query is rotated and converted into a SIMD scoring representation, and Qdrant uses asymmetric scoring to compare that encoded query directly against the packed TurboQuant codes stored for database vectors.

After that, the stored scaling factor is multiplied by the score


4. Which Method to Try First

Qdrant offers multiple choices for quantization, and TurboQuant also offers multiple bit-compression variants such as bits4, bits2, bits1.5, and bits1.

As per their document, lower bit depths offer higher compression at the cost of accuracy.

Figure 9 shows some suggestions for reference in case you still wonder which compression methods to use.

Figure 9: Decision flowchart - start at the top, follow your constraints. The green box is the recommended default starting point. Image by author, based on Qdrant article at https://qdrant.tech/blog/qdrant-1.18.x/,

5. Getting Started: The First Experiment

Change only one config in the current Qdrant code to enable TurboQuant. Your existing collections remain untouched.

Please reference the code snippet below for details.

from qdrant_client import QdrantClient, models

client = QdrantClient("localhost", port=6333)

# New collection — one config change
client.create_collection(
   collection_name="my_collection",
   vectors_config=models.VectorParams(
       size=1536,
       distance=models.Distance.COSINE,
   ),
   quantization_config=models.TurboQuantization(
       turbo=models.TurboQuantQuantizationConfig(
           bits=models.TurboQuantBitSize.BITS4,
           always_ram=True,
       )
   ),
)

# Existing collection — patch without recreating vectors
client.update_collection(
   collection_name="existing_collection",
   quantization_config=models.TurboQuantization(
       turbo=models.TurboQuantQuantizationConfig(
           bits=models.TurboQuantBitSize.BITS4,
           always_ram=True,
       )
   ),
)

For more configuration, please check the Qdrant documentation for TurboQuant here.


6. Benchmark: Does the theory hold?

To test TurboQuant against every other Qdrant quantizer on real embeddings, I ran multiple tests at different sizes (10K, 50K, and 100K vectors) with different quantization methods of Qdrant.

6.1 Why the DBpedia Dataset?

I chose the DBpedia embeddings dataset (License: CC-BY-SA 4.0 and GNU Free Documentation License) because it has a coordinate variance ratio of 233.5x - highly anisotropic. A few dimensions carry most of the signal; the rest carry noise. This is exactly the distribution where TurboQuant’s rotation should help most, and where scalar quantization’s fixed grid wastes the most bits.

Please check the details of the test environment in the Appendix section, part 9.2.

6.2 Recall across scale

Details of the testing recall performance are in Figure 10.

Figure 10: Recall@10 at 50K and 100K vectors. Source: author

Four things jump out:

  • TQ recall remains unchanged as the dataset grows. While Binary Quantization drops from 0.916 to 0.78 when the dataset size doubles, the TurboQuant variants hold up much better. The rotation step helps each bit preserve more information, making TQ less sensitive to corpus growth.
  • Most TQ variants are close to Float32 and Scalar Quantization in recall. Except for TQ 1-bit and TQ 4-bit, the TurboQuant results remain broadly comparable to the Float32 baseline and Scalar Quantization.
  • TQ 4-bit gives the best accuracy–compression tradeoff. It reaches recall close to Scalar Quantization while using roughly half the storage: 8× compression vs Scalar’s 4×. At 100K vectors, TQ 4-bit reaches 0.965 recall, only 1.5 points below Scalar’s 0.980. With rescoring, the gap disappears: 0.996 for TQ 4-bit vs 0.993 for Scalar.
  • Rescoring recovers much of the recall gap, even for aggressive compression (TQ 1-bit). TQ 1-bit improves significantly with rescoring. Binary Quantization with rescoring can work on smaller datasets, but its recall degrades faster as the dataset grows.

6.3 Latency Across Scale

Details of the testing latency performance are in Figure 11.

Figure 11: Median query latency at 50K and 100K vectors. Source: author
  • The latency story is clear: rescoring adds some cost, but not much. At 100K vectors, TQ 4-bit + rescore runs in 6.4 ms, faster than Float32 at 7.6 ms and only slightly behind Scalar Quantization at 6.8 ms.
  • Across TQ variants, rescoring increases latency but remains faster than the Float32 baseline.

6.4 Storage Footprint

Figure 12 below shows the testing storage size for each quantization method.

Figure 12: Storage size between methods. Solid bars = quantized index in RAM. Hatched = original float32 on disk (rescore only). Source: author
  • TQ 1-bit has the same storage footprint as Binary Quantization: both use 18 MB, or around 32× compression.
  • TQ 2-bit and TQ 4-bit use more storage to preserve more information. TQ 2-bit roughly doubles the storage of TQ 1-bit, while TQ 4-bit increases it by about 4×. Even so, both are still much smaller than Scalar Quantization.

6.5 Index Building Time

Details of the testing index building time are in Figure 13.

Figure 13: Index build time includes HNSW construction, quantization, and calibration. Source: author
  • TQ is the fastest configuration at 64s for 50K vectors and 179s for 100K vectors, mostly because sign-bit extraction is cheap.
  • TQ 4-bit takes 57s / 224s, and TQ 1.5-bit takes 75s / 239s. Both are comparable to or faster than Float32 (110s / 289s). This suggests that rotation and codebook calibration add only a small indexing cost.
  • TQ 2-bit is the slowest configuration (73s / 357s). This may be due to a less common bit-packing pattern or implementation-specific overhead. Even so, it still completes indexing for 100K vectors in under 6 minutes.

Indexing time is more environment-sensitive, so treat these numbers as directional rather than absolute. Results can vary depending on CPU, memory bandwidth, disk I/O, parallelism, and the overall machine load during the run.


7. What This Means in Practice

Overall, TurboQuant looks promising when we prioritize the balance of compression and stable retrieval quality. The results show that not all compressed formats behave the same as the dataset grows. Some methods lose recall quickly, while others stay much closer to the Float32 baseline.

  1. TQ 2-bit and TQ 4-bit keep recall relatively stable as the corpus grows. While Binary Quantization and TQ 1-bit drop more noticeably as the dataset gets larger. This suggests that TurboQuant’s rotation step helps preserve more useful information in each bit. As a result, these TQ 2-bit and TQ 4-bit variants are less sensitive to corpus growth.
  2. TQ 4-bit gives the best balance between recall and compression. TQ 4-bit reaches recall close to Scalar Quantization but with twice the compression  (Scalar Quantization gives around 4× compression, while TQ 4-bit gives around 8× compression). This means TQ4-bit can save at roughly half the memory cost.
  3. TQ 1.5-bit with rescoring is the strongest option for extreme compression: It gives around 24× compression while keeping recall close to Float32 after rescoring. This is useful when storage is your major constraint, but the system still needs acceptable retrieval quality. Without rescoring, aggressive compression can lose too much information. With rescoring, much of that gap can be recovered.
  4. TQ with rescoring is the safer pattern when you need to balance latency and accuracy. This is in line with other practices. Rescoring does add some latency, and it is more effective in improving retrieval quality under extreme compression. This makes rescoring a reasonable tradeoff. It gives the system a way to use stronger compression without taking a large hit in retrieval quality.

In short, TurboQuant is not only about reducing memory. TQ 4-bit is the most balanced option for general use. TQ 1.5-bit with rescoring is better when compression is the top priority. The effective pattern is to pair TurboQuant with rescoring.

Important: These numbers should not be treated as a production rule. These act as a reference for your own judgment. Measure the performance on your embeddings, your queries, your hardware, and your recall targets before migrating to production.


8. TurboQuant’s Limitations

Figure 14: Limitations of TurboQuant implementation on Qdrant. Image by author

TurboQuant improves the compression tradeoff. But it does not remove the tradeoff completely.

It is also still new. It was launched May 11, 2026. So real production experience is still limited. The safe approach is simple: benchmark it first, then decide whether it should become your default.

I want to lay out some limitations that need to be considered. A summary of the limitations can be found in Figure 14:

The first limitation is maturity. Qdrant’s benchmark results look promising. But your data may behave differently. Your embedding model, query pattern, filters, and data distribution may not match the benchmark datasets. So TurboQuant should be treated as a strong option, not an automatic replacement.

TurboQuant may also be slower than Binary Quantization at the same storage size. This matters if your main goal is throughput or speed. If you care more about speed than recall, Binary Quantization is still be the better choice. TurboQuant is more useful when you want better recall from a small memory budget.

There is also a calibration cost. TurboQuant needs a one-time calibration step for each segment. This usually takes seconds, not minutes. But it is still a cost. If your system creates many segments or rebuilds indexes often, this extra step should be considered.

Distance type is another limitation. TurboQuant works best with L2, dot product, and cosine similarity. Rotation preserves these distance relationships well. But it does not preserve L1 or Manhattan distance in the same way. L1 and Manhattan distance can still work, but they need full vector reconstruction for each comparison. That can make search slower. If Manhattan distance is important in your system, Scalar Quantization is the safer choice.

As shown in the test result, TQ 1-bit is not a safe choice. TQ 1-bit gives very high compression, but recall can drop too much. The rotation step helps, but 1 bit per dimension is often too little. It cannot always preserve enough geometry at scale. Consider rescoring in case TQ 1-bit does not give you expected performance. Or TQ 1.5-bit looks like a more practical lower limit. It still gives strong compression, but it keeps recall more stable. For very aggressive compression, it is a safer choice than TQ 1-bit.

The main lesson is not “always use TurboQuant.” The main lesson is to test what matters for your own data. TurboQuant shifts the tradeoff in a better direction. It helps reduce recall loss before the bit budget is spent. But it does not make compression free. You still need to choose between memory, speed, recall, and distance behavior.

In short, TurboQuant is a strong new option. It is especially useful with rescoring and moderate bit settings. But it should not be used blindly. Benchmark it on your own embeddings first and measure it carefully before shifting into production.


9. Appendix:

Figure 15 below is a summary of 4 quantization offers in popular vector databases for your reference. 

Qdrant is one of the first services to offer TurboQuant in the market.

Figure 15: Quantization support matrix across Qdrant, Pinecone, Weaviate, Milvus, and pgvector. Source: author

9.2 Test environment

  • Machine: Apple M3, 16 GB RAM, macOS 15.6.1
  • Testing database:
    • Qdrant v1.18.0, single-node Docker, no resource limits
    • HNSW with Default (m=16, ef_construct=100)
    • Distance: Cosine
  • Dataset:

10. Resources

Share this Article
Please enter CoinGecko Free Api Key to get this plugin works.