Even if you don’t know much about the inner workings of generative AI models, you probably know they need a lot of memory. Hence, it is currently almost impossible to buy a measly stick of RAM without getting fleeced. Google Research recently revealed TurboQuant, a compression algorithm that reduces the size of large language model (LLMs) key-value caches while boosting speed and maintaining accuracy.
Google likens this cache to a 'digital cheat sheet' storing important information so it doesn’t have to be recomputed. As we say all the time, LLMs don't actually know anything; they can do a good impression of knowing things through vectors that map semantic meaning. High-dimensional vectors describing complex data use up a lot of memory and inflate key-value caches.
To make models smaller and more efficient, developers employ quantization techniques to run at lower precision. The drawback is that the outputs get worse—the quality of token estimation goes down. With TurboQuant, Google’s early results show an 8x performance increase and a 6x reduction in memory usage without losing quality.
Applying TurboQuant involves a two-step process with a system called PolarQuant. Vectors are usually encoded using standard XYZ coordinates but PolarQuant converts them into polar coordinates on a Cartesian grid, reducing to two pieces of information: radius (core data strength) and direction (the data’s meaning).







