Loading...
Loading...
Google Research unveiled TurboQuant, a new quantization suite (to be presented at ICLR 2026) that compresses high-dimensional vectors used by LLMs and vector search with minimal memory overhead and no measured accuracy loss. TurboQuant combines PolarQuant—a random-rotation plus blockwise quantizer that captures the main signal—and Quantized Johnson-Lindenstrauss (QJL), a 1-bit residual stage that removes bias and preserves pairwise distances. The approach targets key-value cache and vector searc
From-scratch PyTorch implementation of Google's TurboQuant (ICLR 2026) for LLM KV cache compression. 5x compression at 3-bit with 99.5% attention fidelity. Language: Python Stars: 14 Forks: 3 Contributors:
Google Research released TurboQuant, a software-only algorithm suite that compresses LLM key-value (KV) caches to cut VRAM use by about 6x on average and speed attention-logit computation up to 8x, potentially halving inference costs for enterprises. TurboQuant combines PolarQuant — which maps vectors into polar coordinates after random rotation to remove per-block normalization metadata — with a 1-bit Quantized Johnson-Lindenstrauss (QJL) stage that corrects residual error, enabling training-free compression without notable accuracy loss. Google published the methods and papers openly and will present them at ICLR 2026 and AISTATS 2026, positioning TurboQuant as foundational plumbing for long-context, agentic AI while stirring market reaction in memory suppliers. This advances practical LLM efficiency on existing hardware.
Google Research unveiled TurboQuant, an AI-compression approach that can cut large language model key-value cache memory by up to 6x and boost performance (reported up to 8x) while maintaining accuracy. TurboQuant uses a two-step process: PolarQuant, which converts high-dimensional Cartesian vectors into a compact polar representation (radius + direction) to reduce storage and computation, and Quantized Johnson-Lindenstrauss (QJL), a 1-bit error-correction layer that preserves relational information by mapping vectors to ±1 to improve attention score accuracy. Google tested TurboQuant on long-context benchmarks using models like Gemma and Mistral, claiming substantial reductions in memory footprint and faster attention-logit computation without quality loss—benefiting LLM deployment, inference costs, and on-device/edge applications.
Google Research unveiled TurboQuant, a set of quantization algorithms (to appear at ICLR 2026 and AISTATS 2026) that dramatically compress high-dimensional vectors for large language models and vector search without degrading accuracy. TurboQuant combines PolarQuant — which random-rotates vectors and applies high-quality per-block quantization — with Quantized Johnson-Lindenstrauss (QJL), a 1-bit residual correction that removes bias from the first stage. Together they cut key-value cache and vector-search memory use while preserving similarity and attention scores, addressing memory bottlenecks in retrieval, caching, and large-scale search applications. The techniques promise lower costs and faster similarity lookups for AI systems that rely on dense vector representations.
Google Research unveiled TurboQuant, a new quantization suite (to be presented at ICLR 2026) that dramatically compresses high-dimensional vectors for large language models and vector search without accuracy loss. TurboQuant combines PolarQuant — which random-rotates vectors then applies high-quality per-block quantization — with Quantized Johnson-Lindenstrauss (QJL), a 1-bit residual correction that removes bias from the primary quantization. The approach minimizes the usual memory overhead of block-specific constants and targets key-value cache and vector search bottlenecks, promising lower memory costs and faster similarity lookups in production. If validated at scale, TurboQuant could reduce inference footprint for retrieval, caching, and similarity workloads across AI systems.
Google Research unveiled TurboQuant, a new quantization suite (to be presented at ICLR 2026) that compresses high-dimensional vectors used by LLMs and vector search with minimal memory overhead and no measured accuracy loss. TurboQuant combines PolarQuant—a random-rotation plus blockwise quantizer that captures the main signal—and Quantized Johnson-Lindenstrauss (QJL), a 1-bit residual stage that removes bias and preserves pairwise distances. The approach targets key-value cache and vector search bottlenecks, lowering memory and latency for similarity lookups without storing per-block, full-precision constants. Early tests reported strong compression ratios while maintaining attention-score fidelity, promising cost and performance improvements for large-scale AI serving and retrieval systems.