← 回總覽

谷歌 TurboQuant 论文:KV cache 压缩 6 倍且精度零损失

📅 2026-03-26 11:03 梦晨 人工智能 11 分鐘 13438 字 評分: 90
TurboQuant KV cache 模型量化 Google Research LLM 推理
📌 一句话摘要 谷歌研究院发布 TurboQuant 算法,通过极坐标量化等技术实现 KV cache 至少 6 倍的无损压缩,显著降低大模型推理内存需求并提升速度。 📝 详细摘要 本文报道了谷歌研究院即将在 ICLR 2026 亮相的突破性论文——TurboQuant 压缩算法。该算法针对 AI 推理中的 KV cache 瓶颈,通过 PolarQuant(极坐标量化)和 QJL(量化 JL 变换)两大核心创新,实现了 3-bit 或 4-bit 的极高压缩比,且在 Gemma、Mistral 等模型测试中达到精度零损失。PolarQuant 利用极坐标下角度分布的集中性消除了量化常数的

Title: 谷歌新论文把内存股价干崩了!KV cache 压缩 6 倍,网友:硅谷成真了 profile-avatar | BestBlogs.dev

URL Source: https://www.bestblogs.dev/article/50669f01

Published Time: 2026-03-26 03:03:26

Markdown Content: Skip to main content ![Image 10: LogoBestBlogs](https://www.bestblogs.dev/ "BestBlogs.dev")Toggle navigation menu Toggle navigation menuArticlesPodcastsVideosTweetsSourcesNewsletters

⌘K

Change language Switch ThemeSign In

Narrow Mode

谷歌新论文把内存股价干崩了!KV cache 压缩 6 倍,网友:硅谷成真了 profile-avatar

量子位 @梦晨

One Sentence Summary

Google Research releases the TurboQuant algorithm, achieving at least 6x lossless compression of the KV cache via techniques like polar quantization, significantly reducing memory requirements for LLM inference and boosting speed.

Summary

This article reports on a breakthrough paper from Google Research, set to be presented at ICLR 2026: the TurboQuant compression algorithm. Addressing the KV cache bottleneck in AI inference, the algorithm leverages two core innovations—PolarQuant (polar quantization) and QJL (Quantized JL Transform)—to achieve extremely high compression ratios of 3-bit or 4-bit with zero accuracy loss in tests on models like Gemma and Mistral. PolarQuant exploits the concentration of angular distribution in polar coordinates to eliminate the storage overhead of quantization constants, while QJL is used to correct residuals. Experiments show that this technology can boost computation speed by 8x on H100 GPUs. It has been hailed by the industry as Google's "DeepSeek moment," even triggering stock market volatility in the memory chip sector.

Main Points

* 1. TurboQuant achieves at least 6x lossless compression of the KV cache.By eliminating the "quantization constant" overhead required in traditional quantization methods, it maintains accuracy consistent with unquantized versions even at 3-bit quantization, solving the memory bottleneck for long-context inference. * 2. PolarQuant (polar quantization) is the key to achieving zero additional overhead.By converting data from Euclidean space to polar coordinates (distance + angle), it was discovered that the angular distribution is highly concentrated and predictable, eliminating the need to store normalization constants and making the representation more compact. * 3. The algorithm significantly boosts inference speed while drastically saving memory.On NVIDIA H100 GPUs, 4-bit TurboQuant calculates attention scores 8x faster than the 32-bit unquantized version, balancing both spatial and temporal efficiency. * 4. The technology has broad applicability and requires no tuning for specific datasets.It performs excellently in vector search and various long-context benchmarks (such as "Needle In A Haystack"), and can be applied directly to existing open-source models without training or fine-tuning.

Metadata

AI Score

90

Website qbitai.com

Published At Today

Length 1335 words (about 6 min)

Sign in to use highlight and note-taking features for a better reading experience. Sign in now

> 梦晨 发自 凹非寺 > > 量子位 | 公众号 QbitAI

学术会议ICLR,居然和美光和西部数据大跌扯上关系了?

两家存储芯片巨头股价大跌,没有财报暴雷,没有供应链断裂,只是谷歌展示了一篇即将在ICLR 2026正式亮相的论文

!Image 11

谷歌研究院推出TurboQuant压缩算法,把AI推理过程中最吃内存的KV cache压缩至少6倍,精度零损失

市场的解读简单粗暴,长上下文AI推理以后不需要那么多内存了,利空内存。

!Image 12

网友纷纷表示,这不就是美剧《硅谷》里的Pied Paper?

!Image 13

Pied Piper是2014年开播的HBO经典美剧《硅谷》里的虚构创业公司,核心技术就是一种“近乎无损的极限压缩算法”。

2026年,类似的算法在现实世界居然成真了。

要理解TurboQuant为什么重要,先得理解它解决的是什么问题。

AI大模型推理时处理过的信息会临时存在KV Cache,方便后续快速调用,不用每次从头算起。

问题是随着上下文窗口越来越长,内存消耗急剧膨胀。KV cache正在成为AI推理的核心瓶颈之一。

!Image 14

传统的解决思路是向量量化,把高精度数据压成低精度表示。

但尴尬的是,大部分量化方法本身也需要存储额外的“量化常数”,每个数字要多占1到2个bit。

TurboQuant用两个改动把这个额外开销干到了零。 PolarQuant(极坐标量化):

不用传统的X、Y、Z坐标描述数据,转而用极坐标”距离+角度”。

谷歌团队发现,转换后角度的分布非常集中且可预测,根本不需要额外存储归一化常数。

就像把“往东走3个路口,往北走4个路口”压缩成”朝37度方向走5个路口”。

信息量不变,描述更紧凑,还省掉了坐标系本身的开销。

!Image 15 QJL(量化JL变换):

把高维数据投影后压缩成+1或-1的符号位,完全不需要额外内存。TurboQuant用它来消除PolarQuant压缩后残留的微小误差。

!Image 16

两者组合后PolarQuant先用大部分bit容量捕捉数据的主要信息,QJL再用1个bit做残差修正。

最终实现3-bit量化,无需任何训练或微调,精度零损失。

谷歌团队在Gemma和Mistral等开源模型上,跑了主流长上下文基准测试,覆盖问答、代码生成、摘要等多种任务。

在“大海捞针”任务上,TurboQuant在所有测试中拿下完美分数,同时KV cache内存占用缩小了至少6倍。

PolarQuant单独使用,精度也几乎无损。

!Image 17

速度提升同样显著。在英伟达H100 GPU上,4-bit TurboQuant计算注意力分数的速度,比32-bit未量化版本快了8倍。 不只是省内存,还更快了。

在向量搜索领域,TurboQuant同样超越了现有最优量化方法的召回率,而且不需要针对具体数据集做调优,也不依赖低效的大码本。

!Image 18

Cloudflare CEO评价“这是谷歌的DeepSeek时刻”

他认为DeepSeek证明了用更少的资源也能训出顶尖模型。

TurboQuant的方向类似,用更少的内存,也能跑同样质量的推理。

!Image 19

谷歌表示,TurboQuant除了可以用在Gemini等大模型上,同时还能大幅提升语义搜索的效率,让谷歌级别的万亿级向量索引查询更快、成本更低。

不过TurboQuant目前还只是一个实验室成果,尚未大规模部署。

更关键的是,它只解决推理阶段的内存问题。而AI训练环节完全不受影响。

论文地址:

https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/

参考链接:

[1]https://x.com/eastdakota/status/2036827179150168182?s=20

量子位 @梦晨

One Sentence Summary

Google Research releases the TurboQuant algorithm, achieving at least 6x lossless compression of the KV cache via techniques like polar quantization, significantly reducing memory requirements for LLM inference and boosting speed.

Summary

This article reports on a breakthrough paper from Google Research, set to be presented at ICLR 2026: the TurboQuant compression algorithm. Addressing the KV cache bottleneck in AI inference, the algorithm leverages two core innovations—PolarQuant (polar quantization) and QJL (Quantized JL Transform)—to achieve extremely high compression ratios of 3-bit or 4-bit with zero accuracy loss in tests on models like Gemma and Mistral. PolarQuant exploits the concentration of angular distribution in polar coordinates to eliminate the storage overhead of quantization constants, while QJL is used to correct residuals. Experiments show that this technology can boost computation speed by 8x on H100 GPUs. It has been hailed by the industry as Google's "DeepSeek moment," even triggering stock market volatility in the memory chip sector.

Main Points

* 1. TurboQuant achieves at least 6x lossless compression of the KV cache.

By eliminating the "quantization constant" overhead required in traditional quantization methods, it maintains accuracy consistent with unquantized versions even at 3-bit quantization, solving the memory bottleneck for long-context inference.

* 2. PolarQuant (polar quantization) is the key to achieving zero additional overhead.

By converting data from Euclidean space to polar coordinates (distance + angle), it was discovered that the angular distribution is highly concentrated and predictable, eliminating the need to store normalization constants and making the representation more compact.

* 3. The algorithm significantly boosts inference speed while drastically saving memory.

On NVIDIA H100 GPUs, 4-bit TurboQuant calculates attention scores 8x faster than the 32-bit unquantized version, balancing both spatial and temporal efficiency.

* 4. The technology has broad applicability and requires no tuning for specific datasets.

It performs excellently in vector search and various long-context benchmarks (such as "Needle In A Haystack"), and can be applied directly to existing open-source models without training or fine-tuning.

Key Quotes

* The TurboQuant compression algorithm compresses the KV cache, the most memory-intensive part of AI inference, by at least 6x with zero loss in accuracy. * It's like compressing 'walk 3 blocks east, 4 blocks north' into 'walk 5 blocks at a 37-degree angle.' The information remains the same, the description is more compact, and it saves the overhead of the coordinate system itself. * On NVIDIA H100 GPUs, 4-bit TurboQuant calculates attention scores 8x faster than the 32-bit unquantized version. It's not just memory-saving; it's also faster. * Cloudflare CEO commented, 'This is Google's DeepSeek moment.'

AI Score

90

Website qbitai.com

Published At Today

Length 1335 words (about 6 min)

Tags

TurboQuant

KV cache

Model Quantization

Google Research

LLM Inference

Related Articles

* Yao Shunyu Lectures Face-to-Face with Tang Jie, Yang Zhilin, and Lin Junyang! Four Schema Heroes Debate Heroes at Zhongguancun * Andrew Ng's Annual AI Summary is Here! Plus Software Development Learning Tips * Gemini Accuracy Soars from 21% to 97%! Google Achieves This with One Simple Trick: Copy and Paste * Infinigence AI Zeng Shulin on LLM Inference in the AI 2.0 Era: Collaborative Optimization from Model to Hardware * Zhipu GLM-5 Technical Deep Dive: Full Compatibility with Domestic Chips Like Huawei Sparks Global Discussion and asynchronous reinforcement learning infrastructure, while completing full-stack adaptation for domestic chips.") * AI Starts to "Take Action", Alibaba's Qwen Leads the World * GPT-5.4 Costs $80 to Say "Hi": OpenAI Should Look at This New Google Paper | Hao Talks Papers\" metric and its underlying information theory principles.") * In-depth Explanation of LLM Inference Acceleration Principles: Fractal Patterns and Resource Calculation Formulas * GPT-5.4 Released: OpenAI's First Unified Model, Truly Native HomeArticlesPodcastsVideosTweets

Google's New Paper Crashes Memory Stocks! 6x KV Cache Com...

查看原文 → 發佈: 2026-03-26 11:03:26 收錄: 2026-03-26 14:01:02

🤖 問 AI

針對這篇文章提問,AI 會根據文章內容回答。按 Ctrl+Enter 送出。