⌘K
Change language Switch ThemeSign In
Narrow Mode
Kimi 新架构让马斯克叹服!17 岁高中生作者一战成名 ============================
量 量子位 @克雷西
One Sentence Summary
The Kimi team has launched the Attention Residuals architecture, which introduces attention mechanisms into deep residual connections to achieve 'time-depth duality,' significantly boosting large model training efficiency while addressing information dilution in deep networks.
Summary
This article reports on a major breakthrough in model architecture by the Moonshot AI (Kimi) team: the Attention Residuals technology. This research addresses the 'dilution problem' in mainstream Transformer architectures' PreNorm residual connections (where equal weighted accumulation of all layer contributions makes early information difficult to retrieve) and the instability issues in deep network training. It proposes applying attention mechanisms from the time axis 'rotated 90 degrees' to the depth axis. By introducing learnable pseudo-query vectors, the current layer can selectively focus on and extract information from previous layers. To solve the computational complexity problem, the team developed the Block AttnRes block compression scheme, reducing complexity from O(L²) to O(L·B). Experiments show that this architecture achieved a 25% improvement in training efficiency on the Kimi Linear 48B model, with an inference latency increase of less than 2%. Additionally, the article highlights the legendary journey of the paper's 17-year-old co-first author, Chen Guangyu (Nathan Chen), from a hackathon enthusiast to a core AI researcher.
Main Points
* 1. Proposes the Attention Residuals mechanism to solve the information dilution and training instability problems of traditional residual connections.Under the traditional PreNorm paradigm, residual contributions are accumulated with equal weights, leading to early information dilution in deep networks and unbounded growth of hidden state norms. The new mechanism selectively recalls information from previous layers through attention, enhancing information retrieval and stabilizing training. * 2. Utilizes a 'time-depth duality' perspective to transfer the attention mechanism from the sequence dimension to the network depth dimension.Researchers believe that network 'layers' are fundamentally isomorphic to RNN 'timesteps.' Since Attention replaced RNN's fixed recurrence, it should also replace fixed residual accumulation in the depth dimension, enabling intelligent aggregation of inter-layer information. * 3. Achieves linear dimensionality reduction in inter-layer attention's computational complexity through the Block AttnRes optimization scheme.To avoid the O(L²) computational explosion caused by full-layer attention, the team packaged consecutive layers into blocks and compressed them into summary vectors, reducing complexity to O(L·B) and ensuring engineering feasibility in large-scale models. * 4. This architecture, as a 'drop-in replacement' solution, demonstrated extremely high efficiency in practical large-scale MoE model validation.Tests on Kimi Linear 48B showed superior performance under the same computational budget, or a 20% reduction in computation for equivalent performance, with negligible impact on inference latency.
Metadata
AI Score
87
Website qbitai.com
Published At Today
Length 3172 words (about 13 min)
Sign in to use highlight and note-taking features for a better reading experience. Sign in now
##### 克雷西 发自 凹非寺
量子位 | 公众号 QbitAI
17岁高中生,以一作身份,在Kimi团队把Ilya提出的设想,变成了现实。
Ilya之前有个预言,把按时间先后顺序处理数据的LSTM网络“旋转90度”,也就是把时间轴换成模型深度轴,就变成了现在的残差网络。
Kimi团队认为,既然时间上的LSTM能对应深度上的残差,那后来淘汰了LSTM的“注意力机制”自然也可以照做。
他们新搞出的Attention Residuals技术,就相当于把注意力机制也“旋转了90度”。
!Image 12: img 用了这套新方法后,模型在计算当前层时可以聪明地“回头看”,根据需要自由决定去提取前面哪一层的信息。
这篇论文让马斯克也来围观,表示令人印象深刻。
!Image 13: img 除了马斯克,这篇论文也引发了大神Karpathy的思考,直言我们对Attention is All You Need这篇Transformer开山之作的理解还是不够。
!Image 14: img 这种新机制放到Kimi自家的Kimi Linear 48B大模型(3B激活参数)上验证,训练效率提升25%,推理延迟增加不到2%。
!Image 15: img 残差连接的“记忆负担” -----------
先回顾一下残差连接的工作原理。
传统做法是:第N层的输出 = 第N层的计算结果 + 第N-1层的输出。这样一路累加下去,每一层都能“记住”前面所有层的信息。
问题来了,在大模型PreNorm主流范式下,残差连接中所有层的贡献都是等权累加。
就像一个“记忆力太好的人”,把所有经历都以相同权重存进大脑。贡献被逐步稀释,早期信息难以检索,且大量层可被剪枝而损失微小,称之为“PreNorm dilution problem”。
更麻烦的是,隐藏状态的范数会随着深度不断增长。研究人员发现,在深层网络中,这种unbounded growth会导致训练不稳定。
月之暗面团队换了个思路:既然问题出在“无差别累加”,那就让网络自己决定该回忆什么。
#### 用注意力“选择性回忆”
团队观察到一个有趣的对偶性:网络的深度维度和序列的时间维度,本质上是同构的。
在Transformer处理序列时,用注意力机制让当前位置“选择性关注”之前的位置。那么在深度维度上,为什么不能让当前层“选择性关注”之前的层?
Attention Residuals就这么来的:
* 当前层的可学习伪查询向量作为query(learnable pseudo-query) * 所有前层的输出作为key和value * 用注意力机制加权聚合
这样一来,网络可以学会哪些层的信息对当前计算最重要,就多关注一点;不相关的层,权重自然降低。
但这带来一个新问题:计算量爆炸。
如果一个100层的网络,每一层都要对前面99层做full attention residual,复杂度是O(L²),根本跑不动。
#### Block AttnRes:分块压缩
论文中的解决方案是Block AttnRes。
核心思想是把连续的若干层打包成一个block,对block内部的输出做压缩,只保留一个“摘要向量”。
!Image 16: img 具体操作如下:
* 把L层网络分成B个block,每个block包含若干层 * 每个block结束时,把block内的信息压缩成单个向量 * 后续层做attention时,只需要关注块间表征+块内实时层输出,而非全部L个层
这样一来,attention的复杂度从O(L²)降到了O(L·B),在实践中B可以设得很小(论文用的是8-16)。
此外,团队还做了数个工程优化:缓存式流水线通信、序列分片预填充、KV 缓存粒度优化等等。
#### Kimi Linear验证:1.25倍效率提升
理论说得通,但真正让人信服的是大规模验证。
团队在自家的Kimi Linear架构上做了测试。这是一个采用线性注意力的大模型,总参数48B,激活参数3B(MoE架构)。
同等计算预算下,Attention Residuals能获得更好的下游性能;反过来说,达到相同性能需要的训练计算量减少了约20%,相当于获得了1.25倍的效率优势。
在具体任务上,数学推理(MATH、GSM8K)、代码生成(HumanEval、MBPP)均持平或略优,多语言理解的一致性也有所改善。
!Image 17: img 更重要的是,Attention Residuals是一个drop-in replacement,不需要修改网络其他部分,直接替换残差连接即可。
论文里还讲到一个有意思的视角。
团队把这项工作称为“时间-深度对偶性”(time-depth duality)的应用。
在他们看来,深度神经网络的“层”和循环神经网络的“时间步”,本质上是都是对信息的迭代处理。
Transformer之所以成功,是因为用attention替代了RNN中固定的recurrence。
那么在深度维度上,是不是也该用attention替代固定的residual?
更有意思的是,这篇让马斯克、Karpathy等人都为之一震的论文,共同一作之一是一名年仅17岁的高中生——陈广宇(Nathan)。
!Image 18: img 另外两名共同一作,分别是Kimi的关键人物之一、RoPE(旋转位置编码)的提出者苏神(苏剑林),以及Kimi Linear的第一作者张宇。
诚然Attention Residuals是团队协作取得的成果,但一名高中生出现在这样的团队之中,还与两位大神共列一作,已经足够震撼。
!Image 19: img a16z创始人Marc Andreessen、Thinking Machines的联创等人都关注了他的X账号。
!Image 20: img 一年前才刚刚开始了解大模型的陈广宇,是从北京的一场黑客松开始,一路走向硅谷的。
后来回国时,他选择加入了Kimi。
经手过月之暗面投资的奇绩创坛(原YC中国)创始成员董科含,也曾在其个人公众号上刊载过陈广宇的一份自传。
去年二月,北京的一场中学生黑客松上,陈广宇展示了一个关于“人类第三只机械辅助手”的创新构想——ThirdArm。
也正是这个项目,让他结识了黑客松评委董科含,后者也成为了他的创业导师。
当时,董科含追问他,未来是否会深耕这项技术,这促使他开始重新审视自己的职业方向。
随后他入选了董科含发起的只有极少数人入选的青年计划,开始接触IOI(国际信息学奥林匹克)金牌得主及资深科研人员。
此前他曾尝试经营Shopify跨境电商店铺、运营短视频账号,但经过董科含的建议,他决定转向理解时代的底层技术。
当时还不知道Transformer是什么的他,在DeepSeek研究员袁境阳的指导下,利用Gemini作为辅助工具,通过研读经典论文、追踪GitHub开源项目等方式逐步建立认知。
有一次他在推特上分享了对一篇博客的反思后,获得了作者的回复,这篇帖子也因此引起了一家硅谷AI初创公司CEO的关注。
该公司于2024年底成立,2025年初完成了800万美元种子轮融资,资方背景涉及OpenAI与Anthropic。
在通过一项限时通宵完成的实验测试后,他拿到了对方的录用通知。
暑假期间,他前往旧金山开启了为期七周的实习。其中前两周,他负责定义并推进一个涉及144张H100显卡的探索性项目。
在CEO直接指导下,他的工作延伸至运营层面,参与了招聘系统搭建、技术内容输出及融资策略讨论,并获得与早期投资者Vinod Khosla交流的机会。
在硅谷期间,他维持着高强度工作节奏,通过咖啡社交与英伟达工程师及初创创始人建立联系。这次经历让他将科研视为一种支撑创造的底层能力。
!Image 21: img 实习结束后,陈广宇回到国内,并于去年11月加入月之暗面。
把他吸引进去的,正是Kimi一直做的Flash Linear Attention这一类高效attention工作。
实际上,正是GitHub上的FLA项目,吸引了他对机器学习的兴趣并被邀请加入Kimi团队。
也正是顺着这条线,他开始一路往更底层钻,从读论文、看实现,到研究 Triton kernel、理解attention为什么能被这样重写、这样加速。
到了月之暗面,这条路也算是绕了一圈又落回原点——
他最初是被底层技术吸引,最后做的也正是最底层、最核心的那部分事。
相比于讲一个“少年天才一路开挂”的故事,陈广宇的经历更像是另一种成长路径——
先被时代最前沿的技术击中,再一步步把兴趣磨成能力,把能力带到真正的大模型研发现场里。
论文地址:
https://github.com/MoonshotAI/Attention-Residuals/
参考链接:
[1]https://mp.weixin.qq.com/s/gRR99pEDWb5qsk2a2hwe2w
[2]https://nathanchen.me/public/About%20me.html
量 量子位 @克雷西
One Sentence Summary
The Kimi team has launched the Attention Residuals architecture, which introduces attention mechanisms into deep residual connections to achieve 'time-depth duality,' significantly boosting large model training efficiency while addressing information dilution in deep networks.
Summary
This article reports on a major breakthrough in model architecture by the Moonshot AI (Kimi) team: the Attention Residuals technology. This research addresses the 'dilution problem' in mainstream Transformer architectures' PreNorm residual connections (where equal weighted accumulation of all layer contributions makes early information difficult to retrieve) and the instability issues in deep network training. It proposes applying attention mechanisms from the time axis 'rotated 90 degrees' to the depth axis. By introducing learnable pseudo-query vectors, the current layer can selectively focus on and extract information from previous layers. To solve the computational complexity problem, the team developed the Block AttnRes block compression scheme, reducing complexity from O(L²) to O(L·B). Experiments show that this architecture achieved a 25% improvement in training efficiency on the Kimi Linear 48B model, with an inference latency increase of less than 2%. Additionally, the article highlights the legendary journey of the paper's 17-year-old co-first author, Chen Guangyu (Nathan Chen), from a hackathon enthusiast to a core AI researcher.
Main Points
* 1. Proposes the Attention Residuals mechanism to solve the information dilution and training instability problems of traditional residual connections.
Under the traditional PreNorm paradigm, residual contributions are accumulated with equal weights, leading to early information dilution in deep networks and unbounded growth of hidden state norms. The new mechanism selectively recalls information from previous layers through attention, enhancing information retrieval and stabilizing training.
* 2. Utilizes a 'time-depth duality' perspective to transfer the attention mechanism from the sequence dimension to the network depth dimension.
Researchers believe that network 'layers' are fundamentally isomorphic to RNN 'timesteps.' Since Attention replaced RNN's fixed recurrence, it should also replace fixed residual accumulation in the depth dimension, enabling intelligent aggregation of inter-layer information.
* 3. Achieves linear dimensionality reduction in inter-layer attention's computational complexity through the Block AttnRes optimization scheme.
To avoid the O(L²) computational explosion caused by full-layer attention, the team packaged consecutive layers into blocks and compressed them into summary vectors, reducing complexity to O(L·B) and ensuring engineering feasibility in large-scale models.
* 4. This architecture, as a 'drop-in replacement' solution, demonstrated extremely high efficiency in practical large-scale MoE model validation.
Tests on Kimi Linear 48B showed superior performance under the same computational budget, or a 20% reduction in computation for equivalent performance, with negligible impact on inference latency.
Key Quotes
* If LSTM in the temporal dimension corresponds to residuals in the depth dimension, then the 'attention mechanism' that later superseded LSTM can naturally do the same. * The network can learn which layers' information is most important for the current computation and focus more on it; for irrelevant layers, the weights naturally decrease. * Attention Residuals is a drop-in replacement; it does not require modifying other parts of the network, simply replacing the residual connections. * The 'layers' of deep neural networks and the 'timesteps' of recurrent neural networks are essentially both iterative processing of information. * Compared to a story of a 'young prodigy with an easy path to success,' Chen Guangyu's experience is more like another growth trajectory—first being struck by the era's cutting-edge technology, then gradually honing interest into ability.
AI Score
87
Website qbitai.com
Published At Today
Length 3172 words (about 13 min)
Tags
Kimi
Attention Residuals
Model Architecture
Transformer
Residual Networks
Related Articles
* [[Introduction to Generative AI & Machine Learning 2025] Lecture 10: History of Speech Language Model Development (Historical Review; 2025 Technology from 1:42:00)](https://www.bestblogs.dev/en/video/0a818d3 "This article provides an in-depth review of the development history of Speech Language Models (SLM), from early speech recognition and representation models to the latest Generative Speech Language Models like GPT-4o. It discusses the pros and cons of end-to-end and cascaded approaches, how text models empower speech models, and key technologies enabling \"speak-while-thinking\" capabilities.") * GPT-5.4 Released: OpenAI's First Unified Model, Truly Native * MiniMax Hailuo Video Team's First Open-Source Release: Tokenizer Exhibits a Clear Scaling Law * Year-end AI Reflections by Salvatore Sanfilippo, the Creator of Redis * After Reading 200 Papers: Analyzing DeepMind, Meta, DeepSeek - What AGI Narratives Are Chinese and American Giants Describing? | 2025 AI Annual Review * This is Probably the Most Comprehensive, Readable, and Easy-to-Understand Article on AI Large Models I've Ever Read.") * Yao Shunyu Lectures Face-to-Face with Tang Jie, Yang Zhilin, and Lin Junyang! Four Schema Heroes Debate Heroes at Zhongguancun * AI Starts to "Take Action", Alibaba's Qwen Leads the World * Gemini Leadership: Pro's Main Role is to Distill Flash! The Greatest Potential for Breakthroughs Lies in Post-training; Noam, Jeff Dean: Continual Learning is a Key Direction for Improvement * After Topping Open-Source Rankings with its Programming LLM, the Zhipu GLM Team Faced a 3-Hour Questioning Session HomeArticlesPodcastsVideosTweets
Kimi's New Architecture Impresses Elon Musk! 17-Year-Old ... ===============