Title: MSA (Memory Sparse Attention) Paper Analysis: A New Break...

URL Source: https://www.bestblogs.dev/status/2034780748738334743

Published Time: 2026-03-19 23:55:09

Markdown Content: Skip to main content ![Image 1: LogoBestBlogs](https://www.bestblogs.dev/ "BestBlogs.dev")Toggle navigation menu Toggle navigation menuArticles Podcasts Videos Tweets Sources Newsletters

⌘K

Change language Switch ThemeSign In

Narrow Mode

MSA (Memory Sparse Attention) Paper Analysis: A New Breakthrough in LLM Long Memory

![Image 2: Berryxia.AI](https://www.bestblogs.dev/en/tweets?sourceId=SOURCE_4287449f) ### Berryxia.AI

@berryxia

还是很有意义：

解耦记忆与推理：记忆容量不再受限于上下文窗口

实用性：只需2xA800就能跑1亿

token，相对可及

一体化：检索+生成端到端训练，避免了 RAG的多管道复杂度

稳定性：在1M token 下仍保持94%

+准确率

这是向AGI方向的重要一步—长期记忆能力是通用智能的关键！ ”

!Image 3: 艾略特

#### 艾略特

@elliotchen100 · 1d ago

论文来了。名字叫 MSA，Memory Sparse Attention。

一句话说清楚它是什么：

让大模型原生拥有超长记忆。不是外挂检索，不是暴力扩窗口，而是把「记忆」直接长进了注意力机制里，端到端训练。

过去的方案为什么不行？

RAG 的本质是「开卷考试」。模型自己不记东西，全靠现场翻笔记。翻得准不准要看检索质量，翻得快不快要看数据量。一旦信息分散在几十份文档里、需要跨文档推理，就抓瞎了。

线性注意力和 KV 缓存的本质是「压缩记忆」。记是记了，但越压越糊，长了就丢。

MSA 的思路完全不同：

→ 不压缩，不外挂，而是让模型学会「挑重点看」

核心是一种可扩展的稀疏注意力架构，复杂度是线性的。记忆量翻 10 倍，计算成本不会指数爆炸。

→ 模型知道「这段记忆来自哪、什么时候的」

用了一种叫 document-wise RoPE 的位置编码，让模型天然理解文档边界和时间顺序。

→ 碎片化的信息也能串起来推理

Memory Interleaving 机制，让模型能在散落各处的记忆片段之间做多跳推理。不是只找到一条相关记录，而是把线索串成链。

结果呢？

· 从 16K 扩到 1 亿 token，精度衰减不到 9%

· 4B 参数的 MSA 模型，在长上下文 benchmark 上打赢 235B 级别的顶级 RAG 系统

· 2 张 A800 就能跑 1 亿 token 推理。这不是实验室专属，这是创业公司买得起的成本。

说白了，以前的大模型是一个极度聪明但只有金鱼记忆的天才。MSA 想做的事情是，让它真正「记住」。

我们放 github 上了，算法的同学不容易，可以点颗星星支持一下。🌟👀�github.com/EverMind-AI/MSA67q Show More

!Image 4: Tweet image

265

1,696

397.3K

Mar 19, 2026, 11:55 PM View on X

0 Replies

0 Retweets

5 Likes

1,013 Views B Berryxia.AI @berryxia

One Sentence Summary

The MSA architecture achieves linear-complexity ultra-long memory through end-to-end training, maintaining high precision at 100M token context with minimal inference costs.

Summary

This tweet introduces the Memory Sparse Attention (MSA) paper, highlighting how it achieves native ultra-long memory for LLMs by integrating memory directly into the attention mechanism. Compared to RAG and traditional long-window approaches, MSA features linear computational complexity, supports multi-hop reasoning, and offers extreme cost-effectiveness (running 100M tokens on just 2 A800s), representing a significant technical advancement toward AGI's long-term memory capabilities.

AI Score

Influence Score 1

Published At Yesterday

Language

Chinese

MSA (Memory Sparse Attention) Paper Analysis: A New Break...

MSA (Memory Sparse Attention) 论文解析：大模型长记忆的新突破

MSA (Memory Sparse Attention) Paper Analysis: A New Breakthrough in LLM Long Memory

MSA (Memory Sparse Attention) Paper Analysis: A New Breakthrough in LLM Long Memory

One Sentence Summary

Summary

Tags

MSA (Memory Sparse Attention) Paper Analysis: A New Break...

🤖 問 AI