← 回總覽

通过 SSD 流式加载在本地运行超大规模 MoE 模型

📅 2026-03-24 12:08 Simon Willison 人工智能 3 分鐘 3600 字 評分: 88
本地LLM MoE MLX MacBook Kimi
📌 一句话摘要 Simon Willison 介绍了通过 SSD 流式加载权重,在消费级 Mac 硬件上运行超大规模混合专家模型 (MoE) 的突破性技术。 📝 详细摘要 这条推文探讨了一种针对本地 LLM 推理的重要优化技术。通过从 SSD 流式加载部分专家权重,而不是将整个模型加载到内存 (RAM) 中,用户可以在内存有限的设备(如 MacBook Pro)上运行超大规模模型,例如 1T 参数的 Kimi 2.5。这种方法显著降低了运行前沿大模型的硬件门槛。 📊 文章信息 AI 评分:88 来源:Simon Willison(@simonw) 作者:Simon Willison 分类

Title: Running Massive MoE Models Locally via SSD Streaming | Be...

URL Source: https://www.bestblogs.dev/status/2036294026438254783

Published Time: 2026-03-24 04:08:23

Markdown Content: Skip to main content ![Image 1: LogoBestBlogs](https://www.bestblogs.dev/ "BestBlogs.dev")Toggle navigation menu Toggle navigation menuArticlesPodcastsVideosTweetsSourcesNewsletters

⌘K

Change language Switch ThemeSign In

Narrow Mode

Running Massive MoE Models Locally via SSD Streaming

Running Massive MoE Models Locally via SSD Streaming

![Image 2: Simon Willison](https://www.bestblogs.dev/en/tweets?sourceId=SOURCE_0238b3) ### Simon Willison

@simonw

Turns out you can run enormous Mixture-of-Experts on Mac hardware without fitting the whole model in RAM by streaming a subset of expert weights from SSD for each generated token - and people keep finding ways to run bigger models

Kimi 2.5 is 1T, but only 32B active so fits 96GB

!Image 3: seikixtc

#### seikixtc

@seikixtc · 7h ago

I got a 1T-parameter model running locally on my MacBook Pro.

LLM: Kimi K2.5

1,026,408,232,448 params (~1.026T)

Hardware: M2 Max MacBook Pro (2023) w/ 96GB unified memory

Running on MLX with a flash-style SSD streaming path + local patching.

This is an experimental setup and I haven’t optimized speed yet, but it’s stable enough that I’ve started testing it in an autoresearch-style loop.

#LocalAI #MLX #MoE Show More

!Image 4: Tweet image

15

25

290

79.3K

Mar 24, 2026, 4:08 AM View on X

30 Replies

78 Retweets

919 Likes

60.6K Views ![Image 5: Simon Willison](https://www.bestblogs.dev/en/tweets?sourceid=0238b3) Simon Willison @simonw

One Sentence Summary

Simon Willison highlights a breakthrough technique for running massive Mixture-of-Experts models on consumer Mac hardware by streaming weights from SSD.

Summary

This tweet discusses a significant optimization technique for local LLM inference. By streaming a subset of expert weights from SSD instead of loading the entire model into RAM, users can run massive models, such as the 1T-parameter Kimi 2.5, on devices with limited memory like a MacBook Pro. This approach significantly lowers the hardware barrier for running state-of-the-art models.

AI Score

88

Influence Score 159

Published At Today

Language

English

Tags

LocalLLM

MoE

MLX

MacBook

Kimi HomeArticlesPodcastsVideosTweets

Running Massive MoE Models Locally via SSD Streaming | Be...

查看原文 → 發佈: 2026-03-24 12:08:23 收錄: 2026-03-24 16:01:01

🤖 問 AI

針對這篇文章提問,AI 會根據文章內容回答。按 Ctrl+Enter 送出。