← 回總覽

NVIDIA 发布 Nemotron 3 VoiceChat,推动开源语音到语音模型发展

📅 2026-03-18 00:36 Matthew Berman 人工智能 3 分鐘 2529 字 評分: 86
NVIDIA Nemotron 3 VoiceChat 语音到语音 开源权重 AI AI 模型
📌 一句话摘要 Matthew Berman 质疑 NVIDIA 是否正大力推动开源 AI 发展,此番言论是对其发布 Nemotron 3 VoiceChat(一个拥有 120 亿参数的开源权重语音到语音模型)的回应。 📝 详细摘要 这条推文针对 NVIDIA 发布的 Nemotron 3 VoiceChat,提出了关于其在开源 AI 领域战略性推进的疑问。被引用的推文详细介绍了 Nemotron 3 VoiceChat,这是一款约 120 亿参数的开源权重语音到语音(S2S)模型。它强调了该模型在开源全双工 S2S 模型中,平衡对话动态与语音推理方面的领先地位,并提供了关键的基准测试结果

NVIDIA has released Nemotron 3 VoiceChat! A ~12B parameter Speech to Speech model that leads our open weights Conversational Dynamics vs. Speech Reasoning pareto frontier Understanding Speech to Speech model performance is multidimensional - two key and distinct dimensions are raw intelligence and conversational dynamics: how well a model handles the natural rhythms of human conversation such as turn-taking, interruptions.

Amongst full duplex open weights models, NVIDIA’s new Nemotron 3 VoiceChat, V1, leads in balancing these dimensions, setting itself apart from other models on the Conversational Dynamics vs. Speech Reasoning pareto frontier.

Key benchmarking results:

➤ Conversational Dynamics (Full Duplex Bench): Nemotron 3 VoiceChat (V1) scores 77.8%, second among open weights speech to speech models behind NVIDIA's own PersonaPlex (91.0%) and ahead of FLM-Audio (62.0%), Moshi (61.0%) and Freeze-Omni (58.7%)

➤ Speech Reasoning (Big Bench Audio): Nemotron 3 VoiceChat (V1) scores 29.2%, second among open weights speech to speech models behind Freeze-Omni (33.9%) and well ahead of PersonaPlex (12.6%), FLM-Audio (5.3%) and Moshi (1.7%)

➤ Pareto leader: While Freeze-Omni leads on speech reasoning and PersonaPlex leads on conversational dynamics, Nemotron 3 VoiceChat (V1) is the only open weights model that performs amongst the top 3 on both - making it the clear leader on the pareto frontier between these two critical dimensions

➤ Larger than other open weights models but still relatively small compared to LLMs: Nemotron 3 VoiceChat (V1) has 12B parameters, making it one of the larger open weights speech to speech models, while NVIDIA's PersonaPlex is ~7B. While larger compared to other larger open weights speech to speech models the model still is relatively small compared to leading LLMs

➤ Context vs. proprietary models: While this release materially advances open weights performance, open weights speech to speech models still significantly underperform leading proprietary offerings. For comparison, proprietary models on our Big Bench Audio benchmark score substantially higher - Step-Audio R1.1 at 96%, Grok Voice Agent at 92%, Gemini 2.5 Flash (Thinking) at 92%, and Nova 2.0 Sonic at 87%. The gap between open weights and proprietary remains large in this modality.

As the capability and adoption of Speech to Speech models increases, we expect to expand our set of benchmarks to include elements such as tool-calling and multi-turn instruction following.

See more details below ⬇️

查看原文 → 發佈: 2026-03-18 00:36:37 收錄: 2026-03-18 04:00:42

🤖 問 AI

針對這篇文章提問,AI 會根據文章內容回答。按 Ctrl+Enter 送出。