← 回總覽

你的模型真的会“举一反三”吗?RoboChallenge Table30 V2 正式发布

📅 2026-03-24 16:33 量子位的朋友们 人工智能 11 分鐘 12968 字 評分: 87
具身智能 RoboChallenge Table30 V2 泛化评测 零样本学习
📌 一句话摘要 RoboChallenge Table30 V2 正式发布,通过任务、评测和系统三大升级,为具身智能模型提供面向大规模真机原生的泛化评测标准。 📝 详细摘要 文章详细介绍了具身智能评测基准 RoboChallenge Table30 V2 的发布及其核心改进。针对具身智能模型在单任务精调中容易产生的过拟合问题,Table30 V2 从任务、评测、系统三个维度进行了深度重构:任务层面引入了软体处理、工具使用及双臂协作等高难度场景;评测层面全面转向多任务范式,并引入零样本(Zero-shot)与域外(OOD)泛化测试;系统层面则通过硬件扩容实现了 300% 的吞吐量提升。该基准
Skip to main content ![Image 5: LogoBestBlogs](https://www.bestblogs.dev/ "BestBlogs.dev")Toggle navigation menu Toggle navigation menuArticlesPodcastsVideosTweetsSourcesNewsletters

⌘K

Change language Switch ThemeSign In

Narrow Mode

你的模型真的会”举一反三”吗?RoboChallenge Table30 V2 正式发布,泛化时代开幕

量子位 @量子位的朋友们

One Sentence Summary

RoboChallenge Table30 V2 is officially released, providing a generalization evaluation standard for embodied AI models oriented towards large-scale, real-world native deployment through three major upgrades: tasks, evaluation, and systems.

Summary

The article details the release and core improvements of the embodied AI evaluation benchmark, RoboChallenge Table30 V2. Addressing the overfitting issues common in single-task fine-tuning of embodied AI models, Table30 V2 has been deeply reconstructed across three dimensions: tasks, evaluation, and systems. At the task level, it introduces high-difficulty scenarios such as soft object manipulation, tool usage, and bimanual collaboration. At the evaluation level, it shifts fully to a multi-task paradigm and introduces Zero-shot and Out-of-Distribution (OOD) generalization tests. At the system level, it achieves a 300% increase in throughput through hardware expansion. This benchmark will debut as the platform for the CVPR 2026 Workshop competition, aiming to drive embodied AI from single-point breakthroughs to full-dimensional generalization.

Main Points

* 1. Task complexity significantly increased, facing real-world challenges head-on.Adds 18 high-difficulty tasks, including soft object manipulation, tool usage, and bimanual collaboration, testing the model's spatial reasoning, physical common sense, and high-precision synchronous control capabilities. * 2. Evaluation protocol reconstructed, shifting from single-task fine-tuning to full-dimensional generalization.Introduces Zero-shot testing and Out-of-Distribution (OOD) generalization matrices to prevent overfitting optimization for specific tasks, advocating for the development of a single model with general understanding capabilities. * 3. Underlying infrastructure upgraded, significantly improving research iteration efficiency.System throughput increased by 3x, and 'completion time' is introduced as a key performance metric, compelling researchers to optimize strategy execution efficiency rather than blindly stacking inference time.

Metadata

AI Score

87

Website qbitai.com

Published At Today

Length 1851 words (about 8 min)

Sign in to use highlight and note-taking features for a better reading experience. Sign in now

!Image 6具身智能正站在一个关键的历史节点上。随着 VLA 与 WMA 模型爆发式增长,行业的核心痛点日益凸显:在单任务精调之外,具身智能模型是否真正具备通用泛化能力,还是仅仅陷入了单一任务的过拟合?

带着这一核心追问,RoboChallenge正式发布Table30 V2,它以面向下一代模型的大规模真机原生泛化评测为标准,从任务升级、评测升级到系统升级三个维度深度重构:通过引入更严苛的软体、工具使用及双臂协作任务,支持零样本与域外泛化测试,并实现3倍于往届的系统吞吐量。Table30 V2将为全球具身智能研究者打造了一个精准的“泛化标尺”与公平、开放的真机竞技场,同时其预览版将作为 RoboChallenge CVPR 2026 Workshop 竞赛的首秀上线发布。

一、任务升级:直面真实世界的复杂性与灵巧性

!Image 7

具身智能的终极目标是处理现实世界中不可预知的复杂场景。Table30 V2 的第一个核心改进,就是重构了任务集。我们在原有基础上,新增了 18 个全新的双臂灵巧操作任务,配合保留的 12 个经典任务,构建了一个包含 30 个高难度任务的综合评估体系。为什么要如此扩展任务难度?因为泛化能力的缺失往往源于对“简单任务”的过度拟合。

* 从硬到软的跨越:我们引入了对绳索、布料等软连续体物体的处理任务。这些物体形变无限、状态不固定,极大地挑战了模型的空间推理与自适应控制能力。 * 工具与空间的深度交互:新增任务要求机器人能够准确使用工具,并理解复杂的工具-物体空间关系。这不仅是对精度控制的考验,更是对模型物理常识的一次深度摸底。 * 双臂协作的必然性:在 Table30 V2 中,我们设计了大量需要双手高度协调的任务,要求模型在动态受限的环境下,实现高精度的同步控制。 * 硬件升级,验证跨平台能力:引入了新一代移动双臂操纵平台 DOS-W1(具有三角尖端夹具),并与经典的 Aloha系统并存,不仅降低了整体硬件的准入门槛,更通过双机型并行的评测机制,严谨地验证了模型在不同硬件配置下的鲁棒性。

二、评测升级:告别单点突破,迎接全维泛化

!Image 8

在具身智能领域,我们经常看到模型在单个任务上表现优异,却在稍换场景后即刻“翻车”。Table30 V2 通过重构学习与测试协议,彻底终结了“为了比赛而调参”的旧模式。

* 全面转向“多任务”范式:Table30 V2 将全面支持多任务模型,杜绝为每个任务单独训练一个专用模型的“作弊式”优化。我们要求参评者提交具备通用理解能力的单一模型,这与构建通用具身大模型的行业大趋势高度一致。 * 零样本(Zero-shot)测试的引入:这是本次升级的重头戏。我们引入了物体级和环境级两个层面的零样本测试。模型必须在未见过的物体、未见过的背景、甚至动态变化的干扰(如桌面高度微调)下,展示出其真正的理解力,而非单纯的记忆能力。

分层次的泛化矩阵(In-Domain vs. Out-of-Domain):从传统的域内评估,升级为包含域外(OOD)的高阶测试。我们不仅测试模型在预期之内的表现,更通过动态调整测试环境(如将桌面换成沙发等不可控变化),挑战模型的边界。这不再是一场简单的分数比拼,而是一次对模型“智能本质”的压力测试。

三、系统升级:吞吐量提升300%,打造快节奏反馈循环

对于算法迭代而言,速度就是生命线。为了让科研人员不再因为“等待评估结果”而浪费宝贵的算力,我们对整个底层基础设施进行了彻底的扩容与优化。

* 300% 的吞吐量飞跃:通过大量增购主流机器人硬件,并优化调度算法,我们实现了系统吞吐能力的巨大提升。 * 任务准备的“极速化”:我们放宽了像素级的严苛初始状态要求,转向更符合现实的“粗略对齐”,大幅降低了任务准备的空转时间,确保科研团队能够实时获取反馈,快速迭代模型。 * “时间消耗”作为性能指标:我们在排行榜中新增了“完成时间(Time to Complete)”作为关键评分维度,倒逼研究者优化策略的执行效率,而非盲目堆叠推理时间来换取成功率。

诚邀报名:RoboChallenge CVPR 2026 Workshop竞赛正式开启

!Image 9

Table30 V2 的发布,不仅是一个基准测试的诞生,更是一次对未来具身智能方向的指引。我们诚挚邀请全球科研团队、实验室及个人开发者,参与 RoboChallenge CVPR 2026 Workshop 竞赛。本次竞赛将作为 Table30 V2 预览版首秀,参赛者有机会在我们的真实机器人集群上验证自己的模型,与全球顶尖的具身智能算法同台竞技。

竞赛结束后,我们将持续面向全球研究者开放评测基础设施,让每一个有想法的团队都能在真实机器上验证自己的模型。泛化,是具身智能的下一座山峰。Table30 V2,邀你同攀。

详情敬请关注 RoboChallenge 官方渠道:https://robochallenge.cn/competition

*本文系量子位获授权刊载,观点仅为原作者所有。

量子位 @量子位的朋友们

One Sentence Summary

RoboChallenge Table30 V2 is officially released, providing a generalization evaluation standard for embodied AI models oriented towards large-scale, real-world native deployment through three major upgrades: tasks, evaluation, and systems.

Summary

The article details the release and core improvements of the embodied AI evaluation benchmark, RoboChallenge Table30 V2. Addressing the overfitting issues common in single-task fine-tuning of embodied AI models, Table30 V2 has been deeply reconstructed across three dimensions: tasks, evaluation, and systems. At the task level, it introduces high-difficulty scenarios such as soft object manipulation, tool usage, and bimanual collaboration. At the evaluation level, it shifts fully to a multi-task paradigm and introduces Zero-shot and Out-of-Distribution (OOD) generalization tests. At the system level, it achieves a 300% increase in throughput through hardware expansion. This benchmark will debut as the platform for the CVPR 2026 Workshop competition, aiming to drive embodied AI from single-point breakthroughs to full-dimensional generalization.

Main Points

* 1. Task complexity significantly increased, facing real-world challenges head-on.

Adds 18 high-difficulty tasks, including soft object manipulation, tool usage, and bimanual collaboration, testing the model's spatial reasoning, physical common sense, and high-precision synchronous control capabilities.

* 2. Evaluation protocol reconstructed, shifting from single-task fine-tuning to full-dimensional generalization.

Introduces Zero-shot testing and Out-of-Distribution (OOD) generalization matrices to prevent overfitting optimization for specific tasks, advocating for the development of a single model with general understanding capabilities.

* 3. Underlying infrastructure upgraded, significantly improving research iteration efficiency.

System throughput increased by 3x, and 'completion time' is introduced as a key performance metric, compelling researchers to optimize strategy execution efficiency rather than blindly stacking inference time.

Key Quotes

* The lack of generalization capability often stems from overfitting to 'simple tasks'. * By reconstructing learning and testing protocols, Table30 V2 has completely ended the old model of 'tuning parameters just for the competition'. * Generalization is the next peak for embodied AI. Table30 V2 invites you to climb it together.

AI Score

87

Website qbitai.com

Published At Today

Length 1851 words (about 8 min)

Tags

Embodied AI

RoboChallenge

Table30 V2

Generalization Evaluation

Zero-shot Learning

Related Articles

* 150: Year-End AI Review: From Models to Applications, Technology to Business Wars, Grasping the Thread of Meaning in the Torrent * After Large Models: Rewriting the Division of Labor Between Humans and Machines | A 10,000-Word Roundtable Transcript * Zhipu GLM-5 Technical Deep Dive: Full Compatibility with Domestic Chips Like Huawei Sparks Global Discussion and asynchronous reinforcement learning infrastructure, while completing full-stack adaptation for domestic chips.") * The First Three Years of All-in AI | A Conversation with Zhang Jinjian, Partner at Vitalbridge * GPT-5.4 Released: OpenAI's First Unified Model, Truly Native * Vol.94 | A Conversation with Zibianliang Robotics: Three Bold Assertions on Robot Data * AI Starts to "Take Action", Alibaba's Qwen Leads the World * Yao Shunyu Lectures Face-to-Face with Tang Jie, Yang Zhilin, and Lin Junyang! Four Schema Heroes Debate Heroes at Zhongguancun * After Topping Open-Source Rankings with its Programming LLM, the Zhipu GLM Team Faced a 3-Hour Questioning Session * 61. Minimax Goes Public: Three Years of Evolution in the LLM Landscape | A Conversation with Chen Yu from Yunqi Partners HomeArticlesPodcastsVideosTweets

Does Your Model Really 'Generalize'? RoboChallenge Table3...

查看原文 → 發佈: 2026-03-24 16:33:52 收錄: 2026-03-24 18:00:59

🤖 問 AI

針對這篇文章提問,AI 會根據文章內容回答。按 Ctrl+Enter 送出。