拜拜了 SWE-Bench！Cursor 刚发了个 AI Coding 评测基准，难哭 Claude

Skip to main content ![Image 11: LogoBestBlogs](https://www.bestblogs.dev/ "BestBlogs.dev")Toggle navigation menu Toggle navigation menuArticles Podcasts Videos Tweets Sources Newsletters

⌘K

Change language Switch ThemeSign In

Narrow Mode

拜拜了 SWE-Bench！Cursor 刚发了个 AI Coding 评测基准，难哭 Claude ==================================================

量量子位 @西风

One Sentence Summary

Cursor launched its new AI coding benchmark, CursorBench, which uses real-world tasks and a hybrid online-offline evaluation system to reveal efficiency differences among top models in complex agent tasks.

Summary

This article details Cursor's new generation programming benchmark, CursorBench. Addressing the pain points of traditional benchmarks like SWE-Bench, such as unrealistic task types, simplistic scoring mechanisms, and data contamination, CursorBench adopts a task set based on real development data, incorporating designs like expanded task scope and deliberately vague descriptions. Its core innovation lies in a "hybrid online-offline" evaluation approach: offline, it measures model capabilities through standardized tasks, while online, it validates real user experience via A/B testing. Test results show that models like Claude 4.5 perform significantly worse on CursorBench compared to traditional benchmarks, proving the benchmark's greater reference value in differentiating top model performance, and marking a shift in AI coding evaluation from "outcome-oriented" to "efficiency and agent capability-oriented".

Main Points

* 1. Traditional programming benchmarks can no longer meet the current evaluation needs of AI agents.Benchmarks like SWE-Bench suffer from single-task focus, unreasonable scoring mechanisms, and severe data contamination issues, failing to reflect AI's performance in real-world complex tasks such as multi-file modifications and production log analysis. * 2. CursorBench enhances evaluation differentiation and reference value through its "realism" design.The benchmark uses real developer requests from the Cursor platform, deliberately maintaining vague task descriptions and expanding task scope, thereby more realistically simulating actual human-AI collaboration scenarios. * 3. Introducing a closed-loop feedback mechanism that combines online A/B testing with offline evaluation.By observing metrics such as real user acceptance and rollback rates for AI-generated code, it ensures high consistency between offline benchmark scores and actual user experience, forming a virtuous cycle of model selection and validation. * 4. CursorBench emphasizes "efficient execution" rather than merely "task completion".The new benchmark tests models under token constraints and in complex environments, leading to a significant drop in scores for top models like Claude, and revealing performance bottlenecks in real production environments.

Metadata

AI Score

Website qbitai.com

Published At Today

Length 2743 words (about 11 min)

> 一水发自凹非寺 > > > 量子位 | 公众号 QbitAI

编程智能体时代，顶流Cursor举旗发布新的评测基准—— CursorBench，专门评价Cursor中不同模型谁更“智能体”（即高效执行复杂任务）。

结果你猜怎么着？曾在SWE-Bench上威名赫赫的Claude Haiku 4.5/Sonnet 4.5全部歇菜了。

* Claude Haiku 4.5的分数从73.3→29.4； * Claude Sonnet 4.5的分数从77.2→37.9。

而这，也恰好体现了CursorBench和其他编程基准之间的区别：

> SWE-Bench衡量的是程序能否解决问题，CursorBench衡量的是程序能否高效地解决问题。这种差距正是普通基准测试所无法弥补的——在真实的token约束下完成任务。

!Image 12

“龙虾”当道，谁都知道现在评价AI要看执行能力，而且还是要高效执行那种。

而CursorBench的出现，恰好填补了相关空白。

不过问题来了，CursorBench具体咋评的？

关于咋评的这个问题，Cursor还专门撰写了一篇博客。

!Image 13

一上来，Cursor就介绍了一个基本背景—— 随着AI编程助手越来越像“智能体”，目前很多公开的benchmark已经不够用了。

问题呢主要有这么三个： 一是任务类型不真实。

以大家比较熟知的benchmark为例，SWE-Bench主要是修复GitHub issue的bug，任务比较单一。

Terminal-Bench虽然不再局限于代码仓库，但更偏向各种“谜题式任务”，比如根据给定环境完成一系列挑战，此时AI更像是在参加某种竞赛而非进行日常开发。

所以Cursor就说了，“我们发现，这些任务与开发者要求智能体完成的编程工作并不契合”。

现实生活中更常见的是，开发者会要求AI修改多个文件、分析生产日志、运行实验……总之比基准更复杂。 二是评分机制不合理。

很多公开基准通常都假设——一个问题只有一个正确答案。

但现实是，一个需求可能有多种实现方式，不同方案的代码风格、架构选择都有可能不同。

这就往往会导致两种情况：要么直接给正确的方案打叉（出现误判）、要么直接为了可评估性而强行消除模糊性（人为施加限制）。

无论是哪一种，基准都无法反映真实情况。 三是公认的数据污染问题。

这一点就不必多说了，一旦基准出现够久，后来的模型很可能就会直接抓取这些基准数据进行训练。

所以，在这种近乎“透题”的情况下进行评分，其结果到底有多大价值就可想而知了。

!Image 14

而面对这些问题，Cursor拿出了一套“线上+线下混合评”的全新方案。

线下就是我们说的CursorBench，流程也相对简单——

让不同模型都去完成同一批标准任务，然后系统从正确性、代码质量、效率、交互行为等维度进行打分，最终每个模型都能拿到一个离线benchmark分数。

采用这种标准化流程的好处显而易见，包括可以相对而言把模型拉到同一起跑线进行比较、可以重复测试、成本也相对可控。

不过有人可能就说了，这和其他基准好像没差啊？

别急，CursorBench的“制胜法宝”在这里——选的任务不一样。

其不一样体现在三个维度： 一是任务真。

以前的基准更像是“刻意找题”，找GitHub issue、找各种谜题；而CursorBench的题都来自自家Cursor平台。

Cursor有一个工具叫Cursor Blame，它可以追踪某一段代码是由哪个AI请求生成的。

于是就能拿到这样一对对真实数据——开发者请求+某个模型最终提交的代码。

而这些，就构成了CursorBench绝佳的“出题范本”。而且Cursor补充道：

> 许多任务来自我们的内部代码库和受控来源，从而降低了模型在训练阶段见过这些任务的风险。我们每隔几个月就会更新一次这套基准，以跟踪开发者使用智能体方式的变化。 二是任务规模大。

如今用Cursor的人实在太多了，所以CursorBench的任务规模明显更大。

比如在正确性评估中，无论从代码行数还是平均文件数来看，其问题规模从初始版本到当前的CursorBench-3大致翻了一倍。Cursor表示：

> 虽然代码行数并不是衡量难度的完美指标，但该指标上的增长反映了我们将更具挑战性的任务纳入CursorBench 的方式，例如处理monorepo的多工作区环境、排查生产日志，以及执行长时间运行的实验。

!Image 15 三是任务描述刻意保持“模糊”。

这点也比较好理解。

很多公开基准里的任务描述通常非常详细，但现实中大家和AI说话时往往模棱两可。

所以太精准反而与真实相悖。

!Image 16

至此，基于以上特殊设计，CursorBench成了编程智能体时代真正以“真实开发场景”为原点设计的基准测试。

当然这还没完，光做题怎么够呢？很多AI线下分数高，但用户一上手就发现很拉胯。

对此，Cursor还搞了一套线上评测——直接看真实用户使用效果。

他们会使用A/B Test这种方式，观察一部分用户用模型A、另一部分用户用模型B之后的对比效果。

具体主要看开发者是否接受AI生成的代码、是否继续追问、是否撤销修改、任务是否真正完成等可追踪的产品指标。

如此一来，线上和线下就可以形成完美互补，甚至形成良性循环—— 线下CursorBench先快速筛选模型能力，然后线上验证模型是否真的更好，发现偏差后再去调整benchmark或模型。

飞轮这不就起来了（doge）。

那么模型们在新基准CursorBench上的表现如何呢？

来看最终performance（越靠近右上角越好，代表“以最低成本实现最高性能”）：

!Image 17

见此图表，网友们一时讨论连连：

啧，没想到Claude Sonnet 4.5的“性价比”有点低啊。

!Image 18

这个Composer模型（Cursor自研编码模型）又是哪里冒出来的。

!Image 19

Anyway，从Cursor公布的结果来看，一个很明显的结论是—— CursorBench在前沿模型之间的区分度明显更高。

这个其实是自然而然的。基准一饱和，模型们往往拉不开差距，大家分都高、都好。

但一遇到新的、难的，实力差距便自然显露了。

尤其在CursorBench这种任务规模更大、环境更复杂的基准上，差距无疑将被进一步放大。

只需对比模型在SWE-Bench和CursorBench上的得分就能看出来了（左边全挤在一起、右边呈阶梯式）：

!Image 20

以及Cursor还强调了一点—— CursorBench的排名，与真实用户体验更加一致。

通过前面提到的线上实验，他们发现CursorBench的模型排名，和这些线上指标变化基本是同方向的。

!Image 21

接下来，Cursor还将着手开发下一代评测套件：

> 虽然CursorBench-3的任务比公开基准上的任务持续时间更长，但它们仍然可以在一次会话内完成。我们预计在未来一年里，绝大多数开发工作将转向由在各自计算机上独立运行的长时运行智能体来完成，因此我们也正规划对CursorBench作出相应调整。

嗯，瞄准的还是智能体，只不过是运行时间更长的智能体。

参考链接：

[1]https://x.com/cursor_ai/status/2032148125448610145

[2]https://cursor.com/cn/blog/cursorbench

[3]https://www.objectwire.org/technology/cursor

量量子位 @西风

One Sentence Summary

Summary

Main Points

* 1. Traditional programming benchmarks can no longer meet the current evaluation needs of AI agents.

Benchmarks like SWE-Bench suffer from single-task focus, unreasonable scoring mechanisms, and severe data contamination issues, failing to reflect AI's performance in real-world complex tasks such as multi-file modifications and production log analysis.

* 2. CursorBench enhances evaluation differentiation and reference value through its "realism" design.

The benchmark uses real developer requests from the Cursor platform, deliberately maintaining vague task descriptions and expanding task scope, thereby more realistically simulating actual human-AI collaboration scenarios.

* 3. Introducing a closed-loop feedback mechanism that combines online A/B testing with offline evaluation.

By observing metrics such as real user acceptance and rollback rates for AI-generated code, it ensures high consistency between offline benchmark scores and actual user experience, forming a virtuous cycle of model selection and validation.

* 4. CursorBench emphasizes "efficient execution" rather than merely "task completion".

The new benchmark tests models under token constraints and in complex environments, leading to a significant drop in scores for top models like Claude, and revealing performance bottlenecks in real production environments.

Key Quotes

* SWE-Bench measures whether a program can solve a problem, while CursorBench measures whether a program can solve a problem efficiently. * Many tasks come from our internal codebase and controlled sources, thereby reducing the risk of models having seen these tasks during training. * CursorBench's rankings are more consistent with real user experience. * Once a benchmark becomes saturated, models often show little differentiation; however, when faced with new and difficult challenges, performance gaps naturally emerge.

AI Score

Website qbitai.com

Published At Today

Length 2743 words (about 11 min)

拜拜了 SWE-Bench！Cursor 刚发了个 AI Coding 评测基准，难哭 Claude

One Sentence Summary

Summary

Main Points

Metadata

One Sentence Summary

Summary

Main Points

Key Quotes

Tags

Related Articles

🤖 問 AI

Related Articles

GPT-5.4 Released: OpenAI's First Unified Model, Truly Native

After Topping Open-Source Rankings with its Programming LLM, the Zhipu GLM Team Faced a 3-Hour Questioning Session

Yao Shunyu Lectures Face-to-Face with Tang Jie, Yang Zhilin, and Lin Junyang! Four Schema Heroes Debate Heroes at Zhongguancun

AI Starts to "Take Action", Alibaba's Qwen Leads the World

Kimi Releases and Open-Sources K2.5 Model, Bringing New Visual Understanding, Code, and Agent Cluster Capabilities

How Will OpenAI Respond? Google's Midnight 'Bomb' – Pro-level Intelligence at a Bargain Price, Netizens Exclaim: A Versatile Powerhouse!

MiniMax M2.5 Released: $1/Hour, the King of Real-World Work

MiniMax Hailuo Video Team's First Open-Source Release: Tokenizer Exhibits a Clear Scaling Law