← 回總覽

对 SWE-bench 的质疑与新的模型评估方法

📅 2026-03-13 02:06 Lee Robinson 人工智能 3 分鐘 3485 字 評分: 86
SWE-bench 模型评估 数据污染 AI 编程 Cursor AI
📌 一句话摘要 Lee Robinson 讨论了 SWE-bench 结果的数据污染问题,并提倡对 AI 编程模型采用离线基准测试与在线评估相结合的混合方法。 📝 详细摘要 针对 Cursor AI 针对智能体编程任务推出的新评分方法,Lee Robinson 指出,由于数据污染(即模型记住了特定的测试用例),传统的 SWE-bench 结果正变得越来越不可靠。他建议,衡量 AI 编程模型质量最有效的方法是结合离线基准测试和真实的在线评估,这也验证了 Cursor 正在采用的方法。 📊 文章信息 AI 评分:86 来源:Lee Robinson(@leerob) 作者:Lee Robin

Title: Critique of SWE-bench and New Model Evaluation Methods | ...

URL Source: https://www.bestblogs.dev/status/2032156178566758845

Published Time: 2026-03-12 18:06:03

Markdown Content: Skip to main content ![Image 1: LogoBestBlogs](https://www.bestblogs.dev/ "BestBlogs.dev")Toggle navigation menu Toggle navigation menuArticlesPodcastsVideosTweetsSourcesNewsletters

⌘K

Change language Switch ThemeSign In

Narrow Mode

Critique of SWE-bench and New Model Evaluation Methods ======================================================

Critique of SWE-bench and New Model Evaluation Methods ====================================================== ![Image 2: Lee Robinson](https://www.bestblogs.dev/en/tweets?sourceId=SOURCE_a5b19a28) ### Lee Robinson

@leerob

SWE-bench results are really cooked, you can now trivially see which models have been contaminated and will spit out exact answers.

The best way to measure model quality seems to be both offline benches + online evals. Think we have landed on something pretty good here!

!Image 3: Cursor

#### Cursor

@cursor_ai · 2h ago

We're sharing a new method for scoring models on agentic coding tasks.

Here's how models in Cursor compare on intelligence and efficiency:

!Image 4: Tweet image

74

81

936

71.4K

Mar 12, 2026, 6:06 PM View on X

10 Replies

1 Retweets

97 Likes

8,554 Views ![Image 5: Lee Robinson](https://www.bestblogs.dev/en/tweets?sourceid=a5b19a28) Lee Robinson @leerob

One Sentence Summary

Lee Robinson discusses the contamination of SWE-bench results and advocates for a hybrid approach of offline benchmarks and online evaluations for AI coding models.

Summary

In response to Cursor AI's new scoring method for agentic coding tasks, Lee Robinson points out that traditional SWE-bench results are increasingly unreliable due to data contamination, where models memorize specific test cases. He suggests that the most effective way to measure the quality of AI coding models is through a combination of offline benchmarking and real-world online evaluations, validating the methodology being adopted by Cursor.

AI Score

86

Influence Score 15

Published At Today

Language

English

Tags

SWE-bench

Model Evaluation

Data Contamination

AI Coding

Cursor AI HomeArticlesPodcastsVideosTweets

Critique of SWE-bench and New Model Evaluation Methods | ... ===============

查看原文 → 發佈: 2026-03-13 02:06:03 收錄: 2026-03-13 04:00:41

🤖 問 AI

針對這篇文章提問,AI 會根據文章內容回答。按 Ctrl+Enter 送出。