对 SWE-bench 的质疑与新的模型评估方法

Title: Critique of SWE-bench and New Model Evaluation Methods | ...

URL Source: https://www.bestblogs.dev/status/2032156178566758845

Published Time: 2026-03-12 18:06:03

Markdown Content: Skip to main content ![Image 1: LogoBestBlogs](https://www.bestblogs.dev/ "BestBlogs.dev")Toggle navigation menu Toggle navigation menuArticles Podcasts Videos Tweets Sources Newsletters

⌘K

Change language Switch ThemeSign In

Narrow Mode

Critique of SWE-bench and New Model Evaluation Methods ======================================================

Critique of SWE-bench and New Model Evaluation Methods ====================================================== ![Image 2: Lee Robinson](https://www.bestblogs.dev/en/tweets?sourceId=SOURCE_a5b19a28) ### Lee Robinson

@leerob

SWE-bench results are really cooked, you can now trivially see which models have been contaminated and will spit out exact answers.

The best way to measure model quality seems to be both offline benches + online evals. Think we have landed on something pretty good here!

!Image 3: Cursor

#### Cursor

@cursor_ai · 2h ago

We're sharing a new method for scoring models on agentic coding tasks.

Here's how models in Cursor compare on intelligence and efficiency:

!Image 4: Tweet image

936

71.4K

Mar 12, 2026, 6:06 PM View on X

10 Replies

1 Retweets

97 Likes

8,554 Views ![Image 5: Lee Robinson](https://www.bestblogs.dev/en/tweets?sourceid=a5b19a28) Lee Robinson @leerob

One Sentence Summary

Lee Robinson discusses the contamination of SWE-bench results and advocates for a hybrid approach of offline benchmarks and online evaluations for AI coding models.

Summary

In response to Cursor AI's new scoring method for agentic coding tasks, Lee Robinson points out that traditional SWE-bench results are increasingly unreliable due to data contamination, where models memorize specific test cases. He suggests that the most effective way to measure the quality of AI coding models is through a combination of offline benchmarking and real-world online evaluations, validating the methodology being adopted by Cursor.

AI Score

Influence Score 15

Published At Today

Language

English

对 SWE-bench 的质疑与新的模型评估方法

One Sentence Summary

Summary

Tags

🤖 問 AI