Title: Critique of SWE-bench and New Model Evaluation Methods | ...
URL Source: https://www.bestblogs.dev/status/2032156178566758845
Published Time: 2026-03-12 18:06:03
Markdown Content: Skip to main content Toggle navigation menu Toggle navigation menuArticlesPodcastsVideosTweetsSourcesNewsletters
⌘K
Change language Switch ThemeSign In
Narrow Mode
Critique of SWE-bench and New Model Evaluation Methods ======================================================
Critique of SWE-bench and New Model Evaluation Methods ======================================================  ### Lee Robinson
@leerob
SWE-bench results are really cooked, you can now trivially see which models have been contaminated and will spit out exact answers.
The best way to measure model quality seems to be both offline benches + online evals. Think we have landed on something pretty good here!
#### Cursor
@cursor_ai · 2h ago
We're sharing a new method for scoring models on agentic coding tasks.
Here's how models in Cursor compare on intelligence and efficiency:
74
81
936
71.4K
Mar 12, 2026, 6:06 PM View on X
10 Replies
1 Retweets
97 Likes
8,554 Views  Lee Robinson @leerob
One Sentence Summary
Lee Robinson discusses the contamination of SWE-bench results and advocates for a hybrid approach of offline benchmarks and online evaluations for AI coding models.
Summary
In response to Cursor AI's new scoring method for agentic coding tasks, Lee Robinson points out that traditional SWE-bench results are increasingly unreliable due to data contamination, where models memorize specific test cases. He suggests that the most effective way to measure the quality of AI coding models is through a combination of offline benchmarking and real-world online evaluations, validating the methodology being adopted by Cursor.
AI Score
86
Influence Score 15
Published At Today
Language
English
Tags
SWE-bench
Model Evaluation
Data Contamination
AI Coding
Cursor AI HomeArticlesPodcastsVideosTweets
Critique of SWE-bench and New Model Evaluation Methods | ... ===============