← 回總覽

威斯康星大学与麻省理工学院研究指出:现有 AI 编程基准测试存在严重缺陷

📅 2026-03-30 03:21 God of Prompt 人工智能 3 分鐘 2592 字 評分: 86
AI 编程 SlopCodeBench 软件工程 AI 基准测试 研究
📌 一句话摘要 新研究推出了 SlopCodeBench,揭示了当前的 AI 编程基准测试无法衡量软件的可维护性,且往往会奖励那些冗长、不可维护的代码。 📝 详细摘要 这条推文总结了威斯康星大学和麻省理工学院的一篇重要研究论文,该论文对现有的 AI 编程基准测试提出了批评。研究引入了“SlopCodeBench”,该基准测试不再采用单次生成代码的评估方式,而是评估模型在迭代软件开发任务中的表现。研究结果表明,尽管模型可以通过现有的测试,但它们经常产生“Slop”(淤泥代码)——即冗长且不可维护的代码,且无法处理随时间演变的软件需求所带来的复杂性。 📊 文章信息 AI 评分:86 来源:G

BREAKING: Wisconsin and MIT just proved that every AI coding benchmark is measuring the wrong thing. > Pass rate stays high. The code becomes unmaintainable. They tested 11 models including Claude Opus 4.6 and GPT 5.4 on iterative tasks. Zero models solved a single problem end-to-end. Verbosity rose in 89.8% of trajectories. They named it "slop" and built a benchmark around it.

> Every existing coding benchmark evaluates a single shot against a complete spec. Write the code, pass the tests, done. But real software gets extended. Requirements change. Features get added. The agent has to build on its own prior code and nobody was measuring what happens when it does.

> SlopCodeBench forces exactly that. 20 problems, 93 checkpoints. At each step the agent receives an updated spec and must extend its own prior solution. No gold-standard code handed between turns. No fresh start. Bad early architectural decisions compound into every checkpoint that follows.

> The concrete example makes it real. Opus 4.6 on circuit_eval: main() starts at 84 lines, cyclomatic complexity 29. By checkpoint 8 — 1,099 lines, complexity 285. Nine command branches each repeating the same argument-parsing scaffold that should have been extracted once. The code kept passing tests the entire time. Pass-rate benchmarks saw nothing.

→ Erosion rises in 80% of all trajectories across all 11 models

→ Verbosity rises in 89.8% of trajectories

→ Agent erosion: 0.68. Human maintained repositories: 0.31

→ Agent verbosity: 0.32. Human repositories: 0.11 even scikit-learn and scipy stay well below agent averages

→ Cost grows 2.9x from first to last checkpoint. The extra spending doesn't improve correctness.

→ Highest strict solve rate across all 11 models: 17.2%

> They tested "anti-slop" prompting explicitly forbidding verbose patterns, defensive over-engineering, unnecessary abstractions. It cut initial verbosity by 34.5% on GPT 5.4. The degradation slope stayed identical. Prompts shift the starting point. They don't change the rate of decay.

> The root cause is architectural. Agents optimize locally. Hardcoding logic at checkpoint 1 passes tests and costs less than building extensible abstractions. The benchmark penalizes that at checkpoint 3 when the spec changes but by then the damage is already in the workspace. Complexity concentrates. Verbosity accumulates. Tests keep passing.

> Pass-rate benchmarks miss this entirely because test suites cannot see structural decay.

The field has been measuring whether AI can write code. The real question is whether it can build software.

查看原文 → 發佈: 2026-03-30 03:21:29 收錄: 2026-03-30 06:00:46

🤖 問 AI

針對這篇文章提問,AI 會根據文章內容回答。按 Ctrl+Enter 送出。