威斯康星大学与麻省理工学院研究指出：现有 AI 编程基准测试存在严重缺陷

BREAKING: Wisconsin and MIT just proved that every AI coding benchmark is measuring the wrong thing. > Pass rate stays high. The code becomes unmaintainable. They tested 11 models including Claude Opus 4.6 and GPT 5.4 on iterative tasks. Zero models solved a single problem end-to-end. Verbosity rose in 89.8% of trajectories. They named it "slop" and built a benchmark around it.

> Every existing coding benchmark evaluates a single shot against a complete spec. Write the code, pass the tests, done. But real software gets extended. Requirements change. Features get added. The agent has to build on its own prior code and nobody was measuring what happens when it does.

> SlopCodeBench forces exactly that. 20 problems, 93 checkpoints. At each step the agent receives an updated spec and must extend its own prior solution. No gold-standard code handed between turns. No fresh start. Bad early architectural decisions compound into every checkpoint that follows.

> The concrete example makes it real. Opus 4.6 on circuit_eval: main() starts at 84 lines, cyclomatic complexity 29. By checkpoint 8 — 1,099 lines, complexity 285. Nine command branches each repeating the same argument-parsing scaffold that should have been extracted once. The code kept passing tests the entire time. Pass-rate benchmarks saw nothing.

→ Erosion rises in 80% of all trajectories across all 11 models

→ Verbosity rises in 89.8% of trajectories

→ Agent erosion: 0.68. Human maintained repositories: 0.31

→ Agent verbosity: 0.32. Human repositories: 0.11 even scikit-learn and scipy stay well below agent averages

→ Cost grows 2.9x from first to last checkpoint. The extra spending doesn't improve correctness.

→ Highest strict solve rate across all 11 models: 17.2%

> They tested "anti-slop" prompting explicitly forbidding verbose patterns, defensive over-engineering, unnecessary abstractions. It cut initial verbosity by 34.5% on GPT 5.4. The degradation slope stayed identical. Prompts shift the starting point. They don't change the rate of decay.

> The root cause is architectural. Agents optimize locally. Hardcoding logic at checkpoint 1 passes tests and costs less than building extensible abstractions. The benchmark penalizes that at checkpoint 3 when the spec changes but by then the damage is already in the workspace. Complexity concentrates. Verbosity accumulates. Tests keep passing.

> Pass-rate benchmarks miss this entirely because test suites cannot see structural decay.

The field has been measuring whether AI can write code. The real question is whether it can build software.

威斯康星大学与麻省理工学院研究指出：现有 AI 编程基准测试存在严重缺陷

🤖 問 AI