基础设施噪声对智能体编程评估的影响

Title: Impact of Infrastructure Noise on Agentic Coding Evaluati...

URL Source: https://www.bestblogs.dev/status/2031462032508334434

Published Time: 2026-03-10 20:07:45

Markdown Content: Skip to main content ![Image 1: LogoBestBlogs](https://www.bestblogs.dev/ "BestBlogs.dev")Toggle navigation menu Toggle navigation menuArticles Podcasts Videos Tweets Sources Newsletters

⌘K

Change language Switch ThemeSign In

Narrow Mode

Impact of Infrastructure Noise on Agentic Coding Evaluations ============================================================

Impact of Infrastructure Noise on Agentic Coding Evaluations ============================================================ ![Image 2: Thariq](https://www.bestblogs.dev/en/tweets?sourceId=SOURCE_d2829908) ### Thariq

@trq212

really good post on why agentic coding evals are so noisy and might vary by several percentage points between runs

!Image 3: Tweet image

!Image 4: Anthropic

#### Anthropic

@AnthropicAI · 1mo ago

New on the Engineering Blog: Quantifying infrastructure noise in agentic coding evals.

Infrastructure configuration can swing agentic coding benchmarks by several percentage points—sometimes more than the leaderboard gap between top models.

Read more: anthropic.com/engineering/in…Show More

136

106

1,017

190.8K

Mar 10, 2026, 8:07 PM View on X

23 Replies

9 Retweets

189 Likes

25.2K Views T Thariq @trq212

One Sentence Summary

Thariq highlights an Anthropic engineering study revealing how infrastructure configuration causes significant noise in agentic coding benchmarks.

Summary

This tweet draws attention to a critical technical post from Anthropic Engineering regarding the reliability of AI coding benchmarks. It explains that 'infrastructure noise'—the specific configuration of the environment where agents run—can cause evaluation scores to fluctuate by several percentage points. Crucially, this variance is often larger than the actual performance gaps between top-tier models on leaderboards, suggesting that current rankings may be less stable than they appear.

AI Score

Influence Score 38

Published At Yesterday

Language

English

基础设施噪声对智能体编程评估的影响

One Sentence Summary

Summary

Tags

🤖 問 AI