NEW Stanford & MIT paper on Model Harnesses. Changing the harness around a fixed LLM can produce a 6x performance gap on the same benchmark.
What if we automated harness engineering itself?
The work introduces Meta-Harness, an agentic system that searches over harness code by exposing the full history through a filesystem.
The proposer reads source code, execution traces, and scores from all prior candidates, referencing over 20 past attempts per step.
On text classification, it improves over SOTA context management by 7.7 points while using 4x fewer tokens.
On agentic coding, it outperforms all hand-engineered baselines on TerminalBench-2, scoring 37.6% versus Claude Code's 27.5%.
This is a big deal! Here is why:
The harness around a model often matters as much as the model itself.
Meta-Harness shows that giving an optimizer rich access to prior experience, not just compressed scores, unlocks automated engineering that beats human-designed scaffolding.
Paper: arxiv.org/abs/2603.28052
Learn to build effective AI agents in our academy: academy.dair.ai