Title: Insights on Composer 2 Evaluation and Model Benchmarking ...
URL Source: https://www.bestblogs.dev/status/2035784734538613113
Published Time: 2026-03-22 18:24:38
Markdown Content: Some early evals of Composer 2 are coming in! These seem to match the results we published in our blog post. But benchmarks are an imperfect measure.
For example, even though these results show Composer 2 closer to GPT-5.4 and ahead of Opus 4.6, that isn't a universally true statement. Even from my own experience taste testing, I prefer the writing style of Opus.
I also use Opus for general writing critiques outside of coding, so maybe I just have more familiarity with the model. But I think it's important to not view benchmarks as absolutes, just something to consider before testing yourself.
The eval results show the improvements from our continued pretraining and RL work on top of the base. I'd be cool to see Kimi with the Cursor harness added as well to make the comparison as close as possible (the diff there is larger than I'd expect, probably needs some prompt tweaks for Kimi).
We'll be sharing more details on the ML work behind Composer 2 here shortly, in addition to the Terminal-Bench 2.0 and SWE-bench multilingual results we published in the announcement.
I think you need to spend some real time with a model to understand its behaviors and quirks. For example, when we're dogfooding Composer internally, we flag to the ML team any weird behaviors (bad markdown formatting, overly verbose summaries, etc) so we can fix and penalize those behaviors in RL.
Appreciate the Next team making their evals¹ more robust after feedback and including new models there like GPT-5.4, as well as the Roboflow² team for publishing their results!
As more evals come out, I'll try to thread them here to see places Composer is good or maybe not so good, and we should fix going forward. For example, I'd expect it to be not as good at non-coding benches.
[1]: nextjs.org/evals