Title: Qwen Team Acknowledges Community Tool-Calling Benchmark |...
URL Source: https://www.bestblogs.dev/status/2037092892876169556
Published Time: 2026-03-26 09:02:47
Markdown Content: Skip to main content Toggle navigation menu Toggle navigation menuArticlesPodcastsVideosTweetsSourcesNewsletters
⌘K
Change language Switch ThemeSign In
Narrow Mode
Qwen Team Acknowledges Community Tool-Calling Benchmark
Qwen Team Acknowledges Community Tool-Calling Benchmark
 ### Qwen@Alibaba_Qwen
Big thanks to Steve for testing the entire Qwen3.5 family. Community feedback like this helps us get better. 🙏
#### steven.bnb
@stevibe · 20h ago
Which local models can actually handle tool calling?
I built a framework to find out.
15 scenarios. 12 tools. Mocked responses. Temperature 0. No cherry-picking.
Tested every Qwen3.5 size from 0.8B to 397B, and since some of you asked after the distillation tests: yes, I included Jackrong's Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled too.
Only two models went all green: the 27B dense and the distilled 27B.
The 397B? Failed two tests. The 122B? Failed one. The 35B? Failed two.
The timed-out results — mostly on the smaller models, are cases where the model got stuck in a loop, repeating the same tool call until it hit the 30-second limit.
The test that exposed the most models: "Search for Iceland's population, then calculate 2% of it." Simple, but 35B, 122B, and 397B all used a rounded number from memory instead of the actual search result. They didn't trust their own tool output.
Small models hallucinate data.
Big models ignore data.
The 27B just threaded it through.Show More
00:11
81
107
1,120
127K
Mar 26, 2026, 9:02 AM View on X
7 Replies
1 Retweets
56 Likes
3,753 Views Q Qwen @Alibaba_Qwen
One Sentence Summary
The Qwen team acknowledges a community-driven benchmark by stevibe that evaluates the tool-calling capabilities of the Qwen3.5 model family.
Summary
The Qwen official account acknowledges a detailed technical evaluation conducted by user stevibe. The quoted tweet provides a rigorous benchmark of the Qwen3.5 model family (ranging from 0.8B to 397B) on tool-calling tasks. The findings highlight that the 27B models outperformed larger variants in specific tool-use scenarios, while larger models struggled with data adherence. This interaction underscores the value of community-led model testing in identifying specific performance nuances.
AI Score
78
Influence Score 10
Published At Today
Language
English
Tags
Qwen3.5
ToolCalling
Benchmark
LLM
ModelEvaluation HomeArticlesPodcastsVideosTweets