Title: Qwen Team Acknowledges Community Tool-Calling Benchmark |...

URL Source: https://www.bestblogs.dev/status/2037092892876169556

Published Time: 2026-03-26 09:02:47

Markdown Content: Skip to main content ![Image 1: LogoBestBlogs](https://www.bestblogs.dev/ "BestBlogs.dev")Toggle navigation menu Toggle navigation menuArticles Podcasts Videos Tweets Sources Newsletters

⌘K

Change language Switch ThemeSign In

Narrow Mode

Qwen Team Acknowledges Community Tool-Calling Benchmark

![Image 2: Qwen](https://www.bestblogs.dev/en/tweets?sourceId=SOURCE_e04fff) ### Qwen

@Alibaba_Qwen

Big thanks to Steve for testing the entire Qwen3.5 family. Community feedback like this helps us get better. 🙏

!Image 3: steven.bnb

#### steven.bnb

@stevibe · 20h ago

Which local models can actually handle tool calling?

I built a framework to find out.

15 scenarios. 12 tools. Mocked responses. Temperature 0. No cherry-picking.

Tested every Qwen3.5 size from 0.8B to 397B, and since some of you asked after the distillation tests: yes, I included Jackrong's Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled too.

Only two models went all green: the 27B dense and the distilled 27B.

The 397B? Failed two tests. The 122B? Failed one. The 35B? Failed two.

The timed-out results — mostly on the smaller models, are cases where the model got stuck in a loop, repeating the same tool call until it hit the 30-second limit.

The test that exposed the most models: "Search for Iceland's population, then calculate 2% of it." Simple, but 35B, 122B, and 397B all used a rounded number from memory instead of the actual search result. They didn't trust their own tool output.

Small models hallucinate data.

Big models ignore data.

The 27B just threaded it through.Show More

!Image 4: 视频缩略图

00:11

107

1,120

127K

Mar 26, 2026, 9:02 AM View on X

7 Replies

1 Retweets

56 Likes

3,753 Views Q Qwen @Alibaba_Qwen

One Sentence Summary

The Qwen team acknowledges a community-driven benchmark by stevibe that evaluates the tool-calling capabilities of the Qwen3.5 model family.

Summary

The Qwen official account acknowledges a detailed technical evaluation conducted by user stevibe. The quoted tweet provides a rigorous benchmark of the Qwen3.5 model family (ranging from 0.8B to 397B) on tool-calling tasks. The findings highlight that the 27B models outperformed larger variants in specific tool-use scenarios, while larger models struggled with data adherence. This interaction underscores the value of community-led model testing in identifying specific performance nuances.

AI Score

Influence Score 10

Published At Today

Language

English

Qwen Team Acknowledges Community Tool-Calling Benchmark |...

Qwen 团队认可社区进行的工具调用基准测试

Qwen Team Acknowledges Community Tool-Calling Benchmark

Qwen Team Acknowledges Community Tool-Calling Benchmark

One Sentence Summary

Summary

Tags

Qwen Team Acknowledges Community Tool-Calling Benchmark |...

🤖 問 AI