← 回總覽

Qwen 团队认可社区进行的工具调用基准测试

📅 2026-03-26 17:02 Qwen 人工智能 4 分鐘 4014 字 評分: 78
Qwen3.5 工具调用 基准测试 LLM 模型评估
📌 一句话摘要 Qwen 团队认可了由 `stevibe` 发起的社区驱动型基准测试,该测试评估了 Qwen3.5 模型系列的工具调用能力。 📝 详细摘要 Qwen 官方账号认可了用户 `stevibe` 进行的一项详细技术评估。被引用的推文对 Qwen3.5 模型系列(从 0.8B 到 397B)在工具调用任务上的表现进行了严格的基准测试。测试结果显示,27B 模型在特定工具使用场景中优于更大的模型版本,而更大的模型在数据依从性方面表现欠佳。这次互动凸显了社区主导的模型测试在识别特定性能细微差别方面的价值。 📊 文章信息 AI 评分:78 来源:Qwen(@Alibaba_Qwen)

Title: Qwen Team Acknowledges Community Tool-Calling Benchmark |...

URL Source: https://www.bestblogs.dev/status/2037092892876169556

Published Time: 2026-03-26 09:02:47

Markdown Content: Skip to main content ![Image 1: LogoBestBlogs](https://www.bestblogs.dev/ "BestBlogs.dev")Toggle navigation menu Toggle navigation menuArticlesPodcastsVideosTweetsSourcesNewsletters

⌘K

Change language Switch ThemeSign In

Narrow Mode

Qwen Team Acknowledges Community Tool-Calling Benchmark

Qwen Team Acknowledges Community Tool-Calling Benchmark

![Image 2: Qwen](https://www.bestblogs.dev/en/tweets?sourceId=SOURCE_e04fff) ### Qwen

@Alibaba_Qwen

Big thanks to Steve for testing the entire Qwen3.5 family. Community feedback like this helps us get better. 🙏

!Image 3: steven.bnb

#### steven.bnb

@stevibe · 20h ago

Which local models can actually handle tool calling?

I built a framework to find out.

15 scenarios. 12 tools. Mocked responses. Temperature 0. No cherry-picking.

Tested every Qwen3.5 size from 0.8B to 397B, and since some of you asked after the distillation tests: yes, I included Jackrong's Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled too.

Only two models went all green: the 27B dense and the distilled 27B.

The 397B? Failed two tests. The 122B? Failed one. The 35B? Failed two.

The timed-out results — mostly on the smaller models, are cases where the model got stuck in a loop, repeating the same tool call until it hit the 30-second limit.

The test that exposed the most models: "Search for Iceland's population, then calculate 2% of it." Simple, but 35B, 122B, and 397B all used a rounded number from memory instead of the actual search result. They didn't trust their own tool output.

Small models hallucinate data.

Big models ignore data.

The 27B just threaded it through.Show More

!Image 4: 视频缩略图

00:11

81

107

1,120

127K

Mar 26, 2026, 9:02 AM View on X

7 Replies

1 Retweets

56 Likes

3,753 Views Q Qwen @Alibaba_Qwen

One Sentence Summary

The Qwen team acknowledges a community-driven benchmark by stevibe that evaluates the tool-calling capabilities of the Qwen3.5 model family.

Summary

The Qwen official account acknowledges a detailed technical evaluation conducted by user stevibe. The quoted tweet provides a rigorous benchmark of the Qwen3.5 model family (ranging from 0.8B to 397B) on tool-calling tasks. The findings highlight that the 27B models outperformed larger variants in specific tool-use scenarios, while larger models struggled with data adherence. This interaction underscores the value of community-led model testing in identifying specific performance nuances.

AI Score

78

Influence Score 10

Published At Today

Language

English

Tags

Qwen3.5

ToolCalling

Benchmark

LLM

ModelEvaluation HomeArticlesPodcastsVideosTweets

Qwen Team Acknowledges Community Tool-Calling Benchmark |...

查看原文 → 發佈: 2026-03-26 17:02:47 收錄: 2026-03-26 18:00:21

🤖 問 AI

針對這篇文章提問,AI 會根據文章內容回答。按 Ctrl+Enter 送出。