← 回總覽

AI 模型无法读取高中教材基本图表:CMU DIAGRAMMA 基准测试揭示关键缺陷

📅 2026-03-21 19:18 God of Prompt 人工智能 2 分鐘 1604 字 評分: 78
AI基准测试 DIAGRAMMA 多模态AI 图表理解 CMU研究
📌 一句话摘要 CMU 的 DIAGRAMMA 基准测试显示,包括 GPT-4o、Claude 和 Gemini 在内的主流 AI 模型在读取科学图表方面全部失败,最佳模型得分仅为 59.64%。 📝 详细摘要 这条推文报道了 CMU DIAGRAMMA 基准测试的一项重要发现。该测试对 17 个 AI 模型进行了 1,058 道科学图表问题的考核,涵盖数学、计算机科学、化学等领域。结果揭示了当前 AI 系统的一个关键缺陷:即便是最先进的模型,如 Claude-3.5-Sonnet(59.64%)、GPT-4o(57%)和 Gemini-1.5-Pro(44%),在图表理解任务上的表现也不

Is it possible that the AI models passing PhD level exams can't read a basic diagram from a high school textbook GPT-4o. Claude. Gemini. All of them tested.

These are the same models your doctors, lawyers, and engineers are using right now.

> CMU built a benchmark called DIAGRAMMA, 1,058 questions about scientific diagrams covering math, computer science, chemistry, and more. Questions that any high school or college student would handle without breaking a sweat.

> They tested 17 models. Every single one failed. The best score in the world, Claude-3.5-Sonnet got 59.64%. That means the most advanced AI on the planet gets 4 in 10 diagram questions wrong.

> GPT-4o: 57%. Gemini-1.5-Pro: 44%. The open-source models most companies are quietly deploying: 35-42%.

> These aren't trick questions. They're asking models to count elements in a graph, identify angles in a triangle, read a project timeline. Things that require actually seeing and reasoning about what's in the image.

> Computer science diagrams, graphs, flowcharts, network structures were the hardest category across every model. The thing AI supposedly excels at. Graphs. It can't read graphs.

→ Best score: Claude-3.5-Sonnet at 59.64%

→ GPT-4o: 57.28% on the same test

→ Smallest open-source models: 35-38%

→ Random guessing baseline: 25% (4 multiple choice options)

→ Cost to generate 100,000 training diagrams to fix this: under $400

The fix CMU built 100,000 synthetic training diagrams generated for less than $400 suggests the problem is solvable. The fact that nobody solved it before now suggests nobody was looking.

查看原文 → 發佈: 2026-03-21 19:18:48 收錄: 2026-03-22 00:00:28

🤖 問 AI

針對這篇文章提問,AI 會根據文章內容回答。按 Ctrl+Enter 送出。