← 回總覽

当前前沿模型中视觉理解的“海市蜃楼”

📅 2026-03-29 22:32 Gary Marcus 人工智能 9 分鐘 10705 字 評分: 82
视觉理解 多模态 LLM 海市蜃楼式推理 AI 基准测试 计算机视觉
📌 一句话摘要 本文重点介绍了斯坦福大学的一项研究,该研究揭示了前沿 AI 模型经常表现出“海市蜃楼式推理”(mirage reasoning),即在没有图像输入的情况下也能生成详细的视觉分析,这引发了对其是否具备真正视觉理解能力的质疑。 📝 详细摘要 Gary Marcus 讨论了斯坦福大学最近的一篇研究论文,该论文揭露了前沿大语言模型(LLM)在视觉理解方面的重大缺陷。这项研究引入了“海市蜃楼式推理”的概念,即模型在未实际看到图像的情况下,也能针对图像生成详尽的推理和临床发现。值得注意的是,一些模型在没有任何图像访问权限的情况下依然在基准测试中名列前茅,这表明高分可能源于数据泄露或语言
Skip to main content ![Image 2: LogoBestBlogs](https://www.bestblogs.dev/ "BestBlogs.dev")Toggle navigation menu Toggle navigation menuArticlesPodcastsVideosTweetsSourcesNewsletters

⌘K

Change language Switch ThemeSign In

Narrow Mode

The mirage of visual understanding in current frontier models

!Image 3: Marcus on AI Marcus on AI @Gary Marcus

One Sentence Summary

This article highlights a Stanford study revealing that frontier AI models often exhibit 「mirage reasoning,」 generating detailed visual analyses even without image input, questioning their true visual understanding.

Summary

Gary Marcus discusses a recent Stanford research paper that exposes significant flaws in the visual understanding of frontier LLMs. The study introduces the concept of 「mirage reasoning,」 where models produce elaborate reasoning and clinical findings for images they haven't actually seen. Remarkably, some models topped benchmarks without any image access, suggesting that high scores may be due to data leakage or linguistic patterns rather than genuine vision. Marcus argues this reinforces the idea that current AI lacks true world understanding, meaning professions requiring precise visual comprehension and physical robotics remain safe from immediate AI displacement.

Main Points

* 1. Frontier models exhibit 「mirage reasoning」 by hallucinating details.Models can generate detailed reasoning traces and clinical findings for images that were never provided, indicating they rely on internal linguistic biases rather than actual visual input. * 2. High benchmark scores do not necessarily equate to visual competence.Models achieved top ranks on medical and general multimodal benchmarks without access to images, suggesting these benchmarks may be flawed or susceptible to linguistic guessing. * 3. Real-world applications requiring visual precision are not yet ready for AI.Fields like architecture, civil engineering, and humanoid robotics remain insulated from AI disruption because current models lack the reliable visual understanding necessary for these tasks.

Metadata

AI Score

82

Website garymarcus.substack.com

Published At Today

Length 207 words (about 1 min)

Sign in to use highlight and note-taking features for a better reading experience. Sign in now

From a damning new Stanford paper on the illusion of visual understanding in LLMs:

> “Frontier models readily generate detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided, we term this phenomenon mirage reasoning. Second, without any image input, models also attain strikingly high scores across general and medical multimodal benchmarks, bringing into question their utility and design. In the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images. " ![Image 4](https://substackcdn.com/image/fetch/$s_!YYJR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5882bdd-5ea8-4cea-9ba5-8635c7667bd6_1980x923.png)

AGI this stuff ain’t.

This study reinforces what Anh Totti Nguyen has been saying for a long time, in a series of underappreciated papers like Vision Language Models are Blind that I keep trying to draw attention to.

Also, re the very active discussion on AI and jobs: although _some_ white collar jobs (e.g., entry-level coder or market research assistant) may be in near-term jeopardy, many of those that require visual understanding (architect, cartographer, civil engineer, film editor, medical illustrator, urban planner, etc) probably aren’t vulnerable until entirely new techniques are developed.

And humanoid home robots? Don’t make me laugh. If your humanoid robot can’t understand the visual world, it’s just a demo, and not something you can trust.

Subscribe

!Image 5: Marcus on AI Marcus on AI @Gary Marcus

One Sentence Summary

This article highlights a Stanford study revealing that frontier AI models often exhibit 「mirage reasoning,」 generating detailed visual analyses even without image input, questioning their true visual understanding.

Summary

Gary Marcus discusses a recent Stanford research paper that exposes significant flaws in the visual understanding of frontier LLMs. The study introduces the concept of 「mirage reasoning,」 where models produce elaborate reasoning and clinical findings for images they haven't actually seen. Remarkably, some models topped benchmarks without any image access, suggesting that high scores may be due to data leakage or linguistic patterns rather than genuine vision. Marcus argues this reinforces the idea that current AI lacks true world understanding, meaning professions requiring precise visual comprehension and physical robotics remain safe from immediate AI displacement.

Main Points

* 1. Frontier models exhibit 「mirage reasoning」 by hallucinating details.

Models can generate detailed reasoning traces and clinical findings for images that were never provided, indicating they rely on internal linguistic biases rather than actual visual input.

* 2. High benchmark scores do not necessarily equate to visual competence.

Models achieved top ranks on medical and general multimodal benchmarks without access to images, suggesting these benchmarks may be flawed or susceptible to linguistic guessing.

* 3. Real-world applications requiring visual precision are not yet ready for AI.

Fields like architecture, civil engineering, and humanoid robotics remain insulated from AI disruption because current models lack the reliable visual understanding necessary for these tasks.

Key Quotes

* Frontier models readily generate detailed image descriptions and elaborate reasoning traces... for images never provided, we term this phenomenon mirage reasoning. * In the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images. * If your humanoid robot can't understand the visual world, it's just a demo, and not something you can trust.

AI Score

82

Website garymarcus.substack.com

Published At Today

Length 207 words (about 1 min)

Tags

Visual Understanding

Multimodal LLMs

Mirage Reasoning

AI Benchmarks

Computer Vision

Related Articles

* OpenClaw (a.k.a. Moltbot) is everywhere all at once, and a disaster waiting to happen, characterizing it as a security and privacy disaster that grants LLMs unfettered system access without adequate safeguards.") * Google DeepMind Launches Gemini 3.1 Pro with 2x Reasoning Performance * Introducing Agentic Vision in Gemini 3 Flash * Fei-Fei Li on Spatial Intelligence: The Next Frontier for AI * Ultimate prompting guide for Nano Banana * Google Launches Nano Banana 2: High-Fidelity Image Model with Real-Time Web Integration * AI bot swarms threaten to undermine democracy * F Cancer * SAM 3.1: Faster and More Accessible Real-Time Video Detection and Tracking * Alibaba Releases Qwen 3.5 Plus: A Frontier-Level Open Source Model HomeArticlesPodcastsVideosTweets

The mirage of visual understanding in current frontier mo...

查看原文 → 發佈: 2026-03-29 22:32:36 收錄: 2026-03-30 02:00:43

🤖 問 AI

針對這篇文章提問,AI 會根據文章內容回答。按 Ctrl+Enter 送出。