⌘K
Change language Switch ThemeSign In
Narrow Mode
The mirage of visual understanding in current frontier models
!Image 3: Marcus on AI Marcus on AI @Gary Marcus
One Sentence Summary
This article highlights a Stanford study revealing that frontier AI models often exhibit 「mirage reasoning,」 generating detailed visual analyses even without image input, questioning their true visual understanding.
Summary
Gary Marcus discusses a recent Stanford research paper that exposes significant flaws in the visual understanding of frontier LLMs. The study introduces the concept of 「mirage reasoning,」 where models produce elaborate reasoning and clinical findings for images they haven't actually seen. Remarkably, some models topped benchmarks without any image access, suggesting that high scores may be due to data leakage or linguistic patterns rather than genuine vision. Marcus argues this reinforces the idea that current AI lacks true world understanding, meaning professions requiring precise visual comprehension and physical robotics remain safe from immediate AI displacement.
Main Points
* 1. Frontier models exhibit 「mirage reasoning」 by hallucinating details.Models can generate detailed reasoning traces and clinical findings for images that were never provided, indicating they rely on internal linguistic biases rather than actual visual input. * 2. High benchmark scores do not necessarily equate to visual competence.Models achieved top ranks on medical and general multimodal benchmarks without access to images, suggesting these benchmarks may be flawed or susceptible to linguistic guessing. * 3. Real-world applications requiring visual precision are not yet ready for AI.Fields like architecture, civil engineering, and humanoid robotics remain insulated from AI disruption because current models lack the reliable visual understanding necessary for these tasks.
Metadata
AI Score
82
Website garymarcus.substack.com
Published At Today
Length 207 words (about 1 min)
Sign in to use highlight and note-taking features for a better reading experience. Sign in now
From a damning new Stanford paper on the illusion of visual understanding in LLMs:
> “Frontier models readily generate detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided, we term this phenomenon mirage reasoning. Second, without any image input, models also attain strikingly high scores across general and medical multimodal benchmarks, bringing into question their utility and design. In the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images. " 
AGI this stuff ain’t.
This study reinforces what Anh Totti Nguyen has been saying for a long time, in a series of underappreciated papers like Vision Language Models are Blind that I keep trying to draw attention to.
Also, re the very active discussion on AI and jobs: although _some_ white collar jobs (e.g., entry-level coder or market research assistant) may be in near-term jeopardy, many of those that require visual understanding (architect, cartographer, civil engineer, film editor, medical illustrator, urban planner, etc) probably aren’t vulnerable until entirely new techniques are developed.
And humanoid home robots? Don’t make me laugh. If your humanoid robot can’t understand the visual world, it’s just a demo, and not something you can trust.
Subscribe
!Image 5: Marcus on AI Marcus on AI @Gary Marcus
One Sentence Summary
This article highlights a Stanford study revealing that frontier AI models often exhibit 「mirage reasoning,」 generating detailed visual analyses even without image input, questioning their true visual understanding.
Summary
Gary Marcus discusses a recent Stanford research paper that exposes significant flaws in the visual understanding of frontier LLMs. The study introduces the concept of 「mirage reasoning,」 where models produce elaborate reasoning and clinical findings for images they haven't actually seen. Remarkably, some models topped benchmarks without any image access, suggesting that high scores may be due to data leakage or linguistic patterns rather than genuine vision. Marcus argues this reinforces the idea that current AI lacks true world understanding, meaning professions requiring precise visual comprehension and physical robotics remain safe from immediate AI displacement.
Main Points
* 1. Frontier models exhibit 「mirage reasoning」 by hallucinating details.
Models can generate detailed reasoning traces and clinical findings for images that were never provided, indicating they rely on internal linguistic biases rather than actual visual input.
* 2. High benchmark scores do not necessarily equate to visual competence.
Models achieved top ranks on medical and general multimodal benchmarks without access to images, suggesting these benchmarks may be flawed or susceptible to linguistic guessing.
* 3. Real-world applications requiring visual precision are not yet ready for AI.
Fields like architecture, civil engineering, and humanoid robotics remain insulated from AI disruption because current models lack the reliable visual understanding necessary for these tasks.
Key Quotes
* Frontier models readily generate detailed image descriptions and elaborate reasoning traces... for images never provided, we term this phenomenon mirage reasoning. * In the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images. * If your humanoid robot can't understand the visual world, it's just a demo, and not something you can trust.
AI Score
82
Website garymarcus.substack.com
Published At Today
Length 207 words (about 1 min)
Tags
Visual Understanding
Multimodal LLMs
Mirage Reasoning
AI Benchmarks
Computer Vision
Related Articles
* OpenClaw (a.k.a. Moltbot) is everywhere all at once, and a disaster waiting to happen, characterizing it as a security and privacy disaster that grants LLMs unfettered system access without adequate safeguards.") * Google DeepMind Launches Gemini 3.1 Pro with 2x Reasoning Performance * Introducing Agentic Vision in Gemini 3 Flash * Fei-Fei Li on Spatial Intelligence: The Next Frontier for AI * Ultimate prompting guide for Nano Banana * Google Launches Nano Banana 2: High-Fidelity Image Model with Real-Time Web Integration * AI bot swarms threaten to undermine democracy * F Cancer * SAM 3.1: Faster and More Accessible Real-Time Video Detection and Tracking * Alibaba Releases Qwen 3.5 Plus: A Frontier-Level Open Source Model HomeArticlesPodcastsVideosTweets