The mirage of visual understanding in current frontier models

!Image 3: Marcus on AI Marcus on AI @Gary Marcus

One Sentence Summary

This article highlights a Stanford study revealing that frontier AI models often exhibit 「mirage reasoning,」 generating detailed visual analyses even without image input, questioning their true visual understanding.

Summary

Gary Marcus discusses a recent Stanford research paper that exposes significant flaws in the visual understanding of frontier LLMs. The study introduces the concept of 「mirage reasoning,」 where models produce elaborate reasoning and clinical findings for images they haven't actually seen. Remarkably, some models topped benchmarks without any image access, suggesting that high scores may be due to data leakage or linguistic patterns rather than genuine vision. Marcus argues this reinforces the idea that current AI lacks true world understanding, meaning professions requiring precise visual comprehension and physical robotics remain safe from immediate AI displacement.

Main Points

* 1. Frontier models exhibit 「mirage reasoning」 by hallucinating details.Models can generate detailed reasoning traces and clinical findings for images that were never provided, indicating they rely on internal linguistic biases rather than actual visual input. * 2. High benchmark scores do not necessarily equate to visual competence.Models achieved top ranks on medical and general multimodal benchmarks without access to images, suggesting these benchmarks may be flawed or susceptible to linguistic guessing. * 3. Real-world applications requiring visual precision are not yet ready for AI.Fields like architecture, civil engineering, and humanoid robotics remain insulated from AI disruption because current models lack the reliable visual understanding necessary for these tasks.

Metadata

AI Score

Website garymarcus.substack.com

Published At Today

Length 207 words (about 1 min)

From a damning new Stanford paper on the illusion of visual understanding in LLMs:

> “Frontier models readily generate detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided, we term this phenomenon mirage reasoning. Second, without any image input, models also attain strikingly high scores across general and medical multimodal benchmarks, bringing into question their utility and design. In the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images. " ![Image 4](https://substackcdn.com/image/fetch/$s_!YYJR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5882bdd-5ea8-4cea-9ba5-8635c7667bd6_1980x923.png)

AGI this stuff ain’t.

This study reinforces what Anh Totti Nguyen has been saying for a long time, in a series of underappreciated papers like Vision Language Models are Blind that I keep trying to draw attention to.

Also, re the very active discussion on AI and jobs: although _some_ white collar jobs (e.g., entry-level coder or market research assistant) may be in near-term jeopardy, many of those that require visual understanding (architect, cartographer, civil engineer, film editor, medical illustrator, urban planner, etc) probably aren’t vulnerable until entirely new techniques are developed.

And humanoid home robots? Don’t make me laugh. If your humanoid robot can’t understand the visual world, it’s just a demo, and not something you can trust.

!Image 5: Marcus on AI Marcus on AI @Gary Marcus

One Sentence Summary

Summary

Main Points

* 1. Frontier models exhibit 「mirage reasoning」 by hallucinating details.

Models can generate detailed reasoning traces and clinical findings for images that were never provided, indicating they rely on internal linguistic biases rather than actual visual input.

* 2. High benchmark scores do not necessarily equate to visual competence.

Models achieved top ranks on medical and general multimodal benchmarks without access to images, suggesting these benchmarks may be flawed or susceptible to linguistic guessing.

* 3. Real-world applications requiring visual precision are not yet ready for AI.

Fields like architecture, civil engineering, and humanoid robotics remain insulated from AI disruption because current models lack the reliable visual understanding necessary for these tasks.

Key Quotes

* Frontier models readily generate detailed image descriptions and elaborate reasoning traces... for images never provided, we term this phenomenon mirage reasoning. * In the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images. * If your humanoid robot can't understand the visual world, it's just a demo, and not something you can trust.

AI Score

Website garymarcus.substack.com

Published At Today

Length 207 words (about 1 min)

The mirage of visual understanding in current frontier mo...

当前前沿模型中视觉理解的“海市蜃楼”

The mirage of visual understanding in current frontier models

One Sentence Summary

Summary

Main Points

Metadata

One Sentence Summary

Summary

Main Points

Key Quotes

Tags

Related Articles

The mirage of visual understanding in current frontier mo...

🤖 問 AI