Skip to main content ![Image 10: LogoBestBlogs](https://www.bestblogs.dev/ "BestBlogs.dev")Toggle navigation menu Toggle navigation menuArticles Podcasts Videos Tweets Sources Newsletters

⌘K

Change language Switch ThemeSign In

Narrow Mode

浙大团队破解多模态模型「盲目自信」：先校准置信度，再分配算力丨 CVPR’26

量量子位 @听雨

One Sentence Summary

Zhejiang University and collaborators propose CA-TTS framework, which uses confidence-driven reinforcement learning to calibrate multimodal models' self-assessment capabilities and transforms calibrated confidence into inference resource allocation signals, achieving state-of-the-art across four visual reasoning benchmarks.

Summary

This paper presents work published at CVPR 2026 by a research team from Zhejiang University in collaboration with Alibaba, City University of Hong Kong, and University of Michigan. The research addresses the 'perceptual钝化' problem in large multimodal models—where models give high-confidence answers even when images are already unclear. The team proposes CDRL (Confidence-Driven Reinforcement Learning) training method, which uses a dual-reward mechanism to keep models sensitive to visual degradation and honestly evaluate their own judgments. During inference, the CA-TTS framework is employed to transform confidence into resource allocation signals, consisting of three modules: confidence-weighted voting, self-reflection, and visual self-checking. Experiments show improvement from 23.0% to 42.4% on Math-Vision, MMMU reaches 66.3%, and test-time scaling efficiency is 2.2× that of Majority Voting. The team also proposes the Perceive-then-Reason paradigm, shifting from 'reason then perceive' to 'perceive then reason'.

Main Points

* 1. Large multimodal models suffer from 'perceptual钝化' problem, with confidence remaining high when visual information degradesThe research defines this phenomenon as perceptual钝化, referring to the model's lack of sensitivity to changes in visual evidence, similar to humans giving blind answers when they cannot see the question clearly. * 2. CDRL achieves alignment between confidence and visual evidence through dual-reward mechanismThe perceptual sensitivity reward encourages the model to perceive visual degradation, while the calibration consistency reward constrains the model to honestly evaluate itself; after training, the confidence drop amplitude improves by 4.3×. * 3. CA-TTS transforms confidence into inference resource allocation signalsComprising three modules—confidence-weighted voting, self-reflection, and visual self-checking—it establishes a multi-stage verification loop, enabling compute to be more effectively allocated to problems with genuine uncertainty. * 4. Perceive-then-Reason paradigm redefines the foundation of reasoningShifting from 'reason then perceive' to 'perceive then reason', it ensures the model truly 'sees' the image before reasoning, avoiding building complex reasoning chains on unreliable perceptions.

Metadata

AI Score

Website qbitai.com

Published At Today

Length 3281 words (about 14 min)

多模态大模型，到底有多“嘴硬”？

浙江大学联合阿里巴巴、香港城市大学、密歇根大学的研究团队做了一个很直接的实验：

把输入图像从清晰状态一路加噪到接近不可辨认，同时持续监测模型的准确率与置信度。

结果是，准确率断崖式下跌，但置信度几乎不动。也就是说，图像已经看不清了，模型仍然会高置信度地给出答案。

!Image 11

这类“盲目自信”，正是多模态大模型在复杂视觉推理中产生幻觉和误判的重要根源。针对这一问题，研究团队提出了 CA-TTS（Confidence-Aware Test-Time Scaling）框架：先通过置信度驱动的强化学习校准模型的自我评估能力，再把校准后的置信度转化为推理阶段的资源分配信号。

效果也很直接：在四个主流视觉推理基准上，CA-TTS全面达到SOTA，平均超越现有最优方法8.8%。其中，在Math-Vision上，准确率从基线的23.0%提升到42.4%。论文已被 CVPR 2026 接收。

!Image 12

达尔文早就说过：无知比知识更容易产生自信

这项工作的出发点，其实是一个长期被忽视的问题：模型是否真的知道自己“不知道”？

研究团队将上述现象定义为“感知钝化”（Perceptual Bluntness）。也就是，模型对视觉信息质量的变化缺乏敏感性，视觉证据已经明显退化，但置信度仍维持在高位。放在人类语境里，这很像一个人在看不清题目的情况下，仍然非常笃定地报出答案。

为了在多模态场景下更稳定地度量这种问题，研究团队没有沿用文本模型里常见的token级校准方式，而是将置信度定义为整个输出序列的平均负对数概率（NMLP），建立响应级别的置信度度量。基于这一度量，整套方法分成两个阶段：训练阶段的置信度校准，以及推理阶段的置信度感知扩展。

!Image 13

第一步：CDRL让视觉感知与置信度重新对齐

训练阶段的核心模块是CDRL（Confidence-Driven Reinforcement Learning）。它的目标不是单纯提升答题准确率，而是让模型在“看得清”和“看不清”两种情况下，给出与视觉证据相匹配的置信度。

具体做法是，让模型同时处理同一问题的原始图像与加噪图像，并通过强化学习优化一个双重奖励机制：

感知敏感性奖励：鼓励模型在原始图像与噪声图像之间产生合理的置信度差异。差异越大，说明模型越能感知视觉退化。

校准一致性奖励：当模型预测正确且置信度高时给予正向奖励；当模型预测错误但置信度仍高时施加惩罚。

这两个奖励共同约束模型学会两件事：一是对视觉退化保持敏感，二是对自身判断保持诚实。

在训练数据上，研究团队从6个公开基准中筛选出1936个高质量样本，并使用CLIP注意力图定位关键视觉区域，生成更具针对性的扰动，使噪声集中施加在真正影响推理的局部区域。

!Image 14

从结果看，CDRL的效果并不只是“置信度变低”这么简单，而是“置信度终于跟视觉证据对上了”。面对噪声图像时，训练后的模型置信度下降幅度是训练前的4.3倍；面对遮挡条件时，这一比值达到4.7倍。

更值得注意的是，训练前模型在视角变换和马赛克干扰下，置信度甚至还会反向上升，而CDRL训练后，所有视觉扰动条件下的置信度都转为显著下降，ECE与AUC指标也同步改善。

第二步：CA-TTS把校准后的置信度变成推理信号

有了更可信的置信度之后，研究团队进一步提出CA-TTS，把“模型对自己有多确定”转化为推理阶段的调度信号。它包含三个协同工作的模块，并由专家模型动态决定何时介入：

Self-Consistency：不再使用简单多数投票，而是采用置信度加权投票。模型生成多个候选答案后，先由内部置信度进行聚合，再引入专家模型作为外部校准器，对候选答案进行二次评估。

Self-Reflection：当初步结果的置信度不足时，专家模型以Critic角色生成批评意见，引导基础模型重新推理，避免它在原有错误路径上反复自洽。

Self-Check：在视觉层面对答案做进一步验证。通过对比解码，比较原始图像与噪声图像下的输出概率分布；如果答案确实依赖视觉证据，那么在噪声图像下其支持度应当下降。

!Image 15

与常见的Tree-of-Thoughts不同，CA-TTS的关键不只是“多想几步”，而是建立了一个多阶段验证闭环。前一阶段即使给出错误候选，后续模块仍有机会纠正它。论文中的“墙上缺了多少块砖”案例就体现了这一点：Tree-of-Thoughts在最终单点评估上失手，而CA-TTS通过加权投票、反思和视觉自检三步纠偏，最终恢复出正确答案。

实验结果：四大基准全面领先

!Image 16

在四个主流视觉推理基准上，CA-TTS的表现如下。需要强调的是，这里的基座模型统一为Qwen2.5-VL-7B，因此提升主要来自方法本身，而不是底座差异。

几组数字尤其有代表性。Math-Vision上，CA-TTS从基线的23.0%直接提升到42.4%，几乎翻倍；MMMU上达到66.3%，相较基线提升17.5个百分点。这说明它带来的不是单点收益，而是在不同类型视觉推理任务上的一致性改进。

消融实验进一步揭示了CDRL与CA-TTS的分工关系：

!Image 17

单独使用CDRL，提升3.4个百分点，说明置信度校准本身就有独立价值；单独使用CA-TTS，提升15.0个百分点，说明推理框架已经能够显著改善决策质量；两者结合后总提升达到19.4个百分点，表明CDRL为CA-TTS提供了更可靠的策略基础，二者存在明显协同效应。

!Image 18

研究团队还检验了专家模型的依赖程度。即使让Qwen2.5-VL-7B自身充当“专家”，性能也仍比纯Majority Voting高出接近5个百分点（32.57% vs. 27.65%）。换句话说，强专家模型确实能进一步放大收益，但框架本身并不是靠“抱大腿”成立的。

Test-Time Scaling：斜率拉开，才是更关键的结果

如果说四个基准上的SOTA说明方法“更准”，那么test-time scaling曲线揭示的是它“为什么更值”。

!Image 19

在Math-Vision上，研究团队比较了采样数量从1增加到32时，不同方法的准确率增长趋势。结果显示，CA-TTS的扩展斜率β = 3.65，而Majority Voting为1.64，DeepConf为1.19。也就是说，CA-TTS的扩展效率分别是后两者的2.2倍和3.1倍。

这意味着，同样是增加采样次数，CA-TTS并不是“更频繁地碰运气”，而是更有效地把额外算力投向真正不确定的问题上。当Majority Voting和DeepConf在35%左右逐渐趋于饱和时，CA-TTS仍能继续爬升，并最终突破45%。

从这个角度看，置信度校准并不是一个附属优化项，而是在重新定义test-time scaling的效率上限。它让“多算一点”这件事第一次变得更有方向感。

从“先推理后感知”到“先感知后推理”

这项工作最值得关注的地方，可能并不只是又一个更高的benchmark分数，而是它提出了一种新的问题顺序。

过去，多模态推理研究默认的前提是：模型已经在充分利用视觉信息，接下来只需要把推理能力做强。但这篇论文提醒我们，一个模型可能根本没有真正“看懂”图像，却依然能给出高度自信的回答。若这个前提没有被修正，后续再复杂的推理链条，也可能建立在不可靠的感知基础上。

CA-TTS的思路正好反过来：先通过CDRL建立对视觉证据变化敏感、且与准确性一致的置信度，再让这种置信度去指导推理资源的分配。这是一种明确的Perceive-then-Reason范式，也就是从“先推理后感知”转向“先感知后推理”。

当然，这一方向也并非没有代价。多次采样与专家模型调用会带来额外推理成本，当前实验也主要集中在数学推理和通用VQA任务上。但如果目标是让多模态大模型在高风险场景中真正做到“知道自己什么时候不该太自信”，那么这条路线已经给出了一个很有说服力的起点。

论文标题： Linking Perception, Confidence and Accuracy in MLLMs 作者： Yuetian Du, Yucheng Wang, Rongyu Zhang, Zhijie Xu, Boyu Yang, Ming Kong, Jie Liu#, Qiang Zhu# 单位：浙江大学、阿里巴巴集团、香港城市大学、密歇根大学发表： CVPR 2026 项目链接： https://github.com/anotherbricki/CA-TTS

作者简介：本文第一作者为杜越天，浙江大学博士生，研究方向为多模态大模型的置信度校准与test-time scaling，导师为朱强教授。本文在朱强教授和刘洁博士的指导下完成。

量量子位 @听雨

One Sentence Summary

Summary

Main Points

* 1. Large multimodal models suffer from 'perceptual钝化' problem, with confidence remaining high when visual information degrades

The research defines this phenomenon as perceptual钝化, referring to the model's lack of sensitivity to changes in visual evidence, similar to humans giving blind answers when they cannot see the question clearly.

* 2. CDRL achieves alignment between confidence and visual evidence through dual-reward mechanism

The perceptual sensitivity reward encourages the model to perceive visual degradation, while the calibration consistency reward constrains the model to honestly evaluate itself; after training, the confidence drop amplitude improves by 4.3×.

* 3. CA-TTS transforms confidence into inference resource allocation signals

Comprising three modules—confidence-weighted voting, self-reflection, and visual self-checking—it establishes a multi-stage verification loop, enabling compute to be more effectively allocated to problems with genuine uncertainty.

* 4. Perceive-then-Reason paradigm redefines the foundation of reasoning

Shifting from 'reason then perceive' to 'perceive then reason', it ensures the model truly 'sees' the image before reasoning, avoiding building complex reasoning chains on unreliable perceptions.

Key Quotes

* Accuracy plummets, but confidence barely moves. In other words, when the image is already unclear, the model still gives high-confidence answers. * When Majority Voting and DeepConf gradually saturate around 35%, CA-TTS continues to climb and ultimately breaks through 45%. * Previously, multimodal reasoning research assumed the premise: the model is already fully utilizing visual information, and the next step is simply to strengthen reasoning capabilities. However, this paper reminds us that a model may not truly 'see' the image, yet still给出高度自信的回答. * Confidence calibration is not a peripheral optimization item, but is redefining the efficiency ceiling of test-time scaling. It makes 'computing more' the first time truly more directional. * It proposes a new problem ordering: shifting from 'reason then perceive' to 'perceive then reason'.

AI Score

Website qbitai.com

Published At Today

Length 3281 words (about 14 min)

Zhejiang University Team Cracks Multimodal Model's 'Blind...

浙大团队破解多模态模型「盲目自信」：先校准置信度，再分配算力丨 CVPR’26

浙大团队破解多模态模型「盲目自信」：先校准置信度，再分配算力丨 CVPR’26

One Sentence Summary

Summary

Main Points

Metadata

One Sentence Summary

Summary

Main Points

Key Quotes

Tags

Related Articles

Zhejiang University Team Cracks Multimodal Model's 'Blind...

🤖 問 AI