⌘K
Change language Switch ThemeSign In
Narrow Mode
精准识别「界门纲目科属种」!北大彭宇新团队用细粒度树先验提升泛化,破解生物类别分层识别难题
量 量子位 @衡宇
One Sentence Summary
Peking University Professor Yuxin Peng's team proposes the TARA method, which successfully injects category tree knowledge into the model by aligning multimodal large model representations with biological foundation models, solving logical consistency and novel class generalization challenges in hierarchical biological recognition.
Summary
This article introduces the latest research achievements of Professor Yuxin Peng's team at Peking University in the field of fine-grained multimodal large models—TARA (Taxonomy-aware Representation Alignment). Addressing challenges faced by current multimodal large models in biological recognition, such as lacking hierarchical knowledge of taxonomic ranks (kingdom, phylum, class, order, family, genus, species), difficulty ensuring cross-level prediction consistency, and poor generalization for novel classes, TARA proposes an innovative alignment framework. This method aligns the intermediate-layer visual representations of large models with biological foundation models (such as BioCLIP) while simultaneously performing free-granularity category text representation alignment, enabling the model to extract visual features with complete category tree structures. Experimental results show that this method not only significantly improves fine-grained species recognition accuracy but also ensures prediction results conform to the parent-child node logic of biological taxonomy, demonstrating strong robustness when handling rare or unseen species. The paper has been accepted by CVPR 2026 and open-sourced.
Main Points
* 1. Hierarchical visual recognition requires the model to not only identify the final species but also accurately predict the complete biological category tree hierarchy.Unlike traditional recognition, hierarchical recognition needs to cover all levels from kingdom to species. Due to the lack of category tree knowledge, existing models often make logical errors where prediction results at adjacent levels do not satisfy parent-child node relationships. * 2. TARA injects structured knowledge from discriminative biological foundation models into generative large models through dual representation alignment.Through hierarchical visual representation alignment and free-granularity category representation alignment, the large model is encouraged to retain taxonomic information during feature extraction and can map features to correct level names according to instructions. * 3. Introducing category tree prior effectively improves the model's generalization ability to recognize unseen novel categories or rare species.By learning the commonalities of known subcategories to summarize discriminative features of parent categories, the model can reliably determine high-level classification labels for new species not yet formally described by the scientific community. * 4. The strategy of reinforcement fine-tuning without thinking combined with alternating TARA optimization achieves efficient training and inference.During training, the model is adapted to hierarchical recognition instructions through alternating optimization; during inference, no biological foundation model is required, and the optimized large model directly outputs results, ensuring efficiency.
Metadata
AI Score
85
Website qbitai.com
Published At Today
Length 2737 words (about 11 min)
Sign in to use highlight and note-taking features for a better reading experience. Sign in now
> MIPL团队 投稿 > > > 量子位 | 公众号 QbitAI
一张蓝锥嘴雀的图片,你能认出它是“鸟”,但能认出它是“鸟纲-雀形目-唐纳雀科-锥嘴雀属-蓝锥嘴雀”吗?
像大多数人一样,现在的多模态大模型也认不出来。
真实世界中的对象通常包含极其丰富的类别层次,形成类别树结构。比如蓝锥嘴雀是:动物界-脊索动物门-鸟纲-雀形目-唐纳雀科-锥嘴雀属-蓝锥嘴雀(界-门-纲-目-科-属-种)。
区别于传统的细粒度视觉识别,分层视觉识别旨在预测所属的所有类别层次,而不仅仅预测最终的细粒度类别。尽管现有Finedefics、Fine-R1等生成式大模型在细粒度视觉识别任务上表现出色,但由于缺乏类别树知识,无法从粗到细实现每一层的精准识别。
同时,采用分层类别标签对比学习得到的判别式大模型(如BioCLIP、BioCLIP2、BioCAP等),其表征空间已能充分编码类别树中的类间关系与类内关系。基于上述发现,本文利用判别式大模型的表征指导生成式大模型的学习,为多模态大模型学习类别树提供了新路径。
本文是北京大学彭宇新教授团队在细粒度多模态大模型领域的最新研究成果,相关论文已被CVPR 2026接收,并已开源。
尽管现有多模态大模型在细粒度视觉识别上的准确率取得明显提升,但在依赖类别树知识的分层视觉识别任务上,仍无法从粗到细实现每一层的精准识别。具体地,存在如下3点挑战: 1. 同层判别性差:对于更粗粒度的类别层次,“类内差异大”更加突出,模型倾向于学习类别共性;对于更细粒度的类别层次,“类间差异小”更加突出,模型倾向于学习类别差异。两者的矛盾导致模型难以从粗到细区分每一层的相似类别。 2. 跨层一致性差:由于模型缺乏类别树知识,难以保证任意相邻层次的预测类别满足父子节点关系。例如,预测结果为“鹦鹉目-裸鼻雀科”,但两者不满足父子节点关系,“裸鼻雀科”应该属于“雀形目”。 3. 新类泛化性差:现有模型倾向于挖掘不同细粒度子类别的差异,忽略了对其共性的总结(用于识别其父节点的辨识性特征),难以准确识别从未见过的新类别。
针对上述问题,北京大学彭宇新教授团队提出了分类感知表征对齐方法(Taxonomy-Aware Representation Alignment,TARA),用于将类别树结构知识注入多模态大模型。通过将大模型与生物基础模型的视觉表征对齐,促进大模型提取具备完整类别树结构的视觉表征。同时,通过将大模型输出答案的首个词元表征与经生物基础模型编码后的真实类别表征对齐,促进大模型根据指定的层次,将具备完整类别树结构的视觉表征映射为对应层次的类别名称。
实验结果表明,本方法不仅能增强现有大模型的细粒度视觉识别能力,提升最终的细粒度类别的识别准确率,还能增强分层视觉识别能力,从粗到细提升类别树上每一层的识别准确率。
为向多模态大模型注入类别树结构知识,本文提出了分类感知表征对齐方法TARA。如图2所示,TARA包含2个主要部分: 1. 分层视觉表征对齐:通过将大模型中间层与生物基础模型最后一层的视觉表征对齐,促进大模型提取具备完整类别树结构的视觉表征。 2. 自由粒度类别表征对齐:通过将大模型输出答案的首个词元表征与经生物基础模型编码后的真实类别表征对齐,促进大模型根据指定的层次,将具备完整类别树结构的视觉表征映射为对应层次的类别名称。
具体如下:
!Image 15 1. 分层视觉表征对齐。
经分层类别标签训练的生物基础模型(例如, BioCLIP、BioCLIP2、BioCAP等)能提供包含分类学信息的监督信号,促进大模型提取具备完整类别树结构的视觉表征。具体地,给定输入图像I和识别特定层次类别的问题q(例如,“图中动物属于什么门/纲/目/科/属/种?从如下选项中选择:[真实类别,相似类别1,相似类别2,相似类别3]”),生物基础模型的视觉编码器εv(·)输出目标视觉特征img=εv(I)∈RN×d,其中d表示生物基础模型的特征维度。大语言模型第ℓ层的视觉表征表示为ℓimg∈RN×D,采用可学习的映射层PV(·)将其映射到生物基础模型的视觉特征空间,并最小化如下对齐损失:
!Image 16 2. 自由粒度类别表征对齐。
一张图像同时对应不同层次的类别标签,但用户期望识别的类别层次是不同的。例如,专家可能希望在“种”层次上将对象识别为阿卡迪亚霸鹟,而普通用户只需要在“纲”层次上将其识别为鸟。通过在同一层次上对齐生物基础模型和大模型的类别文本表征,促进大模型将具备完整类别树结构的视觉表征映射为对应层次的类别名称。具体地,生物基础模型的文本编码器ET(·)输出目标文本特征ylabel=ET(C)∈Rd,其中C表示在期望层次上的真实类别名称。大语言模型第m层的答案表征序列表示为emanswer∈RN′×D,采用可学习的映射层PT(·)将答案的首个词元表征映射到生物基础模型的文本特征空间,并最小化如下对齐损失:
最终,TARA的对齐损失定义为两者的均值:
!Image 18 3. 模型训练和推理:
在训练阶段,采用无需思考的强化微调(No Thinking RFT)和TARA交替优化大模型、映射层PV(·)与PT(·),促进大模型适配分层视觉识别指令的同时学习类别树知识。在推理阶段,生物基础模型和映射层PV(·)与PT(·)均不参与运算,直接由优化后的大模型进行识别。
表1展示了在iNaturalist-Plant与iNaturalist-Animal上的分层视觉识别结果。本方法不仅能增强多种大模型的细粒度视觉识别能力,提升最终的细粒度类别的识别准确率,还能增强分层视觉识别能力,从粗到细提升类别树上每一层的识别准确率。
表2展示了在TerraIncognita的新类别(已有类别树之外的类别)的分层视觉识别结果。这部分新类别不仅是模型强化微调训练集中未见类别,更是稀有或记录极少的物种图像,在公开数据中几乎没有或完全没有可用样本,更不可能出现在模型的预训练数据中。
对于其中许多样本,很可能是科学界尚未正式描述的新物种,目前只能可靠地确定其较高层次的分类标签(如“目”和“科”)。本方法通过引入类别树先验,促进模型学习子类别的共性,从而总结出用于识别父类别的判别性特征,提升已知类别树之外的新类别的识别准确率。
图3的案例展示表明,相比阿里的Qwen3-VL-2B大模型,本方法能提升同层判别性与跨层一致性,既区分开同一层的相似类别,又确保相邻层次的预测类别满足父子节点关系。
针对现有多模态大模型缺乏类别树知识,无法从粗到细实现每一层的精准识别的问题,本文提出了分类感知表征对齐方法TARA,通过对齐大模型与生物基础模型的中间表征,注入类别树结构知识,不仅能提升最终的细粒度类别的识别准确率,还能增强大模型的分层视觉识别能力,从粗到细提升类别树上每一层的识别准确率。 论文标题:
Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models 论文链接:
https://arxiv.org/abs/2603.00431 开源代码:
https://github.com/PKU-ICST-MIPL/TARA_CVPR2026 实验室网址:
https://www.wict.pku.edu.cn/mipl
量 量子位 @衡宇
One Sentence Summary
Peking University Professor Yuxin Peng's team proposes the TARA method, which successfully injects category tree knowledge into the model by aligning multimodal large model representations with biological foundation models, solving logical consistency and novel class generalization challenges in hierarchical biological recognition.
Summary
This article introduces the latest research achievements of Professor Yuxin Peng's team at Peking University in the field of fine-grained multimodal large models—TARA (Taxonomy-aware Representation Alignment). Addressing challenges faced by current multimodal large models in biological recognition, such as lacking hierarchical knowledge of taxonomic ranks (kingdom, phylum, class, order, family, genus, species), difficulty ensuring cross-level prediction consistency, and poor generalization for novel classes, TARA proposes an innovative alignment framework. This method aligns the intermediate-layer visual representations of large models with biological foundation models (such as BioCLIP) while simultaneously performing free-granularity category text representation alignment, enabling the model to extract visual features with complete category tree structures. Experimental results show that this method not only significantly improves fine-grained species recognition accuracy but also ensures prediction results conform to the parent-child node logic of biological taxonomy, demonstrating strong robustness when handling rare or unseen species. The paper has been accepted by CVPR 2026 and open-sourced.
Main Points
* 1. Hierarchical visual recognition requires the model to not only identify the final species but also accurately predict the complete biological category tree hierarchy.
Unlike traditional recognition, hierarchical recognition needs to cover all levels from kingdom to species. Due to the lack of category tree knowledge, existing models often make logical errors where prediction results at adjacent levels do not satisfy parent-child node relationships.
* 2. TARA injects structured knowledge from discriminative biological foundation models into generative large models through dual representation alignment.
Through hierarchical visual representation alignment and free-granularity category representation alignment, the large model is encouraged to retain taxonomic information during feature extraction and can map features to correct level names according to instructions.
* 3. Introducing category tree prior effectively improves the model's generalization ability to recognize unseen novel categories or rare species.
By learning the commonalities of known subcategories to summarize discriminative features of parent categories, the model can reliably determine high-level classification labels for new species not yet formally described by the scientific community.
* 4. The strategy of reinforcement fine-tuning without thinking combined with alternating TARA optimization achieves efficient training and inference.
During training, the model is adapted to hierarchical recognition instructions through alternating optimization; during inference, no biological foundation model is required, and the optimized large model directly outputs results, ensuring efficiency.
Key Quotes
* Hierarchical visual recognition aims to predict all category levels, not just the final fine-grained category. * Using representation guidance from discriminative large models to direct generative large model learning provides a new path for multimodal large models to learn category tree knowledge. * This method introduces category tree prior to promote the model to learn commonalities of subcategories, thereby summarizing discriminative features for identifying parent categories. * Compared with Alibaba's Qwen3-VL-2B large model, this method improves same-level discriminability and cross-level consistency, distinguishing similar categories at the same level while ensuring predictions at adjacent levels satisfy parent-child node relationships.
AI Score
85
Website qbitai.com
Published At Today
Length 2737 words (about 11 min)
Tags
Multimodal Large Models
Hierarchical Visual Recognition
Biological Taxonomy
Representation Alignment
CVPR 2026
Related Articles
* AI Starts to "Take Action", Alibaba's Qwen Leads the World * MiniMax Hailuo Video Team's First Open-Source Release: Tokenizer Exhibits a Clear Scaling Law * Teaching AI Video to Dub Took Academia a Decade | Behind the Release of Vidu Q3 * GPT-5.4 Released: OpenAI's First Unified Model, Truly Native * Yao Shunyu Lectures Face-to-Face with Tang Jie, Yang Zhilin, and Lin Junyang! Four Schema Heroes Debate Heroes at Zhongguancun * After Topping Open-Source Rankings with its Programming LLM, the Zhipu GLM Team Faced a 3-Hour Questioning Session HomeArticlesPodcastsVideosTweets