让龙虾看懂屏幕！谷歌多模态新成果，文本图像视频音频进同一空间

Skip to main content ![Image 6: LogoBestBlogs](https://www.bestblogs.dev/ "BestBlogs.dev")Toggle navigation menu Toggle navigation menuArticles Podcasts Videos Tweets Sources Newsletters

⌘K

Change language Switch ThemeSign In

Narrow Mode

让龙虾看懂屏幕！谷歌多模态新成果，文本图像视频音频进同一空间 ==============================

量量子位 @henry

One Sentence Summary

Google has released Gemini Embedding 2, its first native multimodal embedding model, achieving deep alignment of text, images, video, audio, and documents within a unified semantic vector space.

Summary

This article introduces Google's newly released Gemini Embedding 2 model. Its core breakthrough lies in mapping multiple modalities—text, images, video, audio, and documents—into a single embedding space, enabling cross-modal semantic alignment. Regarding input capabilities, it supports long text, multiple images, long videos, and native audio embeddings, along with multimodal hybrid inputs. Technically, the model continues to utilize Matryoshka Representation Learning (MRL), allowing developers to dynamically adjust vector dimensions between 3072 and 768 to balance cost and performance. This advancement provides a unified 'sensory bus' for AI Agents (such as OpenClaw) to understand complex UI interfaces and real-world environments, serving as critical infrastructure for next-generation multimodal applications and embodied AI.

Main Points

* 1. Gemini Embedding 2 achieves true cross-modal semantic alignment.By mapping data from different media into a unified mathematical vector space, textual concepts and their corresponding visual or auditory signals are brought extremely close in mathematical distance, significantly simplifying multimodal RAG and search workflows. * 2. The model provides a 'sensory bus' for AI Agents to understand the real world.Traditional agents rely on text labels, whereas multimodal embeddings allow agents to directly understand UI layouts, icons, and video environments, establishing the semantic foundation for agents to operate computers on behalf of humans. * 3. Utilizing MRL technology to achieve dynamic dimension reduction of embedding vectors.This technology compresses core features into the beginning of the vector, allowing developers to choose between 768 and 3072 dimensions based on compute and storage budgets, achieving a flexible balance between performance and cost.

Metadata

AI Score

Website qbitai.com

Published At Today

Length 1658 words (about 7 min)

> henry 发自凹非寺 > > > 量子位 | 公众号 QbitAI

原生，启动！

刚刚，谷歌发布了首个原生多模态（Multimodal）嵌入模型—— Gemini Embedding 2。

这次模型最大的变化在于：把文本、图像、视频、音频和文档，全部映射进同一个统一的嵌入空间。

!Image 7

换句话说，不同媒介的数据第一次被放进同一个语义坐标系里。

在输入能力上，Gemini Embedding 2支持多种数据类型：

* 文本：支持最多8192个token * 图像：每次请求最多处理6张图像，支持PNG和JPEG * 视频：支持最长120秒的视频输入，格式为MP4和MOV * 音频：原生嵌入音频数据，无需中间文本转录 * 文档：可直接嵌入最多6页的PDF

此外，模型不仅可以处理单一模态，还支持多模态混合输入（例如图像 +文本）。

这意味着模型可以捕捉不同媒体之间的复杂语义关系，从而更准确地理解现实世界中的信息。

在评测方面，Gemini Embedding 2不仅整体性能较上一代提升，同时也为多模态嵌入任务树立了新的性能基准。

一方面增强了语音处理能力，另一方面也在文本、图像和视频任务中均超越现有领先模型，实现SOTA。

!Image 8

乍看之下，这似乎只是一次底层的数据工程升级。

但实际上，它正为像龙虾这样的AI Agent真正“看懂”世界，提供了关键基础。

目前，Gemini Embedding 2已经通过Gemini API和Vertex AI展开公测。

嵌入模型（Embedding Model）本质上是把数据转化为稠密向量表示。

在这个向量空间中，语义相似的数据会彼此靠近，不相似的数据则距离更远。

传统的嵌入模型主要针对文本。

例如，在谷歌此前的论文《Gemini Embedding: Generalizable Embeddings from Gemini》中——

Gemini Embedding通过在大语言模型参数中已有的海量知识基础上构建表征，并将得来的嵌入用于：语义检索、文本聚类、分类，排序等下游任务。

!Image 9

但这只停留在文字阶段。

最新的Gemini Embedding 2，则首次彻底打通了多模态数据。

文本、图片、视频、音频和文档，都被压缩到同一个向量空间之中。

而这，就意味着模型实现了“跨模态语义对齐”，能够让猫这个「文字概念」与猫的照片这个「视觉概念」，在统一的嵌入空间中的数学向量的距离极度接近。

通俗来说，当你搜索“猫”的时候，系统不仅能找到相关文字，还能直接找到猫的图片、视频甚至声音。

也正因为如此，很多原本复杂的多模态流程可以被大幅简化。

从RAG检索、语义搜索、情感分析，到数据聚类等应用场景，都能直接受益。

更重要的是，这类能力对AI Agent意义巨大。

过去的Agent在操作电脑时，往往只能依赖文字信息。

例如识别按钮上的“设置”“确认”等标签。但真实世界的UI界面，大量信息其实来自视觉结构：

图标、布局、颜色、控件位置，正是传统文本嵌入模型难以处理的部分。

而有了多模态嵌入之后，情况就不同了。

对于像OpenClaw（龙虾）这样需要操作电脑，识别屏幕的Agent来说，它不再只是识别文字。

!Image 10

它可以直接理解：哪个像素区域是设置图标、哪个按钮与当前任务最相关，屏幕截图与文本指令之间的关系

换句话说，Gemini Embedding 2提供了一条统一的感官总线。视觉、听觉与文本信息，都能在同一个语义空间中进行关联。

这也为未来Agent真正理解屏幕、理解环境并代替人类操作电脑，奠定了最重要的语义基础。

在技术层面，Gemini Embedding 2继续采用Matryoshka Representation Learning（MRL）。

!Image 11

这种方法允许嵌入向量在保持语义信息的同时进行动态维度缩减。

（注：MRL强制模型把最核心、最关键的特征压缩在向量的前几十维里，次要的特征放在后面，这让开发者可以根据预算和算力，自由决定信息密度的分布管理）

Gemini Embedding 2的默认输出维度为3072维，但开发者可以根据需求缩减，例如：3072维、1536维、768维，从而在性能与存储成本之间取得平衡。

除了支持API调用外，Gemini Embedding 2也支持通过LangChain、LlamaIndex、Haystack、Weaviate、QDrant、ChromaDB和Vector Search等工具调用。

通过为不同类型的数据赋予统一的语义表示，Gemini Embedding 2正在为下一代AI应用：多模态Agent，乃至具身智能机器人提供关键基础设施。

参考链接

[1]https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-embedding-2/

[2]https://arxiv.org/pdf/2503.07891

量量子位 @henry

One Sentence Summary

Google has released Gemini Embedding 2, its first native multimodal embedding model, achieving deep alignment of text, images, video, audio, and documents within a unified semantic vector space.

Summary

Main Points

* 1. Gemini Embedding 2 achieves true cross-modal semantic alignment.

By mapping data from different media into a unified mathematical vector space, textual concepts and their corresponding visual or auditory signals are brought extremely close in mathematical distance, significantly simplifying multimodal RAG and search workflows.

* 2. The model provides a 'sensory bus' for AI Agents to understand the real world.

Traditional agents rely on text labels, whereas multimodal embeddings allow agents to directly understand UI layouts, icons, and video environments, establishing the semantic foundation for agents to operate computers on behalf of humans.

* 3. Utilizing MRL technology to achieve dynamic dimension reduction of embedding vectors.

This technology compresses core features into the beginning of the vector, allowing developers to choose between 768 and 3072 dimensions based on compute and storage budgets, achieving a flexible balance between performance and cost.

Key Quotes

* Mapping text, images, video, audio, and documents all into a single, unified embedding space. * The model achieves 'cross-modal semantic alignment,' allowing the textual concept of a 'cat' and the visual concept of a cat photo to be extremely close in mathematical distance within the unified embedding space. * Gemini Embedding 2 provides a unified sensory bus. Visual, auditory, and textual information can all be correlated within the same semantic space. * MRL forces the model to compress the most core and critical features into the first few dozen dimensions of the vector, with secondary features placed later.

AI Score

Website qbitai.com

Published At Today

Length 1658 words (about 7 min)

让龙虾看懂屏幕！谷歌多模态新成果，文本图像视频音频进同一空间

One Sentence Summary

Summary

Main Points

Metadata

One Sentence Summary

Summary

Main Points

Key Quotes

Tags

Related Articles

🤖 問 AI

Related Articles

After Topping Open-Source Rankings with its Programming LLM, the Zhipu GLM Team Faced a 3-Hour Questioning Session

Deconstructing Clawdbot: Local Architecture, Memory Management, Agent Orchestration, and Context Assembly Principles

How to Design an AI Agent System

MiniMax Hailuo Video Team's First Open-Source Release: Tokenizer Exhibits a Clear Scaling Law

GPT-5.4 Released: OpenAI's First Unified Model, Truly Native

AI Starts to "Take Action", Alibaba's Qwen Leads the World

Practices and Reflections on Vibe Coding in Code Generation and Collaboration

Yao Shunyu Lectures Face-to-Face with Tang Jie, Yang Zhilin, and Lin Junyang! Four Schema Heroes Debate Heroes at Zhongguancun