← 回總覽

超越语义相似性:推出 NVIDIA NeMo Retriever 的可泛化智能体检索管道

📅 2026-03-14 04:00 Radek Osmulski, Reza Esfandiarpoor, Yauhen Babakhin, Gabriel de Souza Pereira Moreira, Bo Liu 人工智能 11 分鐘 12562 字 評分: 87
智能体 RAG 信息检索 ReACT 架构 NVIDIA NeMo ViDoRe 基准
📌 一句话摘要 NVIDIA NeMo Retriever 推出一个基于 ReACT 的智能体管道,通过 LLM-检索器循环迭代优化搜索策略,实现了最先进的检索性能。 📝 详细摘要 NVIDIA NeMo Retriever 团队开发了一个可泛化的智能体检索管道,在 ViDoRe v3 和 BRIGHT 排行榜上获得了领先排名。该系统超越了传统的语义相似性,采用 ReACT 架构,其中 LLM(如 Claude 4.5 或 Nemotron)迭代地使用“思考”和“检索”工具来优化查询、重新措辞搜索尝试并分解复杂任务。为了克服智能体工作流的高延迟和资源需求,团队通过将模型上下文协议(MCP)

We are thrilled to announce that NVIDIA NeMo Retriever team has developed a new agentic retrieval pipeline that has officially secured the #1 spot on the ViDoRe v3 pipeline leaderboard. In addition, this exact same pipeline architecture achieved the #2 spot on the highly demanding, reasoning-intensive BRIGHT leaderboard.

In the rapidly evolving landscape of AI retrieval, many solutions are highly specialized, engineered to perform exceptionally well on specific, narrow tasks. However, real-world enterprise applications rarely have the luxury of perfectly curated, single-domain data. They require systems that can seamlessly adapt to a wide variety of challenges—from parsing complex visual layouts to executing deep logical reasoning.

That is why we prioritized generalizability in our design. Rather than relying on dataset-specific heuristics, we built an agentic pipeline that dynamically adapts its search and reasoning strategy to the data at hand. This allows us to deliver state-of-the-art performance across vastly different benchmarks without requiring any underlying architectural changes.

Here is a look at how we built it.

[](https://www.bestblogs.dev/article/8f1cb8a8#the-motivation-why-semantic-similarity-isnt-enough) The Motivation: Why Semantic Similarity Isn't Enough ------------------------------------------------------------------------------------------------------------------------------------------------------

For years, dense retrieval based on semantic similarity has been the standard for finding information. However, as the applications of retrieval expand, finding relevant documents goes beyond semantic similarity alone. Complex document search requires reasoning skills, an understanding of real-world systems, and iterative exploration.

There is a fundamental gap: LLMs are great at thinking and reasoning but cannot process millions of documents at once. Conversely, retrievers can easily sift through millions of documents but possess limited reasoning skills. Agentic retrieval bridges this gap by creating an active, iterative loop between the LLM and the retriever.

[](https://www.bestblogs.dev/article/8f1cb8a8#how-it-works-the-agentic-loop) How It Works: The Agentic Loop ----------------------------------------------------------------------------------------------------------- ![Image 1: Agentic retrieval pipeline overview](https://cdn-uploads.huggingface.co/production/uploads/6172d79a6c2ff4e51d0ea593/2AZCf05aUJxE6AoFS--N1.png)

_Agentic retrieval pipeline overview_

Our agentic retrieval pipeline relies on a ReACT architecture. Instead of a single "one-and-done" query, the agent iteratively searches, evaluates, and refines its approach.

The agent utilizes built-in tools like think to plan its approach and final_results to output the exact documents needed for a given query, alongside a retrieve (query, top_k) tool to explore the corpus. Through this loop, we observed successful search patterns emerge naturally:

* Generating better queries: The agent dynamically adjusts its search queries based on newly discovered information. * Persistent rephrasing: It continually rephrases queries until useful information is found. * Breaking down complexity: It translates complex, multi-part queries into multiple simpler queries with clear goals.

Finally, to synthesize the iterative discoveries, the agent calls a final_results tool to output the most relevant documents, ranked by their relevance to a given query. As a safety net—for example, when the agent hits the maximum number of steps or the context length limit—the pipeline falls back to Reciprocal Rank Fusion (RRF), which scores documents based on their ranks across all retrieval attempts in the agent trajectory.

[](https://www.bestblogs.dev/article/8f1cb8a8#engineering-for-speed-and-scale) Engineering for Speed and Scale --------------------------------------------------------------------------------------------------------------

Agentic workflows are notoriously slow and resource-intensive. To make this pipeline viable for leaderboard-scale evaluation, we had to rethink how the LLM agent and the retriever communicate.

Initially, the retriever was exposed to the agent via a Model Context Protocol (MCP) server—a natural choice, since MCP is designed precisely for giving LLMs access to external tools. But in practice, this architecture imposed a compounding tax on experiment velocity. Every run required spinning up a separate MCP server, loading the right dataset corpus into GPU memory, and orchestrating the lifecycle of both client and server. Each of these steps was an opportunity for silent misconfiguration or a server freeze under high-volume requests. The network round-trips added latency to every retrieval call, and the overall cognitive burden of managing the two-process setup made it significantly harder for other teams to adopt and iterate on the pipeline.

To resolve this, we replaced the MCP server with a thread-safe singleton retriever that lives in-process. The singleton loads the model and corpus embeddings once, protects all access with a reentrant lock, and exposes the same retrieve() interface to arbitrarily many concurrent agent tasks—achieving the key benefit of an MCP server (safe, shared access to a GPU-resident retriever from multiple threads) without incurring network serialization overhead or requiring a separate server process. This single architectural change eliminated an entire class of deployment errors and dramatically improved both GPU utilization and experiment throughput.

[](https://www.bestblogs.dev/article/8f1cb8a8#generalization-vs-specialization-across-benchmarks) Generalization vs. Specialization Across Benchmarks -----------------------------------------------------------------------------------------------------------------------------------------------------

A common observation in modern retrieval evaluation is that solutions highly optimized for one specific type of task often experience a performance gap when applied to a completely different domain.

| Pipeline | ViDoRe v3 | | --- | --- | | NeMo Agentic Retrieval (Opus 4.5 + nemotron-colembed-vl-8b-v2) | 69.22 (#1) | | Dense retrieval (nemotron-colembed-vl-8b-v2) | 64.36 | | INF-X-Retriever (INF-Query-Aligner + nemotron-colembed-vl-8b-v2) | 62.31 | | INF-X-Retriever | 51.01 |

| Pipeline | BRIGHT | | --- | --- | | INF-X-Retriever | 63.40 (#1) | | NeMo Agentic Retrieval (Opus 4.5 + nemotron-reasoning-3b) | 50.90 (#2) |

We placed #2 on the reasoning-intensive BRIGHT leaderboard with an NDCG@10 of 50.90. The #1 solution on that leaderboard, INF-X-Retriever, achieves an impressive 63.40. However, to test cross-domain adaptability, we benchmarked the INF-X pipeline (coupled with the same nemotron-colembed-vl-8b-v2 embedding model used in our agentic pipeline) on ViDoRe v3, a dataset focusing on visually rich and diverse enterprise documents. On this different task, its performance landed at an NDCG@10 of 62.31, lower than the dense retrieval score 64.36. In other words, INF-Query-Aligner does not improve over the dense retrieval baseline on ViDoRe v3.

In contrast, our same agentic pipeline (pairing Opus 4.5 with nemotron-colembed-vl-8b-v2) achieved the #1 spot on ViDoRe v3 with a score of 69.22.

This highlights a core strength of our approach: generalizability. Rather than relying on dataset-specific heuristics or a query-rewriter/aligner, our agentic loop naturally adapts its strategy to the dataset at hand, whether it requires multi-step logical reasoning or parsing complex visual layouts.

[](https://www.bestblogs.dev/article/8f1cb8a8#ablation-studies-open-vs-closed-models) Ablation Studies: Open vs. Closed Models ------------------------------------------------------------------------------------------------------------------------------ ViDoRe v3

| Agent | Embedding Model | NDCG @10 | Average sec/query | Total input tokens (M) | Total output tokens (M) | Average retrieval calls | | --- | --- | --- | --- | --- | --- | --- | | Opus 4.5 | nemotron-colembed-vl-8b-v2 | 69.22 | 136.3 | 1837 | 15 | 9.2 | | gpt-oss-120b | nemotron-colembed-vl-8b-v2 | 66.38 | 78.6 | 1860 | 13 | 2.4 | | gpt-oss-120b | llama-nemotron-embed-vl-1b-v2 | 62.42 | 78.1 | 1459 | 13 | 2.5 | | - | nemotron-colembed-vl-8b-v2 | 64.36 | 0.67 | - | - | - | | - | llama-nemotron-embed-vl-1b-v2 | 55.83 | 0.02 | - | - | - | BRIGHT

| Agent | Embedding Model | NDCG @10 | Average sec/query | Total input tokens (M) | Total output tokens (M) | Average retrieval calls | | --- | --- | --- | --- | --- | --- | --- | | Opus 4.5 | llama-embed-nemotron-reasoning-3b | 50.79 | 148.2 | 1251 | 11 | 11.8 | | gpt-oss-120b | llama-embed-nemotron-reasoning-3b | 41.27 | 92.8 | 1546 | 11 | 4.5 | | gpt-oss-120b | llama-nemotron-embed-vl-1b-v2 | 33.85 | 139.1 | 1516 | 12 | 6.6 | | - | llama-embed-nemotron-reasoning-3b | 38.28 | 0.11 | - | - | - | | - | llama-nemotron-embed-vl-1b-v2 | 19.56 | 0.08 | - | - | - |

We conducted extensive ablations to understand the tradeoff between frontier closed models and open-weights alternatives:

* Model Choice: On ViDoRe v3, swapping Opus 4.5 for the open gpt-oss-120b resulted in a small accuracy drop (69.22 to 66.38 NDCG@10) and makes much fewer retrieval calls. On BRIGHT, the gap was wider, indicating that deeper reasoning tasks still benefit heavily from frontier models like Opus 4.5. * Embeddings: Pairing the agent with specialized embeddings (nemotron-colembed-vl-8b-v2 for ViDoRe and llama-embed-nemotron-reasoning-3b for BRIGHT) yielded the best results, proving that a strong baseline retriever provides a higher ceiling for the agent to reach. * It's also interesting to note that the agent can close the gap between stronger and weaker embedding models. For example, on ViDoRe, the gap between the stronger nemotron-colembed-vl-8b-v2 and the weaker llama-nemotron-embed-vl-1b-v2 is about 8.5 in dense retrieval, but when coupled with gpt-oss-120b agent, the gap shrinks to about 4. Similarly, llama-embed-nemotron-reasoning-3b is about 19 points better than llama-nemotron-embed-vl-1b-v2 on BRIGHT, but the lead shrinks to about 7.5 when coupled with gpt-oss-120b agent.

[](https://www.bestblogs.dev/article/8f1cb8a8#the-cost-of-autonomy-and-whats-next) The Cost of Autonomy and What's Next -----------------------------------------------------------------------------------------------------------------------

There is no free lunch. Agentic retrieval is more expensive and slower than standard dense retrieval. Looking at our ViDoRe v3 results, the agent averages 136 seconds per query and consumes roughly 760k input and 6.3k output tokens per query. _(Note: the sequential latency is measured on a single A100 GPU with a single concurrent Claude API call—i.e., not searching multiple queries at the same time—to reflect true search time in real-world use cases)._

However, we believe agentic retrieval is a highly viable approach for high-stakes, complex queries. Our immediate next steps focus on cost reduction: we are actively researching ways to distill these agentic reasoning patterns into smaller, specialized open-weight agents. By fine-tuning smaller models to orchestrate the think and retrieve loop natively, we aim to deliver Opus-level accuracy at a fraction of the latency and cost.

[](https://www.bestblogs.dev/article/8f1cb8a8#build-your-own-agentic-pipeline) Build Your Own Agentic Pipeline --------------------------------------------------------------------------------------------------------------

While our leaderboard-topping runs explored combinations like Claude Opus and gpt-oss alongside our research embedding models, the true strength of this architecture is its modularity. For production-ready deployments, we highly encourage you to try pairing your agent of choice with our robust commercial embedding model llama-nemotron-embed-vl-1b-v2. To explore these models, dive into the tools, and start building your own highly generalizable retrieval workflows, visit the NeMo Retriever library today.

查看原文 → 發佈: 2026-03-14 04:00:00 收錄: 2026-03-14 06:00:29

🤖 問 AI

針對這篇文章提問,AI 會根據文章內容回答。按 Ctrl+Enter 送出。