规模化人工判断：Dropbox 如何借助大语言模型优化 RAG 系统标注

Skip to main content ![Image 1: LogoBestBlogs](https://www.bestblogs.dev/ "BestBlogs.dev")Toggle navigation menu Toggle navigation menuArticles Podcasts Videos Tweets Sources Newsletters

⌘K

Change language Switch ThemeSign In

Narrow Mode

规模化人工判断：Dropbox 如何借助大语言模型优化 RAG 系统标注 ====================================

I InfoQ 中文 @InfoQ 中文

One Sentence Summary

Dropbox introduces a "human-calibrated LLM labeling" process, using a small amount of human data to calibrate large language models, enabling the scalable generation of millions of retrieval relevance labels for RAG systems.

Summary

This article describes how Dropbox optimizes the search ranking quality of its RAG system, Dropbox Dash, using Large Language Models (LLMs). Addressing retrieval quality, a core bottleneck in RAG systems, Dropbox abandoned the high-cost pure human labeling model in favor of a "human-calibrated LLM labeling" approach. This approach first uses a small amount of high-quality human-labeled data to calibrate an LLM evaluator, which then scales up labeling by a hundredfold, generating millions of relevance data points for training ranking models. The article also emphasizes the importance of correcting model errors by comparing them with user behavior (clicks/skips) and leveraging additional retrieval to gain context for understanding internal enterprise terminology.

Main Points

* 1. Retrieval quality is a core bottleneck for RAG systems, directly determining the accuracy of generated content.Among massive enterprise documents, only a tiny fraction of relevant content can enter the LLM's context, making search ranking precision crucial for the final answer quality. * 2. "Human-calibrated LLM labeling" achieves a hundredfold increase in labeling efficiency by calibrating models with a small amount of human data.This method uses human labels as a baseline to guide LLMs in generating hundreds of thousands of labels, addressing the high cost, slow speed, and inconsistency issues of pure human labeling. * 3. LLM labeling is not directly used for real-time ranking but for training more efficient supervised learning ranking models.Due to slow LLM inference speed and context length limitations, its primary role is to act as an "offline labeler" providing high-quality training data for ranking models. * 4. By comparing LLM scores with user click/skip behavior, the most challenging labeling errors can be precisely identified and corrected.The evaluation process focuses on instances where model judgments diverge from actual user behavior; these discrepancies provide the strongest learning signals for continuous optimization of labeling quality. * 5. Introducing additional retrieval to obtain context information is key to solving LLMs' understanding of specific internal enterprise terminology.For specific internal enterprise terminology (e.g., project codenames), allowing the LLM to perform additional retrieval before labeling significantly improves its accuracy in judging relevance.

Metadata

AI Score

Website mp.weixin.qq.com

Published At Yesterday

Length 1006 words (about 5 min)

作者 | Sergio De Simone

译者 | 明知山

为提升 Dropbox Dash 生成回复的相关性，Dropbox 工程师开始采用大语言模型辅助人工标注，这一做法在识别用于生成回复的文档方面发挥了关键作用。他们的方案也为各类基于检索增强生成（RAG）的系统提供了极具价值的参考。

正如 Dropbox 首席工程师 Dmitriy Meyerzon 所言，文档检索质量是 RAG 系统的瓶颈——这类系统需要从海量文档库中筛选出相关内容，再将其输入给大语言模型。

> 企业搜索索引中存在数百万份文档，超大型企业更是多达数十亿份，因此 Dash 只能将检索到的极少部分文档传给大语言模型。这使得搜索排序质量——以及用于训练排序的相关性标注数据——对最终答案的效果至关重要。

这意味着搜索排序模型的质量直接影响最终生成答案的质量。Dash 采用监督学习技术训练排序模型，会根据文档满足查询需求的程度，对查询 - 文档对进行标注。这种方法的主要难点，在于如何生成大量高质量的相关性标注数据。

为解决纯人工标注的局限（成本高、速度慢、一致性差），Dropbox 引入了一种补充方案：利用大语言模型大规模生成相关性判断。这种方法成本更低、一致性更强，且能轻松扩展到大型文档集。但大语言模型并非完美的评估者，因此在使用其判断结果进行训练前，必须先对其效果进行评估。

> 在实际应用中，利用大语言模型进行相关性评估需要一套自动化与人工监督相结合的标准化流程。

这种被称为“人工校准的大语言模型标注”的方法十分简洁：先由人工标注一小批高质量数据集，用于校准大语言模型评估器；再由大语言模型生成数十万乃至数百万条标注，将人工工作量放大约 100 倍。需要注意的是，大语言模型并不会取代排序系统——若在查询时直接用其进行排序，速度过慢且会受上下文长度限制。

评估步骤包括：将大语言模型生成的相关性评分与人工判断进行对比，测试对象为训练集中未出现的查询 - 文档对子集。评估还重点关注最难修正的错误——即大语言模型判断与用户行为不一致的情况，例如用户点击了模型评分较低的文档或跳过了模型评分较高的文档，这类错误能提供最强的学习信号。

还有一个重要的考量：上下文往往是判断相关性的关键。例如在 Dropbox 内部，“diet sprite”指的是一款内部性能工具，而非饮料。为解决这一问题，研究人员让大语言模型进行额外检索、获取上下文并理解内部术语，这显著提升了标注的准确性。

根据在 Dropbox Dash 上的实践经验，Meyerzon 表示，这种方法能够让大语言模型在大规模场景下持续放大人工判断，成为优化 RAG 系统的有效手段。 原文链接： https://www.infoq.com/news/2026/03/dropbox-scaling-human-judgement/

I InfoQ 中文 @InfoQ 中文

One Sentence Summary

Summary

Main Points

* 1. Retrieval quality is a core bottleneck for RAG systems, directly determining the accuracy of generated content.

Among massive enterprise documents, only a tiny fraction of relevant content can enter the LLM's context, making search ranking precision crucial for the final answer quality.

* 2. "Human-calibrated LLM labeling" achieves a hundredfold increase in labeling efficiency by calibrating models with a small amount of human data.

This method uses human labels as a baseline to guide LLMs in generating hundreds of thousands of labels, addressing the high cost, slow speed, and inconsistency issues of pure human labeling.

* 3. LLM labeling is not directly used for real-time ranking but for training more efficient supervised learning ranking models.

Due to slow LLM inference speed and context length limitations, its primary role is to act as an "offline labeler" providing high-quality training data for ranking models.

* 4. By comparing LLM scores with user click/skip behavior, the most challenging labeling errors can be precisely identified and corrected.

The evaluation process focuses on instances where model judgments diverge from actual user behavior; these discrepancies provide the strongest learning signals for continuous optimization of labeling quality.

* 5. Introducing additional retrieval to obtain context information is key to solving LLMs' understanding of specific internal enterprise terminology.

For specific internal enterprise terminology (e.g., project codenames), allowing the LLM to perform additional retrieval before labeling significantly improves its accuracy in judging relevance.

Key Quotes

* Document retrieval quality is a bottleneck for RAG systems—these systems need to filter relevant content from massive document repositories and then feed it to large language models. * This method, called "human-calibrated large language model labeling," is straightforward: a small batch of high-quality datasets is first human-labeled to calibrate the large language model evaluator. * Large language models do not replace ranking systems—if used directly for ranking during queries, they would be too slow and limited by context length. * The evaluation also focuses on the most challenging errors to correct—instances where large language model judgments diverge from user behavior, as these errors provide the strongest learning signals.

AI Score

Website mp.weixin.qq.com

Published At Yesterday

Length 1006 words (about 5 min)

规模化人工判断：Dropbox 如何借助大语言模型优化 RAG 系统标注

One Sentence Summary

Summary

Main Points

Metadata

One Sentence Summary

Summary

Main Points

Key Quotes

Tags

Related Articles

🤖 問 AI

Related Articles

“AI on the Front Lines: How Developers are Reshaping the Software Development Process” | Roundtable Discussion

How to Design an AI Agent System

The True Moat for Agents is Shifting from Tools to Memory Assets

AI Hot Product Selection: Revolutionizing Recommendations with Trending Topics

From Context to Long-Term Memory: Architectural Design and Practice of LLM Memory Engineering

Practices and Reflections on Vibe Coding in Code Generation and Collaboration

1,500 PRs, 0 Human Coders: Building a Million-Line Internal Product Driven by Codex

From 7.9% to 54% Adoption: The Three-Stage Evolution of Kuaishou's Intelligent Code Review

OpenAI Frontline Development Observations: Those Who Can Manage 10-20 Agents Simultaneously and Run Hour-Long Tasks Are Leaving Other Engineers Far Behind