← 回總覽

Discord 工程师在不牺牲性能的前提下,为 Elixir 的 Actor 模型增加了分布式追踪功能

📅 2026-03-28 18:13 Steef-Jan Wiggers 软件编程 5 分鐘 5310 字 評分: 90
Elixir 分布式追踪 Actor 模型 OpenTelemetry 性能优化
📌 一句话摘要 Discord 通过将消息封装在自定义的 Envelope 原语中,并利用动态采样处理大规模扇出场景,在 Elixir 的 Actor 模型中实现了分布式追踪。 📝 详细摘要 Discord 工程师解决了 Elixir Actor 模型的一个根本限制:缺乏用于在进程间传播追踪上下文的内置元数据。他们开发了一个自定义的 Transport 库,引入了 Envelope 原语,将标准消息与 OpenTelemetry 追踪载体封装在一起。该解决方案为 GenServer 函数提供了无缝替换方案,并包含一个标准化层以支持逐步、零停机的部署。为了防止在高扇出事件(例如单条消息触达数

Discord engineering published details on how they added distributed tracing to their Elixir infrastructure while handling millions of concurrent users. The team built a custom "Transport" library that wraps Elixir's message-passing system with trace context, solving a fundamental challenge in instrumenting actor-based architectures.

Unlike HTTP-based microservices, where trace context travels in headers, Elixir's actor model passes arbitrary messages between processes with no built-in metadata layer. Discord needed end-to-end visibility across its chat infrastructure, yet it faced a gap: OpenTelemetry's standard tracing worked within individual services but couldn't propagate context between Elixir processes.

The team identified three requirements for any solution: it had to be ergonomic enough for developers to adopt, support both raw messages and GenServer abstractions, and enable zero-downtime deployment across their production fleet.

Discord's solution introduces an "Envelope" primitive that wraps messages with trace context. The implementation is deceptively simple, a struct containing the original message and a serialized trace carrier:

defmodule Discord.Transport.Envelope do
  defstruct [:message, trace_carrier: []]
  def wrap_message(message) do
    %__MODULE__{
      message: message,
      trace_carrier: :otel_propagator_text_map.inject([])
    }
  end
end

The library provides drop-in replacements for GenServer's call and cast functions that automatically wrap outgoing messages. On the receiving side, a handle_message function normalizes both old-style bare messages and new Envelope-wrapped ones, extracting trace context when present and cleaning it up after processing.

This normalization proved critical during rollout. Discord couldn't change all message passing overnight or update all nodes simultaneously. The library handles messages from both instrumented and non-instrumented code, enabling gradual migration without service restarts.

Discord's architecture creates unique scaling challenges. When a user sends a message to a guild with a million online members, that single traced operation could spawn a million child spans, one per session, forwarding the message to its client.

The team implemented dynamic sampling based on fanout size. Messages sent to a single recipient preserve their sampling decision 100% of the time. Messages fanned out to 100 recipients dropped to 10% sampling. At 10,000+ recipients, only 0.1% of sessions capture spans. This approach maintains useful trace data without overwhelming their observability infrastructure.

!Image 1: A visual representation of the dynamic sampling preservation. The diagram shows that sampling is preserved 100% of the time when a message is sent to a single session, 10% of the time when fanned out to 100 sessions, and 0.1% of the time when fanned out to 10k+ sessions.

_(Source: Discord blog post)_

Initial deployments revealed tracing overhead Discord hadn't anticipated. Their busiest guilds, those with millions of members, struggled to keep up with activity. Profiling showed processes spending significant time unpacking trace context, even when 99%+ of operations weren't being sampled.

The fix: only propagate trace context for sampled operations. Unsampled traces simply don't include context in their envelopes, saving serialization and parsing costs. This modified traditional head sampling semantics slightly but eliminated CPU spikes.

A second optimization targeted the sessions service, where capturing spans during fanout increased CPU usage by 10 percentage points. Discord forbade sessions from starting new traces after receiving fanned-out messages. Sessions can continue existing traces, but won't independently decide to sample. This single change recovered nearly all the overhead, dropping CPU usage from 55% to 45%.

The most dramatic optimization came from analyzing gRPC request handling. When linking Elixir services to Discord's Python API, 75% of the request processing time was spent unpacking the trace context. The team built a filter that reads the sampling flag from the encoded trace context string without full deserialization. If the trace isn't sampled, context isn't propagated at all.

The investment paid off during a recent incident where a guild failed to keep up with user activity. Traces showed members experiencing 16-minute delays connecting to the affected guild process, a quantifiable user impact that metrics and logs alone wouldn't reveal. The traces also exposed the downstream cascade: users couldn't even click into the guild during the outage.

Discord engineer Nick Krichevsky notes that while this type of severe degradation is rare, tracing has become essential for investigating issues they "_simply couldn't debug before_."

The Transport library represents a pragmatic approach to distributed tracing in actor-based systems. By wrapping Elixir's message passing rather than trying to retrofit HTTP-style metadata, Discord maintained its architecture's strengths while gaining observability that scales to production workloads.

查看原文 → 發佈: 2026-03-28 18:13:00 收錄: 2026-03-28 20:00:28

🤖 問 AI

針對這篇文章提問,AI 會根據文章內容回答。按 Ctrl+Enter 送出。