Discord 工程师在不牺牲性能的前提下，为 Elixir 的 Actor 模型增加了分布式追踪功能

Discord engineering published details on how they added distributed tracing to their Elixir infrastructure while handling millions of concurrent users. The team built a custom "Transport" library that wraps Elixir's message-passing system with trace context, solving a fundamental challenge in instrumenting actor-based architectures.

Unlike HTTP-based microservices, where trace context travels in headers, Elixir's actor model passes arbitrary messages between processes with no built-in metadata layer. Discord needed end-to-end visibility across its chat infrastructure, yet it faced a gap: OpenTelemetry's standard tracing worked within individual services but couldn't propagate context between Elixir processes.

The team identified three requirements for any solution: it had to be ergonomic enough for developers to adopt, support both raw messages and GenServer abstractions, and enable zero-downtime deployment across their production fleet.

Discord's solution introduces an "Envelope" primitive that wraps messages with trace context. The implementation is deceptively simple, a struct containing the original message and a serialized trace carrier:

defmodule Discord.Transport.Envelope do
  defstruct [:message, trace_carrier: []]
  def wrap_message(message) do
    %__MODULE__{
      message: message,
      trace_carrier: :otel_propagator_text_map.inject([])
    }
  end
end

The library provides drop-in replacements for GenServer's call and cast functions that automatically wrap outgoing messages. On the receiving side, a handle_message function normalizes both old-style bare messages and new Envelope-wrapped ones, extracting trace context when present and cleaning it up after processing.

This normalization proved critical during rollout. Discord couldn't change all message passing overnight or update all nodes simultaneously. The library handles messages from both instrumented and non-instrumented code, enabling gradual migration without service restarts.

Discord's architecture creates unique scaling challenges. When a user sends a message to a guild with a million online members, that single traced operation could spawn a million child spans, one per session, forwarding the message to its client.

The team implemented dynamic sampling based on fanout size. Messages sent to a single recipient preserve their sampling decision 100% of the time. Messages fanned out to 100 recipients dropped to 10% sampling. At 10,000+ recipients, only 0.1% of sessions capture spans. This approach maintains useful trace data without overwhelming their observability infrastructure.

!Image 1: A visual representation of the dynamic sampling preservation. The diagram shows that sampling is preserved 100% of the time when a message is sent to a single session, 10% of the time when fanned out to 100 sessions, and 0.1% of the time when fanned out to 10k+ sessions.

_(Source: Discord blog post)_

Initial deployments revealed tracing overhead Discord hadn't anticipated. Their busiest guilds, those with millions of members, struggled to keep up with activity. Profiling showed processes spending significant time unpacking trace context, even when 99%+ of operations weren't being sampled.

The fix: only propagate trace context for sampled operations. Unsampled traces simply don't include context in their envelopes, saving serialization and parsing costs. This modified traditional head sampling semantics slightly but eliminated CPU spikes.

A second optimization targeted the sessions service, where capturing spans during fanout increased CPU usage by 10 percentage points. Discord forbade sessions from starting new traces after receiving fanned-out messages. Sessions can continue existing traces, but won't independently decide to sample. This single change recovered nearly all the overhead, dropping CPU usage from 55% to 45%.

The most dramatic optimization came from analyzing gRPC request handling. When linking Elixir services to Discord's Python API, 75% of the request processing time was spent unpacking the trace context. The team built a filter that reads the sampling flag from the encoded trace context string without full deserialization. If the trace isn't sampled, context isn't propagated at all.

The investment paid off during a recent incident where a guild failed to keep up with user activity. Traces showed members experiencing 16-minute delays connecting to the affected guild process, a quantifiable user impact that metrics and logs alone wouldn't reveal. The traces also exposed the downstream cascade: users couldn't even click into the guild during the outage.

Discord engineer Nick Krichevsky notes that while this type of severe degradation is rare, tracing has become essential for investigating issues they "_simply couldn't debug before_."

The Transport library represents a pragmatic approach to distributed tracing in actor-based systems. By wrapping Elixir's message passing rather than trying to retrofit HTTP-style metadata, Discord maintained its architecture's strengths while gaining observability that scales to production workloads.

Discord 工程师在不牺牲性能的前提下，为 Elixir 的 Actor 模型增加了分布式追踪功能

🤖 問 AI