Title: QCon London 2026: Uncorking Queueing Bottlenecks with OpenTelemetry | BestBlogs.dev
URL Source: https://www.bestblogs.dev/article/e36ebac9
Published Time: 2026-03-18 07:00:00
Markdown Content: Skip to main content Toggle navigation menu Toggle navigation menuArticlesPodcastsVideosTweetsSourcesNewsletters
⌘K
Change language Switch ThemeSign In
Narrow Mode
QCon London 2026: Uncorking Queueing Bottlenecks with OpenTelemetry ===================================================================
!Image 5: InfoQ InfoQ @Mark Silvester
One Sentence Summary
This article explores how Gearset engineers utilized OpenTelemetry and distributed tracing to resolve complex asynchronous queueing bottlenecks by shifting from infrastructure metrics to latency-based SLOs.
Summary
The article summarizes a QCon London 2026 presentation by Gearset engineers on overcoming the limitations of traditional logging and metrics in distributed systems. While standard dashboards failed to diagnose delayed backup jobs, the team implemented OpenTelemetry (OTel) to provide hierarchical visibility across service boundaries. Key technical strategies included creating custom wrappers for queue clients to ensure context propagation, adopting 'wide events' with rich metadata for discovery-based debugging, and using OTel's trace state to track total duration in asynchronous workflows. Crucially, the team shifted their alerting philosophy from infrastructure-centric metrics (like queue size) to customer-centric Service Level Objectives (SLOs) based on latency, arguing that delay is a more stable indicator of user experience than volume.
Main Points
* 1. Distributed tracing bridges the visibility gap between macro metrics and disconnected logs.While metrics show system-wide trends and logs capture individual events, tracing provides the necessary hierarchical structure to group events from a single operation, allowing engineers to see cause and effect across service boundaries. * 2. Effective queue observability requires custom context propagation standards.Unlike HTTP calls which often trace automatically, queues require manual intervention. Gearset implemented wrappers to attach trace and parent span IDs as message metadata, ensuring the request journey remains intact through asynchronous hops. * 3. Alerting should prioritize customer-facing latency over internal infrastructure metrics like queue size.Using a traffic analogy, the authors argue that queue size is misleading; five items can be slow while a thousand are fast. Latency is a more stable Service Level Indicator that directly reflects the actual customer experience. * 4. Adopting 'wide events' with rich metadata enables discovery-based debugging for hidden bottlenecks.By attaching extensive metadata (timestamps, subgroup IDs, Kubernetes context) to spans, teams can query attributes in real-time to uncover inefficiencies, such as obsolete code paths, that traditional monitoring would miss. * 5. Cultural change in observability is best achieved through proving value during real incidents.Rather than a top-down mandate, observability becomes a self-reinforcing practice when engineers see tangible benefits, such as using trace links to navigate 'mega traces' and resolving long-standing production issues quickly.
Metadata
AI Score
83
Website infoq.com
Published At Today
Length 564 words (about 3 min)
Sign in to use highlight and note-taking features for a better reading experience. Sign in now
A customer backup job running an hour late on a Friday evening should have been straightforward to diagnose. For engineers at Gearset, it wasn't.
Despite comprehensive dashboards, metrics, and logs, the team found themselves guessing at root causes rather than identifying them. Metrics showed the forest but not the trees; logs showed individual trees but made it nearly impossible to trace a path between them.
During their QCon London 2026 presentation, Julian Wreford and Oli Lane from Gearset explained how distributed tracing with OpenTelemetry filled that gap. Tracing provides a hierarchical structure that groups events from a single operation, offering visibility into cause and effect across service boundaries. While HTTP tracing often works automatically, queues require custom work to maintain context. The team implemented OpenTelemetry's context propagation standard by creating wrappers for their queue clients, attaching trace IDs and parent span IDs as message metadata ensuring the full journey of a request remained intact.
!Image 6/filters:no_upscale()/news/2026/03/queue-otel-observability/en/resources/1Traces-1773741088971.jpeg)
A central theme of the transition involved moving away from infrastructure metrics toward Service Level Objectives (SLOs) based on customer experience. The speakers used a traffic analogy to explain why alerting on queue size is often misleading. Just as Google Maps alerts drivers based on expected delay rather than the number of cars on the road, the Gearset team shifted to alerting on latency. They noted that a thousand items on a queue might be processed instantly, while five items could be significantly delayed. Latency is more stable and directly reflects the customer experience, reducing the need for constant re-tuning as system characteristics change.
!Image 7/filters:no_upscale()/news/2026/03/queue-otel-observability/en/resources/1GoogleMaps-1773741088971.jpeg)
To implement this strategy, the team adopted a three-step framework: define the Service Level Indicator (SLI), set the SLO, and then configure alerts.
They bucketed events into good or bad categories, such as whether a message was processed within two seconds, and defined an error budget for a target of 99.9% success. Once an alert fires, engineers can jump directly from a macro visualisation of the distribution to specific problem traces. The team shared a specific implementation trick for tracking total duration in asynchronous traces by using OpenTelemetry's trace state to embed and propagate the root span's start timestamp. This allows for calculating the time since the trace began for any child span, regardless of how many queues or services it has traversed.
!Image 8/filters:no_upscale()/news/2026/03/queue-otel-observability/en/resources/1SLO-1773741088971.jpeg)
The adoption of "wide events", which involves attaching as much metadata as possible to spans, unlocked discovery-based debugging. By including created, sent, and received timestamps, with FIFO subgroup IDs, the team could query attributes in real time to discover hidden waste. In one instance, they identified a long-standing bottleneck caused by an obsolete piece of code that was easily removed. The Gearset engineers also highlighted the role of the OpenTelemetry Collector, which they use to automatically enrich traces with Kubernetes metadata and scrub sensitive data before it reaches their query engine.
While the technical implementation was significant, the speakers emphasised that cultural change was the most challenging aspect. The speakers recommended engaging with teams on their own terms and proving the value of tracing through the resolution of real-world incidents, avoiding the pitfall of promoting the technology as a universal cure for all operational problems. When engineers see the tangible benefits of discovery-based debugging, such as splitting "mega traces" into navigable sub-traces using trace links, they naturally begin to enrich their own spans, turning observability from a top-down mandate into a self-reinforcing practice.
!Image 9: InfoQ InfoQ @Mark Silvester
One Sentence Summary
This article explores how Gearset engineers utilized OpenTelemetry and distributed tracing to resolve complex asynchronous queueing bottlenecks by shifting from infrastructure metrics to latency-based SLOs.
Summary
The article summarizes a QCon London 2026 presentation by Gearset engineers on overcoming the limitations of traditional logging and metrics in distributed systems. While standard dashboards failed to diagnose delayed backup jobs, the team implemented OpenTelemetry (OTel) to provide hierarchical visibility across service boundaries. Key technical strategies included creating custom wrappers for queue clients to ensure context propagation, adopting 'wide events' with rich metadata for discovery-based debugging, and using OTel's trace state to track total duration in asynchronous workflows. Crucially, the team shifted their alerting philosophy from infrastructure-centric metrics (like queue size) to customer-centric Service Level Objectives (SLOs) based on latency, arguing that delay is a more stable indicator of user experience than volume.
Main Points
* 1. Distributed tracing bridges the visibility gap between macro metrics and disconnected logs.
While metrics show system-wide trends and logs capture individual events, tracing provides the necessary hierarchical structure to group events from a single operation, allowing engineers to see cause and effect across service boundaries.
* 2. Effective queue observability requires custom context propagation standards.
Unlike HTTP calls which often trace automatically, queues require manual intervention. Gearset implemented wrappers to attach trace and parent span IDs as message metadata, ensuring the request journey remains intact through asynchronous hops.
* 3. Alerting should prioritize customer-facing latency over internal infrastructure metrics like queue size.
Using a traffic analogy, the authors argue that queue size is misleading; five items can be slow while a thousand are fast. Latency is a more stable Service Level Indicator that directly reflects the actual customer experience.
* 4. Adopting 'wide events' with rich metadata enables discovery-based debugging for hidden bottlenecks.
By attaching extensive metadata (timestamps, subgroup IDs, Kubernetes context) to spans, teams can query attributes in real-time to uncover inefficiencies, such as obsolete code paths, that traditional monitoring would miss.
* 5. Cultural change in observability is best achieved through proving value during real incidents.
Rather than a top-down mandate, observability becomes a self-reinforcing practice when engineers see tangible benefits, such as using trace links to navigate 'mega traces' and resolving long-standing production issues quickly.
Key Quotes
* Metrics showed the forest but not the trees; logs showed individual trees but made it nearly impossible to trace a path between them. * Just as Google Maps alerts drivers based on expected delay rather than the number of cars on the road, the Gearset team shifted to alerting on latency. * Latency is more stable and directly reflects the customer experience, reducing the need for constant re-tuning as system characteristics change. * The adoption of 'wide events', which involves attaching as much metadata as possible to spans, unlocked discovery-based debugging. * When engineers see the tangible benefits of discovery-based debugging... they naturally begin to enrich their own spans, turning observability from a top-down mandate into a self-reinforcing practice.
AI Score
83
Website infoq.com
Published At Today
Length 564 words (about 3 min)
Tags
OpenTelemetry
Distributed Tracing
Observability
Queueing Systems
SLO/SLI
Related Articles
* Engineering Speed at Scale — Architectural Lessons from Sub-100-ms APIs * Where Architects Sit in the Era of AI to describe human-AI collaboration levels, and highlighting the extende...") * Explainable AI Agents: Capture LLM Tool Call Reasoning with Spring AI * OpenAI Introduces Harness Engineering: Codex Agents Power Large‑Scale Software Development * Shipping AI That Works: An Evaluation Framework for PMs – Aman Khan, Arize * Building durable Agents with Workflow DevKit & AI SDK - Peter Wielander, Vercel * Architecture in a Flow of AI-Augmented Change * 4 Patterns of AI Native Development * Key Takeaways: Building Successful Enterprise AI Products * Agent Observability Powers Agent Evaluation HomeArticlesPodcastsVideosTweets
QCon London 2026: Uncorking Queueing Bottlenecks with Ope... ===============