Talk Advanced CC BY-SA 4.0

Architecting Distributed Telemetry: OpenTelemetry Storage Spans and Tail-Sampling for Multi-Tenant Lakehouses

Approved

Session Description

Managing multi-tenant data lakehouse environments across distributed data infrastructures introduces severe system transparency barriers. Traditional application logging and metric collectors frequently saturate under high-volume data engineering runs, obscuring the primary root causes of execution bottlenecks, cluster query planning overheads, and network write constraints across big storage systems. To keep parallel processing frameworks executing optimally, infrastructure engineering teams must adopt open-source, vendor-blind instrumentation directly within core data execution paths.

This technical talk addresses the design and implementation of unified logging, metric collection, and trace propagation across distributed data platform pipelines. We will break down how to map OpenTelemetry semantic conventions directly into parallel processing nodes to track granular lifecycle events without compromising computation runtimes. A significant portion of the session covers the mitigation of infrastructure telemetry overhead through tail-sampling collectors, showcasing how to batch, evaluate, and dynamically drop repetitive system heartbeats while retaining 100 percent of anomalous pipeline error states. Attendees will obtain an actionable blueprint to isolate failing data pipelines, track structural storage layers natively, and budget the system resources consumed by platform monitoring tools.

Key Takeaways

Key Takeaways from this talk

Unified Pipeline Tracing Blueprint: Practical structural patterns to map and extend OpenTelemetry tracing spans over complex, multi-tenant database clusters to cleanly isolate slow-running operations.
Telemetry Volume and Overhead Control: Technical implementation rules for deploying localized OpenTelemetry collectors capable of tail-sampling data flows, saving cluster resources while preserving debugging telemetry.
System Metric Conversion: Clear methods for converting unstructured platform system runtime telemetry into structured engineering dashboards focused on cluster storage volume, network load, and memory allocation.

References

https://opentelemetry.io/docs/specs/semconv/

https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor

Session Categories

Technology architecture

Engineering practice - productivity, debugging

Talk License: CC BY-SA 4.0