How can real-time event streaming platforms, handling millions of events and complex data processing, maintain peak performance and reliability? Managing the same has previously been complex. The latest agent changes and addition of semantic convention in OpenTelemetry make it ideal to monitor highly distributed event streaming architectures (EDA) like Kafka. In this session we will discuss how these changes help standardize telemetry, explain the usage of span links for capturing several traces for a transaction in EDA.
The talk will also cover how Otel enables automatic anomaly detection particularly useful for identifying issues like Consumer Lag, Increased Latency in Event Processing, and Partition Failures. By leveraging context propagation, Otel tracks end-to-end latency across the entire Kafka ecosystem, including producers, brokers, and consumers.
The talk covers real-world examples from gaming platforms and data systems which have enabled Otel for Kafka monitoring.
This talk provides significant benefits to the ecosystem by enhancing observability, performance, and reliability in event-driven systems like Kafka. By leveraging OpenTelemetry’s standardized telemetry and context propagation, teams can gain end-to-end visibility across producers, brokers, and consumers, making it easier to monitor real-time event streams. The ability to set automatic alerts for critical metrics like consumer lag and partition failures enables proactive anomaly detection and faster root cause diagnosis. This leads to improved system performance, reduced latency, and quicker issue resolution. Additionally, OpenTelemetry simplifies instrumentation and promotes cross-team collaboration, streamlining operations and future-proofing the monitoring strategy for scalable, fault-tolerant event-driven architectures.
Thus this talk is highly suitable for companies or DevOps teams working on applications serving millions of real time requests / events and how to adopt a scalable observability for event driven architectures. Since the talk also covers the core principle of OpenTelemetry, any individual or development teams looking to build robust monitoring will also equally benefit from this talk.
OpenTelemetry’s trace-to-metric correlation allows you to not just monitor Kafka's performance but also diagnose the root causes of issues like consumer lag, partition failures, and queue delays. By integrating both tracing and metrics, you gain deeper insights into Kafka's event flow, enabling faster anomaly detection, root cause analysis, and ultimately, quicker resolution of performance issues. This ensures that your event-driven architecture can maintain real-time reliability, which is especially critical in high-performance systems like gaming platforms (e.g., PokerBaazi). Major metrics captured are like Kafka Producer - Total Records Sent, Kafka Server - Request Processing Time, Kafka Consumer - Total Records Received, Kafka Consumer - Messages in Topic, Kafka Cluster - Total Throughput, OpenTelemetry Kafka - Errors.
This proposal seems less focused on FOSS. The references are also not sufficient to evaluate.
Not enough meaningful references to evaluate the proposal and the proposers' personal experience implementing these things. Also, that is not a good use of the "key takeaways" section. AI? I also agree with another review that this proposal does not seem very aligned with a FOSS themed event.
For future submissions, we recommend that you either reframe your talk to have a stronger FOSS-specific angle or provide more concrete examples and references to demonstrate your personal experience and expertise in the topic. This will help the program committee better assess the value of your talk to the IndiaFOSS community.