Talk Intermediate

Data engineering at internet scale

Approved

Session Description

Modern internet platforms generate continuously evolving, high-volume event streams that must be processed reliably, efficiently, and in near real time. While open-source technologies such as Apache Spark, Hive, Airflow, Kubernetes, and query federation engines have made large-scale data processing accessible, operating these systems at production scale introduces challenges rarely covered in documentation.

This talk explores practical engineering lessons from building and operating large distributed data platforms using open-source technologies, focusing on system design decisions, performance trade-offs, observability challenges, and scalability patterns.

The session will walk through how real-world data engineering problems such as schema evolution, distributed query execution, pipeline reliability, metadata management, and compute optimization are solved using open-source ecosystems.

A special focus will be placed on query federation and metadata abstraction, inspired by open-source projects such as Apache Lens, and how organizations contribute back to the ecosystem while solving internal scale challenges.

Key Takeaways

The goal of this talk is to help engineers understand:

What breaks first at scale
How open source enables innovation
How engineers can meaningfully contribute to FOSS projects while solving production problems

References

https://blogs.apache.org/lens

Session Categories

Story of a FOSS project - from inception to growth