Sink Framework Evolution in Apache Flink
- Data Lakes, Data Warehouses and Data Lakehouses
- Moscone South | Level 2 | 215
- 35 min
Apache Flink is one of the most popular frameworks for unified stream and batch processing. Like every other big data framework, Apache Flink offers connectors to different external systems to read from and write to. We refer to connectors for writing to external systems as sinks. Over the years, multiple frameworks existed inside Apache Flink for building sinks. The Apache Flink community also noticed the latest trend of ingesting real-time data directly into data lakes for further usage. Therefore with Apache Flink 1.15, we released the next iteration of our sink framework. We designed it to accommodate the needs of modern data lake connectors i.e. lazy file compaction, user-defined shuffling.
In this talk, we first give a brief historical glimpse of the evolution of the frameworks that started as a kind of a simple map operation until a custom operator model that simplified two-phase commit semantics. Secondly, we do a deep dive into Apache Flink’s fault tolerance model to explain how the last iteration of the sink framework supports exactly-once processing and complex operations important for delta lakes.
In summary, this talk introduces the principles behind the sink framework in Apache Flink and gives a starting point for developers building a new connector for Apache Flink.