HomepageData + AI Summit 2022 Logo
Watch on demand

Backfill Streaming Data Pipelines in Kappa Architecture

On Demand

Type

  • Session

Format

  • Hybrid

Track

  • Data Engineering

Difficulty

  • Intermediate

Room

  • Moscone South | Upper Mezzanine | 155

Duration

  • 35 min
Download session slides

Überblick

Streaming data pipelines can fail due to various reasons. Since the source data, such as Kafka topics, often have limited retention, prolonged job failures can lead to data loss. Thus, streaming jobs need to be backfillable at all times to prevent data loss in case of failures.



One solution is to increase the source's retention so that backfilling is simply replaying source streams, but extending Kafka retention is very costly for Netflix's data sizes. Another solution is to utilize source data stored in DWH, commonly known as the Lambda architecture. However, this method introduces significant code duplication, as it requires engineers to maintain a separate equivalent batch job.



At Netflix, we have created the Iceberg Source Connector to provide backfilling capabilities to Flink streaming applications. It allows Flink to stream data stored in Apache Iceberg while mirroring Kafka's ordering semantics, enabling us to backfill large-scale stateful Flink pipelines at low retention cost.

Session Speakers

Xinran Waibel

Senior Data Engineer

Netflix

Sundaram Ananthanarayanan

Senior Software Engineer

Netflix Inc

Das Beste des Data+AI Summits anzeigen

Watch on demand