Streaming Data into Delta Lake with Rust and Kafka
- Data Lakes, Data Warehouses and Data Lakehouses
- Medien und Unterhaltung
- Moscone South | Level 2 | 202
- 35 min
Scribd's data architecture was originally batch-oriented, but in the last couple years, we introduced streaming data ingestion to provide near-real-time ad hoc query capability, mitigate the need for more batch processing tasks, and set the foundation for building real-time data applications.
Kafka and Delta Lake are the two key components of our streaming ingestion pipeline. Various applications and services write messages to Kafka as events are happening. We were tasked with getting these messages into Delta Lake quickly and efficiently.
Our first solution was to deploy Spark Structured Streaming jobs. This got us off the ground quickly, but had some downsides.
Since Delta Lake and the Delta transaction protocol are open source, we kicked off a project to implement our own Rust ingestion daemon. We were confident we could deliver a Rust implementation since our ingestion jobs are append only. Rust offers high performance with a focus on code safety and modern syntax.
In this talk I will describe Scribd's unique approach to ingesting messages from Kafka topics into Delta Lake tables. I will describe the architecture, deployment model, and performance of our solution, which leverages the kafka-delta-ingest Rust daemon and the delta-rs crate hosted in auto-scaling ECS services. I will discuss foundational design aspects for achieving data integrity such as distributed locking with DynamoDb to overcome S3's lack of "PutIfAbsent" semantics, and avoiding duplicates or data loss when multiple concurrent tasks are handling the same stream. I'll highlight the reliability and performance characteristics we've observed so far. I'll also describe the Terraform deployment model we use to deliver our 70-and-growing production ingestion streams into AWS.