Session

Deep Dive Into Streaming and Batch ETLs With Lakeflow Spark Declarative Pipelines

Overview

ExperienceIn Person
TrackData Engineering & Streaming
IndustryEnterprise Technology, Consulting & Services
TechnologiesLakeflow, Unity Catalog
Skill LevelAdvanced

Let's deep dive into the new declarative ETL pipeline framework in Apache Spark™, Lakeflow Spark Declarative Pipelines (SDP). Let's peel it off to learn how SDP's high-level Python and SQL abstractions translate into lower-level Spark SQL and Spark Structured Streaming queries. During the talk, you will learn how SDP automatically resolves complex dependencies and builds optimized Directed Acyclic Graphs (DAGs) for both batch and streaming workloads. We will walk through the internal state management and orchestration logic that allows SDP to handle retries and incremental processing out of the box, replacing thousands of lines of imperative "glue code." You will leave with a clear mental model of the engine's architecture, the pros and cons of SDP, and be ready to debug and optimize your pipelines. You'll understand where and how to use it (alongside your existing ETL pipelines).

Session Speakers

Jacek Laskowski

/Freelance Data Engineer
books.japila.pl