SESSION

The Beauty of Delta for Polyglot Data and ML Workloads

Accept Cookies to Play Video

OVERVIEW

EXPERIENCEIn Person
TYPELightning Talk
TRACKData Engineering and Streaming
INDUSTRYEnterprise Technology
TECHNOLOGIESApache Spark, Delta Lake, Orchestration
SKILL LEVELIntermediate
DURATION20 min
DOWNLOAD SESSION SLIDES

Delta Lake is a powerful open table format we leverage for 1000+ datasets with 20.000+ daily job runs. Delta Lake is extremely useful with Apache Spark™; exactly-once semantics when micro-batching is a simple and very robust way to transport and transform data. Furthermore, it also allows usage with Python and Rust, making it ready to use for ML. In this talk, I will show how my team uses Delta as the backbone for operational data streams, explorative data analysis, and reproducible ML workloads. We will discuss how we keep tabs on data quality and effectively monitor streams using table metadata, all enabled by Delta Lake. I will show how we use Delta:

 

  • for a WAP pattern to prevent bad data going into or out of the platform
  • to only run jobs when data changed to improve efficient usage of our compute jobs
  • for building auto-ML models to monitor and alert Spark Structured Streaming job behavior automatically

SESSION SPEAKERS

Micha Kunze

/Lead Data Engineer
Maersk