Session

Scaling Data Engineering Pipelines: Preparing Credit Card Transactions Data for Machine Learning

Overview

Wednesday

June 11

4:10 pm

ExperienceIn Person
TypeBreakout
TrackData Engineering and Streaming
IndustryEnterprise Technology, Financial Services
TechnologiesApache Spark, Delta Lake, Databricks Workflows
Skill LevelIntermediate
Duration40 min

We discuss two real-world use cases in big data engineering, focusing on constructing stable pipelines and managing storage at a petabyte scale. The first use case highlights the implementation of Delta Lake to optimize data pipelines, resulting in an 80% reduction in query time and a 70% reduction in storage space. The second use case demonstrates the effectiveness of the Workflows ‘ForEach’ operator in executing compute-intensive pipelines across multiple clusters, significantly reducing processing time from months to days. This approach involves a reusable design pattern that isolates notebooks into units of work, enabling data scientists to independently test and develop.

Session Speakers

Brandon DeShon

/Director, Data Scientist
Mastercard

Luke Garzia

/Lead Data Engineer
Mastercard