Streaming ML Enrichment Framework Using Advanced Delta Table Features
- MLOps and DataOps
- Moscone South | Upper Mezzanine | 160
- 35 min
Talk about a challenge of building a scalable framework for data scientists and ML engineers, that could accommodate hundreds of generic or customer specific ML models, running both in streaming and batch, capable of processing 100+ million records per day from social media networks.
The goal has been archived using Spark and Delta. Our framework is built on clever usage of delta features such as change data feed, selective merge and spark structure streaming from and into delta tables. Saving the data in multiple delta tables, where the structure of these tables are reflecting the particular step in the whole flow. This brings great efficiency, as the downstream processing does very little transformations and thus even people without extensive experience of writing ML pipelines and jobs can use the framework easily.
At the heart of the framework there is a series of Spark structure streaming jobs continuously evaluating rules and looking for what social media content should be processed by which model. These rules could be updated by the users anytime and the framework needs to automatically adjust the processing. In an environment like this, the ability to track the records throughout the whole process and the atomicity of operations is of utmost importance and delta tables are providing all of this out of the box.
In the talk we are going to focus on the ideas behind the framework and efficient combining of structured streaming and delta tables. Key takeaways would be exploring some of the lesser known delta table features and real-life experiences from building a ML framework solution based on scalable big data technologies, showing how capable and fast such a solution can be, even with minimal hardware resources.