This talk discusses developments within Apache Spark to allow deployment of MLlib models and pipelines within Structured Streaming jobs. MLlib has proven success and wide adoption for fitting Machine Learning (ML) models on big data. Scalability, expressive Pipeline APIs, and Spark DataFrame integration are key strengths.
Separately, the development of Structured Streaming has provided Spark users with intuitive, performant tools for building Continuous Applications. The smooth integration of batch and streaming APIs and workflows greatly simplifies many production use cases. Given the adoption of MLlib and Structured Streaming in production systems, a natural next step is to combine them: deploy MLlib models and Pipelines for scoring (prediction) in Structured Streaming.
However, before Apache Spark 2.3, many ML Pipelines could not be deployed in streaming. This talk discusses key improvements within MLlib to support streaming prediction. We will discuss currently supported functionality and opportunities for future improvements. With Spark 2.3, almost all MLlib workflows can be deployed for scoring in streaming, and we will demonstrate this live. The ability to deploy full ML Pipelines which include featurization greatly simplifies moving complex ML workflows from development to production. We will also include some discussion of technical challenges, such as featurization via Estimators vs. Transformers and DataFrame column metadata.
Session hashtag: #ML2SAIS
Joseph Bradley works as a Sr. Solutions Architect at Databricks, specializing in Machine Learning, and is an Apache Spark committer and PMC member. Previously, he was a Staff Software Engineer at Databricks and a postdoc at UC Berkeley, after receiving his Ph.D. in Machine Learning from Carnegie Mellon.