Machine Learning workflows involve complex sequences of data transformations, learning algorithms, and parameter tuning. Spark ML Pipelines, introduced in Spark 1.2, have grown into a powerful framework for developing ML workflows. This talk will cover basic Pipeline concepts and then demonstrate their usage:
(1) Building: Pipelines simplify the process of specifying a ML workflow.
(2) Debugging: Pipelines and DataFrames permit users to inspect and debug the workflow.
(3) Tuning: Built-in support for parameter tuning helps users optimize ML performance.
Joseph Bradley works as a Sr. Solutions Architect at Databricks, specializing in Machine Learning, and is an Apache Spark committer and PMC member. Previously, he was a Staff Software Engineer at Databricks and a postdoc at UC Berkeley, after receiving his Ph.D. in Machine Learning from Carnegie Mellon.