Skip to main content

What are ML Pipelines?

Learn how ML pipelines automate and streamline the machine learning workflow from data preprocessing to model validation

4 Personas AI Agents 6

Summary

  • Understand what ML pipelines are and how they connect preprocessing, feature extraction, model fitting, and validation into a unified workflow
  • Learn the difference between Transformers and Estimators as the two core pipeline stage types
  • Explore how Spark ML Pipelines enable scalable, distributed machine learning with native pipeline creation and tuning

Typically when running machine learning algorithms, it involves a sequence of tasks including pre-processing, feature extraction, model fitting, and validation stages. For example, when classifying text documents might involve text segmentation and cleaning, extracting features, and training a classification model with cross-validation. Though there are many libraries we can use for each stage, connecting the dots is not as easy as it may look, especially with large-scale datasets. Most ML libraries are not designed for distributed computation or they do not provide native support for pipeline creation and tuning.

A 5X LEADER

Gartner®: Databricks Cloud Database Leader

The ML Pipelines is a High-Level API for MLlib that lives under the "spark.ml" package. A pipeline consists of a sequence of stages. There are two basic types of pipeline stages: Transformer and Estimator. A Transformer takes a dataset as input and produces an augmented dataset as output. E.g., a tokenizer is a Transformer that transforms a dataset with text into an dataset with tokenized words. An Estimator must be first fit on the input dataset to produce a model, which is a Transformer that transforms the input dataset. E.g., logistic regression is an Estimator that trains on a dataset with labels and features and produces a logistic regression model.

Additional Resources

Never miss a Databricks post

Subscribe to our blog and get the latest posts delivered to your inbox