Skip to main content
Company Blog

Databricks' commitment to education is at the center of the work we do. Through Instructor-Led Training, Certification, and Self-Paced Training, Databricks Academy provides strong pathways for users to learn Apache Spark and Databricks, and to push their knowledge to the next level.

In that spirit, we are pleased to present some great new free content. First up is a series of short videos to help anyone get started with Machine Learning on Apache Spark and Databricks. We follow that with a sample module from one of our 3-day Instructor-Led training classes.

Series: An Introduction to Machine Learning

In the past few weeks, Databricks Academy launched two new self-paced courses: Structured Streaming and Introduction to Data Science and Machine Learning. As part of the Machine Learning class launch, we created series of videos featuring course developer Conor Murphy. If you’d like to follow along with Conor on your own computer, simply download the code. If you don’t have a Databricks account yet, get started for free on Databricks Community Edition.

In this video, Conor introduces the core concepts of Machine Learning and Distributed Learning, and how Distributed Machine Learning is done with Apache Spark. He also sets up the goal of the entire video series: building an end-to-end machine learning pipeline using Databricks.

In order to work on a data set, the data must be imported into a Databricks workspace. In this video, Conor provides a concrete example of importing data into Databricks. Once the data is loaded, Conor uses Databricks to do Exploratory Data Analysis (EDA) and Visualization of salient aspects of the data set.

There are three main abstractions in Apache Spark’s Machine Learning Library: Transformers, Estimators, and Pipelines. In this video, Conor discusses the transform() and fit() methods implemented in Transformers and Pipelines, respectively, and how they are used to construct a full machine learning Pipeline. Conor then walks through the implementation of such a pipeline using Spark in Databricks.

In this video, Conor prepares the data for final model fitting. He first demonstrates the preparation of a Train/Test Split on the data set and discusses the importance of this technique in terms of preventing model overfitting. Next, Conor shows how to use Spark ML Transformers to complete data preparation via a Featurization Pipeline.

Finally, Conor completes the end-to-end machine learning pipeline by training models on the full Pipelines developed throughout this series. He also shows how to use performance metrics to assess the performance of these Pipelines. Having selected a final model, Conor demonstrates how to save the model for later use.

We hope that you find these videos informative, as well as entertaining! The full video playlist is here. You can learn more about Machine Learning using Databricks in the Introduction to Data Science and Machine Learning available at Databricks Academy.

Apache Spark Cost-Based Optimizer

Here we present an example module from Apache Spark Tuning and Best Practices, one of Databricks Academy’s 3-day Instructor-Led Training courses.

In this video, Databricks Instructor Jacob Parr presents the Apache Spark Cost-Based Optimizer. He’ll first explain various optimizers and how they are used within Apache Spark, and then go into detail on the Cost-Based Optimizer, providing examples on actual data with code samples.

To learn more about the courses Databricks offers, check out Databricks Academy. If you’d like to get started with corporate training, contact us today!