Skip to main content

Optimizing Apache Spark™ on Databricks

Description

In this course, you will explore the five key problems that represent the vast majority of performance issues in an Apache Spark application: skew, spill, shuffle, storage, and serialization. With examples based on 100 GB to 1+ TB datasets, you will investigate and diagnose sources of bottlenecks with the Spark UI and learn effective mitigation strategies. You will also discover new features introduced in Spark 3 that can automatically address common performance problems. Lastly, you learn how to design and configure clusters for optimal performance based on specific team needs and concerns.

Duration

2 full days or 4 half days

Objectives

  • Articulate how the five most common performance problems in a Spark application can be mitigated to achieve better application performance

  • Summarize the most common performance problems associated with data ingestion and how to mitigate them

  • Articulate how new features in Spark 3.x can be employed to mitigate performance problems in your Spark applications

  • Configure a Spark cluster for maximum performance given specific job requirements

Prerequisites

  • Hands-on experience developing Apache Spark applications (6+ months). We recommend the Apache Spark Programming course to get started working with Spark.

  • Intermediate experience in Python or Scala

Outline

Day 1

  • Review of Spark architecture and Spark UI

  • Skew

  • Spill

  • Shuffle

  • Storage

  • Serialization

Day 2

  • Ingestion basics

  • Predicate push downs

  • Disk partitioning

  • Z-ordering

  • Bucketing

  • Optimization with Adaptive Query Execution (AQE)

  • Designing and configuring clusters for high performance

Upcoming Public Classes

Public Class Registration

If your company has purchased success credits or has a learning subscription, please fill out the public training requests form. Otherwise, you can register below.

Private Class Delivery

If your organization would like to request a private delivery of the course, please fill out the request form below.

Questions?

If you have any questions, please refer to our Frequently Asked Questions page.