Optimizing Apache Spark™ on Databricks
In this course, you will explore the five key problems that represent the vast majority of performance issues in an Apache Spark application: skew, spill, shuffle, storage, and serialization. With examples based on 100 GB to 1+ TB datasets, you will investigate and diagnose sources of bottlenecks with the Spark UI and learn effective mitigation strategies. You will also discover new features introduced in Spark 3 that can automatically address common performance problems. Lastly, you learn how to design and configure clusters for optimal performance based on specific team needs and concerns.
2 full days or 4 half days
Articulate how the five most common performance problems in a Spark application can be mitigated to achieve better application performance
Summarize the most common performance problems associated with data ingestion and how to mitigate them
Articulate how new features in Spark 3.x can be employed to mitigate performance problems in your Spark applications
Configure a Spark cluster for maximum performance given specific job requirements
Hands-on experience developing Apache Spark applications (6+ months). We recommend the Apache Spark Programming course to get started working with Spark.
Intermediate experience in Python or Scala
Review of Spark architecture and Spark UI
Predicate push downs
Optimization with Adaptive Query Execution (AQE)
Designing and configuring clusters for high performance
Upcoming Public Classes
Public Class Registration
If your company has purchased success credits or has a learning subscription, please fill out the public training requests form. Otherwise, you can register below.
Private Class Delivery
If your organization would like to request a private delivery of the course, please fill out the request form below.