In this course, you will explore the five key problems that represent the vast majority of performance issues in an Apache Spark application: skew, spill, shuffle, storage, and serialization. With examples based on 100 GB to 1+ TB datasets, you will investigate and diagnose sources of bottlenecks with the Spark UI and learn effective mitigation strategies. You will also discover new features introduced in Spark 3 that can automatically address common performance problems. Lastly, you learn how to design and configure clusters for optimal performance based on specific team needs and concerns.
2 full days or 4 half days
Articulate how the five most common performance problems in a Spark application can be mitigated to achieve better application performance
Summarize the most common performance problems associated with data ingestion and how to mitigate them
Articulate how new features in Spark 3.x can be employed to mitigate performance problems in your Spark applications
Configure a Spark cluster for maximum performance given specific job requirements
Upcoming Public Classes