Metrics-Driven Tuning of Apache Spark at Scale

Download Slides

Tuning Apache Spark can be complex and difficult, since there are many different configuration parameters and metrics. As the Spark applications running on LinkedIn’s clusters become more diverse and numerous, it is no longer feasible for a small team of Spark experts to help individual users debug and tune their Spark applications. Users need to be able to get advice quickly and iterate on their development, and any problems need to be caught promptly to keep the cluster healthy.

In order to achieve this, we automated the process of identifying performance issues and providing custom tuning advice to users, and made improvements for scaling to handle thousands of Spark applications per day. We leverage Spark History Server (SHS) to gather application metrics, but as the number of Spark applications and size of individual applications have increased, the SHS has not been able to keep up. It can fall hours behind during peak usage. We will discuss changes to the SHS to improve efficiency, performance and stability, enabling SHS to analyze large amount of logs. Another challenge we encountered was a lack of proper metrics related to Spark application performance. We will present new metrics added to Spark which can precisely report resource usage during runtime, and discuss how these are used in heuristics to identify problems.

Based on this analysis, custom recommendations are provided to help users tune their applications. We will also show the impact provided by these tuning recommendations, including improvements in application performance itself and the overall cluster utilization.

Session hashtag: #Exp7SAIS

« back
About Edwina Lu

Edwina Lu is a software engineer on LinkedIn's Hadoop infrastructure development team, currently focused on supporting Spark on the company's clusters. Previously, she worked at Oracle on database replication. Edwina holds a Master's degree in Computer Science from Stanford University.

About Ye Zhou

Ye Zhou is a software engineer in LinkedIn's Hadoop infrastructure development team and mostly focusing on Hadoop Yarn and Spark related projects. Ye holds a Master degree in computer science from Carnegie Mellon University.