Understanding Memory Management In Spark For Fun And Profit

Download Slides

Allocation and usage of memory in Spark is based on an interplay of algorithms at multiple levels: (i) at the resource-management level across various containers allocated by Mesos or YARN, (ii) at the container level among the OS and multiple processes such as the JVM and Python, (iii) at the Spark application level for caching, aggregation, data shuffles, and program data structures, and (iv) at the JVM level across various pools such as the Young and Old Generation as well as the heap versus off-heap. The goal of this talk is to provide application developers and operational staff easy ways to understand the multitude of choices involved in Spark’s memory management. This talk is based on an extensive experimental study of Spark on Yarn that was done using a representative suite of applications.

Takeaways from this talk:

– We identify the memory pools used at different levels along with the key configuration parameters (i.e., tuning knobs) that control memory management at each level.
– We show how to collect resource usage and performance metrics for various memory pools, and how to analyze these metrics to identify contention versus underutilization of the pools.
– We show the impact of key memory-pool configuration parameters at the levels of the application, containers, and the JVM. We also highlight tradeoffs in memory usage and running time which are important indicators of resource utilization and application performance.
– We demonstrate how application characteristics, such as shuffle selectivity and input data size, dictate the impact of memory pool settings on application response time, efficiency of resource usage, chances of failure, and performance predictability.
– We summarize our findings as key troubleshooting and tuning guidelines at each level for improving application performance while achieving the highest resource utilization possible in multi-tenant clusters.

Learn more:

  • Deep Dive: Apache Spark Memory Management
  • Spark on YARN: a Deep Dive
  • Spark-­on-­YARN: The Road Ahead

    « back
  • About Shivnath Babu

    Shivnath Babu is the CTO at Unravel Data Systems and an adjunct professor of computer science at Duke University. His research focuses on ease-of-use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. Shivnath cofounded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has won a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award.

    About Mayuresh Kunjir

    Mayuresh Kunjir is a PhD candidate in the Computer Science Department at Duke University. His research focus is on resource management and query optimization in data analytics systems. Prior to joining Duke, Mayuresh got his MS from Indian Institute of Science, Bangalore, working on improving power efficiency of commercial database engines.