Tangram: Distributed Scheduling Framework for Apache Spark at Facebook

Download Slides

Tangram is a state-of-art resource allocator and distributed scheduling framework for Spark at Facebook with hierarchical queues and a resource based container abstraction. We support scheduling and resource management for a significant portion of Facebook’s data warehouse and machine learning workloads that equates to running millions of jobs across several clusters with tens of thousands of machines. In this talk, we will describe Tangram’s architecture, discuss Facebook’s need for a custom scheduler, and explain how Tangram schedules Spark workloads at scale. We will specifically focus on several important features around improving Spark’s efficiency, usability and reliability: 1. IO-rebalancer (Tetris) Support 2. User-Fairness Queueing 3. Heuristic-Based Backfill Scheduling Optimizations.

 

Try Databricks
See More Spark + AI Summit in San Francisco 2019 Videos


« back
About Rui Jian

Rui Jian is a software engineer at Facebook. For the past 6 years, he is working on large scale distributed systems. He built Facebook's next generation Map-Reduce execution framework and the indexing service for Facebook social graph. Right now, he is focusing on building the batch scheduler for ML and data warehouse workload. Rui obtained a Master's Degree in Computer Science from Shanghai Jiao Tong University, Shanghai, China.

About Hao Lin

Hao Lin is a Ph.D. student at School of Electrical and Computer Engineering at Purdue University, West Lafayette, under the supervision of Prof. Samuel Midkiff. His research interests include parallel data system and cloud computing. Hao Lin also worked as software engineer intern on data infrastructure at Huawei Technologies and Google Inc.