Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark’s built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we’ll take a deep dive into the Cosco architecture and describe how it’s deployed at Facebook. We will then describe how it’s integrated to run shuffle for Spark, and contrast it with Spark’s built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).
Dmitry joined Facebook’s data warehouse team about 3.5 years ago where he started Cosco as prototype and drove it to wide internal deployment. Before Facebook, Dmitry had more than a decade of experience building B2B and B2C applications.
Brian has been working on Spark at Facebook for over an year. He is excited to be a part of growing and scaling Spark at Facebook. Previously, he was a researcher at Seoul National University and software engineer at Samsung. He has a PhD from UIUC in distributed systems.