Best Exploration of Columnar Shuffle Design
To significantly improve the performance of Spark SQL, there is a trend to offload Spark SQL execution to highly optimized native libraries or accelerators in past several years, like Photon from Databricks, Nvidia's Rapids plug-in, and Intel and Kyligence's initiated open source Gluten project. By the multi-fold performance improvement from these solutions, more and more Apache Spark™ users have started to adopt the new technology. One characteristics of native libraries is that they all use columnar data format as the basic data format. It's because the columnar data format has the intrinsic affinity to vectorized data processing using SIMD instructions. While vanilla Spark's shuffle is based on spark's internal row data format. The high overhead of the columnar to row and row to columnar conversion during the shuffle makes reusing current shuffle not possible. Due to the importance of shuffle service in Spark, we have to implement an efficient columnar shuffle, which brings couple of new challenges, like the split of columnar data, or the dictionary support during shuffle.
In this session, we will share the exploration process of the columnar shuffle design during our Gazelle and Gluten development, and best practices for implementing the columnar shuffle service. We will also share how we learned from the development of vanilla Spark's shuffle, for example, how to address the small files issue then we will propose the new shuffle solution. We will show the performance comparison between Columnar shuffle and vanilla Spark's row-based shuffle. Finally, we will share how the new built-in accelerators like QAT and IAA in the latest Intel processor are used in our columnar shuffle service and boost the performance.