Spark SQL is a very effective distributed SQL engine for OLAP and widely adopted in Baidu production for many internal BI projects. However, Baidu has also been facing many challenges for large scale including tuning the shuffle parallelism for thousands of jobs, inefficient execution plan, and handling data skew. In this talk, we will explore Intel and Baidu’s joint efforts to address challenges in large scale and offer an overview of an adaptive execution mode we implemented for Baidu’s Big SQL platform which is based on Spark SQL. At runtime, adaptive execution can change the execution plan to use a better join strategy and handle skewed join automatically. It can also change the number of reducer to better fit the data scale. In general, adaptive execution decreases the effort involved in tuning SQL query parameters and improves the execution performance by choosing a better execution plan and parallelism at runtime.
We’ll also share our experience of using adaptive execution in Baidu’s production cluster with thousands of server, where adaptive execution helps to improve the performance of some complex queries by 200%. After further analysis we found that several special scenarios in Baidu data analysis can benefit from the optimization of choosing better join type. We got 2x performance improvement in the scenario where the user wanted to analysis 1000+ advertisers’ cost from both web and mobile side and each side has a full information table with 10 TB parquet file per-day. Now we are writing probe jobs to detect more scenarios from current daily jobs of our users. We are also considering to expose the strategy interface based on the detailed metrics collected form adaptive execution mode for the upper users.
Session hashtag: #SAISEco12
Carson Wang is a software engineering manager in Intel data analytics software group, where he focuses on optimizing popular big data and machine learning frameworks, driving the efforts of building converged big data and AI platform. He had created and led a few open source projects, such as RayDP - Spark on Ray, OAP MLlib - a highly optimized Spark MLlib, Spark adaptive query execution engine, Hibench - a big data micro benchmark suite, and more.
Chenzhao Guo is a big data engineer at Intel. He graduated from Zhejiang University and joined Intel in 2016. He is currently a contributor of Spark, OAP and HiBench.