Catalyst is an excellent optimizer in SparkSQL, provides open interface for rule-based optimization in planning stage. However, the static (rule-based) optimization will not consider any data distribution at runtime. A technology called Adaptive Execution has been introduced since Spark 2.0 and aims to cover this part, but still pending in early stage. We enhanced the existing Adaptive Execution feature, and focus on the execution plan adjustment at runtime according to different staged intermediate outputs, like set partition numbers for joins and aggregations, avoid unnecessary data shuffling and disk IO, handle data skew cases, and even optimize the join order like CBO etc.. In our benchmark comparison experiments, this feature save huge manual efforts in tuning the parameters like the shuffled partition number, which is error-prone and misleading. In this talk, we will expose the new adaptive execution framework, task scheduling, failover retry mechanism, runtime plan switching etc. At last, we will also share our experience of benchmark 100 -300 TB scale of TPCx-BB in a hundreds of bare metal Spark cluster.
Session hashtag: EUdev4
Carson Wang is a software engineering manager in Intel data analytics software group, where he focuses on optimizing popular big data and machine learning frameworks, driving the efforts of building converged big data and AI platform. He had created and led a few open source projects, such as RayDP - Spark on Ray, OAP MLlib - a highly optimized Spark MLlib, Spark adaptive query execution engine, Hibench - a big data micro benchmark suite, and more.