We propose a lightweight on-the-fly Dynamic Repartitioning module for Spark, which can adaptively repartition data during execution with negligible overhead to provide a close-to-uniform partitioning. In our experiments with distributions common in practice (for example power law), the time needed to complete a stage could be reduced by 38% to 59% on the average-case. The approach also improves utilization. By using our full-fledged, real-time visualization tool, we demonstrate that: – dynamic repartitioning works under various popular use-cases, – significant speedup can be achieved for common workloads, – we also show how to fine-tune the partitioning mechanism.
Zoltán is a researcher and project lead at the Hungarian Academy of Sciences. His main expertise and interest is the data partitioning and scheduling of distributed data processing frameworks. His current work includes research and development on distributed tracing in Spark and QoS scheduling on Hadoop YARN. Zoltán is a speaker in various Big Data related conferences and meetups, including Hadoop & Spark Summit.