Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the “right” size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn’t adding nodes decrease my compute time?
Daniel Tomes leads the Resident Solutions Architect Practice at Databricks and is responsible for vertical integration, productization and strategic client growth. His big data journey began in 2014 at a major oil and gas company after which he moved to Cloudera for two years as a Solutions Architect and in 2017 join Databricks.