Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Rose Toomey joined Bloomberg as a senior software developer in the AI Group in April 2020. Previously, she worked as a senior software engineer at Coatue Management, Lead API Developer at Gemini Trust, and a Director of Engineering at Novus Partners.