Spark has made writing big data pipelines much easier than before. But a lot of effort is required to maintain performant and stable data pipelines in production over time. Did I choose the right type of infrastructure for my application? Did I set the Spark configurations correctly? Can my application keep running smoothly as the volume of ingested data grows over time? How to make sure that my pipeline always finishes on time and meets its SLA?
These questions are not easy to answer even for a handful of jobs, and this maintenance work can become a real burden as you scale to dozens, hundreds, or thousands of jobs. This talk will review what we found to be the most useful piece of information and parameters to look at for manual tuning, and the different options available to engineers who want to automate this work, from open-source tools to managed services provided by the data platform or third parties like the Data Mechanics platform.
Jean-Yves is the Co-Founder & CEO of Data Mechanics, a cloud-native spark platform available on AWS, GCP, and Azure. Their mission is to make Spark more developer friendly and cost-effective for data engineering teams. They are active contributors to open-source projects such as the Spark-on-Kubernetes operator and Data Mechanics Delight. Prior to Data Mechanics, Jean-Yves was a software engineer at Databricks, where he led the Spark infrastructure team.
Julien is the co-founder and CTO of Data Mechanics, a YCombinator-backed startup building a cloud-native data engineering platform. Their solution is deployed on a managed Kubernetes cluster inside their customers cloud account. Prior to Data Mechanics, Julien was a passionate Spark user as a data scientist and data engineer at the ride-sharing BlaBlaCar platform, and the user analytics platform ContentSquare.