Julien is the co-founder and CTO of Data Mechanics, a YCombinator-backed startup building a cloud-native data engineering platform. Their solution is deployed on a managed Kubernetes cluster inside their customers cloud account. Prior to Data Mechanics, Julien was a passionate Spark user as a data scientist and data engineer at the ride-sharing BlaBlaCar platform, and the user analytics platform ContentSquare.
May 28, 2021 11:40 AM PT
Delight (https://www.datamechanics.co/delight) is a free & cross-platform monitoring dashboard for Apache Spark, which display system metrics (CPU Usage, Memory Usage) along with Spark information (jobs, stages, tasks) on the same timeline. Delight is a great complement to the Spark UI when it comes to troubleshooting your Spark application and understanding its performance bottleneck. It works freely on top of any Spark platform (whether it's open-source or commercial, in the cloud or on-premise). You can install it using an open-sourced Spark agent (https://github.com/datamechanics/delight).
In this session, the co-founders of Data Mechanics will take you through performance troubleshooting sessions with Delight on real-world data engineering pipelines. You will see how Delight and the Spark UI can jointly help you spot the performance bottleneck of your applications, and how you can use these insights to make your applications more cost-effective and stable.
November 17, 2020 04:00 PM PT
Community adoption of Kubernetes (instead of YARN) as a scheduler for Apache Spark has been accelerating since the major improvements from Spark 3.0 release. Companies choose to run Spark on Kubernetes to use a single cloud-agnostic technology across their entire stack, and to benefit from improved isolation and resource sharing for concurrent workloads. In this talk, the founders of Data Mechanics, a serverless Spark platform powered by Kubernetes, will show how to easily get started with Spark on Kubernetes.
We will go through an end-to-end example of building, deploying and maintaining an end-to-end data pipeline. This will be a code-heavy session with many tips to help beginners and intermediate Spark developers be successful with Spark on Kubernetes, and live demos running on the Data Mechanics platform.
- Setting up your environment (data access, node pools)
- Sizing your applications (pod sizes, dynamic allocation)
- Boosting your performance through critical disk and I/O optimizations
- Monitoring your application logs and metrics for debugging and reporting
Speakers: Jean-Yves Stephan and Julien Dumazert
June 23, 2020 05:00 PM PT
Since initial support was added in Apache Spark 2.3, running Spark on Kubernetes has been growing in popularity. Reasons include the improved isolation and resource sharing of concurrent Spark applications on Kubernetes, as well as the benefit to use an homogeneous and cloud native infrastructure for the entire tech stack of a company. But running Spark on Kubernetes in a stable, performant, cost-efficient and secure manner also presents specific challenges. In this talk, JY and Julien will go over lessons learned while building Data Mechanics, a serverless Spark platform powered by Kubernetes.
October 15, 2019 05:00 PM PT
Spark has made writing big data pipelines much easier than before. But a lot of effort is required to maintain performant and stable data pipelines in production over time. Did I choose the right type of infrastructure for my application? Did I set the Spark configurations correctly? Can my application keep running smoothly as the volume of ingested data grows over time? How to make sure that my pipeline always finishes on time and meets its SLA?
These questions are not easy to answer even for a handful of jobs, and this maintenance work can become a real burden as you scale to dozens, hundreds, or thousands of jobs. This talk will review what we found to be the most useful piece of information and parameters to look at for manual tuning, and the different options available to engineers who want to automate this work, from open-source tools to managed services provided by the data platform or third parties like the Data Mechanics platform.