Delight (https://www.datamechanics.co/delight) is a free & cross-platform monitoring dashboard for Apache Spark, which display system metrics (CPU Usage, Memory Usage) along with Spark information (jobs, stages, tasks) on the same timeline. Delight is a great complement to the Spark UI when it comes to troubleshooting your Spark application and understanding its performance bottleneck. It works freely on top of any Spark platform (whether it’s open-source or commercial, in the cloud or on-premise). You can install it using an open-sourced Spark agent (https://github.com/datamechanics/delight).
In this session, the co-founders of Data Mechanics will take you through performance troubleshooting sessions with Delight on real-world data engineering pipelines. You will see how Delight and the Spark UI can jointly help you spot the performance bottleneck of your applications, and how you can use these insights to make your applications more cost-effective and stable.
Jean-Yves Steph…: Hi everyone, and thank you for joining our talk about Delight, a new monitoring dashboard for Apache Spark that we developed and recently released. So, Delight is free. It works on top of any Spark platform and it’s a great compliment to the Spark UI to help you troubleshoot the performance of your Spark applications. But before talking more about it, let’s introduce ourselves.
So, we’re the co-founders of Data Mechanics, a Cloud-Native Spark Platform for Data Engineers. Data Mechanics is a YCombinator startup, growing quickly. We are hiring by the way. So we have open positions in the US and Europe. Check out the website if you’re interested.
Quick personal introductions. So I’m JY and prior to Data Mechanics, I was a software engineer at Databricks, leading their Spark Infrastructure Team. So I have experience with Spark as an infrastructure provider.
And then Julien, our CTO, who will talk later, work with Spark as a data scientist and data engineer at ContentSquare and BlaBlaCar. And so he has experience with Spark as an application developer.
This is actually our fourth Spark summit. So we’re thrilled to be here. In the past we’ve given talks about running Spark on Kubernetes, because that’s what we do at Data Mechanics. Today is going to be a different talk. And we hope that the next summit maybe we’ll be in person.
So let’s talk about the agenda for today. First, I’ll do a very quick primer about Data Mechanics. So you understand who we are and why we’re giving this talk. Then we’ll talk about our goals with Delight, and then we’ll explain how Delight works. And then the core of the presentation will be Julien taking the mic and showing you a few real world, concrete examples of tuning the performance of Spark jobs with Delight. And then in our last section, we’ll tell you about our roadmap for Delights.
So first, a few words about Data Mechanics, our company. We are a managed Spark Platform. So an alternative to EMR, Databricks, Dataproc, Cloudera, Hortonworks, and so on. Our platform is deployed on acuities cluster that we create and manage for our customers inside their cloud account. So if our customers uses Amazon, we deploy on top of EKS. If the customer uses GCP, we deploy on top of GKE. If the customer uses Azure, we deploy on top of AKS.
Now on this committee’s cluster, there is one service running all the time called The gateway, and the gateway is the entry point to start Spark applications, either by connecting a Jupyter notebook, or JupyterHub, JupyterLab, or by using our API, or one of our scheduled or connectors like Airflow, Azure data factory, composer, Argo, and so on. When you submit your application to the gateway, the gateway is going to automatically tune the infrastructure parameters of your pipelines based on the historical runs that we have recorded.
So we tune parameters like container memory, container CPU, the type of instance to use, and then some Spark configurations, like the default number of partitions, shuffle memory configs, and so on. And the goal is to make your applications stable and performance. The committee’s cluster does not have this fixed size. It automatically scales up and down based on load. So really the goal is that you don’t need to manage infrastructure anymore. You just submit a dockerized Spark applications and monitor the logs and metrics through the Web UI served by the gateway.
Last slide, just to illustrate the kind of work we do with customers. So Lingk is a data integration platform built on top of Spark. They were not happy with the developer workflow they had and their high costs on their yarn based the infrastructure on EMR. And so we helped them migrate to Spark and communities and while doing so we divided their AWS costs by three, and we also improved their pipeline startup time and overall duration which instate into great user facing value for them.
But I’ve talked about Data Mechanics. Let’s now talk about Delight. And first, what is our vision? What are we trying to solve? So, why are we doing this? The main tool that you can use today for Spark is the Spark UI. And there’s a lot of information in the Spark UI. But the useful information is often buried under a bit of noise. So it’s hard for most users to quickly identify what is the performance bottleneck of their application.
In addition, the Spark UI does not show any system metrics about your Spark code. So, memory, CPU, I/O. If you want these kinds of metrics you’ll need to use a general purpose tool like ganglia, but then these tools are not really built for Spark. So if you want to correlate these metrics with your Spark code, it’s going to be a bit hard. You’re typically going to have to switch back and forth between ganglia and the Spark UI, and try to align this timestamps to understand what’s going on, not the best experience.
And finally once your application has completed, the only way to render the Spark UI is by using the Spark’s three server. And it is a bit slow and not always stable. So these are the problems that we want to solve with Delight. So what does Delight do? It displays memory and CPU metrics recorded within Spark and align on a single page with your Spark jobs and stages. So the goal here is to make it obvious, which part of your application is the performance bottleneck. So you can focus your attention on it. And these metrics are aligned with a view of your Spark jobs and stages so that, you know which part of your code to focus on.
And of course, we also want the Delight to be very easy to set up. So you can just install a tiny agent in your Spark driver, and then everything is hosted for you and loads quickly. Nothing to maintain on our side… On your side. So this was our vision for Delight. Maybe to tell you a bit more about the timeline. We first started the project in July with a blog post. We received a lot of feedback. So we started working on it in November we reached an intermediate milestone where Delight didn’t have any new screen yet, but he was basically just a dashboard and a hosted a Spark history server. And then in February we released Delight internally to Data Mechanics customers and iterating on their feedback. And now since April Delight is publicly available.
It works on top of any Spark platform, whether it’s open-source or commercial, whether it’s in the cloud or on premise. So let me now explain how Delight works. So Delight is made of an open source agent running inside your Spark application. It’s a jar attached to your Spark driver, which streams metrics to the Data Mechanics backend. The streams metrics are the Spark event logs. And this is the same information used by the Spark history server to render the Spark UI. These logs they contain metadata about your Spark application, like how much data was read and written by each task, how much memory and CPU were used and so on.
This metadata is sent over HTPS and stored securely on our side. It is accessible only by you through a web dashboard served at delight.datamechanics.co. And then after 30 days we automatically clean up your logs. So if you want to get started with Delights you should first create a free account through our website. And then once logged in head over to settings to create a personal access token, this token uniquely identifies you as the owner of your Spark applications, and you will need it later. And then you can head out to our GitHub page and follow the installation instructions specific to your platform. So there are instructions for Databricks, Dataproc, EMR for Spark submits, for Spark and communities operator for Livy.
Maybe to give you an example, let me show you how to install Delights on top of Databricks. You can do this through an init script. And this init script just installs this open-source jar on the Spark driver. And it also inserts the access token that we created earlier. So you just basically define this init script, and then all your Databricks applications will automatically be visible in Delight.
Here’s another example on the EMR. All we need to do here is pass a few options to Spark submit. And these options they basically attach to the jar again to the Spark driver. And you also pays the token that we created earlier. Once this is done your Spark applications will automatically appear on the Delight dashboard, which looks like this. They do not appear while the app is running. You have to wait until the application is completed. Right now.
And the dashboard gives you a few high-level statistics about your application. Like their start time, their duration, the volume of data read and reason by Spark. There are maybe a few columns that you don’t know here, and I’d like to dive deeper into them and explain them. So Spar tasks, CPU, and efficiency.
Let me first explain the CPU column, or we also call it CPU uptime measured in core hours. So this is the lifetime duration of your Spark executors, multiplied by how many cores each executor have. So for example, if you have three executors with two cores each, running for one hour, then this column would indicate six hours. So this statistic is typically proportional to your cloud costs, because it represents the amount of compute resources that you use.
Then the second metric is the Spark tasks column. And this shows the sum of the duration of all the Spark tasks in your application. So remember Spark takes your code and then decide to split it into jobs, stages, and tasks. And task is the smallest you need the work, which can be executed in parallel.
One executive core can run one task at a time. So for example, here in this example you can see we only had running task on the green side of the bars. And the sum of the duration is 72 minutes. Now, the last score I want to explain is the efficiency, which is a ratio between the Spark task and the CPU of time. So if this ratio is 100% that’s great to have perfect parallelism. All your sparkles cores are running tasks 100% of the time. But in our example here the efficiency is about 20%. And just by looking at the image we understand why. It’s because there was basically a long task, straggler tasks that took a long time to compute and while it was running the other cores were idle.
And so this is why we have imperfect parallelism, and maybe there is something to improve for this specific application. So as you can see, this dashboard helps you identify which applications maybe are worth investigating. So in this example, the first line is a notebook, which had an efficiency of about 17%, which is not great. This is probably because it’s an interactive session and maybe you ran some code and then you had to go to a meeting, but your infrastructure didn’t scale down. So this is pretty common for notebooks.
On the second line, shows you more of an ETL pipeline where we have 87% efficiency, which is great out of the 40 hours of CPU of time, 35 hours were spent running Spark tasks. And this is typically the kind of efficiency we should aim for. Unfortunately, we don’t always get 80, 90% efficiency. On average, it’s much lower.
So, why? What are the common root causes? First, not enabling dynamic allocation. Particularly if you’re using Spark interactively from a notebook you should enable dynamic allocation, such that when you notebook is idle the executors go away.
Second common reason is just defining too many executors for pipeline. We call that over provisioning. Third reason is not having enough partitions either in the input data or in the default configuration that sets the number of partition. It’s a good idea to have on average, about three X, the number of partition, then your number of cores.
Another common cause is what we saw earlier, which is a task duration skew. It’s often caused by a skew in the data. So, you have one partition, which is huge, and then it’s the straggler task. And then most of your cluster is idle.
And the last reason is doing a lot of driver only work. So if you have a lot of pure Python code that’s a mixes with Spark code. Well, when the driver is just running Python code the executors are idle. And so that’s another cause of inefficiency.
So now that we’ve seen the theory of it I’d like to leave the mic to Julien, our CTO, who is going to walk you through concrete examples of performance tuning through customers that we work with at Data Mechanics. So now I leave it up to you. Julien.
Julien Dumazert: We’re going to work through a handful of real life Spark applications to showcase how Delight for Apache Spark provides you in a single glance with the most important insights into the performance of your applications. This is a Spark and Kubernetes application from Weather 2020, a weather analytics platform. The top of the page tells us that the application reads data but does not write any outputs. At least not using Spark. The application is rather short. 15 minutes long and out of the three hours and 15 minutes of CPU provision for this application, only one hour and 38 minutes were actually used to run Spark compute. That is an efficiency of only 42%. So more than half of the resources were actually wasted. This is worth digging into it. The situation is confirmed by the executor cores usage breakdown. It also indicates how CPU cores are wasted.
48% of them fall into the category we call “Some cores idles”. This means that this backstage was actually running, but at this stage did not use all available cores. Probably because there’s data skew or simply just not enough partitions. The CPU plot below gives more details. The CPU plot is the combination of a timeline of Sparks stages and jobs, and have a graph of CPU cores utilization. Here it really catches the eye that during more than six minutes only one core out of 16 available is actually used. The culprit is stage one. Elements of the timeline can be clicked and open the relevant page of the dispatch UI. So let’s go there. Dispatch UI shows that this stage contains two tasks on it. This is obviously not enough to use all of the 16 cores available. The next step would now be to dig into the Spark amputation graph and into the code to understand why there are so few petitions in this stage. But let’s jump to our next example.
This is a Spark Kubernetes application as well from one of our customers, Jellyfish. A data-driven digital marketing company. Although the efficiency is high, more than 86%. The application takes more time than the data engineer is expected. The reason is clearly visible in Delight. The bottleneck is shuffled. More than 57% of the resources are used shuffling data between executors. The CPU blocks also indicates that stages 46 and 47 are the bottleneck. Shuffled performance can be degraded by two things, poor disk I/O and poor network I/O.
In this case, we decided to change disks and use SSDs instead of standard persistent disks. SSDs offer more eye operations per second and the better throughput. With SSDs the total duration is reduced from two hours to 26 minutes only. Shuffle is now negligible and the executor cores usage breakdown. The timeline shows that several stages failed. This is actually unrelated and is due to spot kill. And it’s visible in the Sparks UI and you should know where to look for it. With soon at the times of executors birth and death to the CPU plot. So that all information is in the same place.
Moving on. Delights also addresses the problem of memory tuning. With dispatch UI you have no visibility in your memory usage. Any change to your container sizes is risky. We wanted to help with that. Since Spark through that O there are memory measurements in the Spark events logs, although this is not yet leveraged in dispatch UI. Delight uses this information to find and display the peak memory breakdown of executors with top memory usage so that you can see whether you are close to an out of memory error or in the contrary you can reduce memory allocation. This is in our reputation from Oakland, the speed test company. At the moment, Spark categorizes our memory as other. But it’s clear from this graph that most of the memory is used by the JVM and that this executor was on the verge of running into an out of memory error.
And finally, here’s the last example from Iktos, a company developing artificial intelligence for new drug design. In the case of a PI Spark application Delights separates JVM memory from Python memory and memory used by other processes on the executor node.
I hope this gives you an idea of what can be done with Delights and how it improves the Spark performance tuning experience. This was a demo of July today, but we’ll now talk about what’s coming next.
First, we’d like to double down on memory metrics. This is a common pain point for Spark users, and we’d like to offer a graph of memory usage over time for every executor. With this graph, it will be possible to visually find a Spark stages that caused memory usage to go up. Then we’d like to provide similar metrics about the memory usage in the driver. Later on over the summer, we plan to surface issues to help users find a way in Delight and in dispatch UI. Users will get recommendations lights, shuffle duration is high traces. These are increased default parallelism because stage six has too few tasks.
Eventually we’ll make Delight data in real time, which is especially important for long running applications like streaming apps or three servers. Now, this is our current roadmap, but we hope that we’ll receive lots of feedback from you and be able to adjust it accordingly. Getting started with Delight is really easy. All you have to do is create a free account on their website and the other couple of configs to do configuration of your Spark applications. You’ll find instructions on our GitHub page for major setups like EMR, Dataproc, Databricks, Apache, Livy, Spark and communities, and generic Spark submit. And that’s it.
Applications will show up in the July dashboard when they’re completed. Well, that’s all for us. Thanks for listening. We hope that you will find Delight useful and that it will improve your daily experience as a Spark engineer.
Jean-Yves is the Co-Founder & CEO of Data Mechanics, a cloud-native spark platform available on AWS, GCP, and Azure. Their mission is to make Spark more developer friendly and cost-effective for data ...
Julien is the co-founder and CTO of Data Mechanics, a YCombinator-backed startup building a cloud-native data engineering platform. Their solution is deployed on a managed Kubernetes cluster inside th...