Sarah: CEO-Finance-Report pipeline seems to be slow today. Why
Jeeves: SparkSQL query dbt_fin_model in CEO-Finance-Report is running 53% slower on 2/28/2021. Data skew issue detected. Issue has not been seen in last 90 days.
Jeeves: Adding 5 more nodes to cluster recommended for CEO-Finance-Report to finish in its 99th percentile time of 5.2 hours.
Who is Jeeves? An experienced Spark developer? A seasoned administrator? No, Jeeves is a chatbot created to simplify data operations management for enterprise Spark clusters. This chatbot is powered by advanced AI algorithms and an intuitive conversational interface that together provide answers to get users in and out of problems quickly. Instead of being stuck to screens displaying logs and metrics, users can now have a more refreshing experience via a two-way conversation with their own personal Spark expert.
We presented Jeeves at Spark Summit 2019. In the two years since, Jeeves has grown up a lot. Jeeves can now learn continuously as telemetry information streams in from more and more applications, especially SQL queries. Jeeves now “knows” about data pipelines that have many components. Jeeves can also answer questions about data quality in addition to performance, cost, failures, and SLAs. For example:
Tom: I am not seeing any data for today in my Campaign Metrics Dashboard.
Jeeves: 3/5 validations failed on the cmp_kpis table on 2/28/2021. Run of pipeline cmp_incremental_daily failed on 2/28/2021.
This talk will give an overview of the newer capabilities of the chatbot, and how it now fits in a modern data stack with the emergence of new data roles like analytics engineers and machine learning engineers. You will learn how to build chatbots that tackle your complex data operations challenges.
Shivnath Babu: Hello, everyone. I’m Shivnath Babu. Welcome to my talk on an AI Chatbot for performance and quality. I’m co-founder and CTO at Unravel. I’m also an Adjunct Professor of Computer Science at Duke University. My work and research focused on the manageability of the new data stack and data pipelines.
Unravel itself is a platform to simplify DataOps, which is DevOps for the big data and the modern data applications. At Unravel, we have built this platform that can monitor, manage, troubleshoot all your big data clusters, which might be On-Prem Hadoop clusters or Databricks and EMR clusters on the Cloud. It brings together all the telemetry information about your applications, clusters, the data sets, your platforms and users in a single place. And from this data, Unravel creates rich and into the visualizations, that can help you understand the performance, the quality, the resource usage of your entire platform in a single place. You can track and create reports on chargeback on capacity planning, as well as detect anomalous behavior. And Unravel also uses AI and ML to automatically find out the cost of the problems and help you fix these problems.
So, this talk is about a Chatbot, and I’m sure all of you are very familiar with Chatbots. You use many of them on a day-to-day basis. It’s a program that can have a conversation with you, and we’re going to be talking about a Chatbot in the context of Spark. So, in 2019, we demonstrated a Chatbot for Spark, and this Chatbot is meant for usage by Spark users. Many of you have seen the happy Spark user, the user who is very excited that her applications have been created, been moved into production and are creating value for a large number of users. But at the same time, I’m pretty sure that either you or somebody that you knew very well has at times been an unhappy Spark user. And unhappy because maybe the application that was great are just failed, and you just have no idea why it failed, or the application that you moved into production was running well, but suddenly something happened and it’s not fast anymore.
It’s missing their SLA’s. How do you like and fix the problem that the application’s encountering, or maybe you have to migrate an application from a current On-Prem environment to the Cloud. And then, you are just struggling with all of these choices of what no time on what data configuration to use on the Cloud. And maybe you’re already on the Cloud and your costs are shooting up. You have no idea how to get them under control. So for such Spark users, maybe somebody who’s actually having a problem with that application that has failed and struggling with all those dependent stack traces and trying to find the root cause of a problem.
In the Jeeves Chatbot that we presented in the 2019 Spark Summit. You can actually just go and ask this Chatbot he like, “My app failed, I have no idea why can you help me?” The Chatbot can take it from there. It finds the application. It can find the cost of the problem. Here, the Chatbot is saying that his phone that executor’s running out of memory, is what is causing the application to fail and better, it can say how to fix the problem and even better, it can actually still call the problem can be fixed and give a validation. Wouldn’t this be an excellent in the world? And that’s what the Jeeves Chabot enabled. Let’s quickly see the Chatbot in action.
So, the Chatbot we created using Slack. So this is a quick recording of the Chatbot itself. So if you’ve seen the conversation that I’m having with the Chatbot here, so my application, my CEO report application, very specifically, is having some challenges. They want to make it run faster. So the Chatbot immediately found in the CEO report application, was run a couple of times, and it’s telling me that this application IDs 163 and 176, that the runs of my application. And so also confirming with me, that there’s speeding up the application is a specific goal I actually want to achieve. And I’ve confirmed that, that is indeed the case. And then, the Chatbot has come back to me and said, “My application can actually be run faster by tuning a couple of options, the containers, the size of those containers I’m using in my application.” Right?
And also this asked me whether I can provide the exact command that I’m using to run the application. So it provided the command. And then, the Chatbot is telling me here, that has created a tuning session for me. A session that I can participate in and help get my application to better performance.
So, let’s see that in action. So, here as you can see, the two runs of the application that the Chatbot had identified for me, those are these applications 163 run and application run 176. You can see both of them here. And I can see the recommendations that the application has to improve performance. For example, you see here, there were recommendations around the sizing of these containers. On top of that, I can compare the performance of these applications at multiple levels. So, here I can quickly check the different properties that were used. The configuration properties that were used.
You can see that the Spark SQL shuffle partitions, for example. I tried to change as I was tuning it, but still, the performance remains at around two and a half minutes, in terms of running time of the application. The Chatbot is further checked and found the configuration that can improve performance. And on top of that, it is using the command I gave it, to try to run the application with the change configuration. And right here, you see the third run, the new run, the application run 189 is actually with the new recommended configuration. And clearly, the application with this new recommendation ran it half the time. So in addition to like, you’re telling me how to improve the performance, the Chatbot got me to a better place where my application is running 50% faster. It’s not only the improving the performance of the application, in terms of running time.
Here I’m asking it, “Did my Application PageRank failed?” And it found the run, the specific run ID that failed. And I can ask the Chatbot to fetch the errors of the application. So if I ask that we fetch the errors, it can immediately fetch all the error laws. So, what you see right here, are the driver and the executor logs, which were created for my failed application, and I’m sure some of you have seen such logs, right? They are pretty huge. It’s stack trace that on like on those stack trace and you still struggle to understand where exactly the application failed. On what caused it to fail? So instead of having to look at the logs like this, it wouldn’t be great, if you can actually go back to the Chatbot, like I’ve done here and ask it like, “Why did that application fail?”
Now, let’s see what the Chatbot is able to do. I’m asking my question right here, “Why did the app fail?” and low behold, the Chatbot is able to tell me that it failed because the Spark executors went out of memory. And its fault were asking me just like [inaudible] the application earlier for [inaudible],”Do you want to make the application reliable?” Get it running robustly, even if things like data sizes and have been changed.
So this is what we actually saw in 2019 with the Jeeves Chatbot. Fast forward to 2021, there’s a lot of interesting things that have happened. We actually had a pandemic. But on top of that, every company used the last… Like in the few years and all the data that has been collected from many different data sources to transform themselves, to become data companies. That if you look at the entire healthcare space, that was totally transformed by all the [inaudible] coming new to the bandwidth from healthcare to transportation, in the finance or entertainment.
Every industry, every vertical is on the path to make or generate great insights from the data. And how are they doing that? They’re doing that by creating data pipelines. So, almost any company, if you look at it today, is at least running a handful around 10 new data pipelines in production, which are probably their innovative data products. And where are these pipelines actually running? They’re running on a new data stack and let’s see why a new data stack is required to run these pipelines. Because ultimately, the pipelines are taking a lot of diverse data sources which has to be captured. Sometimes batch ingestion happens. Sometimes these data sources have to be captured in real-time. They have to be stored. And all these new data sources could be huge in volume.
They could be very diverse in their initial. From structured, semi-structured to unstructured data. And then they have to be passed on to innovative algorithms, sometimes involving very sophisticated machine learning and AI transformations then published and finally consumed as part of advanced analytics, BI or real-time applications to power these data products.
All together, this is the data flow that you refer to as a data pipeline. And then companies are building very innovative data pipelines on a new data stack that is most over the past few years. And this new data stack is multi-system even how pipelines have very different needs along the different stages of the pipeline. When something as simple as orchestrating a few tasks on data. But that itself turns out to be an interesting and complex problem. We have a new systems like Azure Data Factory, Fivetran, Airflow for this part of the pipeline phase. For the streaming aspect of the pipelines, where the stream data is ingested processed, served in a real-time store to real-time apps.
Then we have systems like Fluenta, Kafka, Spark, Druid, MongoDB and Pinot. For the data lake side to store all of this huge volumes and diverse data sets and to apply machine learning and advanced analytics on the data. We have systems like Databricks or storage systems like data lake and Azure Data Lake store. We have Spark, or we have the [inaudible] SQL engines like Trino and Presto. But those of you, who are more of the data warehousing side of the pipeline, you have systems like Snowflake or Redshift or task or missions can be done in a system like DBT.
All the machine learning side of the pipeline that are new systems like SageMaker, TensorFlow, PyTorch are for storing all the diverse kinds of data and discovering them quickly. We have systems like Data Hub elements and for all the features that have been created from this data, the systems like Tecton. If you see to power all of those innovative data products and run the data pipelines that actually feed and generate the data for these products, you actually need a collection of systems.
And anytime you have applications that are composed of multiple systems, you know that there can be a whole bunch of challenges that arise, right? The entire DataOps factors has emerged in the past few years to enable pipelines, to be run effectively and efficiently on this modern data stack. So we actually talked to a large number of companies to understand better how they manage their data pipelines and we got a variety of interesting insights. So some of the companies told us that running the pipeline, especially on this new stack and that multiple systems that are feeding into the data pipeline. Problems can arise, problems and performance, problems and costs, problems and quality. And a lot of the time the detection of these problems is very reactive. And worse, fixing the problems in these pipelines can take hours, if not sometimes days and weeks.
And as these pipelines have become mission critical, SLA’s on these pipelines have become very important, so missing an SLA, can actually lead to penalties. And as more and more users have come onto these platforms, something or the other, will actually cause a problem where the user is not being as efficient as they need to be. Now on the other side of the fence, the operations teams have often are complaining that these pipelines are not well tested. So there can be situations where the pipelines might fail and that can causes all this new problems around detecting and the time to fix.
A lot of the companies actually told us that there is a problem in terms of the coordination, when are these pipelines getting schedule to run and that schedule is not actually handled properly, we can end up in scenarios where the pipeline just in the cost artificial contention of the cluster, where in the Cloud environment, we’re notes to scale up. And lot more cost to be incurred unnecessarily.
We heard a lot of complaints around like not having the right quality data or not understanding when the data actually breaks. And that causes some issues with the correctness of the data product itself. Many companies are trying to migrate their pipelines from On-Prem environments to the Cloud. And that itself is challenging if all the dependencies for the pipelines are not captured correctly.
And last but not the least, many companies want to move to a better place, where there is a very good CI/CD for their data and their pipeline just like their CI/CD full software stack. And a lot of companies are also in a modernization phase where they have adopted these platforms, especially running on the Cloud and reducing the cost and running these platforms more efficiently as a top priority.
If you look at all of these different kinds of used cases or questions that are coming up, you’ll see that they all fit in the Overall Development Life Cycle or the SDLC of these data pipelines, where the pipelines are created, or if they’re created by data scientist or data analyst. And then there might be a data engineer, who’s in tasked with taking the pipeline, moving into the production, which involves new detecting problems during the pipeline, during the CI. And then deploying the pipelines in production from where the operations teams take over the pipeline. They do the day-to-day monitoring operations and sometimes, some of the optimizations that need to be done as the data changes or more and more pipelines come on board.
The overall feedback from running the applications in production, comes back and then the development teams might take on that feedback, update the pipelines for the tune them, and this entire process keeps going on. It’s a constantly running iterative process.
In this entire process notice that there are multiple faces. There are multiple stakeholders who all have to coordinate to ensure that with different problems that we saw, which might be in one stage of this pipeline development or specific to one stakeholder, they can all be addressed. And to do that, we actually need an effective DataOps factors that has the holistic view of this entire development life cycle and all the stakeholders that are involved. So they can coordinate accordingly, especially in an environment where not everybody might be an expert in all of these different stages. So to do that, that’s where the DataOps factors new comes in. And what I’d like to show you in the rest of the talk, is how we have created Unravel’s Pipeline Observer to simplify this entire DataOps factors or data pipelines running on the modern data stack. So from a 30,000 foot level, the pipeline observe works as follows.
It collects all different kinds of telemetry data from all the different kinds of systems that comprise the modern data stack and power pipelines. And the telemetry information might be in the form of logs or metrics or traces. Or they might be in the form of most structured events or the form of configuration and metadata. So all of this information constantly swims in into the Unravel platform where the pipeline observer is responsible for correlating the data and making it available in a real-time store. And from that store, there are these microservices, for example, a microservice for baselining or the microservice for detecting an anomaly in the run of the pipeline, and then finding out what is the root cause of that.
One of these microservices further enrich the data in the form of insights and all of them are solved by the pipeline observant UI, which can be consumed by an API that can then power a Chatbot, or it can be used for proactive alerting when your pipeline is having a problem. On top of that, the pipeline observer can help track the SLA’s, especially deviations from the SLA’s. It can help with do very fine great cost charged back for the pipelines and making them buy a relation to the data products that are running on top of these pipelines or with the pipeline capacity estimation and planning, as things can changed over time.
So, I’d like to show you a demo. And in the context of this part of demo, what I pick is a modern data stack that data pipelines are running on compute powered by Azure Databricks. The storage is on Azure Data Lake. The orchestration is happening with the Apache Airflow. The transformations are being done on DBT. We have data quality checks being run using Great Expectations. The Chatbot itself is on Slack, like I showed you earlier. And the underlying insights and APS have been powered by Unravel which is into an observability around data pipelines.
I’ll focus on three demo scenarios. The pipeline that is in danger of missing SLA. The pipeline that’s having a cost overrun. And the pipeline that’s having a data quality problem. So, let’s check it out in action.
So back to the Chatbot interface in Slack. So here, we can see the first big change that we have done, where instead of having to ask and converse, here the interface is actually telling us that a particular pipeline, the reporting insights hourly pipeline, is actually running significantly slower than the baseline. And one click, we can check out what is going on with the pipeline. So, what you’re seeing here, is the pipeline observer UI, where the central part of the UI is actually seeing the reporting insights of the pipeline. It’s run information about that one itself. If you look at the right side, it’s a live feed. And as you can see in the live feed, the pipeline run is delayed. It’s actually significantly delayed as from eight times slower than the baseline. The baseline itself is shown on the leftmost side.
The baseline can be built from recent runs of the pipeline. It can be built from all the runs in the recent time window, or you can compare with a very specific run. So, here from the baseline, you can see the duration of this pipeline is very less than a minute. The time is in milliseconds, right? But today’s run, is taking eight times more. So, what is going on here? So with the pipeline observer, you can also see on the right side, that is a constant root cause analysis of performance deviation being run, being analyzed. And right here, it’s telling us that the delay is actually happening in a very specific competence that is being run as part of the pipeline. And you can click on the cluster activity to check why that problem is happening, which is seems to be a delay.
So, let’s check out the cluster activity. So what you’re seeing here, is a point in time of the execution of the pipeline. And you can see that this particular [inaudible] here is running as part of the pipeline itself, the reporting insights hourly, but it’s struggling to get resources. And it’s stuck in an accepted state. And boosting all the resources, the UI is also showing us that there’s an Adhoc BI app that’s running and that’s taking all the resources, startling all the pipeline from getting the resources to run and finish within the SLA time.
So notice how very quickly, going from overrunning a lot of different pipelines, understanding which pipelines are having a problem, and then understanding why that problem is happening. Everything can be done intuitively using the pipeline observe interface, and serve that buyer, Slack Chatbot.
Let’s check out the next problem. So this one, is basically a problem where the pipeline is having a cost overrun. The pipeline job insight aggregate v4, it’s overall cost is very significantly different from baseline. So let’s check out that particular run.
So, this is the pipeline observer showing the run of another pipeline, which is not only delay, but worse is having a big cost overrun. So, if you look at the corresponding baseline value, now we’ll look at the cost. The cost is shown in cents. So this is around $2, which is what the pipeline usually takes to run. But today’s run, is already taken $12, it’s six times more.
And once again, we can drill down into the applications itself to understand which application is costing the significant overrun. And in one at a glance, if you check out right here, you can see that the cost of this cost overrun is in the baseline. Only six conference are running as part of this pipeline, but in today’s run, that is having a cost overrun, more conference has actually shown up. And this is a very common problem that we see where sometimes some changes are checking the pipeline, or it could be that in the data properties changed just causing the query optimizer to pick a different plan. And that’s causing that change of performance and worse in this case, a huge cost of [inaudible].
So, let’s check out the third problem, which is even more interesting. And this is a data quality problem. Where the pipeline that’s running, is actually having an issue with new data of correctness. As you see in the interface right here, this time, the cost is okay, the application is running well on time, but it is having a court issue. We are using the Great Expectations tool that you can define what the correct quality and what the assertions, which will define the data actually to be of high quality and what they are. And right here, one of those assertions you can see is fail and you can see on which particular table and which data set the problem is happening.
So overall, you notice how the pipeline observer, the management of the pipeline from being some gigantic thing with a lot of different conference has become much more easy and much more tractable.
So, this is what we have actually done over the last couple of years, looking at data pipelines, running all the new data stack and seeing them understanding all the different challenges that data teams face in running these pipelines and seeing how we can bring together AI-driven DataOps practice powered by a tool like Unravel in conjunction with a bunch of other tools and the co-system to enable pipelines, to be developed and managed with these while saving you time and money.
We’d love to get your feedback. Please sign up for a free trial at this particular URL. We are definitely hiring. We would love to have all people who are passionate and interested in these problems working with us. These are pretty hard and interesting problems and last but not least your feedback. And yet the talk is very important. Thank you.
Shivnath Babu is the CTO at Unravel Data Systems and an adjunct professor of computer science at Duke University. His research focuses on ease-of-use and manageability of data-intensive systems, autom...