Raven: End-to-end Optimization of ML Prediction Queries

May 28, 2021 10:30 AM (PT)

Download Slides

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.

We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.

This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.

We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

In this session watch:
Konstantinos Karanasos, Principal Scientist Manager, Microsoft
Kwanghyun Park, Senior Research Engineer, Microsoft



Konstantinos Ka…: Hello everyone. And thank you for joining this virtual session. Today, together with Park, we will be talking to you about Raven, a system we have been building to optimize machine learning prediction queries within Azure data analytics end-to-end. We are, Park, over the Gray Systems Lab of Azure data for Microsoft.
So, to set some context, what we’re after here is what we call enterprise-grade machine learning. So, what we want to do is to be able to support our enterprise customers in all their machine learning life cycle. We have identified three main areas that we are actively working on. So, this includes how do you create new models, which is what you’ll know as training, developing new models. Once we have done that and we are… We have deployed our own models, we will use them with new data to perform the… What we call, the model inference or scoring. And the third main area is governance. So, we want to be able to support provenance and tracking and various policies, I mean, both for the data and the models that are being used in the process.
Today, we’ll be talking about Raven, which focuses on the model inference side of things. And to make things a bit more concrete. Let’s see the following with use case, which is about across people. And it’s about the health domain. And what we want to do is, there is a data scientist that essentially wants to build a model to predict how long every patient is going to stay in the hospital.
In order to do that, he has access to many different data across different hospitals in the country. So, he will go on and do this data exploration and preparation. He will clean up the data, join different data sets, and so on. This is an interactive process. Then he will train various models, pick the best one and deploy it in a service like Azure ML, for example. So, this is our model that predicts the length of stay of a patient in the hospital.
Now, let’s see how we will consume this model. So, then there is an analyst or a software developer. And what she wants to do is to use this model, so that she can find the pregnant patients that expect to stay in the hospital for more than a week for specific hospitals. So, she has access to those specific hospital and wants to find those patients.
Now, let’s see how this would happen today. So, typically, say, with the… I have access to a web server that includes of this application logic. And data will typically reside in a database, like in traditional DBMS because this is high-value data. And then will also have a container that includes the model and can be accessed through REST end point. So, she will, first, enter her request to the web server. The web server will talk to the database in order to access the patient data. We get back those data and then we’ll access the container through a REST call in order to get the predictions back and update probably the database with the latest predictions. Now, if you have multiple models, you will even have more containers involved.
So, this approach is lacking A lot of enterprise features. First of all, very important for our customers is security. So, both the data and the models are residing out of the database, which very often is a no-go, for example, for something like hospital. You can also observe a couple of these extra infrastructure of keeping many different components, which increases the total cost of ownership. And we also are… Like in some tools that it can be present in mature database systems. And now, on the other hand, we also are getting hit on performance because we have a lot of data movement around, which increases latency and can drop throughput, especially for the batch coding scenario that we’re targeting here. So, instead, what we are after with Raven is to do all these in-engine.
So, here, you can have the data engine, and you can see that it includes both the data and it can also, within the boundaries, have the models. And [inaudible] databases we will see, can be Spark or any other data engine well. So, here, the analyst will access the web server. The web server with access directly the data engine and get back the predictions. So, we have a lot of enterprise features this way. So, it helps security because day and time models are within the boundaries of the data engine. We can reuse existing infrastructure. And we also drop the total cost of ownership.
But what about performance, which is our goal in the rest of the talk here? [Inaudible] without techniques, compared to our implementation, our integration of Raven in Spark, we can be up to 13x faster and in SQL Server, we can be up to 330x faster, as you will see.
So, now let’s see an overview of our Raven optimizer. So, the input to the optimizer is data and models. Data can be expressed in any type of queries, so typically, we are looking into the SQL Server, via T-SQL and Spark SQL. That’s what we care about here. We can also have support from Pandas, which gets translated to SQL with one of the projects we’re working on. And then, on the model side, that we can have ONNX, we can have PyTorch and any other model. But essentially, everything can be translated to ONNX, which we will be using as our format here.
So, once we have the… The first main idea is that we are expressing this prediction query within the same IR. So, this is what we call a unified IR. And you can see that we will have both data operations here, like joins and selections, as well as models. This is fed to the Raven optimizer. And then, it optimizes this plan and the outputs and optimizing the intermediate representation, as you can see here. And this can be executed either in SQL Server or on Spark, as we’ve integrated Raven with both. And this is part of our Azure Synapse Analytics offering as well.
So, Park is going to be talking to you in a bit about the specifics of the optimization. So, two main points before that. The first one is this unified intermediate representation that I mentioned. This is important because we can express both data and machine learning operations in a common graph. And the second one is that we’re embedding high-performance machine learning runtimes within our Azure data engines.
So, let me double-click very fast on each of them. So, how do we construct this IR? So, first of all, the Raven IR operators are the four of them. We have relational algebra, so everything you’re expecting from your database. We have linear algebra. We have other machine learning operators like decision trees and data featurizers, as well as UDFs.
And the way we construct this is we get this a SQL+ML plus PREDICT query. And we are analyzing it, we’re parsing it, and then we’re constructing this ir. And as I mentioned before, we also have an ongoing project that we can take Python, like Pandas expressions and express that to SQL and then to our IR well.
And then about the embedding of machine learning inference runtimes within Azure data engines. Just a couple of points here. When it comes to SQL Server, we already have a PREDICT statement that allows the users to integrate machine learning inference within simple SQL queries. This is… And we have embedded ONNX Runtime in the SQL Server. And this is available already in Azure SQL Edge and in the Data Warehouse offering, which is part of the Azure Synapse Analytics.
And when it comes to Spark, we have also introduced a new PREDICT operator that has a similar syntax to SQL Server. And we can support different types of models.

Kwanghyun Park: All right. So, from, now on I’ll briefly go over how Raven works with a practical example. So, here is one query, which is to find pregnant patients expected to stay in the hospital more than seven days. So, as you can see in the screen, the top yellow box is about fetching a pre-trained model. And then, the middle box is a data query, which is fetching data for the ML features. And then the bottom box is end-to-end fetching quarry, which is using both data operation and MLs pipeline.
So, as a first step, Raven is actually generating this unified intermediate representation. So, as you can see in this gray box, the bottom part is basically a whole bunch of data operations, including joins and joins filtering, and then the selections. And the middle part is some simple data pre-processing, which is using categorical encoding and then scaling. And the top part is actually the tree classifier, along with the extra filter to get the patients who is longer… Who’s staying longer than seven days. So, we are actually representing this whole thing in a single deck in this step.
And then, in the second step, we are actually applying our novel cross-optimizations and then create this optimized query plan for both data and ML pipelines. So, I’ll explain more details in the next slide about this list of optimizations. And so, basically, by using this optimized query plan, we can… I mean, in the final step, we actually decide where to run… I mean, we can decide which runtime we use for each of these components.
So, let me double-click this cross-optimizations with our example. So, as you can see here in the left box, which is basically original query plan without any authorization. And then the right-hand side is optimized query plan after applying our set of cross-optimizations. So, let me give you the list of projections. The first optimization that we are applying is called predicate-based model pruning.
As you can see in this example, here is one data bracket, which is about whether the patient is pregnant or not. And simply, we can actually… I mean, simply we can propagate this information to decision tree classifier, so that we can prune some of the branches in the tree. So, as you can see, since we know the top condition in the tree, which is about pregnancy, is always true, that means what we can do is we can prune the right-hand side of the tree completely. In this way, we can avoid weeding the unnecessary features.
The second optimization is called, modern projection pushdown, which is basically getting information from the ML pipeline and then prune… I mean, save some data fetching based on that information. So, in this example, what we did is we already pruned the right-hand side of the tree, which means we can avoid weeding a gender column. So, that means we can add additional project operator in the query plan, so that we can avoid weeding some unnecessary columns.
Then, the third optimization is called, model splitting. So, as you can see here, in the left-hand side, we have a pretty simple branch, which is in green color. And then the right, the red color is another decision tree. So, we can basically apply different optimizations to… I mean, these two different types of decision tree. So, by splitting this model at the end, basically we union result of this splitted models, as you can see in the right-hand side.
And the fourth one is called, a model-to-SQL translation. So, as you can see here, the green color, which is pretty simple decision tree. We can simply represent this as a SQL expression. In this way we can save… I mean, we can avoid invoking ML runtimes completely.
And then, the fifth one is called . So we, we have another project called, Hummingbird. So, this project is also presented in Spark Summit as well. So, by using that technique, what we can do is we can simply translate that deep decision tree, which is in red color here. We can translate it into a neural network, so that we can utilize whole bunch of sophisticated neural network optimizations and parallel execution plan, whatever.
The last of optimize… I mean, yeah, sixth one is called, I think… We use standard DB operations as well. So, by enabling these Raven optimizations, some cases, we can completely avoid complex drawings and some complex data operations. And then lastly, by enabling these neural networks, we can also apply compiler optimizations automatically. So these are the set of occupations we can support at this point.
And then, next slide, let me briefly explain what are the key ideas behind these Raven optimizations. So, the first thing is to avoid unnecessary computations. So, as you saw in the slide, what… I mean by using our first two optimizations, what we did was basically passing informations between modal and data pipelines. And then, sometimes you can… We could avoid some complex computations, and also we can avoid data fetching, which is pretty expensive.
And then the second important thing is we are picking up the right runtime for each operations, which means, in some cases, if the machine learning pipelines can be represented in SQL operations and that they can do better in data engine, then we actually avoid invoking ML pipelines. And then the other way around, which means if we can do better in ML runtimes with… For the data operations, then we go for that.
And then the third important thing is actually happening with the project called, Hummingbird. So, basically, by using this Hummingbird technique, we translate traditional machine learning models to neural networks, so that we can utilize hardware accelerations with the GPU and PGA.
So, in this slide, it is showing our results with Spark HDI cluster. We tested the Raven with two different datasets. One is with the hospital, which is predicting patients who is willing to stay more than a week. And then, the other data set is the hospital. So, it is basically predicting the hotel rating. In this graph, the y-axis is basically end-to-end query time with and without Raven optimizations. And then the x-axis is different model types, such as decision tree, logistic regression and gradient boost.
As you can see here, Raven can outperform other base ML runtimes, such as, as broken as SparkML, scikit-learn, ONNX Runtime by up to 44x, which is pretty huge, right? And then, one thing… One interesting thing is, so in case of simple models, such as decision tree with with depth 5, the main reason that we got this 44x was because we could avoid ML runtime invocation, because we could completely… You placed the ML decision tree into SQL, so that was the main reason. And then the other one is even for the complex models, such as the gradient boost with 20 estimators, we are able to get some gains, as you can see here, by enabling some other rules like, model projection pushdown.
The next slide shows how Raven can do better with hardware observations. We tested Raven with neural network translation for a GPU. In this graph, x-axis shows different parameters for the gradient boost models. The leftmost one is 60 estimators with depth 5, which is fairly simple. And then, rightmost one is using 500 estimators with depth 8, which is really complex.
As you can see here, the baseline which is in a green color, the performance is pretty sensitive as the model is getting complex. But in case of Raven neural net translation with GPU, that performance was quite stable, even though we used the complex models. So, at the end, we were able to get like 3x speed-up, even for the complex model.
And then, we also tested Raven in SQL Server as well. The main difference between this one compared to Spark is we also tested one of the popular RDBMS machine learning package, which is MADlib. So, we… To the fair comparison, we used DOP1 SQL Server with Raven, because MADlib is only support a single thread. So, even with single-thread Raven, could able to beat MADlib by large margin, like 230x. And then, even for the baseline SQL Server with DOP16… DOP means Degree of Parallelism, so that means we are using multi-threads. Even with that set up, we were able to improve the performance by 100x. So, the potential gain in SQL Server is pretty large, as you can see here.
And then the last evaluation is done with the SQL Server with GPU. As we saw in the previous slide, the trend is pretty similar to Spark. So, as the model gets complex, the gain by using hardware acceleration is getting bigger. So, in this case, we were able to get like 2.6x speed-up.
Okay. So, next, we’re going to give a quick demo with Raven implementation on Spark. So, one of my collaborator, [Dalisa,] will give a demo.

Speaker 3: Okay, everybody. In this demo, I’ll be showing how we can use Raven with Spark SQL and run the Raven optimizers on models to make predictions. So, here, I have a decision tree model. I’m going to show how we can run it in Raven. So, can import Raven just like a regular Spark session. And in… First case, I’m going to run it with all the optimizations turned off. And then in the next few examples, I’ll compare that with the prediction pushdown optimization turned on, ML-to-SQL optimization turned on, and do the ML-to-DNN optimization turned on for GPUs.
So, let’s run it with all the optimizations turned off first. So, let me look at a sample of the data set. Now, I’ll create a PREDICT query. I can look at the Spark, and that’s generated for this. I’ll now run the query. So, for that optimization, it takes 168 seconds.
So, now, let’s run with the ML-to-SQL optimization turned on. Set that to True. So, de-optimize that to True also. Let’s run this. So, now we can see with the ML-to-SQL optimization turned on, it has replaced the model in the Spark plan with a SQL expression. Now, let’s run it and see how fast it goes. ML-to-SQL turned on, it only took 13 seconds. Yes, now, I’ll run it with the projection pushdown optimization. So, I’ll set this to True. Then, let’s see it I doit False. Let’s run it and see how this performs. So, if you look at the query plan, we can now see that it’s replaced the model with batch [inaudible] optimized ONNX model. So, this model is the optimized model, the ML… I would say, prediction pushdown optimization.
So, let’s a have look at this model and compare it to the original. [inaudible] So, if you compare this to the original model, we can see, this model is much leaner and trimmed down, with some of the inputs pruned from the tree. So, let’s run it and see how long this model takes, takes 100 seconds.
And the next one I’ll look at the ML-to-DNN optimization on a GPU cluster. All right, so in this example, I’ll be looking at Raven’s ML-to-DNN optimization. Some optimization then takes traditional ML models and converts them to DNNs for a GPU optimization. So, this is a model, I have here. This is a tree classifier. And I’ll now run it, first, without the ML-to-DNN optimization. Then, in the next run, I’ll run it with ML-to-DNN optimization side to try if we can see the performance improvement. So, it’s that.
So, here’s our Spark plan. And now it’s running the query. Let’s see how long it takes. So, we can see, it took 156.7 seconds. Let’s see how it does with ML-to-DNN optimization. I’ll now run this with the ML-to-DNN optimization on Raven. I’ll set that to True, then start. How long it takes? I know you can see it took 27 seconds. This is an example of how you can sit down with Raven and run the various Raven optimizations for your models using Spark SQL. Thanks everybody.

Kwanghyun Park: Okay. So, to conclude this talk, Raven is the first step in a long journey of incorporating ML inference inside the data engine. I mean, and also we support novel Raven optimizer with the cross-optimizations and then operator transformations. As we showed, we were able to get a fairly large margin of performance improvement, both on Spark and SQL Server. So, our integration with Spark is almost done. And then, SQL Server side is ongoing work.
Thank you for attending this talk. We’ll be happy to get some questions for the rest of the time. Thank you.

Konstantinos Karanasos

Konstantinos Karanasos is a Principal Scientist Manager at Microsoft’s Gray Systems Lab (GSL), Azure Data’s applied research group. He is the manager of the Bay Area branch of GSL and tech lead fo...
Read more

Kwanghyun Park

Kwanghyun Park is a senior research engineer at Microsoft Gray Systems Lab (GSL), looking at query processing and optimizations in database management systems. He is also interested in large-scale dat...
Read more