Systems for performing ML inference are increasingly important, but are far slower than they could be because they use techniques designed for conventional data serving workloads, neglecting the statistical nature of ML inference. As an alternative, this talk presents Willump, an optimizer for ML inference. Willump speeds up ML inference applications through two new statistically-motivated optimizations which target ML applications whose performance bottleneck is feature computation. First, Willump selectively computes features. It automatically classifies most data inputs using only high-value, low-cost features, improving performance by up to 5x without statistically significant accuracy loss. Second, Willump accurately approximates ML top-K queries. It discards low-scoring inputs with an automatically constructed approximate model then ranks the remainder with a more powerful model, improving performance by up to 10x with minimal accuracy loss. Both optimizations automatically tune their own parameters to maximize performance while meeting a target accuracy level. Willump combines these novel optimizations with powerful compiler optimizations to automatically generate fast inference code for ML applications. Overall, Willump improves the end-to-end performance of real-world ML inference applications by up to 16x.

– Hi, I’m Peter Kraft, and today, I’d like to talk to you about Willump, a statistically aware end-to-end optimizer for machine learning inference.

So, as you already know, Machine Learning inference is an increasingly important problem today. And there’s been a recent focus on tools for ML prediction serving.

A common bottleneck in machine learning inference is feature computation. In many ML inference applications, a pipeline of transformations computes numerical features from raw data and then executes a model on those features. Often, this feature computation is the most expensive part of ML inference.

Feature computation is most often a bottleneck when using inexpensive ML models such as boosted trees or linear classifiers. But not neural networks. This happens most often when performing ML inference on tabular or structured data.

Feature computation is often a problem in production. For example, in this study of production ML inference applications at Microsoft, they found that Feature Computation accounted for over 99% of the run-time of some applications.

Unfortunately, the current state-of-the-art in ML prediction serving, doesn’t do everything it can to maximize the performance of future computation. They optimize ML inference using traditional serving and compiler optimizations, which work well but neglect the unique statistical properties of machine learning applications that could be used to improve performance by even more.

One statistical property of machine learning that can be leveraged to dramatically improve performance, is the amenability of machine learning to approximation.

For example, most ML classification workloads contain a mixture of easy and hard data inputs. Where an easy data input can be accurately classified by a computationally simple model, but a hard data input requires a more powerful model. Existing ML inference systems use the same expensive model to classify both easy and hard data inputs. But a statistically aware system could use a computationally cheap model on easy data inputs, and a more expensive one on hard data inputs, therefore significantly improving performance. Another interesting property of machine learning is that many ML applications are run as part of higher-level applications, such as Top-K queries. A Top-K query asks us to rank and return the K highest-scoring items in a dataset. For example, the 10 artists a user would like the most. Existing systems would do this naively by predicting every single item in the dataset with the same expensive model, and then rank and returning the top K.

A statistically aware system on the other hand, could a use a computationally cheap model to identify and discard the majority of low-scoring items, and then use a more powerful model on the minority of higher-scoring items to precisely rank and return them, dramatically improving performance.

The statistically aware optimizations I just discussed aren’t new, and they’ve been used in both industry and academia for a while. However, existing implementations of them are application specific and custom built. You have to do a lot of work yourself to adapt these optimizations for your particular workload and problem demand. For example, by constructing your own approximate model.

This creates a dilemma. Our relentless systems are easy to use, but they’re slow because they don’t implement statistically aware optimizations. The optimizations are much faster, but they require a lot of work to implement a significant ML and domain expertise.

So that begs the question, can we build an ML inference system that’s both fast and easy to use?

We think the answer is yes. And that’s why we developed the Willump, a statistically aware optimizer for machine learning inference. Willump optimizes ML inference applications whose performance bottleneck is feature computation. It uses automatic modeling agnostic, statistically aware optimizations to dramatically improve the performance of real world ML inference workloads.

And the remainder of this talk, I’m first going to give a high level overview of the Willump system then discuss Willump’s statistically aware optimizations in more detail. And then evaluate Willump’s performance on a real workloads.

At a high level, the goal of Willump is to automatically maximize the performance of ML inference applications whose performance bottleneck is feature computation.

Willump begins with an ML inference application written as a pipeline from raw data to features to predictions. The first thing Willump does is infer the transformation graph of this application to figure out which operators are computing which features.

Next, Willump optimizes feature computation using the statistically aware optimizations that I’ll discuss later in this talk.

Then, Willump optimizes individual operators using compiler optimizations.

Finally, Willump returns a compiled and optimized pipeline, ready to perform ML inference.

Now I want to talk about the first of Willump’s statistically aware optimizations, the end-to-end cascades optimization, which maximizes the performance of classification workloads.

So, the high level idea behind cascades, as I mentioned earlier, is that most classification workloads contain a mixture of easy and hard data inputs. Where an easy data input can be accurately classified by a computationally cheap model, but a hard data input requires a more expensive model. The idea behind cascades is that you can use a cheap model to identify and classify easy data inputs then cascade the others, the hard inputs to a more powerful model.

Cascades has been around for a while and are used for tasks like image classification and object detection. Unfortunately, existing cascade systems are application specific and custom built. If you want to use cascades, you have to manually construct your own approximate model and tune its parameters. This can be very difficult. It requires extensive expertise and both machine learning and your particular problem domain.

The goal of Willump, is to make this easier. Willump automatically constructs cascades for any ML application whose performance bottleneck is feature computation. It does this by cascading feature computation. Computing only some features for easy data inputs and cascading to computing all features for hard data inputs.

By default, an ML inference application computes all features for all data inputs, predicts with a model and returns that prediction.

Willump instead computes a handful of high value, low cost selected features for each data input, predicts with an approximate model that returns to that prediction. Unfortunately, the approximate model is not accurate enough to classify all data inputs by itself. Therefore we only return its prediction if it’s above a threshold, if it’s confidence in that prediction is above a threshold, which we call the cascade threshold.

If the confidence is not above that threshold we will compute all remaining features and cascade to the original model.

Okay, so that’s a high level overview of how cascades work. Now, let me talk to you about the more interest and difficult aspect of cascades, which is how we automatically construct them for any ML application whose performance bottleneck is feature computation.

Willump automatically constructs cascades ahead of time during model training for a models training set and an accuracy target.

The most important question to answer like constructing cascades is which features select for use in the approximate model. What are our high-value, low-cost features?

We want to select the features from which we can construct cascades that minimize expected query time, given an accuracy target.

So what is expected query time? Well, when we conduct a query, one of two things can happen. Either we approximate the query or we don’t approximate the query.

If we do approximate the query, then we only have to compute the selected features. So the first term in our expression for expected query time is the probability that you could approximate times the cost of computing selected features.

If you don’t approximate, then you have to compute all features. So the second term in our expression for expected query time is the probability that you don’t approximate times the cost of computing all features.

Sum these together and we get a complete expression for expected query time.

We want to construct cascades from the set of features that minimizes expected query time.

Here’s how we’re going to do it. First, we’re going to choose several set potential feature costs.

Then, we’re going to find the best set of feature that has that cost for each of the cost we’ve chosen.

Next, we’re going to have to figure out which set of features does the best overall. To do that, we’re going to train up approximate model and find a cascade threshold for each feature set we’ve selected. We’ll then compute the expected query time free set of features and construct cascades for a set of features that performs the best overall. The one that minimizes expected query time. So first we have to choose several potential feature costs. These will just be simple numbers, say one 10th of the total cost one fifth the total cost and so on. For each cost we’ve chosen, we want to find the best set of features that has that cost. So what do I mean by best? What I mean is that given a feature costs C-Max, we want to find the set of features that minimizes expected query time, given that the cost of those features is equal to C-max.

This is equivalent to find a set of features from which we can train the most accurate possible approximate model given that the cost of those features is C-max.

Unfortunately, computing approximate model accuracy is expensive. Therefore we’ll estimate the accuracy of an approximate model, trade on a set of features as the sum of the permutation important scores of those features that turns this minimization problem into a knapsack problem that we can solve easily.

Okay, so now we have several sets of several best feature sets. We have to figure out which of these best feature sets is the best one overall. To do that, we’re going to train a model and find a cascade threshold for each set.

We’re going to train the small, we’re gonna do this empirically using held-out data. We’re going to train approximate model, audio set of features on part of the overall training sets.

We’re then going to predict a held-out portion of the training set, and then use the predictions on that held-out set to empirically determine the cascade threshold, determine how often we can approximate while keeping accuracy above the target. Once we know how often we can approximate while keeping high accuracy for each set of features we selected, it’s easy for us to compute expected query time, free set of features you selected and then construct cascades from the set of features that minimizes expected query time overall. By following this algorithm, we can construct the cascades that minimize expected query time while keeping accuracy above the target.

And practice, this improves ambulant first performance by up to five X without statistically significant accuracy loss.

Alright, now I want to talk about Willump’s second optimization, the Top-K Query Approximation, which maximizes the performance of Top-K queries.

A Top-K Query asks us to rank the K highest scoring items of the dataset. For example, the 10 artists or 10 songs, a user would like the most.

Top-K Queries are interesting because they are asymmetric. They require high-value items to be predicted and ranked precisely because they’re going to be returned. Low-value items on the other hand, would only be identified as low-value and then they can be discarded because they won’t be returned.

Therefore we can approximate Top-K Queries by using an approximate model to identify and discard, low scoring items, and then ranking and returning high scoring items with a more powerful model.

Existing systems have done similar things. However, the systems were application specific and custom built. They required users to manually construct their own approximate models and tune their parameters. Just like in cascades, this is a difficult and error prone process. Willump on the other hand, automatically generates approximate models and tunes their parameters for any ML inference application whose performance bottleneck is feature computation.

Just like in cascades Willump automatically selects features and tunes parameters to maximize query performance while keeping accuracy above the target. If you’re interested in more details, please see our paper.

Overall, this improves performance of real world workloads by up to 10 X.

Okay, now I want to talk about how we evaluated Willump’s performance and show that it works on real workloads.

We evaluated Willump on benchmarks curated from top-performing entries to major data science competitions like Kaggle and WSDM. We’re going to look at three benchmarks in this presentation. All there more in the paper. We’ll look at music, which does music recommendation and queries remotely stored precomputed features. The next benchmark purchase, predicts a customer’s next purchase and uses tabular features automatically generated by the auto ML tool feature tools. The third benchmark toxic, performs toxic comment detection and compute string features.

First, we’re going to look at how much the end-to-end cascades optimization, improve throughput on classification workloads. We’ll compare against two baselines, the original implementation of the benchmarks and a compiled version of the benchmarks that runs a bit faster. We find the end-to-end cascades improve throughput by anywhere between 60% and the five X overall baselines.

Next we’ll look at how much the end-to-end cascades optimization can improve latency. We find the optimization dramatically improves median latency, but has a smaller effect on tail latency. Because the tail of a workload optimized for end-to-end cascades is going to be the set of data inputs that could not be approximated and therefore do not run any faster.

Finally, we’ll look at how much a Top-K, we’ll evaluate the Top-K query approximation on Top-K workloads. We find that it improves throughput by anywhere between 2.7 X and 10 X against all baselines.

So in conclusion, what I’d like you to take away from this talk is that it’s possible to dramatically improve the performance of ML inference through statistical optimizations, such as approximation, and now we can do these easily and automatically using simple algorithms like the Willump algorithm I just described. If you’re interested in learning more, please see our source code at github or look at our paper online, published MLCs earlier this year.

Stanford University

Peter is a third-year graduate student at Stanford University, advised by Peter Bailis and Matei Zaharia. His research focuses on systems, particularly on improving the runtime performance of machine learning applications. His most recent paper, Willump, on improving ML inference performance, will be presented at MLSys 2020.