How to use Apache TVM to optimize your ML models

May 27, 2021 03:15 PM (PT)

Download Slides

Apache TVM is an open source machine learning compiler that distills the largest, most powerful deep learning models into lightweight software that can run on the edge. This allows the outputed model to run inference much faster on a variety of target hardware (CPUs, GPUs, FPGAs & accelerators) and save significant costs.
In this deep dive, we’ll discuss how Apache TVM works, share the latest and upcoming features and run a live demo of how to optimize a custom machine learning model.

In this session watch:
Sameer Farooqui, Product Marketing Manager, OctoML



Sameer Farooqui: Hello, everyone. Today I’m going to talk to you about how to use a Apache TVM to optimize your machine learning models for faster inference in the cloud and at the edge. My name is Sameer Farooqui. I’m a product marketing manager at OctoML. I joined OctoML pretty recently, a month or two ago, after working in Google Cloud for about three years. And before that, I was actually an evangelist at Databricks for Apache Spark, which is probably how most of you guys know me. So I’ve been exploring Apache TVM this year, and I’m really excited to talk to you about it. So first, at OctoML, we want to enable faster artificial intelligence everywhere. And the way we do that is through Apache TVM, which is a optimizing deep learning compiler. There was an article in SiliconANGLE in April 2018 that brought my attention to machine learning compilation. The article title was the, “AI compilation wars. Intel, Google, Microsoft, NVIDIA, IBM and others arm for deep learning acceleration.”
And there are a few quotes in this article that I found especially telling. For example, “Cross-platform model compilers are harbingers of the new age in which it won’t matter what front-end tool you used to build your AI algorithms, and what back-end cloud, platform, or chipset is used to execute them.” Additionally, “Cross-platform AI compilers will become standard components of every AI development environment, enabling developers to access every deep learning framework and target platform without having to know the technical particulars of each environment.” And the most interesting point to me was that it predicted within the next two or three years, the AI industry will converge around one open source cross compilation framework supported by all front-end and back-end environments. This was a really interesting article to me. And I think TVM stands a really good chance to be this framework of choice in ML compilers. There was another article that I found really interesting from January 2020.
There is a quote from the co-creator of PyTorch and engineer at Facebook AI, Soumith Chintala. He said, “With PyTorch and TensorFlow, you’ve seen the framework converge. The reason quantization comes up and a bunch of other low-level efficiencies come up is because the next war is compilers for the frameworks, XLA, TVM, PyTorch has Glow, a lot of innovation is waiting to happen.” Another really important point that turned my attention to compilers and brought me to OctoML. So in this talk, I’m first going to talk about machine learning compilers, a little bit about how TVM works. I’ll cover some of the existing TVM use cases. And towards the end, I have a OctoML product demo that uses TVM internally. So first things first, what is a compiler? For a lot of the big data or deep learning engineers in the audience, just taking a step back, a compiler is software technology that takes in programs written by humans and turns them into something computers can understand.
The original compilers freed engineers from having to master the arcane operations of computer hardware and allowed even beginners to build fast and efficient software applications. And without good compilers, the entire world of software would be much slower, costlier, and error prone, just generally less capable. So here we have a three-phase static compiler. This is like most C compilers. The front end parses the source code, checks it for errors, and builds abstract syntax tree to represent the input code. In the middle, the optimizer is responsible for doing a broad variety of transformations to try to improve the code’s running time, like eliminating redundant computations. The optimizer also is independent of the language on the left and the hardware target on the right. And then finally, the back end or code generator maps the code on to the target instruction set. And in addition to making correct code, it’s responsible for generating good code that takes advantage of unusual features of the supported back-end architecture.
Common parts of compiler back end include instruction selection, register allocation, and instructions scheduling. By the way, JVM is an example of this implementation model, which uses Java bytecode as the interface between the front end and the optimizer. So exploring this a bit. The way classical compilers typically work is there’s languages that come in on the left, and the common compiler bridges those different languages into different target hardwares. The most important win of this classical design comes when the compiler decides to support, like in this case, multiple source languages and target architectures. If the compiler uses a common code representation in its optimizer, like a comment optimizer, then a front end can be written for any language that can compile to it, and a back end can be written for any target that can compile from it. And with this design, porting the compiler to support a new source language like basic, for example, requires implementing a new front end, but the existing compiler and back-end design can be reused.
In the LLVM project, which is a very common compiler, this is the technique that’s used and it supports many family of languages like GCC, Java, .NET, Python, Ruby, Scheme, and Haskell. So machine learning compilers do something similar for different deep learning frameworks on the left instead of languages, and different hardware chips that’s on the right like CPUs, GPUs, or accelerators like FPGAs, or NPUs, or async devices. And that is exactly what TVM is in a nutshell. It bridges different deep learning frameworks like PyTorch, TensorFlow, Keras, ONNX, MXNet, and many more, to different hardware back ends. So that’s fundamentally what a compiler like TVM does. I would encourage you to check out the TVM paper from February 2018. There’s a link to this on the left-hand side. And I want to give you a few quotes that I found really telling in this paper. One, “There is an increasing need to bring machine learning to a wide diversity of hardware devices.
“TVM is a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back ends.” And experimental results show that TVM delivers competitive state-of-the-art results compared to hand-tuned libraries for CPUs and GPUs. In addition to TVM, there are two other papers that I’ll cover in just a sec. I actually also wanted to give you this quote from Luis who is the CEO of OctoML, but also one of the co-authors of this paper, that I think he breaks down what TVM does in a really easy-to-understand way for beginners. And what he says is, “The way it actually works is when you set up a new hardware target, like a CPU or a GPU, the TVM engine runs a bunch of little experiments on that target hardware to learn how the new hardware target behaves for your ML model, and it tries out a bunch of different optimizations on that target hardware. By building that set of training data for how the hardware behaves, you can learn the personality of the hardware target, and it uses that to guide TVM’s optimization for that specific target.”
It’s a compiler as well. So you’re not going to change the accuracy of your model. And you can get anywhere from two to 3x to as much as 30x better performance on your hardware target. So that’s faster inference on your hardware targets. The other two papers that laid the foundation for the current generation of TVM are Relay and Ansor. Relay is a high-level IR. It’s a intermediate representation that enables end-to-end optimization for deep learning models. Relay is a functional, statically typed IR that unifies and generalizes existing deep learning IRs to express state-of-the-art models. And it essentially is the foundation for future work. The high-level way of thinking about relay is that your deep learning neural network graph gets dropped down to the relay format. And then TVM does all of this optimizations on that relay represented version of your neural network. And then finally, the third, I think, foundational paper in the space is Ansor, which is for generating high-performance tensor programs for deep learning.
Essentially, answer is a project for searching the space of possible optimizations. So Ansor can find high-performance programs that are outside the search space of existing hand-tuned state-of-the-art approaches. Okay. So TVM has been growing significantly in the past three or four years. Right now, there’s about 500 contributors. And I really, really want to thank the Apache TVM contributors for bringing the project onto where it is today. So TVM was founded… Oh, I’m sorry, OctoML was founded in October 2019. We’re based out of Seattle, Washington, and we have about 50 employees right now. And OctoML was founded by the computer scientists from University of Washington, the team that created TVM, XGBoost, MXNet, and Rust. We’ve raised about 50 million so far in handful of rounds from Madrona, Amplify, and Addition. And in the open source, we partner with key hardware vendors like Qualcomm, Arm, and AMD, to essentially unlock any deep learning model to run on those vendors’ hardware chipsets. So who is using TVM?
The three prominent use case I know about are in Amazon, every Alexa wake-up today across all devices uses a TVM optimized model. So that’s for faster responsiveness. So less user perceived lag, and also less energy usage. So less battery usage. Facebook also has been exploring a TVM for language models and speech synthesis models, and a quote from them is that they’ve been contributing to TVM for the past year and a half or so, and it’s been a really awesome experience for them. Andrew Tulloch, an AI researcher from Facebook, said that they’re really excited about the performance of TVM. And Microsoft is another really large commercial user of TVM for Bing query understanding, which is three times faster on CPU, and QnA bot which is two and a half times faster on CPU. There’s a few other companies on the right that are listed. Another interesting data point is who attended the TVM conference last year, and there were almost 1,000 attendees from a lot of prominent companies. So you can see that there’s a lot of interest from really major corporations with TVM.
So I want to talk to you about the landscape of deep learning. I think a lot of end users are pretty familiar with the orchestrator frameworks like MLflow if you use Databricks, or Kubeflow, if you perhaps use Google Cloud, or frameworks, right? The last decade of deep learning has been heavily focused on developer productivity and increasing the accuracy of the models to get state-of-the-art results, especially using PyTorch or TensorFlow. But this talk is going to be more focused on the layers below the framework like the accelerators like TVM, or ONNX Runtime, or XLA, or PyTorch Glow, which optimize your model, as opposed to having to use vendor libraries to hand-tune your model, which can be a lot more cumbersome and take months to do. And we’ll talk a little bit about the hardware that you’ll want to run your model on ultimately. Okay. So at the high level, TVM has three bags of tricks that it does to speed up your model. One of them is graph-level optimizations.
This essentially rewrites the data flow graph, like the directed acyclical graph of your neural network, the nodes and edges, to simplify the graph and reduce device peak memory usage. The type of graph flow optimizations TVM does, for example, are operator fusion which fuses multiple small operations together, constant folding which pre-computes graph parts that can be determined statically saving execution costs, static memory planning pass which pre-allocates memory to hold each intermediate tensor, data layout transformations which transform internal data layouts into backend friendly format. So those are four examples of graph level optimizations. In addition, TVM does operator-level optimizations, which are more hardware target-specific low-level optimizations for individual nodes or operators in the graph. And finally, when you get your optimized model, you run that in efficient TVM Runtime. It’s a lightweight runtime system that provides a minimal API for, excuse me, loading and executing your optimized model in Python, C++, Rust, Go, Java or JavaScript.
So I mentioned that TVM does operator-level optimizations. This is what operators are in neural networks. These are essentially the nodes in your neural network DAG, right? Things like matrix multiply, convolution, pooling, softmax, dropout, LSTM, RNN, batch normalization. The edges between these operators or nodes, that represents the data flow moving between the operators. These operators and nodes, you should also keep in mind, are abstract, hardware-independent, and language-independent APIs. So they’re hardware agnostic. And when you create a neural network DAG, it’s not necessarily aware of the backend hardware it’s going to run on. And one of the things TVM will do is a bunch of optimizations to this agnostic DAG in order to optimize specifically for your hardware backend. So here’s a high-level diagram of the TVM internals. Remember that the main goal of a compiler like TVM is to not change the program, the fundamentally in the value it gives out. As far as ML compilers, that means don’t change the accuracy of the model.
So there should be no observable effect of running the program to the user as far as accuracy goes. But, of course, the inference speed should be a lot faster if you use a ML compiler like TVM. So I’ll cover a few of these steps in the future slides. But at the very high level, step one is you import your neural network into the TVM internal format, which is really, the import just takes a TensorFlow, or PyTorch, or MXNet, sorry, or MXNet neural network and builds a fragment of the AST out of it, the graph for it. Then Relay represents your entire neural network in its format IR language. Relay essentially provides a deep learning framework agnostic way of representing that neural network DAG, that is ideal for further optimizations, both graph and hardware operator level optimizations. So in step three, for every operator or node in your graph, we have a corresponding tensor expression, which tells us how to implement an operation like matrix multiplication in a hardware agnostic way.
Each operation is abstract in Relay, operations like add or multiply, and then the tensor expression essentially gives more meaning to it. How do you actually compute the add or the multiplication? The tensor expression or TE is also hardware agnostic. And Relay can take the TE based on the meaning and then combine and compile it to give you a more efficient version of it. One way to think of this is like, you give TVM a recipe to cook dinner, that’s your higher level neural network in TensorFlow, and then TVM can reorder that recipe and change a few steps to be able to cook the dinner more efficiently, but the end result you get is the same meal. So step four, scheduling. The older version of this was auto-TVM. The new version of this is called… The next generation version of this is called auto-scheduler. This is mostly carried out by automation. So you shouldn’t have to worry about it. B
ut essentially, given a description of a tensor expression, the auto-scheduling will produce a schedule that optimizes some performance on a given target. So TVM defines a search space of possible schedules to explore, and then it searches for the ideal schedule. This is what Luis was calling many experiments that get run on your target hardware. And then finally, step five is the tensor expression and scheduling isn’t just given any schedule, we use that schedule in the tensor expression to generate the final code. And finally, step six, tensor IR. Now that we have a low-level intermediate representation, we simply use LLVM to generate the code. So TIR is lower to LLVM. Going back to the classical compiler like LLVM, as a programmer, you would write code in C on the left, and LLVM first converts that into LLVM IR to represent that C code. And this is perhaps the most important aspect of a traditional LLVM-like compiler, to drop the user code down to the intermediate representation format.
This intermediate representation of the compiler is interesting because it can be a perfect world for the compiler optimization. Unlike the front-end code on the left, or the backend code which I’m not showing here, but the backend code for the target hardware that the compiler generates, this middle code, this LLVM IR, the optimizer isn’t constrained by either a specific source language or a specific target machine. So it has a lot more freedom about what it can do with the IR to find the optimizations. Okay. So how do we do that? So today, you provide a little metadata to TVM, like shape information and the model itself. Here, we’re loading a ONNX model from its serialized format into Python, and then converting this into a source code in Relay with the parameters. When we have those two things, we can compile and build executor in one step to execute the TVM program. Here, we create a pass context, set optimization level to one, create an executor using the default graph executor.
The model containing all the source code derived from the model, then the CPU and LLVM as target. This will compile code for your CPU and execute it on one CPU. Then there is one line of code to evaluate the main function of that module, pass, input parameters and convert back to NumPy. And TVM shares memory representation with common frameworks like PyTorch, NumPy, and MXNet, allowing easy integration to pass tensors with almost no cost. Okay. So now we have dropped the neural network DAG into the Relay intermediate format, a high-level differentiable IR. So Relay’s goal is to build high-level representation with slightly richer programming model from some of the existing frameworks to support capturing as much of your model as possible. Then we can holistically optimize each and every part of your model. Some previous IRs didn’t capture control flow, and didn’t always capture functional extraction, and couldn’t capture more complex data structures. So as an end user, you had to work around these but not with Relay. Okay. So here I’ve rendered a Relay program as a DAG, directed acyclical graph.
We have a convolution 2D in red, a bias term in red as well, and then an addition and an activation function like relu. And say you want to do an optimization like fusion. So we can select a subgraph and say you want to fuse it. In many previous frameworks, you had to rely on a fused implementation having been written for your platform. But in TVM, since we holistically represent the entire program, including each and every kernel like convolution 2D, addition, the activation function, we can take the implementations of each of these kernels at the lower level, at the TIR level, and combine the implementations and TIR to produce implementation that’s optimized for the input shapes and your target device. And if you have another subgraph like that with the same convolution and bias and activation function, we can produce serialized code for each one that maximizes performance for each device target. So if you want to target two devices, we can target the graph in blue for CPU and in red for GPU, and then generate a specialized subgraph for each.
Relay can also do layout transformation. So in some frameworks, they have fixed input formats like NHCW tensor. The N stands for number of images in a batch, H stands for the height of the image, W is width of the image, and C is the number of channels in the image like three for RGB or one for grayscale. And sometimes users have to change the data to meet the code to get the performance or suffer the transposing dynamically. In TVM, we change the code to match your data. If you want to pass NHWC data, we generate specialized subgraphs that compose well with the changing the layout shapes and devices. Okay. So that’s a little bit about Relay. Next, I want to talk to you about the auto-scheduler in TVM, which was the codenamed project, Ansor. This is a collaboration between OctoML, UC Berkeley, Amazon Web Services, and Alibaba. And the goal here is to automatically turn tensor operations like matrix multiply or convolution 2D into efficient code implementation for target hardware.
You might have heard the term auto-TVM. That was the first generation attempt at this goal. That was a template-based search algorithm to find efficient implementation for tensor operations. But this required domain experts to write a manual template for every operator on every platform. And in TVM, just these templates take up like 15,000 lines of code. These auto-TVM templates sometimes had inefficient and limited search spaces, making them unavailable to achieve optimal performance. The second gen auto-scheduler replaces auto-TVM. And auto-scheduler aims to be a fully automated scheduler for generating high-performance target code for tensor computations without the need of manual templates. Auto-scheduler can achieve better performance with faster search time in a more automated way, because of innovations in search space construction and a better search algorithm. So let’s compare the workflow for generating code for an operator in TVM and auto-scheduler.
Auto-TVM is on the first column and auto-scheduler is in the second column here. So in auto-TVM, there’s three steps. The first step is you write the compute definition in TVM’s tensor expression language, which is pretty easy because it’s just like math expressions. The second step is to write a scheduled template, usually 20 to 100 lines of tricky DSL code that requires domain expertise of both the target hardware architecture and the operator semantics. This was the tricky part. And finally, the last step is automated run by a search algorithm. That’s the tune step. So auto-scheduler eliminates that difficult second step via automatic search space construction, which enables exploration of many more optimization combinations, and accelerates step three with a better search algorithm. This shows the search process when optimizing a whole neural network. The deep learning models are the input at the very top, then it partitions the big model into smaller subgraphs with Relay’s operator fusion pass.
The task scheduler is used to allocate the time resource for optimizing many subgraphs. Each iteration picks a subgraph that has the most potential to increase end-to-end performance. Each subgraph tensor expression is analyzed and several sketches are generated for it. You can see here the sketch generation, random mutation, and the learned cost model in blue. Then evolutionary search is run with that learned cost model to get a batch of optimized programs. The optimized programs are sent to actual hardware for measurements. After measurements are finished, the profiling results are used as feedback to update all components of the system. This process is repeated then iteratively until the optimization converges, or it runs out of time budget. So here’s some performance results for AutoTVM versus auto-scheduler. You can see in blue, we have AutoTVM. In green, there’s auto-scheduler. The three charts on or the three bar comparisons on the left are CPU, the three on the right are GPU.
The CPU that was used was Intel 18 core, Skylake 8124M, and the GPU was a NVIDIA T4 GPU. And this is a benchmark of a floating point 32 single batch inference latency on three networks, ResNet-50, MobileNet 2, and BERT base. You can see that auto-scheduler outperforms in all performance cases in the top layer. In all performance cases between one to nine acts because it explores a larger search space and finds more efficient combinations of optimizations that are missed in our manual templates. So in the top graph, low, sorry, higher is better. In the bottom graph, lower is better. That’s the search time comparison. So less time is faster. And in the search time, it typically takes several hours to let the search converge for a single neural network. And auto-scheduler just requires much less time to converge in most cases, even though it searches a larger search space, because auto-scheduler has a better cost model and a better task scheduler.
So here’s some real world results from this. In December 2020, we compiled Hugging Face’s BERT-based model, a common NLP model, against Apple’s new M1 chipset. And we got what we believe is currently the fastest BERT performance that we know of on Apple’s arm-based M1. So we got 22% faster CPU and 49% faster GPU, which you can see in orange. Those are the TVM results. In light gray, those are Apple core M4 results. And we also, as a point of comparison, ran the BERT base through Keras and TensorFlow graph def, both with ML compute. Those results were essentially unusable between 500 and then more than 1,000 milliseconds for the model latencies. So how did this performance increase work for TVM? Well, machine learning and compiler engineers can’t cover all possible optimizations for all possible models when writing kernel libraries and heuristic compilers by hand. And this is doubly true for new hardware like the Apple M1 chip.
And also, TVM can fuse qualified subgraphs to reduce memory pressure on the M1, and directly generate code for the specified layouts. Apple’s core ML only optimizes a fixed set of fusion patterns and just certain subgraphs. So TVM just has more flexibility. Another point I want to make is that the current generation of TVM uses the best of both worlds. So we care about performance coverage and portability, not cogeneration ideology. So in TVM, if we find that a vendor library like cuDNN, or TensorRT, or MLK performs better than TVM’s automatic search space-based approach, then we will flip a switch and use the vendor library where it’s appropriate for performance. And one final thing, there’s a new version of auto-scheduling third generation coming soon called auto-TIR, which I would advise you to check out if you’re interested. All right. Before we move to the product demo, I also want to mention growing interest in tiny machine learning.
This is a space where you might be wanting to run your machine learning model like your deep learning model on bare model devices that are embedded devices, which are very memory, compute, and power-constrained devices like refrigerators or embedded devices, right? We have a micro-TVM sub-project for this. These embedded devices are where there’s no operating systems, there’s no virtual memory, there’s no advanced programming languages allowed. So they’re very constrained. And bringing the power of deep learning to these types of devices is the goal behind micro-TVM. So check that out as well if you’re interested. So before I do the OctoML Octomizer demo, why would you want to use a hosted version of TVM? Well, you get fast access to all of OctoML’s cost models. So we aggregate the cross product of deep learning models to hardware targets. And we’re continuously improving that iterative cycle. So by using OctoML, you would converge faster and perhaps have a better search space experience to find the most optimal model, than you would if you use TVM out of the box that didn’t have previous history of running.
There’s also no need to install any software, and there’s no need to set up any benchmarking hardware like a harness. If you want to use TVM, it’s not simply a matter of installing the TVM software and running your model through it, you also have to build a hardware harness of the exact chipset you want to optimize your model for, and that can take some work to… If you want to optimize on a Snapdragon Android phone and also a iOS like Apple iPhone, you’d have to connect both of those hardware devices directly to your computer to give TVM access to running those micro experiments or trials to learn the personality of your hardware. So using something like hosted TVM in OctoML lets you bypass those steps as well. You get access to our world class support, access to our comprehensive benchmarking, which we will recommend to you what hardware device to use for inference to find the ideal cost and performance balance. So those are just some of the reasons.
All right. So let me switch to a live demo of the Octomizer. This is about five or six minutes where we’re going to essentially take a ONNX Model Zoo model, run it through OctoML for performance optimization, and we’ll compare the TVM results to ONNX Runtime. One of the benefits of using OctoML is we are relatively agnostic in that fashion where you can optimize within OctoML and whichever engine gives you the best results, that’s the engine you can use, even if it’s something different from TVM like ONNX Runtime.

Speaker 2: OctoML makes it easy to speed up your machine learning models predictions, saving compute time, costs, and energy. After you log in, the Octomizer homepage shows all of the models you’ve uploaded so far. Here, I have a couple of pre-trained deep learning models: Open AI’s GPT-2 language model for text generation to predict the next word in a given sequence of words, and SSD-MobileNet-1, a single shot object detection model used to identify what objects are present in an image with boundary boxes. I’m going to show you how to upload a new computer vision model to the Octomizer and speed it up. You can upload your pre-trained custom TensorFlow, PyTorch, or ONNX models to the Octomizer. But for this demo, we’ll grab a pre-trained model from the ONNX Model Zoo on GitHub. Under image classification, find ShuffleNet-2, an extremely efficient vision model designed specifically for mobile devices with limited compute power.
Scroll down to download it. Note, the input to this model is an image with three color channels: red, green, and blue. And the output will be the prediction of one of the 1,000 classes in ImageNet the data that this model was trained on. Return to the homepage and click “Add model.” Choose the ShuffleNet-2 model you downloaded, name it, and provide a friendly description with some labels. Then click “Upload.” After uploading, you’ll see the new model in the recent activity. Click into it to review the model’s details. For this model, the input type and shape is automatically detected. In this case, an image with three color channels with height and width of 224 pixels. The workflows and benchmarks tables will be empty until you Octomize the model. Click “Octomize” and choose the hardware targets you’d like to both Octomize and benchmark this vision model on. Let’s choose a couple of NVIDIA GPUs along with the three more affordable Intel CPUs.
By benchmarking on multiple hardware types, the Octomizer helps you choose the right balance of cost and performance to meet your objectives. We’re going to be adding many more CPUs, GPUs, and accelerators throughout this year. And if you’d like to request a specific hardware target for us to prioritize, let us know. After you click “Start,” several workflows will kick off to optimize this model for all of the selected hardware targets. Now, under the hood, the Octomizer uses Apache TVM, an open source machine learning compiler framework created by the founders of OctoML. This optimization process to speed up the inference time of your model generally takes a few hours to run. Internally, Apache TVM employs the latest graph-level optimizations such as operator fusion, constant folding, and static memory planning, along with operator-level optimizations for diverse hardware targets.
By mixing the best performing combination of its own machine learning-guided search for code generation with external vendor-supplied operator libraries, Apache TVM outputs a new lightweight and efficient model. This means that you’ll get the best possible performance across any device, even in the case where the vendor hasn’t caught up to the latest operators in the fast-paced deep learning industry. Also, you no longer have to worry about learning or using different vendor APIs and packages for every hardware you want to deploy to. Apache TVM is actively being used and developed by companies like Microsoft, NVIDIA, Samsung, Amazon, and more. After optimization completes, let’s check out the results. For each hardware target, you’ll see two jobs completed. The first job benchmarks your model against a baseline, ONNX Runtime. Here, we see that for Intel Broadwell CPUs, the average inference runtime was 17 milliseconds. The TVM optimized model runs more than twice as fast, only seven milliseconds.
Note that for NVIDIA GPU targets, the Octomizer uses the TensorRT backed version of ONNX Runtime for the baseline. Scrolling down, we see a bar chart of benchmarks for the three CPUs and two GPUs. OctoML’s TVM inference results are in purple, while the baseline ONNX Runtime’s results are in blue. Lower bars mean faster inference. For this vision model, TVM beats ONNX Runtime by large margin on all three CPU targets, the 2014 Broadwell, the 2019 Cascade Lake, and 2015 Skylake. On the two NVIDIA GPUs, TVM is faster on the 2018 Tesla T4, but not on the 2014 Tesla K80, but TVM is about a millisecond slower. The Octomizer gives you objective benchmarks so you can easily determine the best engine for your model’s hardware needs. So you want to deploy on the Tesla T4 to get the absolute fastest inference in under a millisecond for each image. Next, you’ll want to package the optimized model for deployment to production.
You’re now ready to deploy. Scroll up to the workflows table and under the NVIDIA Tesla T4 GPU, you’ll see buttons to download either the Python wheel or C++ Linux shared object. For quick validation and testing or applications written in Python, the wheel is ideal. But for maximum performance with the minimum dependencies, the Linux shared object is the way to go. Click the one you’d like to deploy and when ready, you can download the optimized model. This optimized model reduces user perceived latency for online models, and increases throughput for offline models. Finally, to learn how to deploy the package into production, switch over to the Octomizer’s documentation. Here, you can learn the core concepts about the Octomizer, explore the software development kit, the command line interface, or the Python API. But we’re here to understand how to deploy our optimized model. Follow these steps to install and run the optimized model. If you need any help along the way, just let us know.

Sameer Farooqui: So I want to conclude with sharing some recent performance results. So we ran a cross product of 60 models against hardware benchmarks, and we compared how OctoML performs against the best baseline target we could find on the internet for that model. And we found some interesting results here. So in this graph, this is the 60 models I mentioned, the blue lines are publicly available models and the red lines are private models. In general, you’ll see a few trends here. If you look at the far right, there is one publicly available model that TVM performs really well on, then there’s a bunch of red models which are the private models that TVM performs really well on. So a lot of private models, TVM does good on. And then if you start going towards the left, you can see more blue lines, which are public models. One takeaway here is that public models tend to be pretty optimized when they get released by framework, writers, or hardware vendors.
However, private models where you might have changed some of the layers, or done transfer learning, or changed hyper parameters, or the activation function, those changes might not give you the best performance compared to using a publicly available model. So a takeaway here is if you have private models, that’s probably another reason to consider using ML compiler framework like TVM. So across the board, TVM got about a 2.1x improvement. That’s the dashed line. By the way, the zero y-axis, that is the baseline of what we saw, the best performance that we could find on the internet for that model. Just across the board, TVM got about 2.5x better performance, but 2.5x average performance improvement on the private models. They did better on the private models. These are what some of those models are. The far right model, the blue one, was a vision model, Yolo-V3. And then you can see really good performance improvements on some of these models that we mentioned, 5.3x on a video analysis model, 4x on a random forest model in XGBoost, and 2.5x on a MobileNet model.
I should also mention that we recently did a collaboration with Microsoft, a Hummingbird Project, to bring the ability to run tree-based models in XGBoost in TVM as well. So tree-based models, whether in scikit-learn or XGBoost, are one of the non-deep learning models that TVM supports. It essentially drops XGBoost or scikit-learn-based tree models into relay and into a tensor language, which can then actually be optimized by TVM. All right. So we are hiring like most startups. So if you are interested in working on the next generation of machine learning compilation and this is the space that you find interesting, the machine learning system space, then I would encourage you to apply. We are remote-friendly, even though we’re based out of Seattle and we’re growing very, very fast. And we’re hiring across all parts of the company from marketing where I work to customer success, to consulting, to support, to, obviously, engineering. So please reach out to us if you’re interested in working on the next generation of machine learning compilation technology. And with that, I will conclude and open it up for questions and answers. Thank you.

Sameer Farooqui

I'm a Product Marketing Manager at OctoML, the company that built Apache TVM (a ML compiler that optimizes models to run inference much faster). Prior to this, I was a Strategic Cloud Engineer for Big...
Read more