MLflow Model Serving

May 27, 2021 03:15 PM (PT)

Download Slides

Discuss the different ways model can be served with MLflow. We will cover both the open source MLflow and Databricks managed MLflow ways to serve models. Will cover the basic differences between batch scoring and real-time scoring. Special emphasis on the new upcoming Databricks production-ready model serving.

In this session watch:
Andre Mesarovic, Sr. Resident Solutions Architect, Databricks



Andre Mesarovic: Hello, good afternoon, everybody. My name is Andre Mesarovic. I am a senior Resident Solution Architect at Databricks. And today I’m going to talk about MLflow model serving. I’m assuming everybody here in the session is familiar with MLflow. MLflow is a end to end machine learning framework that is been around since June of 2018. It is a open source package built on the similar to the Spark model in the sense that there’s an open source version. And then there’s a managed offering by Databricks. It was initiated by Databricks folks. But it like I said, it’s open source, there’s a lot of non Databricks contributors out there. So we’re not going to be talking about training here, I mean, MLflow handles experiment tracking, model registry, and ultimately, serving. We’re going to be focusing on the serving component in this discussion.
There are essentially two types of serving, as many of you are aware of. One is offline scoring, which is essentially also called batch scoring. Spark is ideally made for this. The other type of scoring is online scoring or real time scoring, which is essentially, you’re basically scoring one record at a time with low latency requirements. So in offline scoring, you’re typically consuming vast amounts of data. So an example would be customer recommendations for a, let’s say, a drugstore or something. We will also then touch upon, besides the standard MLflow model server for real time scoring, we also touch upon custom deployments. And then finally, we will look into Databricks real time models server, which has recently been announced, and there are some exciting new developments in that area.
So, as I mentioned before, the two types of scoring or prediction, we can call it. Offline on the left is high throughput, bulk predictions, Spark Databricks, you basically would, you have a table, create a data frame, consume it, either through direct code or a UDF. There’s inside, within offline scoring, you can break it down into two categories. One is batch predictions, so that’s just a classical Spark job. And you can also have a structured streaming prediction, which uses Spark structured streaming to be consuming a feed. Then on the other side, the online prediction, as I mentioned, this is low latency, typically individual predictions, I mean, you can score more than one if you get them up there coming in. This is essentially used for web servers, IoT devices, etc. So this is outside of Spark, there’s no Spark here involved.
MLflow offers a real time scoring server that comes in various forms. It’s essentially a flask, a Python flask server wrapped by Gunicorn. You can build it as a container, specifically, either an Azure ML container, or a SageMaker container. And then you build that Docker container and push it over into the cloud, the specific cloud provider. And then finally, last but not least, is Databricks model serving, that is essentially the ability to expose a public endpoint within Databricks, stand up the web server, and avoid the time consuming and onerous steps sometimes involved in pushing a container out to a given cloud provider.
Here we’re going to address some of the vocabulary, the synonyms. So model serving can be thought of as equivalent to model scoring or model prediction, model inference and model deployment is the process of pushing the model out into the target platform. As I mentioned, the two types of prediction if we break down the prediction into categories, we have offline or batch prediction. Within that we have two subcategories Spark batch, Spark streaming and then there’s online, which is also called real time scoring. Here’s a diagram of the categories of scoring here. On the left, we have the centerpiece of MLflow is the MLflow model registry. So after you’re done training, typically, you go through many iterations of training with different hyper parameters. And when you’re satisfied there, you take the best model, the one with the best metrics, you push it into the model registry, and there it you can have, one or more versions of a given model.
And, in addition, each version is a monotonically increasing integer. And then in addition, you can tag those versions with what we call stages, which is either production or staging stage. So your favorite model would be called production. And then you can access that with a model URI to retrieve it and deploy it to your favorite target environment. So we have on the upper right, there’s, as we talked about offline serving, there’s two modes, either Spark batch or Spark structured streaming. And then in online serving, it’s a much broader category, it’s sort of like the Wild West. In the sense, there’s a number of possible target deployments, you can start off with MLflow scoring containers, so that’s a Docker container. It’s made for either SageMaker or Azure ML or just a plain Docker container. Then MLflow has a deployment plugin API. So essentially an extension point so you can implement your own targets. And currently there are four deployment targets.
There’s TorchServe, RedisAI, the Ray framework and Algorithmia. And then of course, you can just roll your own custom web server, just write it from scratch if you really want to if you have particular needs. And below that, I’m going to demonstrate later custom TensorFlow serving. So TensorFlow serving is TensorFlow’s serving platform and I will demonstrate how to grab a registered model and deploy it into TensorFlow serving. And then the brave new world out there of edge devices, you can, there’s nothing to prevent you from taking a registered model, and pushing it out to your favorite to whatever edge device you have in IoT, mobile, etc. And then the last category is just hardware devices, there if you really want to go that route, with Intel, Nvidia and so forth.
So, as we talked about offline scoring, this is Sparks sweet spot, how we can take advantage of distributed nature of Spark. If we have millions or billions of inputs to score, it can either be bad structured streaming. One of the really nice things about MLflow is the ability to create a UDF user-defined function. This can be either in Python or SQL, and then to paralyze the scoring for frameworks that are not parallel. So Scikit-learn, for example, is not a parallel the framework, it’s a one node framework, but using a UDF you can actually parallelize the scoring. So that in my opinion is a really cool feature. And MLflow offers a one line method of retrieving a model and returning UDF with that model. And this is obviously done from the MLflow registry. So, here we illustrate the different ways of scoring. At the top, we create a URI.
So, MLflow has this concept of a model URI. It can either be the scheme there, the first part in this example models. This refers to a registered model, if you wanted to, you could directly access a run with the runs scheme prefix. But that’s, we’re going to be focusing on the registered model. So MLflow has this concept of flavors, which essentially a flavor is the machine learning framework. So it’s the Scikit-learn, TensorFlow, PyTorch, that’s the native one. So in this case, you get when you call the native flavor load model, you’re going to get exactly what you would Scikit-learn. So that’s a Scikit-learn object you get, then you can call the native flavors. In addition, every flavor has something called a Pyfunc flavor, which is simply a one method wrapper API.
So this allows us to generically call models with the predict function. So in this example, coincidentally, both Scikit-learn and Pyfunc call the method predict. But for example, in SparkML, the predict method is typically called transform. But with the Pyfunc flavor, we don’t have to worry about that. And that allows for powerful tooling, building containers, and just scoring in a generic way. Below that, we look at how to retrieve a UDF. So we’re taking that single node Scikit-learn object, and we create a Spark UDF user-defined function. And then we can call that on a data frame and then that’s going to basically distributed to the scoring across all the workers.
And finally, in the last example, we show how you can even do it from SQL. So we simply take the UDF registered under a given name and then using the SQL mode, in Databricks, we can score using SQL syntax. So online scoring, like I mentioned, there’s a large variety of options. This is essentially the key point here is we need to do immediate, very low-latency scoring, typically in one record at a time with immediate feedback. A very common option is a REST API web server. Like I said, MLflow comes with its own container, you can build your own. There’s nothing to prevent you from embedding your scoring in a browser, like TensorFlow Lite. And then there are edge devices mobile phones, sensors, routers, and then hardware accelerators. So it’s really up to the specific use case how you want to deploy this.
This is from the famous Martin Fowler’s folks team, basically he has a nice breakdown of the different real time serving options. Typically, one way many people do is literally to embed the model in application code. This, of course, introduces coupling, you either run into tech stack issues, compatible libraries. Another very common way is to do it as a web service, as we talked about before. And then the third one is to basically publish this in a registry as a data and then retrieve it on need. So the MLflow model server is the actual cornerstone of MLflow real time deployment. So this is a essentially a Python Flask server. Or, in a very unique case, there’s a Java Jetty server for something called MLeap. We’ll get into this a little bit later.
So the web server exposes a very simple API. There’s simply one endpoints called invocations. It accepts three types of inputs, one is CSV. There are two JSON formats, they’re called pandas-split or pandas-records formats, this is part of the pandas spec. And then just recently announced is the support for TensorFlow inputs. I mean, sorry Tensor, just Tensor inputs, which are actually represented as JSON. And the output is simply a list of predictions.
So the CLI MLflow has basically two modes you can work with. One is the Command Line Interface, CLI. So from your shell, you can invoke many methods. In this case, what we’re doing is we’re building a server, or you could actually do it through the API. So within your Python code, you could be building a container that has the server. And in the last line, there are links to the MLflow documentation that describes all of this. So the different modes of the MLflow model server, are local web server, so this essentially runs on your laptop. And it’s a Python Flask server just comes up, you could then use cURL, or Python requests to submit predictions. The other one is the SageMaker Docker container. So this is essentially geared toward obviously SageMaker. One of the nice things about the SageMaker Docker container is you can run it in local mode. So before pushing up the SageMaker, you can actually test it and see if it’s working fine. So this is great. I actually use this one a lot on my laptop.
There’s an analogous Azure ML Docker container. There’s also a generic Docker container. And then there are of course, as I mentioned, before, there are custom deployment plugin targets. This is a really nice feature that was, I think, released last fall. And it’s been eagerly adopted by third party vendors. In this case, TorchServe, RedisAI, Ray and Algorithmia. In fact, in MLflow 1.12 November, there was close collaboration with a core PyTorch/torch server team, we work together to develop that plugin. So in terms of diving into the weeds here a little bit more technical, there are three incarnations of the MLflow model server. The standard one is simply a Python Flask server on the left there. So what we do is we grab the model from the model registry, we take the associated conda YAML file, which is part of the ML project concept.
So every model has the code, the actual Python code, serialized in the native framework. So for example, for Scikit-learn, it would be a pickle file for TensorFlow, it would be a saved model.pv file. And then in addition, there’s a meta data concept called ML project, which amongst its components contains a conda.yaml file which lists all the packages, so this can be built for you. This is only for non SparkML models. So in the middle box here, if you want to deploy a SparkML model, for real time scoring, things get a little bit more complicated, because now we need to have two processes. One is the Flask web server, and then we need a Spark runtime. So you will submit a request to the Python web server, and it will simply delegate that to the Spark server.
This is obviously only for SparkML models. And as you can imagine, it’s not as performant, because Spark is essentially not made for real time scoring, it’s essentially more geared toward ETL batch mode. And then on the right hand side, we lastly have this concept of as called MLeap. MLeap is an open source package that translates a SparkML model into its own format that does not require a Spark runtime. So you basically… It only works for Scala for Java, that is the JVM. So it’s basically faster, the only catch is that the Mleap commercial support is no longer available, the company that supported MLeap has gone on to other things, and it’s a little bit, it’s not quite as mature as one would hope. So It’s something that buyer beware.
Getting even more tactical, here’s an example of at the bottom, we have a client submitting a request, putting the content type there. It’s an HTTP POST, a request, and then we have three choices. If we deployed the Python, standard Python server, then we have in the middle, the SparkML server and then on the right, we have the MLeap runtime. So those are three different ways we could deploy this and then it returns essentially, a prediction. Here’s an example how to launch, this is the simplest way to launch a server. This would launch on your laptop, so we’re using the command line interface here, to create a server on the given port. In this case 5001. The option model-uri, as I mentioned before, this specifies in the registry, the address of the model. In this case, it’s a Scikit-learn iris model, and we’re grabbing the production version of it.
Here’s an example how to score with the CSV format. So we’re using the standard command called cURL. We submit, as I mentioned before, there’s simply one endpoint to the servers called invocations. We tell it the content type, in this case, it’s CSV. And then we submit the data that we want scored, in this case, we’re submitting two records. And then we get the response there, the prediction for those two records. This is a similar example using the JSON split-oriented format. This is a standard pandas formats, where you specify the first line at the input, or the column names, and then subsequently, you can have one or more rows of data and begin the same response.
This is another option here record-oriented format. This is a slightly more verbose JSON format, in the sense that each row specifies the column in addition to the data. So obviously, the split-oriented is probably preferred if you have multiple records. And finally, this was just released a few months ago, as I mentioned, there is a new format called tensor input now, so the requests will be cast to a standard Python Numpy array. And it’s actually the input is in JSON, and there are two sub types of formats per TensorFlow documentation. So you can see this at that TensorFlow serving web page. Here’s an example what it would look like if you were submitting a Tensor input [inaudible]. Quite similar to the JSON input. So here, we’re going to look at the two standard containers, cloud provider containers that MLflow has.
There is, as I said, there’s essentially two Python containers that you can build. One is the standard Python container, which has the model embedded in it. And then there’s a SparkML one, and as I’ve mentioned before, SageMaker is the most versatile container in the sense that it can run on your laptop, which is a very useful feature. So Flask is, as some of you might be aware, is a very popular Python web server framework. Of course, we wrap it with Gunicorn to allow for a scalability concurrent requests. It contains the serialized model itself, and it’s wrapped by Pyfunc, so the tooling can work in a generic way. The SparkML container has two processes within it. One is the Flask web server which delegates the request to the Spark server. It’s a generic, open source Spark, there’s nothing Databricks specific about that container.
So here are just some links on documentation. These are examples in the CLI, you can build. You first build a container, and then you can either deploy it to SageMaker, so you’d have to first go into SageMaker, create your endpoints, do all the SageMaker specific stuff. Or you can just run it run-local, runs it on your laptop.
Here’s an example of using the CLI to build. So in the first example, we’re building the container. And then pushing it out to SageMaker, in the sense of, we specify the model URI, that’s the registry, and then your image URL, this is something you need to then previously have created this in SageMaker, so it’s all SageMaker specific. And the second example is we’re actually building that container and then running it on the local machine, port 5051. And as I mentioned, this is a really, really, in my opinion, a nice cool feature. Because the whole process of pushing a container out to a cloud provider as anybody knows, is pretty tedious. There’s a lot of manual steps, things can go wrong, turnaround time is typically if you’re lucky 10 minutes. And so it’s not really amenable to testing, every time you make a change, if you want to test it, it requires you to go through.
So, Azure ML also has a very similar CLI. So this is, you can either deploy this to Azure Kubernetes service, which is the preferred mode or Azure Container instances. There’s links here to both the CLI, and the API. So you can either orchestrate this through Python API or the CLI. And there’s an example out here the last link that will take you to the actual code and show you how to do that. Here’s an example of what an Azure ML image build deployment looks like. You specify the model URI pointing to the registered model, you of course, have to have an Azure ML workspace. That’s all Azure ML Specific stuff. So it’s pretty standard stuff. We mentioned MLeap. So MLeap, like I said, is a… The intent of MLeap is to provide you the ability to take a SparkML model. So those are the serialized bits in the SparkML format, and translated into an MLeap format, that can be then deployed without a Spark server runtime.
So this is the ability to have low-latency scoring. It’s definitely faster than using the Spark server. And as I mentioned before the company backing MLeap is no longer providing support. And there are issues of stability, errors, bugs, and documentation. So it’s basically a buyer beware issue, but it is available. Here’s an example, here’s a link to some code that shows you an end to end ML pipeline using MLflow. So if you go to that link, you’ll see Python examples for each of these stages in the ML pipeline here. On the left here, we essentially go through the classical training cycle. Tweaking all the hyper parameters, giving different values for each hyper parameter is going typically be hundreds or 1000s of different runs. We record each of those runs in the MLflow repository. When we are done in the second box there, we select the best model that is being defined by your favorite metric. So for example RMSE will take the lowest score, then you take that, the best run and push that model into the model registry.
So now you can, using registry features you have the ability to track governance, auditability, promote it to different stages. In this, then, once we’re done with that stage, we then can launch a server container with the production model. Either using Azure or SageMaker, or our own. And then finally, we have a client submitting HTTP POST request, either in JSON format, or CSV. So those who want to see hands on source code, feel free to go out to that link.
There’s a really nice feature that just came out last year, which is MLflow and now has a standardized plugin API. So you can create plugins, the deployment plugins for your favorite target. So, in general MLflow has a concept of plugins, these plugins can be touched upon various features of MLflow. One is for example, the metadata store. So typically, MLflow has stores as data in two places. One is the metadata, so these are experiments, runs, input parameters to runs, output metrics. This is typically stored in a SQL database. It can be either MySQL, Postgres, SQLite, and there’s a plugin for this too. So somebody’s written a plugin, for example, for SQL Server. The other type of plugin that we’re concerned about here is a deployment plugin. So here, as I have links to the four, current plugins, TorchServe, RedisAI, and Algorithmia and Ray serve.
Here are examples of these plugin examples, how to deploy them. So we’re simply using the MLflow CLI deployments command, we create. So here we have a in the first example, obviously, it’s TorchServe. And we give it a deployment name there as the -M is, of course, the standard URI, to the model and the model registry. And then we have custom provider specific configuration files. And then for RedisAI, it’s a similar thing. Here we give it, the name is going to be the Redis key under which this model is stored. Where the endpoint will be that is, I’m sorry, and then the -M is of course, the model URI Ray, similar with Ray. Ray is a really cool new feature supported by a new startup called Anyscale, that’s geared toward, in this case, a hyper parameter tuning. And we have in this case, the configuration is specifying the number of replicas in your cluster.
And then there’s finally, last but not least, Algorithmia. So if you happen to Algorithmia deployment, you can actually take MLflow registered models and deploy them there. Okay, so here we’re going to do a little bit of a deep dive into PyTorch. So last November with MLflow, 1.12.0, the Databricks team or the MLflow team, which had a very close collaboration with the core PyTorch, folks, and integrated essentially PyTorch lightning, TorchScript with Torch serve. So for those of you who are familiar with the PyTorch ecosystem, PyTorch lightning is essentially a higher level interface over Rob PyTorch code. It’s sort of analogous in the TensorFlow world of what Keras is to TensorFlow. There is now a format, a serialized format called TorchScript which essentially translates your PyTorch code into a generic Python free, optimized format. So you can actually be scoring this outside of Python.
And finally, there’s TorchServe, which is PyTorche’s analog to TensorFlow serving. So this is a serving framework for TorchScript models. And MLflow now has a deep integration with that, using the deployment plugin, as we mentioned before. So this is actually really, really exciting, in my opinion, because PyTorch is a increasingly popular deep learning framework, it’s getting a lot of traction, and now we have native integration with MLflow. So there’s a lot of exciting things coming down the road regarding this. Here’s a diagram from the PyTorch website. Essentially, you have your data scientist, you have the standard training routine, you’re building your PyTorch model, you go through a large number of distributed training, you then generate a optimized model, this is the TorchScript, the quote unquote, Python free serialize bits. You then register that in the model registry.
In this example, we’re doing this with the MLflow, autolog concept. So MLflow, when you’re doing training, there are two modes, you can the classical way. Before autologging came around, you have to put explicit calls in your code API calls to the MLflow API that is simply log metrics. So log parameter, let’s say you so you log all your input parameters explicitly, then you log your metrics and then you finally log the actual serialized model. So those are all manual steps that you need to, obviously execute on your own. And now there’s this new feature, or it’s been around for a while, but it’s been enhanced, it’s called autologging. So essentially, once you turn the autolog switch on, there’s no MLflow calls at all. In fact, there’s no, you don’t even have to import MLflow in your code, it’ll automatically introspect.
We have, for a given number of frameworks, which is constantly expanding. We have actually tweaked the code introspecting into the actual training ML framework code example for Scikit-learn using, what’s it called? Monkey patching gorilla. We actually kick trap the calls for the parameters, and we automatically will take all the input parameters to that given framework logged in for you. So you don’t have to explicitly log anything, this is actually pretty cool, wicked cool. Then once it’s in the registry, we then access the MLflow TorchServe plugin. And then we finally push it out to the TorchServe framework, deployment framework wherever that lives, be it in a cloud provider, etc. And then we can do inference on that using standard TorchServe functionality and API’s. Here’s some resources on the TorchServe and PyTorch. There was a blog post from Databricks, there was also a blog post from the PyTorch team. And there’s some third party articles out there, Medium. And of course, there’s the source of truth is always on GitHub, in the source code.
Here’s an example. Here are some resources of the RedisAI. So RedisAI for those of you are not aware of it is a node SQL in memory database that has been around for a long time, it’s extremely powerful. And they have recently expanded into the AI space. So there’s a new feature called RedisAI, so they provide you with very, very low-latency quick access, in memory access to data, models, etc. And they have actually implemented their own deployment plugin for MLflow. So here are some resources articles, a video. This is an actually I’ve worked with Redis in the past, and I’m really impressed and excited about it. So this is a great step forward for both MLflow and RedisAI. As I mentioned before, Ray serve is an open source project that’s supported by a company called Anyscale that is focusing on distributed hyper parameter tuning for models.
They have also recently, last, I think, beginning of this year, developed a plugin, there’s a Databricks blog post with the Ray folks. And then here’s a re blog and you can look at the source code, and Ray documentations on how to use this. And we demonstrated earlier in this slide, the actual MLFlow, CLI commands how to deploy this. And then there’s Algorithmia, which also has created its own plugin. Here’s a bunch of different articles and videos on Algorithmia. So here, I’m going to sort of take a segue here and show you how you could actually make completely, let’s say, you don’t have a MLflow deployment plugin yet, or you’re not interested in it, you want to make a completely accustomed deployment. So what we’re going to be doing is essentially, building a standard TensorFlow serving Docker container and pulling it from the model registry.
So, I did this as sort of a side project just to push the envelope on MLflow to see how we could integrate with TensorFlow serving. So as some of you might be aware of Keras, which is a high level interface over TensorFlow is actually been merged in into the TensorFlow main branch. So right now Keras and TensorFlow are essentially married, before Keras had a concept of different underlying providers. And now it is essentially a part of the TensorFlow package. Previously, it supported, Keras was standardize on something called HD5, I’m not going to get into the details of that, that’s sort of a old legacy format. It’s basically a file system in a file. It was the default in Keras TensorFlow one, it was originally in MLflow the default and it no longer is, so we’re not going to dive into that if you have legacy issues we can cover that later. The current TensorFlow format is called save model, which is a native TensorFlow, serialized model format, which also known sometimes as .pd. That is now the default in TensorFlow 2x, which is the current version of the merged Keras TensorFlow offering.
And it is as of Mlflow 1.15, it is also the default so. So here’s an example, we can just focus on the current diagram here. This is just for informational purpose for folks, because we’re in transition here, if folks have previous HD5 legacy use case, you can look at the right there, there is a step that would translate an HD5 format, into save model. But if you’re starting from scratch, you can just skip that step. And you will essentially geard same model by default. And in this example, I if you go to the previous link, this is not only just the TensorFlow serving that I demonstrate here, I also demonstrate other TensorFlow serving options, mainly being TensorFlow lite model, which is made for edge devices mobile. And then there’s TensorFlow.js which is a format for browsers, JavaScript, js. So you can embed your scoring inside a browser, or on edge devices. So there’s two different formats that TensorFlow offers.
Now, those are not native, they’re not built into MLflow per se, but they can be integrated MLflow has this concept of artifacts, as I mentioned before. So artifacts are essentially used to store your serialize bits. They’re essentially cloud storage based, or any arbitrary object you want. So it could be a PNG, a plot, data, anything you want. So your models are stored under a managed concept called a model artifact. That’s the native flavor, so that will be in this case, saved model. And then in addition, if you have other artifacts you want stored, since MLflow doesn’t come with native support for TensorFlow Lite, for TensorFlow.js, we can actually store those simply as those bits. So we take the TensorFlow code that serializes that model to a lite model format, or a JS model format, stored as an arbitrary artifact. And then we can later retrieve that for scoring. So that’s demonstrates the power of MLflow to be extensible.
So here’s an example of TensorFlow serving example. So this is based upon the standard TensorFlow serving documentation here, we’re simply spinning up a TensorFlow serving Docker container. The code has already pulled a registered model from the model registry. And it has used the TensorFlow serving API, to embed that saved model format into TensorFlow. So essentially, we extract it from MLflow, go through the steps, the standard TensorFlow serving steps, and we build a container and then we submit a request in TensorFlow JSON syntax, which is slightly different MLflows syntax, but it’s this essentially the same creature, and we get a prediction there. Here’s an example in side MLflow. So here’s a for those of you who are familiar with MLflow run. Every run has, besides the metadata, which like I mentioned before, it’s stored in a single database, the other part is stored in cloud storage that called the artifacts. So in this case, we have a pretty fancy example here.
We have up at the top we have what’s called ONNX. So I’ll get into that a little bit later, but ONNX is a basically an interoperable model format that is supported by a large number of vendors, principally Microsoft, SageMaker, Facebook, etc, Nvidia, Intel, its goal is to have a generic, common model format. So as you’re well aware of every framework has its own proprietary or custom model format. And it’s very difficult or time consuming to translate this to another format. So let’s say you train in TensorFlow, and for some reason you want to do your scoring in TorchServe, there’s no really easy way to do that. And ONNX has basically [trises] excuse me solve this problem by translating that TensorFlow format, into a generic ONNX format. And then there’s an ONNX runtime that will execute that.
And then below the ONNX model, we have a custom artifact, like I mentioned. ONNX is an MLflow concept, so you can save an ONNX model as a managed MLflow model. The TensorFlow.js model artifact below that, with its sub artifacts is a custom artifact. So it’s not supported by Mlflow, so you have to use a native TensorFlow.js commands to save that. Same with the TensorFlow Lite model, that’s a custom artifact there. The TensorFlow model, artifact is actually a managed MLflow artifact. So you can see that by the sub artifacts, for ONNX model, you’ll see ML model, you’ll also see it under TensorFlow model, there’s an artifact called ML model, and the conda yaml. So those are two metadata concepts, the ML model defines metadata around that model and then the conda.yaml defines it dependencies. And obviously, those don’t exist for customer artifacts.
So like, if you look at the right hand box there, you’ll see more details in terms of the differences between managed models and unmanaged models. And of course, the management models, they all leverage the Python concept, which is a generic wrapper with one method called predict that allows all this magical tooling to happen for you. And then there are links there for those examples on GitHub. So as I touched upon ONNX, here, it’s interoperable model format that has a lot of the heavyweight vendors out there supporting it. In theory is train once deploy anywhere. The caveat here is that it depends upon the actual converters for each one of these frameworks. So if you imagine, you have, let’s say, 20 input frameworks, and you have 20,output frameworks, somebody needs to create actually a converter from TensorFlow into ONNX format. And so these all vary by maturity. So this is essentially an open source project.
In my experience, I’ve had the best luck with for example, Scikit-learn and TensorFlow. And it really depends upon the interest customer demand, and so forth. Microsoft is a huge supporter of this. According to their documentation, the Bing search engine, amongst many of the other products is a heavy user of this. So it’s definitely got a lot of traction. And ONNX flavor is a native MLflow model here. So you can treat that, you would essentially have to take your, let’s say, your TensorFlow model called the standard ONNX converter. And then save that in MLflow as a native flavor. And it’s also embedded, you can also run ONNX models from the MLflow model server. So that’s definitely an option there for those who are interested in that.
Okay, now we’re going to touch upon Databricks. So Databricks, as I mentioned, has a managed MLflow offering which is essentially brings enterprise features to MLflow in terms of integrating it tightly within Databricks notebooks, security, ackles, permissions, etc. So here are the four components of MLflow, the four pillars. We have models on the left here, which is essentially the serialized bits, the flavors. So Scikit-learn is a flavor, TensorFlow is a flavor, Python is a flavor. Then we have experiment tracking. This is your training cycle. So you’re basically going through the your hyper parameter loops. And we essentially record the input parameters, the output metrics, any artifacts, the serialized bits of the model, and any arbitrary artifacts you have.
And then once you’re done with your training, you take the best model, you push it into the registry. And we have a new feature called model serving, which is tightly integrated into Databricks. So you don’t have to go as I mentioned before, if you want to push your model out to a third party, cloud provider, there’s a lot of manual steps that are outside of Databricks, so many customers have requested for a much more simple, integrated solution. So Databricks now has an offering. It is not currently meant for production. It is, you can call it lightweight. And it is increasingly popular and Databricks has on its roadmap, actually a production grade model serving which will handle scale and low-latency SLAs.
So essentially, we have one endpoint an HTTP endpoint that is publicly exposed using standard Databricks authentication your personal token. Right now, as I mentioned, it’s in basically, in public preview, it’s an experimental mode. There’s only one node, is a one node cluster. You can select your instance type, so you have choice and your instance type. And it’s meant for light loads and testing. So, let’s say you will eventually want to run this at scale on your cloud provider, but you’re going through iterations, testing iteration, so you can actually test this all within Databricks. And this model server actually will contain all the versions of your model. So in a sense, it has a multi model concept, not different models, but different versions of the same model.
There’s a nice feature, you can score directly from the UI for convenience, or, of course, using standard HTTP protocol. As I mentioned, on the roadmap, this is very exciting. There’s a major undertaking to provide production grade serving. So this will be a multi node cluster with auto scaling, model monitoring, model drift and metrics. You’ll be able to access the input requests and the responses if you want to do custom model monitoring, low-latency, high availability, scalable, all the fun stuff that Databricks offers, and in addition, there will be GPU support. So here’s an example, if you go to the registry, you’ll go to the serving tab and you can choose your instance type. That is, you launch the model, and you can see the different versions and of course, you can then track it through the model events.
Here’s an example of what the UI scoring is. There’s two actual endpoints here, one is the raw version. So this is version seven, as you can see under model URL, widget here. And in addition, this version has been tagged as a production stage. So you can invoke either one of those. The input, like we mentioned before, is the pandas JSON format. So that on the left there, you can actually just cut and paste request, and then the response is on the right there.
Here’s an example how you would do that with cURL. You will simply take that endpoint URI, specify the appropriate JSON and then give it the data. This is exactly the same syntax we saw before. This is the actual MLflow scoring server under the hood, so it’s the exact same format. It’s just a different incarnation of it. Here’s some resources on Databricks model serving. There’s been several blog posts on it. There’s official documentation out there in Databricks. So you can actually go and get some more information. Feedback is very important to us, so if you have any questions or comments on this presentation, please feel free to reach out and get in touch with us. And we’ll be looking forward to your feedback. Okay, that’s basically it. Have a good day and have fun with model serving. Thanks.

Andre Mesarovic

Sr. Resident Solutions Architect at Databricks with a focus on MLflow and ML production pipelines.
Read more