A successful machine learning platform allows ML practitioners to focus solely on their experiments and models and minimizes the time it takes to develop ML applications and take them to production. However, building an ML Platform is typically not an easy task due to the many different components involved in the process. In this talk, we will show how two open source projects, Ray (https://ray.io/) and MLflow (https://mlflow.org/), work together to make it easy for ML platform developers to add scaling and experiment management to their platform.
We will first provide an overview of Ray and its native libraries: Ray Tune (https://tune.io) for distributed hyperparameter tuning and Ray Serve (https://docs.ray.io/en/master/serve/index.html) for scalable model serving. Then we will showcase how MLflow provides a perfect solution for managing experiments through integrations with Ray for tracking and model deployment. Finally, we will finish with a demo of an ML platform built on Ray, MLflow, and other open source tools.
Amog: Hi, everyone. Thanks for attending your presentation. I’m Amog Kamsetty and I’m here with Archit Kulkarni from the rate team at any scale. And today we’re going over two open source libraries, Ray and ML flow. Talking about some of the work we’ve been doing recently to integrate the two of them and how we can use both of them to build an ML platform. So just to get started, this is obviously a huge team effort. It’s not just us two. There’s many people from the Ray team, as well as folks from ML flow team at Databricks who helped with this.
So an overview of today’s talk we’re first going to go over what are ML platforms and why they’re important. Then dive into these two libraries that I mentioned, Ray and ML flow, and then finally end with a demo of what sample ML platform built with ML flow and Ray could look like. So first, what are ML platforms? To first understand why we need an ML platform. You have to take a look at what a typical ML process looks like. Well, for a hobbyist or independent researchers, ML development is mostly straightforward. And for production ML applications, this process is pretty complicated, involves many parts.
Usually it starts off with some sort of business need and some time spent in the initial research to find an ML method or technique, which will be able to solve that problem. Then we go into an offline experimentation phase where you try out a bunch of different models, different techniques, make sure that those models perform well. And our on our offline metrics, this involves a lot of training, which requires a lot of GPU and their computational resources as well as hyper parameter tuning our models. Finally, when we determined that our models improve our metrics, we then go into productionization and even then, if we don’t see any improvement and we can end up going back to the beginning. So this process is complicated overall, and was very time consuming.
And just to put it in simpler terms, we can basically split up the ML process into two components. One is the execution, which involves the feature engineering, the actual training of over models, as well as tuning them and finally serving them in both offline and an online setting. And in addition to the execution, we also have a management aspect which involves tracking our data, our code, and configurations that we use in order to get reproducible results. And finally, being able to deploy our model in a variety of different environments. How can we simplify this complicated process? Well, we can take another view of the process and see that there really is a split in responsibilities between what a data scientist or a research scientist does versus what an engineer does. So data and research scientists want to focus on the ML aspects of this process. This involves using their expertise, doing the ML research, doing the initial data analysis, as well as the analysis of their predictions and model evaluations during the production stage, the work to the data and research scientists do is not generalizable across different applications.
The models are used for recommender systems are different than the ones you use for fraud detection, for example, which is different than for reinforcement learning. On the other hand, we have the engineers, they handle a lot of the infrastructure, creating the data pipelines and training pipelines, handling the storage of your metadata, as well as your features and your data management. And they also are in charge of integrating with CIA system, for example, and taking your deployed model and making it available in a prediction service.
The work that engineers do, unlike the work that the data and research scientists do, art is generalizable across different applications, regardless of which use case you’re trying to tackle with machine learning, all of them have a need for the work that engineers do. And this actually leads to a natural abstraction. So this is where the ML platform abstraction comes in with an ML platform. You can have a single set of tools that can be used by your data and research scientists across your entire company or organization, regardless of which applications are building or which use case are tackling. The ML platform provides two main things. One of them is scalable infrastructure for your ML computation and the other is productivity improvements. So a set of tools that make your data and research scientist more productive.
So just to hammer this point home, we can take a look at some of this scale of ML platforms at some big tech companies. For example, LinkedIn has over 500 AI engineers building models, and to have over 50 engineers actually building the ML platform itself to assist these AI engineers. The ML work at LinkedIn actually constitutes more than 50% of their offline compute demand. And this is growing more than 2X a year. So clearly ML platforms have become much more important in these companies. Many other companies like Uber, Airbnb, and Facebook are also investing heavily in ML platforms, have their own dedicated teams to build ML platforms from the ground up.
So when building an ML platform, you can leverage a lot of different tools that already exists in the open source world. But going back to the original point of the two things you need are scalability and productivity. There’s two libraries which you can use to solve these problems. And those are Ray and ML flow with Ray and ML flow, you can solve these two problems with execution and management. Ray will solve the execution aspect by providing a framework for easy distributed and scalable applicant for easily building distributed, scalable applications and ML flow will help with the management of your experimentations.
Archit: So now I’m going to give you a little overview of Ray and some of its libraries. So first of all, what is Ray? It’s a simple in general library for distributed computing. And so distributing work across multiple cores on a single machine or even across hundreds of nodes. It’s agnostic to the type of work. So here we’ll be focusing on machine learning workloads, but it’ll work for anything where you can parallelize and use distributed computation. And Ray also comes with an ecosystem of libraries for ML and for other use cases. So there’s the native libraries built on top of Ray that’s for example, RLlib, Ray Tune and Ray Serve. And then there’s also third-party libraries that run on Ray. I know we’ve listed a few here and finally Ray comes with tools for launching clusters on any cloud providers. So it makes it really easy for you to actually spin up a bunch of machines and have Ray running on them. So there’s three key ideas at play with Ray.
The first is executing remote functions as tasks and instantiating remote classes as actors. So this is the task and actor paradigm from distributed computing and it naturally maps onto functions and classes, which as a programmer you’re already familiar with. And the key thing here is this allows you to support both stateless and stateful computations. Another idea is asynchronous execution using futures. This is something that enables parallelism and I’ll give you a demo of this in a couple of slides. And finally, there’s a distributed immutable object store.
And this allows for efficient communication where you can send arguments by reference. So let me give you a glimpse of the Ray API. So here we just have a couple of simple Python functions. One of them reads in array from a file and another one adds to erase. The primitive that Ray exposes is this Aray.remote decorator. And what this does is it takes an arbitrary Python function and designates it as a remote function. Remote functions can then be called with function named dot remote highlighted here. And this will return immediately with a future, which takes the form of an object ID that represents the result.
So by calling this many different functions can run at the same time and you can also construct computation graphs using this API and you can do it right in the Python code. Now, if you want to actually get the results, you’ll call Ray.get, and this will block until the tasks that finished executing and we’ll retrieve the results. So this API enables the straightforward parallelization of a lot of Python applications, and you can find out more on the Ray documentation. So now I’ll introduce another key Ray feature actors. And these are essentially remote classes. As an example, here’s a simple Python class with a method for incrementing, a counter. You can add the remote decorator to convert it into an actor.
Then when we instantiate the class, a new worker process is created somewhere in the cluster that host, that actor and each method on that actor will create tasks that are scheduled on that worker process. So in this example, these three tasks are executed sequentially on the actor and there’s no parallelism among these tasks and they all share the mutable state of the actor. You can also usually schedule these tasks on GPU by specifying resource requests. And this allows users to easily parallelize, for example, deep learning applications. So we use Ray to program. What feels like an infinite laptop with Ray, you have access to a rapidly expanding open ecosystem of scalable libraries. Some of which we talked about earlier and Ray functions as the universal compute substrate for all these applications. And of course it runs on your favorite cloud. So now I’ll pass it over to Amog who’s going to tell us a bit about Ray Tune.
Amog: Thanks Archit. So as Archit mentioned are many libraries which leverage Ray’s task and actor paradigms to build other applications. And one of these is Ray Tune. Ray Tune is a scalable hyper parameter tuning library, which comes in many features. So first Ray Tune comes with various search algorithms and schedulers from state-of-the-art research. So these include Bayesian optimization, hyper band and population-based training. And in addition, Ray Tune is also compatible with many different ML frameworks. So regardless of whether you’re training your models with TensorFlow, PI torch, XG boost, et cetera, it’s still compatible with the Ray Tunes’ hyper parameter tuning capabilities.
But the two main features that Ray Tune provides are focusing a simplifying execution, Ray Tune allows you to easily launch distributed and multi GPU tuning jobs. All you have to do is start a Ray cluster connect to it and then run your training function. And this will all run in parallel, regardless of whether you’re paralyzing across the cores in your laptop or across hundreds of GPUs in a large cluster. Ray Tune also has automatic fault tolerance to save up to 3X and GBU costs allowing you to use spot instances. And finally Ray Tune also inter-operates with other hyper parameter optimization libraries. You may already be tied into one of these other libraries such as Optuna, Hyperopt or SigOpt. Ray Tune allows you to leverage their algorithmic advancements and benefits while still using Tune for distributed execution for its fault tolerance capability.
So let’s take a look into the API for Ray Tune. The first thing you need is to define a training function. So in this example, we’re simply going to instantiate a convolutional network, train it for a certain number of steps, now to add tune, what we need to do is have this tuned.report line into this training function, which will… Whatever metric you pass into, you’re trying to optimize for it’ll report it back to tune. And finally we take our training function and pass it into a tune.run call. And we can also specify any hyperparameters we want through this to a configuration argument. So in this case, we have a learning rate of 0.1 and you can also specify search space.
So what this will do is you can run your training function, this will create a hundred different samples or a hundred different training runs. Each of them will have a different learning Ray hyperparameter, which is sampled from this uniform search space from 0.001 to 0.1. And all of these different training runs are going to be run in parallel. With tune you can also easily leverage state-of-the-art schedulers and search algorithms. So for example, if you want to do a synchronous hyper band, you can do so with tune, or if you want to use population-based training, you can do that as well. And for some of these schedulers, you also might have to use checkpointing. That’s very simple to do a tune through this checkpoint directory argument. That gets past your training function.
Archit: All right. So thanks Amog. I’m now going to talk a little bit about Ray Serve. So yeah, let’s get into it. Ray Serve is in a nutshell, a web framework built for machine learning model serving. So here’s what it’s like to serve your model on the web with Ray Serve. You’ll just drop your model in a class and this class will have an init function. So you can initialize your model and this step might take a while. Maybe you have to download a lot of data. And then when it’s called, you can pass in a web request and then do any kind of pre-processing and post-processing you want before and after passing it to your model.
So Ray Serve is high-performance and flexible. So on the performance side it’s framework agnostic, it easily scales, and we’ll see how easy it is and the demo later, and it supports batching and other features that are very useful for machine learning applications. But it’s also flexible. You can query your deployments from ACDP and also from Python programmatically. And this allows you to easily integrate racer with other tools. What makes all this possible of course, is that Ray Serve is built on top of Ray. So that means as a user of Ray Serve, there’s no need for you to think about inter-process communication, failure management, scheduling, or any of these other things that make distributed computation hard. You can just tell Ray Serve to scale up your model.
So here’s what the Ray Serve API looks like. You can just serve any function or stateful class that you like and Ray Serve will use multiple replicas to parallelize across cores and across nodes in your cluster. And the way you can query your deployment, well there’s two ways, like I mentioned. So from HTTP, you can just do it like that, or query from Python using the Python API. And these will both query your model in the same way.
Now let’s talk about another piece of open-source software that handles an important part of the ML life cycle and that’s ML flow. So, before saying what ML flow is let’s talk about some of the challenges of machine learning in production. Some of them are listed here. It’s difficult to keep track of experiments, reproduce code, package and deploy models in a unified way. And there’s no central store to manage models, including their versions and stage transitions. If some of them are still being developed or some of them are in production. So ML flow is one tool that handles all of these challenges and it’s library agnostic and language agnostic. So it’ll work with your existing code.
And it does this by four key functions, tracking projects, the model registry and models. And we’ll see how tracking and models work together with Ray Serve and Ray Tune in some of the integrations coming up in this presentation. But to give you an idea of tracking ML flow allows you to log whatever parameters, metrics, or artifacts that you want. And here’s some of the metadata associated with an ML flow model. And one important thing to notice here is the model can be serialized. That’s what this pickled model is. And any ML flow model can be loaded as a Python function. And that’s key to integrating it with research. So now I’ll hand it back to Amog to talk about some of the integrations.
Amog: Thanks Archit. So, we have two integrations between Ray and ML flow. The first one is with Ray Tune and ML flow tracking. So if you recall from the previous example with Ray Tune, in order to use it with ML flow tracking, all you have to do is add this ML flow lager callback. What this will do is that whenever your metrics are being reported back to tune, it will also automatically get logged to ML flow under the ML flow experiments. There’s also an alternative API you can use, which is the ML flow mix and decorator. This allows you to have more fine grain control of what’s being logged to ML flow within your training function. So in this example, if we decorate our training function with ML flow mixing, we can now specify ML flow to auto log and use ML flows, auto logging capabilities, which will handle automatic logging for certain models such as XG boost or Patricia lightening, which we will see later.
Archit: Thanks. And now I’ll talk a little bit about the Ray Serve and ML flow integration to get it set up. It’s a pip package, which you can install and then to start running it, you’ll start a Ray cluster and start a Ray Serve instance using these commands. And so here’s what the command line interface looks like for this plugin. So there’s many commands. One of them is created deployments, sort the most basic command. And you can see here, we’ve specified dash T Ray Serve. That means Ray Serve as the backend for this plugin, but you can use various other plugins as well made by other communities, other companies, and here you can pass in the model. URI, so any model that’s stored in your registry, you can access it by a number of different URLs that can be on S3, for example. And then finally this dash C numb replica is equals a hundred.
That’s telling Race Serve if that’s a Race Serve specific config parameter to scale up the model to serve 100 replicas in parallel. So, that’ll really increase your throughput. And there’s many other commands like deleting, updating, listing deployments, and they all have corresponding things in a Python API. So from the command line or Python, you can use all of these commands. So here’s the one for creating a deployment from Python and integrating with Ray Serve is easy. This is because Ray Serve deployments can be called from Python. And this gives you a clear conceptual separation where Ray Serve handles the data plane, the processing and ML flow handles the control plane, which is the metadata and configuration. So now we’re going to give a brief demo of an example of building an ML platform using some of these integrations. So the ML flow, right tune integration and the demo flow Race Serve of integration.
Amog: Let’s see how a data science engineer would use an ML platform built with Ray and ML flow in order to train and tie tune a partridge model for endless classification. And then later on deploy that model into a web service. So this is going to be very bare bones, pseudo ML platform, but for the sake of the demo show, how we can leverage these integrations. So the first thing we need is a ML flow tracking server, which I’ve started here. So this is where all the team members are going to be logging the experiments to, and being able to visualize their runs. Now that we have this tracking server, next thing we need to do is start our training. So I have a Jupiter notebook running locally. And what I’m actually going to do is connect it to a remote Ray cluster.
What this will allow us to do is still developed from the comfort of a laptop, but be able to leverage resources in the cloud through this Ray cluster that we’re connected to. So as you can see, we have four GPU’s available to us, even though this is running on my laptop. So now that we have GPUs, we can begin our training. I’m going to define this partridge lightning, a very basic classifier here. And then we create a training function, very straightforward, instantiate the classifier and their dataset create the partridge sliding trainer and called trainer.fit. What we want to make sure we do here is specify auto logging so that we can log our results to ML flow and then decorate this function with this ML flow mixing decorator. Once we define a training function, now we can do hyperparameter tuning.
We will want to tune learning Ray hyperparameter over this search space. So we pass in our training function as well as our configuration to tuned.run. And we want each training run to use one CPU in one GPU, and you want a total of four training runs. So once we create our tuning function, we can now perform our hyperparameter tuning. And we will see that these experiment results will show up on our ML flow tracking server. So this will take some time to train, but once this is finished, we will be able to see the results shown on our experiments tab here.
And so the first thing we see is that the parameters are being logged. And soon once this is finished, we will be able to see the metrics being logged as well. So now we have four training runs, log two ML flow along with the models for each training run. So now that we have everything logged in and the ML flow all we have to do is go through all these runs, which we’ve logged to our tracking server, find the one with the best validation, accuracy, and register that model with ML flow model registry. So we just go through this, the model is now registered, and this is great. So now anyone can go and deploy this model, and we’re going to hand it over to Archit, who’ll show how we can take this registered model and deploy it on a service using reserve.
Archit: So here we’re logged into a serving cluster. So we’ve seen this before. ML flow deployments create dash T that’s the serving backend. We’re going to specify Ray Serve here, but you can use any other plugin like Reddis AI towards serve or algorithm. And that’s sort of, what’s nice about the ML flow deployments API and the plugin architecture. And we specified the model URI, that we had from earlier. And then we give this deployment in the name PTL deployment. So why don’t we run this command? It’s going to spin up one replica of this model on the Ray Serve instance. So here’s the Ray dashboard and you can see that one backend replica has been spun up. So one thing you might want to do is to update a model on the fly. So you can do that with this command. So here I’ve updated the existing deployment, but I’ve passed in a new model URI.
And so when you run this command Ray Serve will without bringing service down, swap out the new model in the backend. Another thing you might want to do is to take advantage of all the resources on your cluster. You might want to scale your model horizontally by making more replicas. So here’s the command to do this. It’s the same command, but we’ve passed in non-replica equals 10. So when we run this, it’ll spin up 10 replicas. That’ll all be serving the model in parallel. So as soon as this command returns, we’ll be able to go over to the dashboard and see that there are now 10 replicas serving in parallel.
And finally, you can use the deployments API to run predictions. So for this one, we’ll go to Python. I mean, all these commands exist, both in the Python and in the CLI interface. So as simple as this, you import the deployments API, and then you can run the prediction here. We’ll just run it on some random data and we get some random output. And of course you can query this over HTTP as well, but here we’re just highlighting the integration with the deployments API.
So this demo was kind of manual, right? We logged into the serving cluster and we ran all these commands, but you can also do this programmatically as part of a production pipeline. So for example, you could connect this to your continuous integration workflow. So upon each commit to your repo, that updates a model, the CI could trigger a hook that connects to the serving cluster and runs a command to swap the new model in without bringing the service down. So we’d like to thank some of the people that made this possible. So in particular, thanks to Jules Damji, Sid Murching and Paul Ogilvie for some of their help and guidance with ML flow. And of course, thanks to the ML platform demo team and the rest of Ray team as well. And please join us at the Ray Summit it’s coming up in June, and it’ll enable you to learn a lot more about Ray and a lot of the integrations with Ray and see how people are using Ray in the community. Thank you.
Amog Kamsetty is a software engineer at Anyscale where he works on building distributed training libraries and integrations on top of Ray. He previously completed his MS degree at UC Berkeley working ...
Archit Kulkarni is a software engineer at Anyscale working on the open-source library Ray Serve. Prior to joining Anyscale, Archit completed his PhD in mathematics at UC Berkeley.