When it comes to Large Scale data processing and Machine Learning, Apache Spark is no doubt one of the top battle-tested frameworks out there for handling batched or streaming workloads. The ease of use, built-in Machine Learning modules, and multi-language support makes it a very attractive choice for data wonks. However bootstrapping and getting off the ground could be difficult for most teams without leveraging a Spark cluster that is already pre-provisioned and provided as a managed service in the Cloud, while this is a very attractive choice to get going, in the long run, it could be a very expensive option if it’s not well managed.
As an alternative to this approach, our team has been exploring and working a lot with running Spark and all our Machine Learning workloads and pipelines as containerized Docker packages on Kubernetes. This provides an infrastructure-agnostic abstraction layer for us, and as a result, it improves our operational efficiency and reduces our overall compute cost. Most importantly, we can easily target our Spark workload deployment to run on any major Cloud or On-prem infrastructure (with Kubernetes as the common denominator) by just modifying a few configurations.
In this talk, we will walk you through the process our team follows to make it easy for us to run a production deployment of our Machine Learning workloads and pipelines on Kubernetes which seamlessly allows us to port our implementation from a local Kubernetes set up on the laptop during development to either an On-prem or Cloud Kubernetes environment
Charles Adetilo…: Good day, everyone. Today, we’re going to be talking about infrastructure agnostic machine learning workload deployments, which basically means how to deploy your machine learning workloads to multiple cloud environments and to deploy it on-prem with just one deployment setup. Abi and I will be giving this talk. Abi and I together work at MavenCode. So just a little bit about us, we work for MavenCode. MavenCode is an artificial intelligence solutions company located in Dallas, Texas. We do product development, training, consulting around provisioning scalable data pipelines in the cloud. We do deployments of machine learning models in different platforms in different cloud environments. And we do a lot of things around streaming IOT-edge device application development as well. So a little bit about us. My name is Charles Adetiloye, I’m a machine learning platform engineer at MavenCode. I’ve been doing this for the past 15 years, building large scale distributed application. And with me today is Abi. Abi, you want to introduce yourself?
Abi Akogun: My name is Abi Akogun. I’m a machine learning and data science consultant, here at MavenCode. I have experience building and deploying large scale machine learning models across different industries and different verticals like healthcare, finance, telecommunications, and insurance. So in terms of agenda today, we’ll start by going through an overview of a machine learning model deployment workflow. Next, we talk about the various approaches to model training and model deployment in the cloud. After that, we’ll go ahead and talk about how we can deploy machine learning workloads in the cloud. We’ll talk about implementing feature storage backend, basically for machine learning model training. We’ll talk about how you can run containerized Spark ML workloads on Kubernetes with Kubeflow pipeline.
Overview of machine learning deployment workflow. When you deploy machine learning models, the first aspects to this has to do with getting your data. You source the data from different locations. And next thing is when this data comes, it doesn’t come prepared. It doesn’t come ready. So you have to do some kind of pre-processing. So you have a data, some parts of this data like you have skewed data, you have missing data points. You want to do some pre-processing to get the data ready for machine learning. After you’re done doing some pre-processing, where you look at missing values, you look at things like outliers. Then you go to the next stage, which is feature engineering. So during pre-processing, if you run into issues in terms of your data, you want to fix those processes and you want to create features from your data.
Once you have your feature ready, then you can go to the next stage. It is where you actually do the model training and your model evaluation. In model training, this is where you apply different logarithms and this is where you look at different metrics. After you’re satisfied and you pick which of the models you think is good for model deployment, you go to the next stage, which is where you want to use the model for actual inferencing and scoring as well as model management. And also, once you have gone through this entire pipeline, you’re ready to push your model for inferencing.
In terms of this machine learning workload deployments, you can have all these steps that we’ve put together right there, the pipeline. You can do this either on any of the clouds like Google cloud, AWS, Azure, and this also can be done on-prem. In terms of efforts for machine learning model deployment. So your data verification as when you do your data preparation and storage, after you’ve gone through the data preparation and storage, and you’ve gone through all the different processes that we’ve talked about, then you talk about your model inferencing, your model monitoring. And this is where you need efficient compute resource management. In terms of the workflow as well, data sourcing usually takes 16% based on experience from what we’ve been doing over time. Then the pre-processing part of this takes at least 32% of the time. And your feature engineering piece is 10%. In model training itself, and the model evaluation takes about 36% of your time. Then the scoring and management is only about 2%. And the model inferencing is about 4%.
A typical machine learning developer workflow is what we’re trying to show here. Like we said, we have a pipeline, you start from data sourcing, you go to pre-processing, you go to feature engineering. So all these data sourcing, the pre-processing and feature engineering, all these, once you get your features, you can put your features in a storage location. And that’s where we add things like S3, like Google cloud storage and Azure storage. That’s where they come into play. In this case, you have a transformed data, data that is ready to be used downstream by the data scientists. And you store this data with the right features in a storage location.
Once your data is in a storage location and the data is prepared, then we can talk about model training. The model training, the model scoring and managements as well as inferencing, this can all happen in a cloud environment. The data scientists, the ML engineers, they work hand in hand in time for pulling the data and get the data process before you start ML training on a managed cloud service.
Charles Adetilo…: Thank you very much, Abi. So just to illustrate a lot more better with what Abi just said. A typical developer machine learning workflow looks just like this. You have your storage where the data is, and you’re trying to wrangle the data, pre-process the data and make sure the data is ready for ML training. And you have the compute environment where you want to load all these your compute workload. So if we look at all the steps that Abi mentioned previously, the data sourcing and the pre-processing, you can enter that on the cloud by putting your data on things like S3 bucket, Google storage bucket and Azure storage buckets. And your data can come from different sources, streaming data from S&Qs or databases even from another bucket. So once we have all those data processed and we have it in a form where we can do the feature engineering and the data’s ready to do the ML model, there are different options out there right now in terms of managed services on Google cloud.
I mean, you have different… You have the Google cloud AI, you have the SageMaker and you have the Azure ML. So as a machine learning engineer or a data scientist, you have all these options across all the different clouds that you can use. So, depending on your skills and what you do and how you do things within your enterprise, you maybe boxed to one of these options and things like that. So this is just a view from the perspective of a single ML engineer. So if we look at it from the enterprise level where you have multiple data scientists and multiple teams doing things in different ways, it looks a little bit more chaotic, especially now when there’s a multicloud strategy in most organizations where you want to make sure you have all the major cloud providers, you want to have your on-prem and things like that.
So the question is, how can we make things a lot more better and not have a workflow like this? And that’s why we started looking at ways of building a cloud agnostic deployment that basically allows us to run on any of the major cloud providers on-prem. So imagine you have multiple data scientists working on different things and a lot of times they’re sharing data or pulling data. And all of the data scientists are connecting to all this data stores and trying to pull the data to train a model, and they can decide to just train their model with Google cloud AI or SageMaker or Azure ML and things like that. It becomes highly unmanageable. So, to solve this, we started looking at how we can simplify things. And the major things that we saw around ML training is, you have the compute, which is where you do your training and the storage.
So most of the time, implementing machine learning, it becomes a little bit expensive because of the cost of compute and the cost of storage and things like that. So, our goal was to see how we can simplify this thing, especially if you have multiple cloud environments across multiple team. We thought about abstracting our compute workloads to run on Kubernetes. And ML workloads, we thought about building everything on feature store so that, no matter where the data is coming from, if we can build it up and have it in the feature store, every data scientist or every ML engineers can go. Engineer on the team can go pick up the features they need, and they can use compute on the Kubernetes environment.
So going back to that workflow again. So imagine this is what we have, where you have your storage, you have your compute, and we go in, we pull all these data, we run a compute on them and we try to get the results. So instead of having it that way, why not let us have a kind of cloud neutral deployment where we can abstract a lot of things that we’re doing such that, our storage can be seen to the data scientist as a feature store? They don’t have to go through all the process of, pre-processing the data, getting the data from all the source and combining the data or doing aggregation and things like that. Can we expose the data to them just as a feature store? And when they want to deploy, instead of worrying about all the underlying infrastructure on Google cloud or Azure or AWS, can we deploy the workloads on Kubernetes?
So the question is, why the need for cloud agnostic infrastructure? So for us, it makes it easier to migrate our workloads. So sometimes you’re doing something on a particular cloud environment, so let’s say Azure, you want to move it to Google cloud, or you want to move it to AWS. So the migration was a lot more easier for us, especially since we do a lot of consulting in this space. We have multiple clients with different cloud providers and want to be able to shift the workloads across all these cloud providers. And the fact that we’re not tied to a particular cloud infrastructure stack, makes it easy for us to try new things and experiment a lot more easily. And it’s easier for us to implement best practices. So basically, our solution is portable across all the major cloud infrastructure and on-prem. So we can establish best practice patterns around building a feature store, around deploying models, around model serving and things like that.
And one of the things that this provides is a common based denominator for all enterprise ML work, especially in an ML environments where you have an hybrid of all these different cloud infrastructure and things like that. But the overall benefit is to control costs because you can basically forecast your demand, estimates your demand, you’re not running on a managed service where you don’t really understand what’s going on behind the scenes and stuff like that. But with the kind of deployment we have, you can easily manage your utilization, know what people are doing, know when things are bootstrapped and how long they’re going to be on and things like that.
So a cloud agnostic ML environment that we’ll work with looks like this, we have the feature store. When using Apache, we need to build our offline feature store. And one of the reasons that we work with Apache is, it basically allows us combine historical and fresh data that are coming into the store. And we can combine it into one and write it out into the cloud. Hudi backs up into Amazon S3 bucket, basically means most of all the cloud infrastructure storage buckets, so we can use them with Google cloud or Azure storage as well. And once we have our feature store, we have Kubernetes set up. Most of the time we can use a managed Kubernetes or you can set your own Kubernetes up. But the fact that we’re running on Kubernetes allows us to package our model training into containers that can connect to this feature store to grab the data and use it for the training. So that’s on a high level the kind of abstraction that we have and we run.
So what’s the feature store all about? So feature is a measurable observable attribute that is part of an input to a machine learning model. So let’s say you want to train a model, and before you train your model, you need some input. So you have the feature vector that represents all the attributes that goes inside your algorithm for model training. And over time, as more data starts coming in, you can have more and more features coming in for your model training. So features that derive from your raw data store, it could be, let’s say, a user table, it could be a streaming data source from IOT, or it could be aggregates of inputs, or it could be something that you compute over windows of time, like every five minutes tell me the number of people that clicked on a particular item or hourly or daily and things like that.
So you have your feature vector that you compute, and you fit into the model for your model training. So, feature changes over time and over time, as you add more features, you may want to retrain your model once you see that your models are drifting. So you have the time component. You have all these features flowing into your model when you use it to train your model and things like that. So, the advantages of having a feature store includes for us is, it makes it easy to operationalize our ML workflow. Most importantly, the data management and storage of the data for our models. Features can be shared easily among teams running different models and pipelines. And we get to version our datasets and track changes easily.
And in terms of consistency between the input attributes that we use for our training and the model serving, basically, if you have everything in the feature, especially in online feature store, we can use the same attributes that you’re using for your model training. And you can use it to enrich your input that you use in the inferencing.
So there are two types of feature stores. Mainly the first one is the offline feature store that you use for batch training. And you have the online feature store where you use that for inferencing, so you basically use it to enrich the input data that the user is sending to inference and you base before you send it to the model and do your inferencing and things like that.
So we’re going to dwell a lot more on the offline feature store, which is where we do a lot of ML training. So, let’s say we have a streaming and batch data source coming in, we’ll use a Kubeflow pipeline to spread the inflow of data. So once comes in, we’ll try to write it to our feature store, which is Apache Hudi sitting on any of the major cloud providers so we have a configuration for the feature store build with Hudi. So the data flows into our pipeline and once the data flows into our Kubeflow pipeline with bootstrap as Spark operator, that writes the data at the end of it to the Hudi storage system.
So why did we choose Hudi? The need for a unified platform where new data can be made available in addition to the historical data within minutes. So every time over time, you keep acquiring data, it could be an IOT application, it could be a database, where you’re adding more users and stuff like that. So one of the main challenges before Hudi was, how do you reconcile the new data that you’re getting with the old data without having to recompute the whole thing all over again? So Hudi solved that problem for us. So that’s one of the reasons why we’re using Hudi. And if we need to do a quick computation, because we can have a Spark job that runs and computes extended data frame, we can add more column computes and derive the values that we want to do. And we can just write it back to Hudi.
The incremental versioning of our feature collections. So one of the things we need to do sometimes is, you want to go back in time and maybe retrain your pipeline, or retrain a model to try to see what’s going on. So, the fact that Hudi has a timestamp on all these roles allows us do a time-travel in terms of how we query. And whenever we want to run a model training on new datasets or historical datasets, we do some experiments and things like that. And it backs up into Azure, Google and AWS cloud storage layer. So with that, we ensure portability of our datasets across all the major cloud providers. And of course, we do a lot of things with Spark and PySpark, so it makes it a lot easy working with Hudi.
So getting data into Hudi feature store for us, it is dominantly with Kubeflow pipeline, where we have these stages in Kubeflow pipeline. So Kubeflow, it’s a toolbox that runs on Kubernetes that allows you to do a lot of ML training and model training on Kubernetes. So for us, we use Kubeflow to orchestrate a lot of all this process. So let’s imagine in this case, we have a Kafka data streamer that connects with Kafka data store. So we have a components that will connect to the data store to pick up the data. It will connect to the topic, to connect to the Kafka server, pick up the data. And once we pick up the data, one of the things we try to do is to validate the Schema. So we try to run a validation to make sure that the data we’re getting matches what we’re expecting. Depending on the condition in some cases, we may need to drop the datasets. In some cases, we may need to assume a default value if the data is not there. So we’ll run a validation and we’ll create a Schema that maps the dataset that came in.
Then the next thing we do, that’s the pre-processing, which is we manage to transform the data into a new form to make sure that it’s in the format that we want to write in our data store as a feature. So in this case, once we’ve done the pre-processing and the transformation, we’ll now write it to the Hudi table where we just have it as a data frame that we’ll write into the Hudi feature store. So we do a lot of all these orchestration with Kubeflow pipeline.
And just to focus a little bit more about Hudi writer, this is what it looks like. We connect to a feature store, we create a PySpark session. We make sure we have all the packages that we need. We do some configuration, the Hudi options, we add some configurations in there for the write And we write inside the data store. We write the data frame. In this case, the data frame will contain all the features that we want to write and we just do an input write. And we specify the format as an Hudi table.
So that’s how everything looks on a high level. So apart from looks like this, we have a feature store where we have the datasets that the data scientists can consume. And once they consume the data from the feature store, they want to run their ML workloads with it, on Kubernetes. And because it’s Kubernetes, that’s the common denominator. You can run Kubernetes on AWS, on Azure, or Google cloud. And that means you can run your workload, your ML jobs on any of the major cloud providers as long as you can containerize it and you can auto-scale your Kubernetes node pool and things like that.
So for us, we run machine learning workloads on Kubernetes, we’re leveraging an open source project called Kubeflow that allows you to run ML workflows on Kubernetes. Basically, it’s a toolbox with a lot of things that you can use to run your workloads on Kubernetes. So it has these operators. So operators are things that helps you manage the deployment of your components or your containers on Kubeflow, you can manage the life cycle of a particular deployment by monitoring things and making sure things are deployed successfully. So there’re different kinds of operators of Kubeflow. There’s a TensorFlow, the PyTorch, the XGBoost and the Spark Operate of course, which is getting a lot of leverage right now and it gets bundled with the Kubeflow deployment.
So on a high level, the cloud agnostic ML, machine learning development environments in a typical enterprise looks like this. You may be running on Azure or AWS or Google cloud, it doesn’t matter, as long as you have to Kubernetes. And your Kubernetes will contain a bunch of CPU node pools or GPU node pools if you re doing things with GPU. And you have Kubeflow that allows you to deploy your machine learning… Kubeflow that allows you to run your machine learning components. And Kubeflow has a bunch of operators. For us, we do a lot of things around the Spark Operator.
So a data scientists or an ML engineer manager, will just need to whip up a Jupyter Notebook around the PySpark jobs. And the whole life cycle of the job is controlled within their namespace on Kubernetes. And that means they get allocated a certain percentage of resource within their namespace and they can run their job easily within that namespace. And you have other operators as well, like the TensorFlow operator, depending on what you’re doing, if you’re doing things with TensorFlow or PyTorch operator. So from there, we can connect to a feature store, which is prepare datasets. And basically we can use it to run our ML models and things like that.
So on a good day, the workflow for a typical ML developer in our team looks like this. You write your PySpark code, you containerize it. And once you containerize it, you create a Spark application YAML, which is the CRD that defines your Spark job is going to run on Kubernetes. And with your Kubectl apply, and that deploys the job to the Kubernetes cluster. So for you to deploy your job to the Kubernetes cluster, it looks like this, where you come in there, you do a Kubectl apply, you submit your jobs. And once you submit your job, you have your Spark driver running in a container, and you have your executors running in their own containers as well.
And everything is all coordinated because you’re using the Spark Operator. And all these runs on Kubernetes, all containerized running on Kubernetes. And any of this, it can be on cloud, it can be on-prem, it can be on your local laptop. As long as you have Kubernetes, you can easily do that. So what you need to do is to just come in and do Kubectl apply and you basically deploy your Spark workloads on the Kubernetes cluster.
So now imagine you have multiple people from different teams around their ML jobs and they’re using the Spark Operator or TF-operator. We have the underlining infrastructure which is Kubernetes it can narrow the scale. And with Kubernetes cluster, you can be on the Amazon, Azure, Google cloud or on-prem or whatever you have been running. And all these teams can now come in and deploy their jobs. And depending on the configuration and what they’re trying to do, some people may be running a three node cluster, which is, you have three executing containers and one driver container. Some people may just have two executing and one driver and things like that. But the factor is, you have a share resource where everyone can come in and deploy their job, and we can monitor the life cycle of these jobs as things are going and things like that. So it doesn’t matter what you’re doing, we have a common base and everybody’s doing things the same way across the enterprise.
So a little bit more details around the YAML file for the deployment. So the Spark application YAML file allows you to specify the configuration of the Spark job that you’re trying to deploy. So in this case, we’re trying to deploy a Spark application for click stream analysis, that’s the ML job we’re trying to run. So you deployed a job in a particular namespace. So let’s say they’ve created a click stream-jobsnamespace. So that’s where the job is going to run and that’s where the job is going to reside. So we have a container that contains the workload, what we’re trying to deploy. And we have a Python file built inside that container that contains a PySpark code. And we’re saying, we want to run with PySpark 3.1.1. And we have different policies around it, depending on how you want to kick off a job.
If something fails in this case, we’re saying that if a job fails, we’ll try to retry three times and the interval between retries is 10 and things like that. Then you specify the number of drivers, which is typically one and the resource allocation to the driver. Same thing with the executor. In this case, I have one driver and one executor, and I have all these configurations. So this is all I need to submit a new job that will run my ML job in a Kubernetes environment run with Spark Operator enabled.
So we might decide to make things a little bit more fancy, where we want to connect to the feature store and we want to be able to split our datasets into train test split, and we want to fit in the training data to the Spark ML container that is going to run our Spark training job. And we have a test data that we’ll use for evaluation after that and we can do some predictions when the whole thing is done. So we can pipeline this and that’s where we use Kubeflow pipeline to orchestrate this whole process. We have a feature store reader that connects the feature store that contains all the attributes and all the datasets that we need for the model training. Then we feed it into the training test plate. And those are components that we’ve implemented and the Kubeflow pipeline components.
So just to dive a little bit deeper, the DSL, which is the domain specific language for describing the Kubeflow pipeline that deploys the job looks like this, what we have on the right. So you specify the DSL and the DSL creates the graph that you saw in the previous slide and you have all these components. In this case, we have a feature store reader component that goes in and reads data from the feature store. We have our trained test splits dataset that take whatever we got from our feature store, the output of it, which is going to be in packet format and it does the split. And you specify the split ratio 0.8 and 0.2 points. 0.8 for training and 0.2 for testing. And it’s going to execute after the feature store reader. So the first one will be used to connect to the feature store, get the datasets, export into the pocket format to a store relocation.
And we now train the split on the datasets. Then once we’ve done that, the next thing is to create the staging area for the training data. And we now do our ML training on the click stream data that we’ve prepared for training. And for that, this is a YAML, far on the right that contains the information that we saw in the previous slide. So we’re trying to deploy this resource so you use a resource op DSL on the Kubeflow pipeline. But if you notice this YAML configuration is similar to what we saw previously. So we’re trying to run a Python type code, which it goes without saying, it’s a PySpark job with Python 3. This is our base image, the image pool policy and stuff like that. Then we run our training and once we run our training, we specify the number of drivers, the drivers and the executors. In this case, we’re running with just one driver and we’ve bumped things up to three executors.
Abi Akogun: Thank you, Charles, for that detailed overview of how we do infrastructure agnostic machine learning workload deployment here at MavenCode. Next, we’re going to talk about cost comparison with managed cloud services on AWS. In terms of compute utilization, managed services running on AWS, they go around hundred percent utilization. Then your Kubeflow plus Hudi feature store takes about 30%. When you’re talking about start-up time, managed services running on [inaudible] takes about 66 seconds. But Kubeflow and Hudi feature store on AWS takes about 15 seconds. In terms of team agility and productivity, we have about six times productivity for both Kubeflow and plus Hudi feature store compared to AWS. Then on GCP, if you do a similar comparison on GCP, you can see we have a 26% utilization cost for Kubeflow plus Hudi feature store on Google cloud storage. In terms of start-up time, we have about 12 seconds. In terms of team agility and productivity, we have six times productivity.
In summary, we have a simplified common deployment of ML training process across multi-cloud and on-prem infrastructures. After the initial hump in learning curve and the overall team efficiency improves significantly. Teams are not locked into a particular cloud infrastructure. With this, what we have described here today, you’re free to go with any of the cloud providers. And also, it is easy for you to control costs. At the same time, it’s much easier for you to forecast your future capacity for demand capacity. Thank you very much.
Abi is a Machine Learning and Data Science Practitioner with experience building and deploying large-scale Machine Learning Applications in different industries that include Healthcare, Finance, Telec...
Charles is a Lead ML platforms engineer at MavenCode. He has well over 15 years of experience building large-scale, distributed applications. He has always been interested in building distributed syst...