We build machine learning products to support discovery and automation within the fitness health and wellness sector. Our products range from building recommender systems to enable our consumers to discover products from our customers within our fitness marketplace to applying natural language techniques to enable our customers to create automated marketing emails to delight their customers. In this talk, we will present our solution for training and deploying machine learning models into our production environment. We will talk about how our pipeline has evolved with open source tools like DBT, AirFlow and Sagemaker to address various pain points in building and scaling our data pipelines to support our machine learning solutions across the breadth of our wellness and beauty product ecosystem.
Genna: Hi, everyone. Today Brandon and I are really excited to speak to you about our experiences building machine learning services for our health and wellness marketplace. So I believe this audience is familiar with how tedious and time consuming it can be to bring trained machine learning models into production. Today we’re going to speak about how we leveraged open source tools to design workflows around reusability and scalability that drastically reduce the time it takes to bring our services into production.
I’m going to start by introducing you to Mindbody and the types of machine learning products our team builds. Next I’ll dive into the details about the recommendation services we have built for our health and wellness marketplace. I will then hand it off to Brandon to describe how we productionize these services. Specifically, he will speak to the challenges we faced and how we went about addressing them.
So let me briefly tell you about Mindbody. We are the leading software provider for the health and wellness industry. Our mission is to connect the world to wellness, and we do that through two types of product offerings: Mindbody business and the Mindbody app.
Mindbody business provides software for studio owners that offer scheduling management, marketing automation, and anything else an owner might need to run and expand their business. The Mindbody app is probably what you’re more familiar with. It is the one-stop shop where you will find and book all of your fitness, wellness, and beauty needs in our global marketplace. So the AI ML team at Mindbody was formed in 2019, and we’re charged with a mission to build machine learning powered services to compliment our business and marketplace software.
Some examples of the services we’re building in our business-facing software include a machine learning powered business advisor suite. This includes services such as automated marketing campaigns and promotions; customer engagement engines for lead scoring, churn prediction, upselling, automated staff management and integrated goals; and next-best action recommenders. Within this sector, there are a wealth and variety of opportunities for providing machine learning services to help automate running the business.
So for our consumer-facing app, the machine learning opportunities that compliment an online marketplace include recommendation engines, personalized search, personalized campaigns, and trust signals or data tagging. So today we’re going to focus on a couple of recommendation engines that we have built for the marketplace.
So with that, let’s jump into the problem we are addressing with these recommender systems. This is our consumer marketplace. On the left, you have an enormous number of consumers, each with many dimensions of preference, fitness modality, location, willingness to spend, instructor preferences, et cetera. On the right, you have an enormous number of providers, each with their own unique value proposition. So today our marketplace is a book-in utility. This means that a consumer enters knowing what they want, books it, and then leaves. So we want to shift our booking utility into a marketplace where our consumers go to find their next favorite fitness class through rich, meaningful discovery. We are doing this by building recommender engines that enable this discovery by pushing relevant recommendations and offers to our consumers.
So to date, we have launched two different recommendation engines, one for our dynamic price offerings that offer last-minute classes at discounted rates, and one for our virtual fitness class marketplace. The dynamic pricing recommender is designed to increase purchases by surfacing relevant offers to the marketplace consumers. The approach consists of an ensemble of collaborative filters and content-based recommenders to personalize the recommendations. The end-to-end development cycle for this service was about nine months. The virtual class recommender is designed to increase the diversity of inventory that is surfaced to marketplace consumers and is actually three recommenders in one. It [ensembles] a couple of collaborative filters and consumer data in different ways to provide different sets of fitness class options. So we were able to apply some of the learnings from the dynamic pricing recommender and shorten the development cycle to five months.
The next recommender we want to build is a general recommender to provide personalized recommendations to consumers for all of the health, beauty, and wellness offerings in the marketplace. Our goal is to reuse the work we’ve done for the previous recommenders to dramatically reduce the development life cycle and rapidly put this new recommender into service.
So now that I’ve introduced you to the problem setting and the machine learning approach we took to solve this problem, I’m going to hand it over to Brandon to talk about how we productionized these services and reworked our MLOps architecture to produce a framework that dramatically reduces the time it takes to launch a new recommendation engine.
Brandon: Thanks, Genna. So now that you’ve got a feel for the problem that we’re solving, now I want to take a step back and ask this question: “What goes into a machine learning project?” Most of the time when we think of machine learning, our first thought is that box right there in the center, your ML code and your models. Now this diagram, it’s very popular right now in MLOps. It was brought put together by the Google cloud team. And it really highlights just how complicated it is to deliver machine learning. Now two things that this is really helpful for understanding. First, the model selection and ML development effort really is just one task in a long list of engineering work that goes into delivering machine learning. There is a lot to consider when you sit down with your team and plan your project.
Second, the relative effort of model development is actually really small compared to the infrastructure, data engineering, and tooling that’s needed. And this is why it’s often said that nine out of ten models never actually make it to production. And now for us at Mindbody, this diagram is very real. We were often done with training and tuning our models months before we were actually able to deliver.
And one more way I like to think about this is by really diving this project and dividing it into four different categories. To make our project successful, we need to invest in all four of these aspects of our product. And I like to think of them as levers or gears, where if you invest your time in one of these, it will actually have a multiplying effect on the success of your project, and if you’re really strategic, the future projects as well.
For example, an engineering team might focus heavily on the ML code and build some really, really great models. But if they don’t take the time to consider something like UI, it will have a really large impact on the usability and effectiveness of that effort as a whole. In our case, we really realized that we needed to focus on those bottom two elements, our data and our infrastructure. We all know that with data, clean, quality, representative data can really be a deciding factor between a good and a bad model. And, similarly, if you don’t have the infrastructure to support the amount of training or the development pipe or the deployment pipelines and testing to support your development and serving goals, you’re going to have a serious bottleneck, and you really won’t be able to measure the success of your models and iterate and make improvements.
So our MLOps team, we took a step back and we started to drill in how can we improve both of our data and our infrastructure and our platform so that this will have a multiplying effect on all of our future projects.
And so here is a rough layout of what our recommender systems look like. Now, it was just a natural progression. On the left, we were given the task to build a dynamic pricing recommender. I’m calling it a Recommender One here for simplicity. So we went to our data warehouse, and we copied over a data set that we were going to use as our training data. And we engineered features, we trained up a model, we even built an API and deployed it. It was a Flask Python API, which we wrapped in a Docker container and deployed to Kubernetes. And we also needed some consumer features available in our fast production database at time of serving.
So we built a pipeline to compute all these features and preferences and upload that to our production database. And it worked. It worked great. This recommender is out in production serving traffic. But then it was time to build our second recommender. I’m calling it Recommender Two. This is our virtual class recommender. And we followed a similar process. This time we trained up three models. One of those models is actually the same one that we used in the first recommender. We engineered our features, we built an API, we uploaded data to our production environment. And, again, it works. So we now have two working recommenders in production. But following this pattern, imagine what would happen when we want to deploy and build a recommender three or four or five or 10 or any other machine learning project that uses similar features in our marketplace.
This pattern comes with some obvious pain points. Duplicate data sets and feature engineering efforts. We really started from the ground up every time when it came to data and feature engineering. We copied over raw data from the data warehouse. And with each consumer, we needed a lot of the same features for both recommender systems. We needed the past visit history, the past sale history, aggregated sales by category, time of day, day of week. And every time we started a new project, it felt as if we were either reinventing the wheel, going back to the raw data, or just copy and pasting massive [SQL] queries from one [repo] to another. And so now we’re having problems managing two queries in two different places.
Another pain point was redundant models across both these projects, really with no mechanism or way to share them between projects. To deploy these models to be able to be used by these APIs, what we were doing was really just manually uploading models to S3 buckets. Each API would pull from a different bucket. It’s dedicated to its own use case. You can see right here, model A is used in both projects. And really in each project, we were maintaining our own separate versions and files of really just the same model.
And then the third one was managing all of our different jobs. These jobs were non-standardized and very hard to manage. It’s not explicitly drawn here on this diagram, but if you look at most of these arrows, there is some sort of job or automation, or even just a manual job that needs to run. For example, model training, we need jobs to run model training, data upload between offline and online, evaluating our models, processing data for these features.
And really the way that we went about it was just following the path of least resistance. So for example, model training, what we were doing was SSHing into an Ec2 Instance and running there because we needed a lot of compute power. Or a model evaluation, we said, “Well, we’re deploying on to Kubernetes, So why don’t we just deploy Cron Jobs on these same Kubernetes clusters and run them in production.” Or a data upload job. Some of them are Cron Jobs, some of them were just run on our CIC build agents. And this really led to a disjointed platform which was impossible to manage let alone monitor. And, again, the biggest problem here is, in my opinion, was the scalability. It’s easy to imagine how this is going to break down when we want to build an ecosystem of high-quality ML at our company.
And now on paper this seems obvious if you look at this. It’s a swim lane approach. But when you’re deep in a project, you’re dealing with deadlines. Sometimes it seems like the fastest way forward is just to get tunnel vision, put on the blinders, and build out your single pipeline. Don’t think of anything else. Don’t be distracted by building out anything else. But in Mindbody, what we did was we took the blinders off and we started to address these pain points. And we realized that they could actually really easily be solved with just some simple, easy-to-use open source tools that helped us do it.
So let’s dive in. The first one we had duplicate data sets and feature engineering efforts. And the way that we solved this was by building out a feature store. Now, a feature store really is just an ecosystem of databases and queries. It’s not just a data store. I think of it as, again, an ecosystem of more than just a database, but there’s queries, there’s code. And what it does is it provides our machine learning pipelines with a single, unified access to features for machine learning. Now on the data store side, it consists of two. One is the offline feature store, which is a large database or set of databases which has all of our features that are used for model training that’s versioned, it’s very easy to access. And then the online feature store, which needs to be a lightning-fast production-ready database that our APIs and online models will be able to grab features from model [inaudible].
Now, the tools we used, we used DBT, which is open source. It’s stands for Data Build Tool. It’s a way for us to version and save and reuse our SQL transformations and separate out our features from the raw data. And then also Great Expectations, which is a library for data quality for being able to test our data and making sure it’s up to spec with what we expect.
And our result here was that we had the standardized access to features from one central place. And you can see here now, instead of the swim lane approach, it’s more of a fan out, with a feature store right at the center. And I like to think of this as “features as a service.” On the right side, you can see our model training. Now instead of having their own features, they pull features from the offline feature store.
And similarly, the online feature store, which is synced up with the offline, it wraps our features in an API, which then any other online machine learning service or API can pull the online features.
So that was the first pain point. The next one was our redundant models across projects. They have no sharing mechanisms. And the way we solved this was with the MLflows model registry. And the model registry, it’s just a centralized model store. It allows you to collaboratively manage a full model life cycle, and it provides model lineage, model versioning, and stage transitions. So when we have a model we want to actually deploy to production, through MLflow we can graduate this model into production. And now we can share models from one place. And now if you look in our updated diagram, we have the model registry here on the top right. And there’s a lot of arrows pointing to it, really just showing that from wherever in our pipeline we need a model, we access it from one place. Our training, when it’s done training a model, it can upload it to the registry, and our APIs can pull down that model.
Another thing to note is our model training A. Now there’s just one model training pipeline for that model. And we don’t even think of model training A as being for the first recommender or for the second recommender. It’s just a model. It goes to the registry and whoever needs it in production, it can be the online feature store, it could be a totally different service that we build in the future, can leverage and use that same model.
And then the final pain point was dealing with our jobs. They were non-standardized and really hard to manage and really hard to monitor, too. We decided to go with Apache Airflow. It’s an open source tool used for orchestration and allows us to programmatically author, schedule, and monitor our workflows. And they’re just written as pipelines in DAGS. So it’s a visual diagram where you can see the different stage of a process and link them together. So for example, down here, here’s a model training workflow. First it gets fresh data, then it retrains the models, and then it computes metrics.
One thing that I thought was really neat that we did was in order for us to still leverage the large compute power of something like Ec2 Instance or [here at] Amazon SageMaker what we did was we just deployed our training jobs as Docker image into Amazon ECR. And then Airflow, when we got to stages where we needed a lot of computing power, such as retraining the models, we would just orchestrate Amazon SageMaker to pull down that Docker image and then run it on a large instance.
So this was super useful because there’s still just one place to go and see how our jobs are doing, if they’re failing, if they’re passing, how often they’re running, and we can update them all from one place. It doesn’t matter which project they’re a part of.
So now let’s look at our updated diagram. And right at the center, we have our orchestration piece, which a bunch of arrows pointing out just to signify that it is managing and orchestrating all the different pieces and jobs. So on the right side, it’s running all of our model training jobs. It’s also pointing to our offline feature stores, so it manages all of our feature transformations. And it also manages the syncing between our offline and our online feature store. And there’s other things in here, like if model evaluation… We’re even using it to compute our metrics of the success of our recommenders. So we’re able to compute conversion rate based on past history and visits. So it’s very, very useful, one place to go.
And just lastly to look at this diagram, this now is our updated architecture. And the biggest highlight for me is that we shifted from being project focused, where we’re looking in these swim lanes, isolated on a single use case. Now we are platform focused. We’re building out each use case with a larger platform in mind. And so with that, I’ll pass it back to Genna, and she’s going to dive a little deeper into some of our key takeaways.
Genna: Thanks, Brandon. So now the question is, “What did we get out of investing all of this work into incorporating these open source tools and consolidating these engines?” Well, there’s three takeaways.
First, we have a better platform for maintaining our services. Instead of multiple services doing the same thing but in slightly different ways, we have a single framework to monitor and maintain.
Second, the feature store has provided us with a centralized, consistent data framework. Before, our data processing pipelines were branching off from the data warehouse, and it was up to each individual engineer to determine how their services defined a specific piece of data, for example, consumer visits. This led to a lot of inconsistencies in our data processing across our different services and also a lot of repetitive work for Brandon, who is our resident data guru. So now the feature store centralizes all of the logic and all of the data transformations so that engineers can reuse each other’s work and our definitions will be consistent across all of our engineers. And this ends up saving Brandon a lot of time so he can focus on other aspects of the MLOps pipeline.
So lastly, designing out this framework and using reusable and scalable components allows us to branch beyond our existing services and reuse code across all of our machine learning projects. So this means that there’s no more reinventing the wheel, and it actually allows our machine learning engineers to focus on machine learning versus rebuilding different software components.
So the key result of this platform rationalization and consolidation was that we were able to dramatically reduce the development life cycle of launching a new recommender service by over 85%. So we’ve gone down from a nine-month release cycle to a one-month release cycle for our services. So not only does this benefit our machine learning team, but from a company perspective, it lowers the barrier for other product teams to work with us so that we can rapidly incorporate machine learning services across Mindbody’s full ecosystem of products. So if you like what you heard or have a passion for health and wellness, our team’s hiring, so feel free to reach out and thank you so much for listening.
Genna Gliner is a machine learning engineer at Mindbody working on building ML products for our customers and consumers. Prior to Mindbody, Genna was a data science consultant at Oliver Wyman who spe...
Brandon Davis is an Machine Learning Engineer at Mindbody. With a background in Software Engineering and API Development, he brings this experience to the MLOps space to lead the ML team's efforts in ...