The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

May 28, 2021 10:30 AM (PT)

Download Slides

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.


In this session watch:
Elijah Ben Izzy, Data Platform Engineer, Stitch Fix



Elijah: Hello, everybody. My name is Elijah, and I’m going to talk about how we built an abstraction to enable simple ML operations at Stitch Fix. All right, so what’s on the agenda for today? First, I want to talk about Stitch Fix and data science at Stitch Fix. Then, I’m going to talk about the problem we solved, thinking about some common data science workflows and the motivation for making them easier. Next, I want to talk about something at the core of that, which is how we represent the model. Then, I want to talk about the capabilities that that representation of a model unlocks. Finally, a little bit about where we’re going next.
If there’s one thing I want you to take home, it’s that the right abstraction can enable separation of concerns between data science and platform. In this case, it’s the abstraction around how we represent a model that allows data scientists to focus on writing the best possible model, and platforms to focus on giving them the best possible infrastructure.
A little bit about me. Grew up in Berkeley, California, studied applied math and computer science at Brown and then worked at Two Sigma, a quantitative investment company, and now Stitch Fix on basically on what I studied, on the intersection of software engineering and mathematics and specifically how to build out software engineering systems that make researchers and data scientists really productive.
What’s Stitch Fix and why do we keep showing up at machine learning conferences? Stitch Fix is a personal styling service. There are two avenues with which you can interact with the product. Either you get five hand-picked items by a professional stylist, you keep what you like and you send back the rest so it’s an iterative process, we have the chance to learn from what you like and what you don’t, or you can shop at your own personal online curated store. Again, check out what you like and we can learn from it once you review it and send back information to us. Either way, it’s an iterative process in which we get to learn from your preferences, and data science is behind every piece of that puzzle.
There’s an algorithms organization, almost 150 data scientists and platform engineers separated out against three main verticals and the data platform. These verticals are the merchandising and operations algorithm, focusing on optimizing the operations around clothing and merchandise; client algorithms, focused on giving the clients the best possible recommendation and styling and customer experience algorithms, focusing on optimizing the styling and stylist experience as well as the customer experience. All of this is built on top of a team in which I am a part data platforms. We build tools to make these guys more productive.
Kind of blew by the explanation of Stitch Fix. There’s a lot of interesting things to learn, so I highly recommend. But if you’re more curious, go to the Algorithms Tour. It’s algorithmstour,, or just Google Stitch Fix algorithms tour. It’s a great visualization of how data science is woven into the day-to-day of Stitch Fix and how important it really is to the business.
Data science at Stitch Fix. What is the data science organization and the data science day-to-day look like? First, I want to talk about what it’s not like. It’s like a lot of other organizations. In a typical organization, you might have horizontal teams. These are an engineering team, a data science team, perhaps a data engineering team, each of which is responsible for managing their entire horizontal. You’ll have handoffs between them, so if a data scientist wants to get their model into production, they’ll write a white paper, get some sort of implementation ideas and give that over to engineering, who then takes the baton, writes out some code, test it out and then works with the data engineering team to get the data they need to run in production. So there’s a lot of handoffs between them and a lot of coordination required.
At Stitch Fix, we don’t do it this way at all. We have a very specific way of managing it that’s core to the culture and the way we run data science. At Stitch Fix, we have a single data science organization that handles all of the data science related questions. We explicitly forbid handoffs, so no one person starts a project and then the other person finishes it between functions. Data scientist are empowered to own their projects end to end, from idea to working with the business to implementation and into monitoring and management.
We don’t want to have anything in their way to get their model out of production. We have a ton of them, so we really need to build data platform tools that scale to a wide array of problems and a bunch of different data scientists. It’s all built on top of these data platform tools and abstractions. The goal of data platforms is to make all of these data scientists’ lives easier while still allowing them to manage the full stack and to own everything end to end.
So what’s the problem we’re trying to solve, and how does this relate to the organizational structure? Well, the problem comes when you have verticals organized like this. When we say data scientists are full stack, meaning they are able to own everything, get anything they need out into production, that does not mean that the data scientist builds the stack from the ground up. Again, this is why we have data platforms. The goal of our team is to scale without more complex infrastructure that data scientists have to manage and more cognitive burden on these data scientists.
Data scientists should always be full stack as I’ve been repeating, but can we shorten this stack? Can we build a machine learning platform on top of which the data scientists can really easily get their stuff out in production? That’s what we’re going to be talking about.
To figure out what this machine learning platform needs to look like, let’s take a look at some data science workflows. First, you might think of a data scientist training their model. They write an ETL or run it on a batch job, then save some mobile artifact on S3 and then copy it into a production environment so they have it available for the next step of that process, which is inference. Inference could take a few different forms. This is just a sampling of the types of inference that data scientists do. There’s writing the model and running it in a microservice, so serving predictions in real time; running predictions in batch, for instance, getting the recommendations for all clients for analysis or for… For analysis and then streaming predictions, so running in based off of streaming features coming in and then on a Kafka topic and making predictions on those.
And then you got ad hoc, or after-the-fact, analysis. This could involve tracking metrics on those models, evaluating their performance, comparing them with other models and sharing it with other teams, understanding how your models relate to other teams and how you can work together better. We wanted to enable all of this. There were a few different ways and platforms that I showed to do the inference side, but not much work had been done on the analysis side that was common to all of data science.
How can we optimize this workflow, and what is the actual problem we’re trying to solve? Well, the goal is to build an abstraction to give data scientists all of these capabilities, those black boxes that we just showed, for free. The caveat is we’ve got largely uniform workflows, so they all follow that same path of training and analysis, or training and inference and then analysis, but the technologies they use them are independent from each other.
We don’t want to tell a data scientist whether they have to use TensorFlow, scikit-learn, [pysen], PyTorch. They know best and we trust them to do that. Even though they all write in Python, they can use whatever technologies they want on top of that. And some don’t even write in Python. The question is, what do we put in the middle of this system to make it so they don’t actually have to write code to do any of these other capabilities so then you get them with a little conflict and the press of a button?
All right, I’m just going to tell you what we did, sort of set the scene for the rest of the talk. But the first question is, do we build or do we buy? Hats off to MLflow, TensorFlow Extended, ModelDB. These are all great options, and we’ve actually talked with a few of the creators and really dug into these products to see if they would work for us. Really awesome stuff, and I’m very excited about the direction the industry’s going. However, we decided to build our own.
Why is that, you might ask? Well, it gave us seamless integration with current infrastructure that we had, which gave us leverage. One of the hard parts about building these systems is building all of the components. We had a lot of great components. As I mentioned, the Model Lifecycle team, my team, is not the only one in the data platforms team. Each of the pieces of the platform is available and reliable and [inaudible 00:09:18], so we got a lot of leverage out of reusing those rather than pulling in something from the external world and having everybody relearn.
The model tracking and management data model that we use isn’t quite standard. We have a lot of different segments and varying ways to slice and dice the models. We found that open-source and commodity-based options were a little more opinionated than we would like about how to organize models. I’ll get to what that means in a bit and how we built around that.
How do you custom build while [inaudible] pivoting is necessary? We invested in interface design heavily to allow for plug and play with open-source options. We can sort of quickly iterate, figure out what users need, change our direction [inaudible]. But once we figure out something that really works, we can take out some subcomponent of our system and replace it with an open-source option.
We called what we built the Model Envelope. The data scientist sticks the model in an envelope, gives it to us and then we can do anything we want with it. We have all the necessary information.
It looks something like this. The data scientist only writes the training script, and the rest is configuration driven. Here’s an example of ETL sort of simplified, but gives a good sense. You can see the data scientist might load some data, vary their features, train their model, and then at the end of it, they tack on a piece that calls into our APIs. In this case, they save a model of the log metrics. We’ll go over what all of this means in a little bit.
Once they do this, once they save their model, they then send their model in an envelope over to the model envelope registry, where we can give them all these capabilities. We’re going to talk next about how we represent models in the model envelope registry and what all of these… [Inaudible] use that representation to unlock all of these capabilities.
So first, how we represent a model. All right, I want to compare it to writing a recipe, particularly to think about what information we need to write down a recipe in a way that anybody can use it and cook what you want. I’m going to put… It’s three parts. The first is the instructions. This is how to actually cook it and what to do as you’re going along in your meal preparation. Next is the cookware. These are all the extra bits and pieces that you need the right ones of to make this work: your oven, your casserole pan, et cetera. And the ingredients. Without the right ingredients, your recipe’s going to be garbage. You need a way to write it down and communicate it with the readers and cookers of your recipe.
And now, onto representing a model and how does this compare. Well, the function, this is what the model does. That corresponds to the instructions. The context, or the cookware, this is where and how to run a model. The data is the data the model needs to run on so that it runs effectively and so that we can get good results. We’re going to talk about what each one of these means and how this enables that separation of concerns that I talk about so the data scientist can focus on giving us the best function, the best context and the best data, and we can do the rest for them in a way that’s self-service to avoid handoffs.
So the function, what is this? This is the artifact and the shape. The artifact is a serialized model. This is bytes including state, so any sort of state or anything that’s important for the model to run, and metadata around that serialization.
What does this look like? Well, the data scientist passes an object and the platforms serializes it. The platform derives metadata around that serialization. This is where a separation of concerns starts to come in. The data scientists actually just creates a Python object and we serialize it for them. They don’t have to think about how to serialize it efficiently, how to deserialize it. All they do is pass an object then we’ve got.
And the shape. This is the function inputs and the function outputs. So what did this look like? Well, the data scientist can pass us a sample dataframe or they specify type annotations for a function. In this case, I’m showing the sample dataframe, but it’s similar with type annotations. You can see they pass an API input and API output.
We can take these and, from the shape of these dataframes, derive what their function needs to run, so the shape of the inputs. We then serialize that, representing the custom format. All they actually do is worry about getting us some data and specify the shape of their function and we can run it. We can store it and use it for later for them.
And next, the context. This is the environment of the model and the index, how to look it up. The environment is composed of a few things. This would be installed packages, so what libraries it needs to run; a custom code, anything that isn’t in a published library, but it’s code it needs to run and the language and version, what version of Python, or other languages if you want, this model needs to run on.
So what did this look like? Well, in this case, we do a lot of automatic deriving for them. With installed packages, we can actually freeze the entire environment, or they can pass pointers if they have a very special case. For instance, if their environment is super cluttered, they can say, “We just need these packages,” and then we freeze those packages and all their dependencies. They can pass and custom code and needed just the pointers to the code or to the modules, and we can serialize it, zip it up and make it available for them later. And the language and version, that’s automatically derived. We never have to tune that, the goal being that their training environment should match their production environment almost exactly.
That’s, again, where separation of concerns come in. They are focused on making the best possible training environment, using the right libraries for their model, and we as platforms are focused on making that environment available for them in production and in inference. We’ll show you how that works in a bit.
And the index. This is a set of key-value tags, both of which are strings, which form the spine of this envelope registry. It helps look up models. Again, what does this look like? Well, platform derives a set of base tags, and some are required. Every model has instance name, instance description to help you understand what the model does. We also derive username and team name from the environment because we have insight into that in our Stitch Fix workflow system.
The data scientist can also pass as many custom tags as they want. Those are used to organize the models to slice and dice the data. If they want to say, “This model of applicable for the women’s business line in the US region, and this one’s applicable for the men’s business line,” and then they have a staging in prod [inaudible], these tags can help them identify that. These come into handy later.
And then the data. This is the training data and the metrics. The training data is comprised of features and summary statistics on these features. What this looks like is that the data scientist and optionally pass a spec that specifies where their features come from for their model. And, as they pass data in to specify their API, we use that data again to derive summary statistics on that so we can understand what the shapes are and what the features are this model we’re trained on and the boundaries of it, as well as some different summary statistics.
All right. And metrics. These take two forms, scalars and things I call fancy metrics, which really are just things that are not scalars. Scalars might be your validation loss, number of iterations, whatever a data scientist wants. Fancy metrics have a bigger shape. These could be ROC curves, learning curves, who knows what. It’s kind of open and extensible.
What this looks like is that a data scientist logs metrics using a platform metrics schema library that we’ve built. They can log a scalar. They can log a fancy metric. We have these validators that ensure that the metrics’ the right shape. So again, data scientists only have to worry about logging the best metrics to the platform, and platforms can help them analyze it and view it and store it so they don’t even have to think about it.
I’ve talked to you about how we represent the model, why are we doing it this way and what capabilities does it unlock. Let’s go over a few. First, I want to talk about online inference. The approach that we take is to generate and automatically deploying microservices for a model’s prediction. So what did this look like? Well, the data scientist generates and tests out a service locally. We have a tool to let them run it locally before they put it into production and [inaudible] around, get some sample results. Then, they set up an automatic deployment rule. This is all actually based on pad. They say when a new model is published that matches this tag, deploy it to this service under this namespace. They publish the model and then wait for that rule to kick in.
Platform runs a cron job. This is a CD system to determine which model should be deployed. Then, they generate code to run model microservices. We actually have a whole code generation system that can generate a code with a specified [inaudible] they’ve given us to serve a model’s prediction. And then we deploy that models with their config to AWS so they can have it across our whole BPM. Finally, we take responsibility for monitoring and managing the model infrastructure. We found that if… This is a system that we build. This is code that we’ve written, and we get a lot of leverage out of monitoring and managing it. We actually have hundreds of services that platform monitors, and we control what they look like. We’ve built them in a nice, clean way so they’re really not that much work.
Okay, so how do we use each piece of this way of writing a recipe? How do we use the function, the context and the data? Let’s go over each one of these.
The function. This is the serialized artifact loaded on service instantiation and called during endpoints. We can load up their artifact when the service starts and call to it with the data passed in, in the endpoints. And then we use this function shape to create an OpenAPI spec and validate the inputs to push errors upstream.
You can see this is actually a generated service. This is the OpenAPI spec. You might be familiar with this UI. There’s a query function that takes in the features from the model that we’ve derived earlier and can validate. If they give the wrong feature, we’ll get a 422. Again, separation of concerns. All they have to do is think about giving us the right function and communicating the shape to us in one of the ways we’ve given them, and we can generate the best service possible for them.
Three is the context. As I said earlier, the tags spec is used to automatically deploy whenever a new model is published. Nope, the user never actually has to call deploy. It’s all done through system-managed CD, continuous deployment. They never have to worry about when to get their model [inaudible]. All they do is save it and perhaps do some messing around with the tags, and then we get it out to production for them.
We use the store package version to build docker images. The custom code is made accessible to those docker images for deserialization and the execution of the model. You can see the Model Envelope and then that JSON config go to the CD system. The docker image is built and that’s all shipped onto ECS. So their model is out, and the microserving production results.
And finally, the data. We can use the summary statistics to validate and monitor input to see data drift, to see if their data is different between training and production, which is really valuable to know. This feature pointer we can use to load the feature data.
You can see there’s now a query ID endpoint in which they can pass a client, let’s say, and then we load up all the features for that client and pass it into the model and call it for them. Again, that really makes it so they have to worry about less. All they have to worry about is giving us the right features that they use, and we then pull data from the feature store and run it on their model in the most efficient way. Again, separation of concerns.
So next capability we want to unlock is batch inference. Our approach is to generate a batch job in the Stitch Fix workloads system which is built on top of Airflow and Flotilla, Airflow being an open-source workflow orchestration system, and Flotilla is the Stitch Fix job execution environment for running jobs on the [inaudible].
So what does that look like? Data scientists create a config for a batch job either run locally or run on spark. There’s two modes that they can use. Then, they give us a tag query to choose the model that’s part of the config saying, “We want to always run the latest model that matches these set of pads,” and they give us inputs and output tables, what data to run the model on and where to save the output to, all within our data warehouse. And then, they execute it as part of an ETL. Again, they’re full service. They… Sorry. They’re full staff. They manage their ETLs, so we give them this piece to run as part of it. Platform spins up a spark cluster if specified, if they want to use the spark mode, loads the input data and optionally joins with features if they’ve specified a feature table and a feature source, executes the model’s predict function over the input and saves that to the output table.
How do we use each piece of this? How do we use the function, the context and the data, or all the pieces of the recipe? Well, again, the serialized artifact is loaded on batch job start. We load it up, deserialize it for them. The function shape is used to validate against the inputs and the outputs. This is where we’re using Hive as part of the data warehouse. We can validate the schema of Hive against their model shape to make sure their model will work before they even run any code. And, we use the Spark function mapPartitions as well as PyArrow, the tool for data serialization, to run a model that take in pandas DataFrames, a common pattern [inaudible], efficiently on Spark.
I mention this to you because this really gets back to that separation of concerns piece. They give us a model and we can figure out how to run it efficiently in batch, which can be kind of confusing and kind of complicated.
And now, how do we use the context? Well, again, we build a docker image with a frozen package and language versions used to install dependencies, and make their custom code available so their model can be serialized with the exact same mode on which it was trained. We then use the tags to determine which model to run, as we talked about earlier.
And finally, the data. We can use the feature pointer to load feature data if they want to give us a table of IDs instead of a table of features. If they want to give us a table of clients and do some prediction for those clients, we can join that with the feature table they specified for us, or the feature source. And then every time they run as part of our batch job, we store a pointer to the evaluation table, the output, in the registry so they can see every run they’ve ever had and analyze the data. So run the Model Envelope on the union of feature pointers of the feature ideas and the feature store, and then save that output in our registry. It’s something they don’t even have to worry about. They might not know that it’s available, but then when they look back and debug, we have the output saved for them.
So, okay. We’ve published our recipe. We want to get reviews of it. We want to see how it tastes, what the chef thinks of it, and that’s tracking metrics. So how do we do this? Well, the approach is to allow for metrics tracking with tag-based querying. Again, we use these tags as the model spine. It’s useful or querying and running models, as well as analyzing metrics.
What do we do? The data scientist logs metrics using the Python client, as I showed you before, and explores those in a tool we built called the Model Operations Dashboard. Then, they can save a URL from the Model Operations Dashboard for their favorite visualization and share it with their friends or other data scientists.
Platform has built and manages this dashboard, and we add fancy new metrics type to those metrics schema libraries. If data scientist wants to visualize something in a really cool way, we can build it for them and it’s pretty easy. It’s a nice extensible system.
All right. I want to show you some pictures because I think this is really pretty and I’m really proud of it. Here is what a time series analysis of models might look like. Each one of these dots is a different model, and it’s sort of split across different tags. You’re seeing staging versus prod and a few different metrics. They can see how their models change over time. This is obviously a more contrived example. It’s a nice way to demonstrate what the capabilities of the system are.
Then, they can view their models in a scatter plot. You can view the metrics in the scatter plot. Each one of these dots is a model and a metric. They can compare two metric and see how they vary against each other. This is really useful for hyperparameter tuning or understanding the relationship between hyperparameters and other metrics.
And finally, they can save these fancy metrics, CDFs, PDFs, time series, ROC curves, learning curves. Whatever they want, they can save it and visualize it in our system, see how across the last few models these different ones have varied. Happy to answer more questions about this later. There’s a lot of stuff I just whizzed by on.
In summation. What was the value added by this platform specifically by separating concerns? What are data scientists concerned with and what are platform concerned with? Data scientists are concerned with creating the best model, with ensuring that they have the model that solves the business problem in the most optimal way. Choosing the best libraries. Again, this is something that we trust them to do. If they want a version of PyTorch, a version of pysen, whatever it is, they can choose the best one and understand all the trade-offs. We’ve don’t have to worry about that. And determining the right metrics to log, how to understand whether the model is actually performing well in production, whether the model was actually performing well and getting the right insight into that.
Platform. Platform is concerned with making all these deployment contexts really, really easy so they don’t have to worry about it. They can just press a button. Ensuring that the environment in prod matches the environment in training so they don’t get any surprises when they ship their model out into production. Providing really easy and shareable metrics analysis, and wrapping up complex system. They can scale without having to get more and more complex system that they have to mend.
Finally, we’re concerned with giving them behind-the-scenes best practices. I didn’t really mention this, but think it’s a really cool piece of the system where if we find the best way to write a microservice or a really cool, new way to run mapPartition that work with PyArrow that makes everything more efficient, we can do that and they won’t even know. They might notice that their stuff gets more efficient, but it’s really, again, not their concern. The data scientist focuses on creating the best model, on writing the recipe, and platform focused on building the optimal self-service infrastructure for them, which is cooking it. Again, it’s self service, so there really aren’t handoffs. They manage it the whole way, but it taps into infrastructure that we build to make it really easy and really straightforward for them to do that.
So what do we want to do next? A few ideas. We want to go through some more advanced use of the data. So I talked a bit how we use the training data statistics to have visibility into prod training drift. This is really more prototyping our system, and we’d like to invest more in it and understand some of the use cases and how this can help data scientists manage their models.
We want to build out more deployment context. I mentioned that people want to predict on streaming and Kafka topics. That currently isn’t the capability, although we’ve prototyped it. There’s been some apps [inaudible], so we’d like to be able to build that out. I give that as an offering to data scientists.
We want to build a more sophisticated feature tracking and integration. Feature stores are all the rage. We’re starting to dig in and think about what featurization looks like at Stitch Fix. We have some capabilities that I showed to do this, but it’s not as sophisticated as we would like, so we focusing more on building that out.
Want to look at Lambda-like architecture. As I showed you, currently we have a service that we deploy and that takes a while. We actually have a system that we query for any model’s predictions and get it back instantly. That’d be awesome, but it requires potentially more unified environments. We might have to have some trade-offs with the data scientists here. We’re thinking through options here. I think it’s a really interesting problem.
And finally, we want to be able to attach external capabilities to replace home-built components of our own system. As I told you earlier, we invested heavily in interface design and ensuring that the subcomponents each have very defined responsibilities. If we have something from the outside world that we like, or if we want to give something that we out to the outside world, that’s entirely possible. If you have any ideas or are interested in anything here, feel free to ask or reach out.
Thank you so much. If you have any questions, feel free to ask me now or reach out later. Really appreciate it. Have a great day.

Elijah Ben Izzy

Elijah has always enjoyed working at the intersection of math and engineering. More recently, he has focused his career on building tools to make data scientists more productive. At Two Sigma, he was ...
Read more