The ‘feature store’ is an emerging concept in data architecture that is motivated by the challenge of productionizing ML applications. The rapid iteration in experimental, data driven research applications creates new challenges for data management and application deployment. These challenges are complicated by production ML pipelines with interdependent modeling and featurization stages. Large tech companies have published popular reference architectures for ‘feature stores’ that address some of these challenges, and an active open source ecosystem provides a full workbench of power tools. Still, the abstract role of the feature store can be a barrier to implementation. We demonstrate an implementation of a feature store as an orchestration engine for a mesh of ML pipeline stages using Spark and MLflow. This is broader than the role of a metadata repository for feature discovery. The metadata in a feature store allows us to break the unit of deployment down to the level of the ML pipeline stage so that we can break the anti-pattern of ‘clone and own’ ML pipelines. We isolate concerns of pipeline orchestration and provide tooling for deployment management, A/B testing, discovery, telemetry and governance. We provide novel algorithms for pipeline stage orchestration, data models for feature stage metadata, and concrete systems designs you can use to create a similar feature store using open source tools.
Key Takeaways:
– Hello Spark Summit, my name is Nate. I’m a data architecture consultant and I focus on the area of industrialized machine learning. So this is just the term that my employer uses to refer to the work we do in machine learning which is focused not just on the data science, but also on taking the research outputs of that data science and putting those into production. So having the opportunity to work at this intersection of data architecture and machine learning I’ve had a lot of exposure to this emerging concept of the features store. So that’s what I’d like to talk about today.
I’d like to provide an overview of this space of the feature stores that are becoming more popular I’d like to describe an approach to implementing feature stores which I use in my day-to-day and I feel like it might be a little underrepresented and then I at the end will provide a demo of how you might be able to take some of these concepts and put them into production for yourself. Before we dive into what makes some of these approaches to feature stories different start with what makes them the same what unifies all these approaches is that there’s a common problem that we’re trying to solve. It’s a challenge that I’ve had to address on every data science team I’ve had to work on. There’s this fundamental tension between the scientific nature of the work that we’re doing, the experimental nature of the work and rationalizing that with the fact that for everyone these iterations we have to be creating business value. So what we need is the ability to tell more complete story about the business value that we’re creating the data science at every one of these iterations. And one way that we can do that is at the outset of every iteration of every experiment, we can be really clear about here’s the business hypothesis that I’m setting out to validate or invalidate in this experiment. And when validate our hypothesis we create lift in production, we advance some KPI, we’ve created value that way.
But regardless even if we invalidate our business hypothesis, we’ve still created some business insight from that we’ve learned something and what’s critical is that I’m not learning this as an individual, but I’m capturing that insight in a repository to accelerate future research and one of those repositories that we can use to capture those insights is the features store. So in other words the feature store when we’re using a feature store every experiment that we’re doing is accelerating future research. Now, one of the original intentions of this talk was to, do a survey of a lot of the material that already had been put out here on this topic. And in doing that research I identified this resources this is a relatively new resource that’s come out that’s done a lot of that legwork for us. So a big thank you to featurestore.org it means today we can focus not on surveying so much everything that’s been put out there but we can talk about you know how these approaches are distinct and we can dive deep into one of these approaches. So do your own research you’re not gonna find every commercial product available on this site but it’s an excellent resource so thank you to the guys at featurestore.org and there’s a lot of ways to break down these different approaches the one that I’ve chosen is to, identify what’s being automated in each of these approaches as an engineer. What’s the effort that I no longer have to do when I’m deploying machine learning application and by far the most common approach here the most common feature store implementation is where we’re automating the delivery of the data to the scientist or to the production application. The most common question that occurs with this approach is, how is this different than my existing data management and data governance framework? We already we have a data catalog I already have strategies for managing my ETL run times. What’s new here? And the way that I answer that is that when we’re working with these machine learning applications we wanna extend that framework we wanna extend that governance framework to capture some of the nuances and edge cases of our machine learning applications. So here’s an example of some of the semantics who might need to extend our data governance framework with. If you’re not doing machine learning you’re probably not dealing with training and test data, but here’s why you might want your governance framework to be aware of these concepts in the context of machine learning. Here’s a typical example of what we might be doing when we’re deploying machine learning application. Say some sales data, one of the first things I might wanna do with that sales data might want create some customer segments, depending on the model I’m creating what probably I gonna do is I’m gonna break my data down into a train set it’s a set. I do some machine learning I create that produces a model the model makes predictions and then crucially here the output those predictions are themselves feature themselves features that can be used to accelerate future research. So this seems like the ideal scenario for a feature store I’m creating a business insight, I’m using that to accelerate future research in this case, that future research might involve building a model next best action. Clearly something that could benefit from a customer segmentation So similarly I’m gonna join the customer segmentation to some other feature data from my source data I’m going to do a train test split this seems like the ideal scenario for a feature store, but what has gone wrong here? So if we were to build this model, what we’ll probably find is that the model performs really well on our test data and so we would continue to we would continue to tune our model and we would continue to perform really well on our test data and then when we go to deploy on our model, we would find that our test data is our performance on our test data is not a good signal of how our model is gonna generalize to unseen data, right? The fundamental reason we wanna create these training and test sets is to give us that signal so that we know for over fitting our model what we might find in this system here, is that our test data is no longer providing that signal. And the reason for that is that information about our test data has snuck into our training process through the segmentation feature. So this is an example of why you might want to extend your data governance framework to be aware of these training and test semantics or other machine learning semantics. In theory we want to deliver our data to our scientists or to our production applications in a way that helps us avoid this situation.
A second approach to management of features there are a number of approaches out there which describe how we automate not the way we deliver the data but, how we do the ETL that’s involved in the feature engineering. So a number of scientists have observed that a lot of the ETL that we’re doing during feature engineering feels repetitive it feels like something that can be automated you know window my data based on these dimensions pivot my data based on those dimensions, create a whole bunch of features feed that into a feature selection algorithm and that could be a more automated method to feature engineering. So this is another class of approaches to feature management that’s out there. And where I’d like to spend the remainder of this talk, diving more deeply is on the automation of how we construct our machine learning pipelines. So one example of automation and constructing our ML pipelines is, the methodologies behind Auto Ml rate and this is these are methodologies that target the citizen data scientist. There’re scientist who may not have a depth of knowledge in algorithmic design but can still benefit from machine learning. And what I’d like to share today is an approach to automating the construction of our ML pipelines. In a way that targets the folks who are doing that algorithmic design a way that lets us expose an API to them, which is not doing the algorithmic design for them but instead isolating the concerns of the operations of wiring the pipeline together and deploying that pipeline, so that they can focus on algorithm design.
So for the uninitiated, what are we talking about?
What do I mean when I say ML pipeline? Some of us may have a vision of how machine learning works there you know there’s a common vision of machine learning which focuses on the very last step of machine learning where we have an estimator which is taking some featurized data, our final featurized dataset and then doing something like a logistic regression or a decision tree to make a prediction. When in fact in an industrialized context this is typically the last stage of a whole pipeline of estimators. Each estimator creating features to feed downstream to the pipeline. An example of that is the example what was provided earlier you know I make as a first step of creating the next best action model I might use in customer segmentation. Alright, so these multiple estimators if, assembled into a pipeline and that full pipeline is what gets fit for data and what we found when putting these types of applications into production is that’s crucial that we manage not just the model that comes out of that pipeline when we fit it to our data, but the whole the artifacts in the pipeline itself. We need the same level of insights and controls into the pipeline as we need into our models. And so here’s what that looks like in code. This is from The Spark documentation it’s just the Quick Start documentation for this is what a pipeline looks like. We are defining some of those stages of the pipeline in this case we have a hash we’re virtualizing our inputs using a hashing estimator and then that gets fit into a logistic regression all of this gets wired into a pipeline the pipeline is what gets fit to our training data and then that outputs a model that model makes predictions which we evaluate for model quality. And the basic insight behind the approach I’m going to describe in the last few minutes here is it comes down to this question, why does this line of code exist?
Is it here just for syntactic purposes or what I’d like to convince you of today is that in fact we have all the information at this point to construct a pipeline implicitly without having to explicitly wire together these stages and by doing so, it creates opportunities for us to optimize the construction of that pipeline and layer governance on top of it. So let’s talk about that algorithm and first as a point of contrast let’s talk about how these pipelines normally get deployed or typically get deployed. So here’s a pipeline very much like the one we saw in code we’re doing some tokenization of texts or vectorizing that text and the we’re building an NOP model on top of it. And our first scenario, we’re creating a sentiment model. Now often what will happen is, deploy this sentiment model, I’ll go into my next experimental iteration and now I want deploy a new kind of model maybe I want try and make some inference about the toxicity of text rather than a sentiment of text. And all too often what will happen is this sentiment pipeline will exist in a notebook I’ll copy pasted large portions of that notebook and with it a copy paste large portions of the pipeline there will be a lot of redundancies not just in the code but also in the computation these featurization steps can be very computationally expensive. And now if I try to layer any governance on top of this pipeline or automation on top of these pipelines that’s also getting copy pasted all over the place. So as an alternative of what I would like is to be able to declare the stages of my pipeline and given the current metadata about my pipeline in particular what do I need to come out of my pipeline in this case I need a sentiment prediction a toxicity prediction. I can rate a relatively naive algorithm which works backwards to construct a pipeline, which is maybe a little bit more optimal. So I have not just removed a line of code here but I’ve also by inferring, by implicitly constructing the pipeline from stages that I’ve declared and some metadata that declared. I’ve already seen some early optimizations that I can do when I construct this pipeline. But so far this is a pretty simple algorithm, we can take a look at a more interesting case in this case we want to experiment with two different strategies for vectorizing our text. So maybe my sentiment model will perform better if I use a different strategy for turning my text into a vector.
And if I construct this graph in the same way what I’d find is, I can actually feed this graph into spark, and the reason is I have multiple incoming edges to tip to a node right? And Spark mustn’t know exactly where it gets data from there can’t be this level of ambiguity it needs to know my sentiment estimator needs to know where to get these vectors from. And so one solution to this problem would be to reconstruct this graph, by reconstruct it from the possible traversals of our original graph right? And that would look a little something like this. And there are some interesting applications here I have some applications for AB testing I can compare how my pipeline performs when I use Word2vec as my vectorizer rather than why use TFIDF as a vectorizer. So that’s already sort of emerging just from this algorithmic approach to implicitly constructing the pipeline, but another thing that’s emerging here is some really interesting metadata. So from this, I naturally have a metadata about data lineage, I know where my sentiment model is getting it’s predictions from and I know how that data is constructed in turn. So these are opportunities for better management of our operational tasks such as metadata management and runtime optimization and they come just from the algorithm itself but the real value here is our opportunity to create new types of API’s based on this algorithm and the real way that I get accelerated from an operational perspective here is because it’s the fact that I can now build API’s that better isolate the concerns of the algorithmic design from the operational concerns how am i building my pipeline, how am i constructing it my DevOps how am i deploying that pipeline, the automation around the runtime management.
The metadata management and discovery as well as how a layer or governance on top of that pipeline. So with that I’m gonna provide a demo of what an API like that could look like as well as an example deployment from the API that indicates some of the new ways that we can interact with this metadata and layer governance on top of our pipelines.
So after the demo I’m handing things off to Nate of the future to answer any questions you have I hope you’ll stick around for those questions and also to provide feedback at the very end of the talk. The feature flow ML pipeline orchestration algorithm creates new opportunities to write API’s that separates the concern of
how rhythmic design from our operational concerns. Concerns such as meditative management, runtime management and governance. So here’s another example of such an API implemented as a,
library to be used as part of ADO deployment pipeline.
So, here we are doing just some typical algorithmic design defining some stages of our ML pipeline in this case we are defining some NLP tokenization strategies and building a sentiment model out of that.
And recall that we don’t explicitly wire together the stages this pipeline we rely on the feature flow algorithm to do that but we do define some metadata around our pipeline and specifically, we’re telling the algorithm here are the outputs that I’m expecting.
By one means or another I’m expecting a sentiment prediction and I’d like that to be joined to the the text that was provided as input to the pipeline.
So, noticeably absent,
from these
these algorithmic concerns are integrations with our operational tools such as our ML flow model management tool or our data bricks runtime management tools. Those concerns are instead handled, hidden behind this API right? So in particular for all the pipeline stages, we have defined we now have this API method which can be used to deploy those stages.
So here all those operational concerns governance for how our pipeline is evaluated all the sticky details of how to configure our cluster all of that has been captured in a reusable way
as part of this API.
So to make that more concrete here’s an example of what a deployment like that might result in. Here we’ve the API is managing three jobs for us. We are managing a job to fit our pipeline to the data
that’s been provided
and then because the output of that pipeline is a model which we store in the model catalog and the model catalog as this notion of staging model and production model, we have two jobs for making predictions using models from the model catalog one for making predictions using the model that’s been promoted to staging and the other for using the model has been promoted to production.
So all of the details for integration with the ML flow Model Management have been encapsulated into these jobs, and we can expect we have controls around and guarantees around, what metrics are being collected? What standards are being used to collect all this data in ML flow as well as, the model catalog?
So now what about metadata management? When we did this deployment we captured all sorts of metadata about our pipelines, where did that metadata end up?
Here you can put it anywhere here we demonstrate UI we threw together
and this helps to
illustrate some of the additional insights that can be created by managing our pipeline stages our featurization stages in addition to managing the feature data itself. So for example,
in addition to searching for features such as text vectors I can also identify the relevant stages which are consuming text vectors or what strategies I have for creating text vectors a scientists can,
compose these strategies and share them with one another.
Also because I’m managing not just the featured data but also
the pipeline I can tie these pipelines to the features Accenture stages. So here we can see this is some information about a pipeline where this featurization stage is being used. I have a simple visualization of how this pipelines has been wired together and so I have insights into the pipeline but also because I’m managing the runtime, I have insights
into that runtime. So here we’re capturing using a popular visualization called the facets visualization.
Accenture
Nate is a Data Architecture and ML Engineering consultant at Accenture. He leads the design and technical delivery of complex ML applications. With his background in productionizing research applications, he helps enterprise clients develop their playbook to transition from promising research results to high value industrialized deployments.