Scaling Data and ML with Apache Spark and Feast

Download Slides

Gojek, Indonesia’s first billion-dollar startup, has seen an explosive growth in both users and data over the past three years. Today, it uses big data-powered machine learning to inform decision making in its ride-hailing, lifestyle, logistics, food delivery, and payment products, from selecting the right driver to dispatch to dynamically setting prices to serving food recommendations to forecasting real-world events. Hundreds of millions of orders per month, across 18 products, are all driven by machine learning. Features are at the heart of what makes these machine learning systems effective. However, many challenges still exist in the feature engineering life-cycle. Developing features from big data is often an engineering heavy task, with challenges in both the scaling of data processes and the serving of features in production systems.

Teams also face challenges in enabling discovery, reducing duplication, improving understanding, and providing standardization of features throughout organizations. In this talk, Willem Pienaar will explain the need for features at organizations like Gojek and will discuss the challenges faced in creating, managing, and serving them in production. He will describe how leveraging open source software like Spark and MLflow allowed their team to build Feast, an open source feature store that bridges data engineering and machine learning. He will explain how Feast and Spark allows them to overcome these challenges, the lessons they learned along the way, and the impact the feature store had at Gojek. Finally, he demonstrate how democratizing the process of creating, sharing, and managing features dramatically reduces time to market and leads to key insights.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hello everybody my name is Willem and the data science platform leader at Gojek.

Scaling Data and ML with Feast and Apache Spark

Today I’ll be talking to you about scaling data and ML with Feast and Apache Spark.

So the agenda for the day I’ll give you a bit of background on some of the data challenges we face at Gojek. Why we decided to create Feast a high level functionality of Feast and we’ll talk a bit about how you can get data into Feast how you can use trainer model used for serving statistics generation validation and a bit about the project and the road ahead for us.

So what is Gojek? Gojek is an Indonesian super application. Today we are classified as a tech a corn. We are focused on injecting products and services more we have more than 15 products and services. We are most famous for ride hailing cars motorcycles. We have food delivery one of the biggest food delivery networks in Southeast Asia. Digital payments just takes lifestyle services. So the company has more than 500,000 merchants, restaurants, more than a million drivers. So the scale of our operations is very big in Southeast Asia.

Machine learning at Gojek

So given the product diversity um and the scale of the organization, ML and data is critical to what we do. The most classic example is matchmaking matching customers and drivers. But not just matchmaking dynamic pricing, so surge pricing, having machine learning models calculate prices across the whole search all these service areas that we cover. So those are just two of the bigger examples and also have routing recommendation systems for example for food products and for delivery, incentive optimization. We have supplied positioning so how do you make sure that the primers are located in the right position? For trips to be started and balance the supply and demand and then of course fraud prevention. So given all of these use cases and these are basically just the tip of the iceberg. ML and data is critical to what we do. And we originally started typical flow for a data scientist was something like they’re given a business problem.

Machine learning life cycle prior to Feast

They think in terms of notebooks and data and basically hacking something together. So most of our original use cases when the CST when Gojek started following this approach. And teams had an idea of going into production would require some kind of model serving requests and features and then some integration with the production system. But for the most part the focus was on what can you extract out of data in order to deliver value?

So teams typically evolve these notebooks. They’ll hack together a notebook. They’ll iterate on that. And this is a very common practice not just at Gojek but across the industry. These notebooks evolved into airflow pipelines or any kind of ML pipelines that are holistic end to end flows. You start with your data lake or your data warehouse, you’re transforming your data. You can use Spark for that. And as whatever you require you use some framework like extra boost or sensorflow to trainer model. And you somehow deploy that model into your serving environment. And that’s serving environment productions isn’t gonna integrate with you. And you can make predictions based on future values that you’re also processing and serving. And this is kind of the person Gojek took originally and it worked. You’d need to manage this whole end to end pipeline, you’d need to have some kind of system that can transfer the data that you’ve created in batch pipeline into a production environment. Eventually you’d also start hooking up streams and transforming those streams and then populating your online stores as well as your historical source data. So for holistic system this works this is fine. An MVP can easily be created that does this. But at Gojek this approach didn’t really scale for us. And I think it most organizations this is great for the one but it doesn’t really allow you to reach higher efficiencies and scalability.

Problems with end-to-end ML systems

So some of the problems we experienced with this approach this end to end monolithic approach is. One of the biggest ones was a lack of ability to trade independently. So we had teams that would want to trade on features and we had team members on a trade on the modeling aspect and then others that want to trade on the models in production. So they wanna add new features. They want to experiment with different types and models and using AB testing or they want to just train your models. But all of these life cycles are independent yet they’re coupled extremely tightly if you evolve a notebook into an end to end production system. So that was one of the bigger problems we face. Another pain we face is that code needed to be ported over from the training environment because that’s what was being used for for example, using Python to manipulate data and then train a model. It needs to be imported into the streaming pros to processing engines or it needs to be ported over into the serving environment in order to create features. So this duplication of work was one of the pains that we also experienced and the duplication of code led to inconsistency because now the data that you’re receiving and serving is different than the data that you’re using to train your model. And that inconsistency can lead to data quality loss and more performance loss. And another problem we had was because we were building the systems from the ground up yes you’ve got a production grade infrastructure but they’re not build with ML in mind they’re generic data infrastructure. And so the quality and monitoring tools that are available or it needs to be needs to be configured for your specific use case. And we needed to configure that in a way that made sense. And one of the problems with this approach is that the production engineers that are typically on a call for managing these models do not really have the context of data that is being used in the model. And they do not know what is the right properties or shapes of data. And so this is this disconnect between the production side and the democratic side is one of the big problems that we also faced. And then finally when you build these into monolithic systems there’s deep lack of reuse and sharing of especially features. So you’ll find that firstly, there’s no knowledge of what is being used in these pipelines. So another team doesn’t know which features are being created here. And if they do know which features are created here they don’t have a way to extract that data out of this pipeline out of this whole flow. So that was one of our bigger challenges is how can we avoid teams recreating these pipelines from scratch every time a new project is started because clearly spending two thirds of the time creating features is not an efficient way to go about this.

So Feast is kind of one of the approaches to solving these key data challenges with production using machine learning.

Feast background

Give you a bit of background about the project. Feast was developed in collaboration with Google Cloud and Gojek It was developed at the end of 2018 and it was open source early 2019. So Gojek is the primary contributor there and it’s running in production using for most of our large scale ML systems. So our pricing engines or matchmaking systems many of the large systems within Gojek runs on Feast. Today Feast is a committee driven efforts. It has adoption at multiple tech companies and some of these I’ve listed here. And these companies either adoptees or they are contributors to the project.

Machine learning life cycle prior to Feast

So if we go back to this original diagram I think the question really is how can we improve this monolithic approach to ML?

If you introduce Feast stages of this pipeline are decoupled. So each of these stages creation of features, training of models, serving of models, have a touch point with features. And they were all done independently. That’s the Feast allows you to decouple them and iterate independently. So if you’re a data creator and this might still be the same data scientist, you’re creating features either through processing a stream either processing your batch data and data lake or your batch data in a data warehouse or even just excuse me. But that data ends up in Feast. And that’s where that specific iteration cycle ends. When you’re training a model you’re not processing your data from scratch, you’re not really transforming raw data, you’re selecting data from the feature store. And so that’s where kind of this decoupling means that there’s no strong relationship between the person training the model and the person creating the features. Although normally in projects those were the same person. But this decoupling allows the person creating the model for training to select any amount of features from any amount of teams throughout the organization. So this is very powerful for us to make it possible for teams to trade independently. And so when you’re training a model it’s mostly a matter of selecting the features that apply to your use case. And then creating your model binary and processing them in a model store like ML flow. So what teams typically also do is they store a list of features. These are the features of feature references as we call them that were used to train the model With that model binary and then they ship that into the model serving environment and model serving you’re loading your data from this registry like ML flow you’re serving it into your production environment. And all you need to do to get the same data to that model is send another request to Feast for online values and Feast will return those to you. So here you have three stages that are not decoupled and you can iterate on them independently. And what you’ll find is that scale that you’ll build up a base of features that are reusable across projects and you automate your production serving environments. So all worth the focus goes into the modeling environment. And so that’s where the data scientist spends most of his time instead of crafting features from scratch.

What is Feast?

So what is Feast so in summary Feast is an ML specific data system and it is an attempt to solve these specific problems. So I highlight that aspect because if you’re only focused on data storage or stream processing or any one of those aspects none of those are ML specific but there are specific constraints to running ML in production like the consistency between training and serving that need to be upheld. And so these can be seen as an opinionated configuration of production data technologies in order to serve these ML needs. So for example Feast allows you to consume from multiple resources so either batch sources or streaming sources and it stores those processes those in a way that it can be served in a point in time correct way later for training or data for training your models and for serving for model serving. So when you’re ingesting data it stores the data in online stores and stores it into historical stores. Feast allows you to standardize the definitions of features. So as a data creator you define the entities you define features and you map this into the data that you create. Then as an organization you have a standardized definition of those features that you can reuse across projects. And this encourages reuse and sharing because now you’ve also have with Feast a single canonical reference to each one of those features. And so you’re not talking about look at the CSV, look at this column or look at this key in this database. Look at this query you’re talking about a name of a feature this reference and that’s all your model sees. It doesn’t see the implementation details of the infrastructure. Feast allows you to ensure consistency between your training and serving environments through that abstraction. And Feast ensures that the data through partly through the ingestion layer and where it stores data it ensures that the training and serving layers always receive the same data. So the model is trained on the same data that will get in serving and that ensures that the performance of the model um is upheld. Feast is also able to do a time travel or point in time correct query of your features for producing a model training data sets. This is a critical aspect of creating training data sets that a lot of data scientists do today but is quite error prone. If you do this incorrectly you’re either going to have stale data that you’re serving your training model for training, or you’re gonna leak data from the future to your model. And in both cases the model will perform badly. And it will be very hard to debug and know that there is a problem. Feast is also able to allow data creators to define schemas. And these schemas can be used based on statistics that Feast generates on the data to validate the quality of your data. So if there’s a data shift Feast can let you know that something’s wrong and it can prevent you from shipping a bad model or it can prevent you from you know continuing to serve a bad model or a good model and bad data in production.

So what is Feast not. So Feast is not a workflow scheduler it’s not a pipelining solution. It isn’t the warehouse though it is a storage system for data. It abstracts the warehouse lake or online stores. But it’s not just a database. Feast is not a transformation system. So this is a very key point. Some of the other features those of transformation capability as key value add. For us we consider existing tools to be a better solution to these problems. So we try and focus on the other aspects of other challenges to production rising ML. So we encourage the use of Spark or Pandas in upstream systems and then we become productionization layers. Feast is not a data discovery or catalogueing system for your whole organization. An aspect of what we do is feature discovery and reuse and cataloging. But that’s a subset of the data that’s available in your organization. Feast is not good for data or data versioning or linear system. There are tools and products that do that. We might integrate with those. But they’re solving a different kind of problem to what we’re trying to solve. Feast is not a model serving or metadata tracking system. So we produce artifacts and we produce metrics that can be tracked in those systems. We do not serve models. When you serve model because model is ultimately a transformation you can produce data as an input to the model that gets back into Feast. So you can use Feast to track statistics about the models and to actually use the schema validation to see if your models are drifting from their traditional outputs. So in some ways you can integrate with model serving but doesn’t serve model itself.

So let’s talk a little bit about how you can get that into Feast how you can define features and entities and schemas.

Create entities and features using feature sets

When you’re working with on ML systems typically when you’re trying to do is you’re trying to predict some kind of invent some kind of phenomenon on an entity. So the entity can be a customer it can be a driver or it can be a book in order you wanna make some kind of prediction. And you do that through features. These features are attributes or properties about that entity or about any entity in the environments world that you’re operating. So the key concept in Feast is definition of these features and the definition of these entities. So the feature set is how you do that. On the screen we have an example of a feature set the driver weekly feature set. This feature set maps on to one table that we’ve shown there in green and yellow. And this is a table that represents drivers. So on the left, you can see driver IDs, you can see there’s a conversion rate an acceptance rate and an average daily trips. And so this is batch these are batch features. You’re they’re calculated on a daily basis. So you can think of the feature set as a kind of bulk way to define sets of features that occur together. And importantly these all occur on the same time stamp so they’re computed as part of the same aggregations. And these features I think it’s a very important thing to note is that a feature set as an ingestion concept it’s a means of defining a schema on how data will be loaded into the system or data will be sourced by Feast into the system. It’s not a means of selecting features for training a model. It’s only a definitions grouping.

Ingesting a DataFrame into Feast

So here’s an example of ingesting data into Feast. So what we’re gonna do is we’re gonna take that existing data frame that we showed in the previous slide loaded in as a CSV the driver weekly data CSV, we’re gonna define a feature set we’re gonna infer the entities and features from that data frame. And then we have an object the Feast object that features an object that we can apply and applies to essentially registers that feature set with Feast. So now Feast has an idea of that entity the type of the entity and the features that are associated with that entity. So the final step there is ingestion. Ingestion step loads the data into Feast. The important thing to note here is that once you ingest that data depending on the subscription of the stores that are inside of Feast the serving layer those stores will immediately be updated with this data. So when you do an ingestion all of our stores immediately get get access to this data. And they do it they have the data available to them at that point in time and consistent way.

But Feast is also capable of ingesting from streams. So in this case there’s a subtle difference were defining a driver stream feature set. On the right you can see an example to frame off this event stream. So we’ll see if there’s a trips today feature in the one column in green. And these are just events that are coming in on different kinds of driver IDs on some kind of stream and the topic is the driver stream topic.(clears throat) So what happens here is that you’re defining a feature set you’re registering it with Feast using the blind method. As soon as you do that Feast is gonna provision ingestion job. This ingestion job is going to start streaming in the data. And it’ll make sure to populate all the stores that are subscribed to this feature data.

What happens to the data?

And this is kind of the high level architecture of what happens to that data. So if you look at the five stages here you’ve got your data, you can think of that first layer as kind of the rule layer to your existing streams, your existing ware house data like or just your notebooks where you’re doing iteration. That’s the area where you’re working with your data. Then you have the ingestion layer. This ingestion layer is going to take data from wherever you’ve created it and populate stores. And these stores can be either historical stores used for creating training data sets or they can be online stores. An aspect of what actually happens that that’s not really shown here is that there’s a streaming layer that populates online stores. And so that streaming layer allows you to have as many online environments as you want. So if you have a large organization they can have their own independent serving environments that just tap off of a stream. And their stores automatically get populated when anybody loads new data sets or streams into Feast. Feast serving layer is the fourth layer. And that’s the way that teams will interact with the data that’s persisted within Feast. So a model training pipeline or a model serving system could send a request to the Feast serving layer either for a training data set or for online features and Feast will export that or return that at low latency. All of this is managed through Feast Core. Feast Core is you can think of that as having a dual purpose of being a registry. It’s the central place where you define features define entities define feature sets allows you to search and discover these features. It allows you to track metadata and attributes and properties of those features. And Feast Core also allows you to generate statistics. Now an important job that Feast Core does is manages the ingestion there. So when you register feature sets Feast Core is the one spinning up the jobs and ingesting the data into these stores. And so this is a kind of high level view of how Feast takes data stores it in a way that is consistent. And the reason it’s consistent is because the ingestion layer writes to all of the storage locations in one go. So that ensures that at no point in time do these stores become inconsistent with each other. So that unified storage is a key aspect of what Feast does.

So once you’ve stored your data inside of Feast feature serving is the next important thing to look at.

Feature references and retrieval

So here we have two loops. If you look at the right there’s the model training loop and model serving loop. For training what you wanna have is a training data set and for serving you want to have you know just the features that you need to do a prediction. But there’s one thing that the model always needs and that is the data should be in this the correct format. A consistent format in both cases. And Feast helps with this by abstracting away kind of the infrastructural aspects the implementation details. And so the only API contract that the user needs is the feature reference list. So feature references or canonical references to features stored within Feast. An important thing to note here and if you look at this feature list is that they’re based on the two feature sets that we registered. The reason that’s important is because you can reference features from any amount of feature sets. As long as a feature is referenced in Feast you can reference it. So you can send a request to Feast serving either for training data set or for online serving. And Feast will bold the training data set it will do a point in time correct join and stitch together the data to produce the training data set. The only requirement is that you have the correct entities to join that feature data onto. So in training you’re gonna send Feast a list of entities so driver IDs and timestamps, and then Feast will join the data on to the future data on to that. And in serving, you will send just push off the driver IDs, list of features, and then Feast will send back the feature values attached to those driver IDs. So in online serving you’re only getting back the latest data but in for historical serving is giving you the correct view of the data for training.

So one of the key value adds here is that is the consistency between these environments. And the fact that the historical data sets represent the point in time correct view of the data.

Events throughout time Prediction made here —— Outcome of prediction 7

An important thing to talk about is this point in time correctness and kind of the events that happen throughout the lifecycle of the project. So if you look at this timeline the green diamonds are essentially feature values that are be being computed. That can be during batch transformations or streams. It doesn’t really matter but essentially at some point in time a new value is populated and it’s ingested into Feast or it’s streamed into a store. And that becomes available. The red square is a event in the prediction system. So that’s some kind of event that mandates the prediction. So that can be like a booking is made or some kind of transaction is made. And you need to make a prediction. And the final purple square is the outcome. So that’s the final event that happens or the final data point that arrives that tells you whether it was a success or a failure. So these three colors represent the three types of data that you deal with when you’re building an ML system. And you need all three of them to be able to train a model and to be able to know the outcome of a model and to have you know the correct labels. Also to train that model.

Ensuring point-in-time correctness

So if you look at this view if you’re doing a prediction online, so that let’s assume that at the Crider booking event in order to make a prediction you have your model already. What you wanna do is you wanna (clears throat) provide the correct feature values. So for each of those four cream rows you’re gonna send the latest values to your model. That’s very easy to do an online serving all you need to do is make sure that the latest values are available and that they’re not too stale. So you always have a point in time correct view of the data and all that sort of thing. But how do you do that for historical serving? This is very tricky to do because these timestamps don’t line up perfectly. So if you have many different booking events and you have many different labeling events or outcomes and all of your features are calculated at different time stems producing this training data set is non trivial. And this is something that is requirement for data scientists to do and it is an extremely error prone and can be often a complex task for them to undertake. This is something that Feast does natively. It allows you to stitch together at the data(clears throat) based on the source based on it update rate and frequency all of that is abstracted away from you it’ll produce a on time correct time traveled training data set for you regardless of at what time the data points were updated.

Point-in-time joins

So a quick example here is using our existing data frame of driver features. So on the right you have a new data frame with some labeled events. So these are basically just has been completed successfully or not successfully. Those are the target values that you want to join onto. And on the left you have features and driver IDs. And so if these are two data frames that have been loaded into Feast and you query the salt you could produce this final training data set. And it doesn’t matter whether the timestamps don’t line up. Feast is able to Feast is you can indicate to Feast that the one table is your basis table the table on the right and the one on the left is the features that you wanna join onto it. And Feast can ensure that this joins an appointment time correct manner.

Getting features for model training

So here’s this code snippet using our Python SDK and how you can retrieve future values from Feast or trading more. So on the left you have your list of features those are the feature references. So those are the canonical references that I spoke of earlier. So this is all you need to pass to Feast this list of features and then you want to pass the entity data frame to Feast as well. This data frame contains driver IDs and timestamps and those typically are mapped straight on to the timestamps that the events occur for prediction. And so you send that as a request to Feast Feast returns a data set that can be materialized into a data frame, a Panda’s data frame, or it can also just keep the data frame persisted on disk or an object store. But essentially what it’s gonna do is produce this final training data set based on feature sets multiple feature sets. So if you look on the right you have two different categories of future data. You have the batch data that we ingested originally and the other trips today streaming features. And so it’s irrelevant where the data came from This is nice of Feast because you can just join these in a point in time correct way. But it doesn’t matter that the ones coming from a stream or the ones coming from batch. So Feast can join all of us together for you and prevent the typical errors that occur like leaking feature data or joining data together in a way that propagate stale data to your model training. So a team which typically produce the data frame and we’ll train them to train their model and then we produce their model binary with a list of features.

Getting features during online serving

So here we have an example of going to production. So this is the Python SDK and you’ll notice that the list appears identical. So you didn’t have a spark query or a SQL query in training and then you have some kind of different database using serving. The only difference here is the method that you call online serving. But the list of features is consistent. And that’s kind of one of the key value that Feast brings is that the list of features. And the feature references unify both environments. So in production you’d have driver IDs these are your entities you would not have timestamps these are always requesting the latest feature values. So you’d send the list of features driver IDs and Feast will attach the latest feature values that you then return. So in this case, we’re only asking for one driver IDs features but in reality you’d ask for let’s say 100 driver IDs or more. And the Feast would give you these values at low latency. So our current stores operate in the single digit millisecond latency for this and operating at scale is really one of the requirements that we have. We also have JVM and golden clones. So this Python Client is basically just shown for illustrative purposes.

Feast is also able to allow you to generate statistics and use those six to generate schemas and then subsequently use those schemas across training runs or as well as in production to validate data and prevent data shifts from occurring.

So here’s an yeah so Feast integrates with TFX. Feast has interoperability with TFX through its features specifications, so TFX and TFDV are tools developed by Google that allow you to find schemas on data sets.

Feature validation in Feast

And these schemas can be used for validating data. We felt that these were great tools and we didn’t want to reinvent something that already existed but we wanted to complement them. And so that’s why we integrated those schemas into our feature specifications. So Feast is able to generate statistics that are compatible with TFTV. So why don’t you load your data into Feast you can export these statistics in a TFDV compatible format. Your also you’re also then able to use those TFDV schemas to do validation of the statistics. And you can do this during training runs. And you can do this during ingestion time as well when you’re streaming data into your stores. And Feast is also able to integrate with monitoring systems and alerting systems. So currently we have integration with Prometheus. So our ingestion pipelines can produce metrics and statistics. And these statistics can trigger alerts based on schemas that are defined upstream. So these schemas are defined based on the properties of data by the data scientist. I think this is one of the key value as to why we wanted to integrate TFDV and TFX interfaces because the data creators know the most about the data that they’re publishing or authoring for consumers. They have the domain knowledge for it’s important for them to be the ones defining the schemas that ultimately will be used for validation. And if we only limited this TFX or TFDV it would be a batch (clears throat) It would be a siloed batch process that happens in a single pipeline. By incorporating this into our feature specifications. We can apply those schemas not only for training but also for serving and ingestion. So throughout all the touch points that features occur in your organization.

Infer TFDV schemas for features

Here’s an example of inferring schemas from existing features and then registering that with Feast. So the first step there were generating statistics based on an existing Iris feature set that is within Feast. That statistics object that is returned calculated is the computation is all done by Feast behind the scenes. So it abstracts away the infrastructural and computational aspects but the schema of that statistics is returned is 100% compatible. In fact there is a TFDV schema that’s returned. Excuse me it’s TFD statistics that’s returned. Those two statistics can then be used to infer a schema. So that’s all happening client side outside of Feast, you can then use the normal TFDV approach of just tweaking the schema and setting the values. And then you can so what we wanna do is once you’ve defined that schema you want to update the feature set that you’re working with in order to enrich it with the schema. So we’re gonna retrieve using the get feature set method the schema sorry the feature set we’re going to import the schema into that feature set and we’re gonna re-register using the apply method. So the apply method method is an important update method for that feature set. So we’re registering that Iris feature set with Feast and if you look to the right, you’ll see that now the Iris feature set has not just the name and the value type of the feature but also presence and fraction and account. So these are TFX TFDV properties that can ultimately be used to validate that those features. The schema can be used for not just for validating training data sets not just for validating, ingestion or serving. But it can also be used to indicate the intent and the properties of that feature. And an important thing to note here is that when you’re registering these properties about these features, you’re doing it on a feature by feature basis. So even though they’re grouped together here, when they’re used ultimately they can be used across feature sets.

Visualize and validate training dataset

So here’s an example of creating a training data set and then returning statistics and schema from the training data set. So this is going back to our previous example of the driver based features. So you’re sending your driver entities to Feast you’re sending a list of features and Feast is gonna return a data set. But for statistics and validation purposes Feast will also return precomputed statistics based on the data set that has been produced. So for that exported data set it will run a TFDV compatible operation that produces statistics. And it will also provide the schemas that have already been registered on those features. So remember now you’re picking features that are crossing feature sets. It’s any kinds of driver features and organization you can pick from. And it will propagate the schemas that those feature authors created. And it will return that with your data set. So you can use TFDV client side to validate the data set that you’re going to use for training your model prior to training. And it will you can just use TFDV to check those anomalies. Because we are using TFDV. You also benefit from being able to use facets. So facets illustrated on the right using that UI. So you can just generate the statistics and visualize it for people both basically for ETA or for debugging purposes.

So what are the key takeaways from incorporating Feast into your life cycle (mumbles). So the key value adds of Feast are with sharing teams can now start with selection instead of creation, they can quickly get up to speed focusing on aim in the last place on iteration, they can independently attribute different aspects of the different stages of the lifecycle without having to be stuck on a monolithic approach.

What value does Feast unlock?

They can focus on independent aspects. Feast provides consistency between training and serving and ensures point in time views of your data. Feast allows for centralized definitions. And it allows you to reuse features throughout different projects. And Feast also ensures the quality of the data that you’re producing and allows users to encode data to encode their knowledge into their schemas and to ensure the quality of data they ultimately reaches your models. So finally the road ahead

So Feast 0.6 is landing soon is landing in June of 2020. With that well have statistics validation and proof discovery method or functionality the community is currently developing databricks support azure support AWS support we have big query SQL sources landing soon JDBC connectors, we’re looking at developing a drive features system for online serving as well as the user interface and allowing for automated training service queue detection.

So we encourage you to get involved. There are some links to our open source project. We are on the Cube Flow Slack channel. You can have a look at our mailing list and join that if you want to find out more. There’s a link to the slide deck and yeah that’s it from my side.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Willem Pienaar


Willem is a tech lead at Tecton where he currently leads open source development for Feast, the open source feature store. Willem previously led the data science platform team at GOJEK, working on the GOJEK ML platform, which supports a wide variety of models and handles over 100 million orders every month. His main focus areas are building data and ML platforms, allowing organizations to scale machine learning and drive decision making. In a previous life, Willem founded and sold a networking startup and was a software engineer in industrial control systems.