Personalization is one of the key pillars of Netflix as it enables each member to experience the vast collection of content tailored to their interests. Our personalization system is powered by several machine learning models. These models are only as good as the data that is fed to them. They are trained using hundreds of terabytes of data everyday, that make it a non-trivial challenge to track and maintain data quality. To ensure high data quality, we require three things: automated monitoring of data; visualization to observe changes in the metrics over time; and mechanisms to control data related regressions, wherein a data regression is defined as data loss or distributional shifts over a given period of time.
In this talk, we will describe infrastructure and methods that we used to achieve the above: – ‘Swimlanes’ that help us define data boundaries for different environments that are used to develop, evaluate and deploy ML models, – Pipelines that aggregate data metrics from various sources within each swimlane – Time series and dashboard visualization tools across an atypically larger period of time – Automated audits that periodically monitor these metrics to detect data regressions. We will explain how we run aggregation jobs to optimize metric computations, SQL queries to quickly define/test individual metrics and other ETL jobs to power the visualization/audits tools using Spark.’
– Hi, everyone.
Have you heard of the sock universe? That’s the universe where socks seem to move in and out of their own accord.
You put six socks into a dryer and sometimes five come out and sometimes seven come out. Or maybe your red socks are replaced by the blue socks. You know what’s similar to the sock universe? It’s the data flowing through the ETL pipelines. You put six attributes in, sometimes five come out, sometimes seven come out. Or maybe two of the attributes exchanged values. Today, my teammate Preetam and I are going to talk on about data quality using sock universe as a palette. I’m Vivek. I’ve worked at Apple, Sumologic, and Amazon. Right now I’m working as a senior software engineer at Netflix. Netflix is a video on demand streaming service. This is a recommendations page. You may have heard of some of these popular titles, like “Money Heist,” “Ozark,” and “Tiger King.” Let’s see how does this recommendation page gets generated. The first step is that we need data, we need member data and the video data.
Using these two things, we can do the feature generation. Once we have generated the features, we can send that to a model schooling service, and from there, we can compute the recommendations. When we are doing the online feature generation, we take a snapshot of the member data and send that to a historical fact store. From there, we can recompute the features on demand. Why do we do this? Because, this helps us improve our experimentation speed. We can add any features or edit an existing one and once we are happy with the new feature values, we can do the model training and send the updated model to compute fresh recommendations. Today, our focus is going to be the data going in and out of the historical fact store.
On Friday at 11:10 a.m., I am going to be talking about the Spark limitations that we had in the offline feature generation Spark chart. Do check that talk out. Coming back to our historical fact store, there are four major components to it.
The first one is the logger that takes the facts from the online feature generation and sends it to the ETL pipeline. ETL pipeline transforms those facts and stores them in Hive. And from Hive, we can query those on demand and give them to the offline feature generation. Let’s get some more facts about this historical fact store. It has more than ten petabytes of data and there are hundreds of attributes in it. When we talk about an attribute, what does an attribute mean? Let’s talk about thumb ratings, thumbs given to a show like “Ozark.” So, we need the video ID for “Ozark” and we need the rating that the member gave, whether the member gave a thumbs up or a thumbs down. So these two things form two different attributes. Now combining this thumb ratings with the subscriber information and viewing history. All of that combined forms one row in a historical facts store. There are more than a billion such rows that are flowing through our ETL pipelines everyday. All of the recommendation machine learning models use data coming from historical fact store. That’s why we care about the quality of data in this fact store. So what can cause bad data to come into this fact store?
We think there are two main reasons. The first is the data at source. This historical fact store is only taking a snapshot of the data. It’s not the source of truth. The source of truth are the services like thumb rating service. We talked about thumbs up or thumbs down to “Ozark.” Let’s say, see, if there’s a bug in that thumb rating service. In that case, that thumbs up given to “Ozark” can be marked as thumbs down. It’s an egregious error. But that error is going to reflect in the historical fact store too because the information at the source was wrong. The second one is a bug in the ETL pipeline. There are three steps, logger, ETL, and query. What if there’s a bug? We are converting from POJO to proto, and to JSON, and so on. During those conversions, maybe we dropped an attribute, or we exchanged some attributes. So that’s a bug in the source code. Let’s take some examples of bad data to understand this problem more.
The first example we are going to talk about is entanglement. Coming back to our sock universe, let’s say we have ten white socks and two red socks. We put all of them combined in a washer. What you get at the end is ten pink socks and two slightly less red socks because your red socks were leaking colors. That’s entanglement in the sock universe.
In the Netflix world, let’s think about this. What if there’s a bug in the ETL pipeline and we mark 10% of the kids’ profiles as adult profiles? With that, the recommendations for adults are going to start seeing more and more kids’ content. With that, members may start giving thumbs down to the content. And that’s entanglement. With that, 10% change in the kids’ profile information can cause 20% change in the thumb ratings.
And usually, this kind of thing manifests itself very slowly. It’s not a sudden change and it’s very hard to detect. In the example shown, we had a particular attribute that always had values between 4 and 5 million, but at some point, it started drifting to 3.5 and 5.5, to an extreme between 3 to 6 million.
And we were able to detect this and we fixed it.
After that, the values started coming down again. Let’s take second example of bad data, it’s a drastic change.
In the sock universe, if you put six socks into a dryer and suddenly five come out, that’s an example of a drastic change.
In the shown chart, we had a particular attribute that was sending at 600 million rows everyday and at a sudden point in time, it started sending at only 150 million. We got an alert, we fixed it, and we saw a reshape recovery.
Let’s move on to the third example of bad data. In the sock universe, let’s say you have 20 pair of socks, but you wear your favorite Santa socks everyday. The remaining 19 socks are underutilized. That same for data and we think that any data that does not get used is bad data. In the shown example, we had a particular attribute that was coming in 15 million times everyday, but it was only consumed 1 million times. We talked to the application owners, they removed this attribute, well, the remaining 14 million, and that was 95% saving on this particular attribute. And it was a huge quantity of data. So, let’s summarize what we talked about. We talked about three bad data examples, entanglement, unstable data, and under utilization. There are two reasons for these. The first one is data at source and the second one is bugs in the ETL pipeline.
To talk all about the solutions to these problems, I have my teammate Preetam with me. Over to you, Preetam. Thank you. – Thank you, Vivek. Hello, everyone. I’m Preetam, I work at Netflix, and I’ve been here for about a year. Previously I’ve worked at Thumbtack and Yahoo. And today I’ll be talking more about solutions to data quality problems that Vivek just mentioned. Data quality often is, does not receive as much attention as it needs in the industry and most machine learning pipelines. So we want to be building some high level solutions to the problems that was listed. So, essentially, we’ll start with the first problem, which is addressing the problem with the data at the source. So, from the previous slide, you could see that we had a bunch of components.
For this particular solution, we’ll be concentrating only on a couple of components. The first, those two components are the online feature generation and the historical fact store. So, essentially, the online feature generation component generates features and there’s a machine learning model that scores those features and then generates recommendations. And at the time of generating the features, this component logs raw facts into this historical fact store. So, what we do on top of that is we build a data aggregations pipeline that’s based in Spark. And we store these aggregations in a data store, which we are going to call the aggregations data store throughout this presentation. On top of this aggregations data store, we have an automated monitoring and alerting component. And we also have a metric visualization component. Which, both are powered by this aggregations data store. So we’ll be talking about all of these components in more detail. Let’s start with the data aggregations component for now.
So, we’re gonna start with a simple example. You can see that in this snapshot of a data set we have a few members, a few videos, and we also have the play duration, which essentially is the amount of time a member played a video in a given number session. And that’s in seconds. So, play duration is a raw fact and also, for this example, I constructed a feature which is log base 10 of the play duration. So you can see from this data set that are a few problems already. There is a null values at these two different components. And the corresponding feature value for those null values is 2.556. If you’re wondering why the feature value is 2.556, it’s essentially because at the time of constructing the feature, when you have missing data or you have null data, you would handle it by imputing the data with some default value. The default value in our case happens to be 360, which is row number one. And you can see that that just happens to be the median of the play duration. So log 10 of 360 is 2.556. And so you can see already that having too many of these null values could dilute the importance of your feature. Another example in this same data set is this extreme example where you had 1 billion seconds of play. So these are some of the things that you might want to be considerate about when you are building your data sets. The problem that this causes is more worse when a machine learning model has been trained on data that does not have any of these outlier values or these null values. And that can result in bad predictions online. So, in terms of aggregations, we would build, here’s an example of aggregations table. Here we are aggregating by date. And there are a bunch of columns here and these are just examples. The key thing to take out of this is these aggregations are a route of a raw fact, which is the play duration. You can see that there’s an aggregation on the number of null plays, there’s a median play duration, and there’s also 99 percentile play duration. And so you can have a bunch of those. So, from these aggregations itself, you can see that the number of null plays for a particular day is high, right? So, that kinda gives you an idea of how these aggregations would help and so you can have multiple of these kinds of aggregation columns. So, moving on.
So, once you have this aggregations table, you would want to build automated monitoring and alerting. And automation is the key here, because you have a lot of data and you want to make sure that all of this data is automatically monitored.
So, we will describe in a little bit detail what kind of automated monitoring that you would want to keep here. So, coming back to our sock universe example, ideally, you would have two blue socks or two red socks. But in the case where you end up with one blue sock and one red sock, you will at least want to detect them before it goes to production, right? So, the same case with the data. The red line here represents last week’s data and the blue line here represents the current week’s data. So, from this, you can see that there is a deviation from the pattern from the last week data. So the key point to note here is you can use historical data as a baseline when you are monitoring your data. So, in this case, it was last week, but you could end up using 15 days, one month, a year, completely up to you, according to your use case. So that’s one way to automatically detect issues with your data. A few other things that you can do is a distribution check. So let’s say you take an attribute from your data set and you have another data set, which we’ll say last week’s data. And you can plot a histogram and overlay them against each other. Visually, you can see that there’s a difference between them by this particular histogram. So, although you can see this visually, you would also want to automate that. And so we’ll be talking about how to do this automation, basically how you compare two different distributions automatically in the next few slides. So, this is another thing that you can do. The other thing that you should keep in mind is typical distributed systems issues. Things like, for example, late arriving data. That could happen when you have a fast producer and a slow consumer and the data could arrive out of time and so your upstream pipelines could consume less data or more data and that could cause issues. So having some sort of simple checks and to be able to detect these typical distributed systems issues would be important. So let’s dive into how to do distribution checks automatically.
So, you have,
you have two data sets and they have thousands of attributes. You would want to compare the distributions of each attribute independently. And to be able to do that, you want to prune the attribute set using some sort of a statistical test. We’ll talk about what statistical test we can use for this. So, after pruning the list of attributes, you will arrive at a smaller list of attributes. Let’s say you started off with a thousand, you applied a statistical test, and then you arrived at 10. And then you can, once you have a set of 10 attributes, you can manually inspect them using histograms or other techniques. So, let’s describe the statistical test a little bit more detail. So, there are several ways you can do this. One way is using the Kolmogrov-Smirnov statistical test.
So, to describe this test, let’s start with an example. So, here we have a particular attribute from a data set and there’s a histogram plotted on the left side. So the graph on the left side denotes a histogram for a given attribute. On the x-axis you the attribute values, and on the y-axis you have the counts for those attribute values. The graph on the right is an empirical cumulative distribution function. And even though that might sound a little wordy, what it essentially is is it gives you the fraction of data points that are less than a given attribute value. So you can see that on the x-axis, if you consider the value one, the corresponding value is 80%, which is .8 on the x-axis. And you can clearly see from the histogram that there were about 80% of the data points that were less than one, right? So that’s the empirical CDF. Now if you take another attribute from say a different, or you take a different data set for the same attribute and chart a histogram, visually you can see that they’re different, but then again, we want to do this automatically. So we will plot the empirical CDF for this attribute as well. So on the right you can see that the green line is the empirical CDF for the second data set and the blue line is the empirical CDF for the first data set. And the way this test works is it computes the point of maximum discrepancy, denoted by the red line there. So, more concretely, if you describe it in terms in of a mathematical notation, this is how an empirical CDF is defined. The function i is an indicator function, which returns a value one if its value is true and a value zero if the value is false. And the value that I am referring to is whether xi is less than a. And then you sum over this across all the different values of i and then average it by picking, by over all of the data points. Right? And that should give you the empirical CDF. The maximum discrepancy value is denoted by this little variable called Dn.
So it compares the two CDFs and then computes this red line here. So the important point to note here is the closer the value Dn is to one, the more likely that your two distributions are different. The closer the value Dn is to zero, the more likely the two distributions are the same. Okay, so you don’t have to worry about implementing this yourself, there is a library in Spark ML called the Kolmogorov-Smirnov test. And you can leverage that to do this automatically. So, moving on.
So, we talked about automatic distribution checks.
When you have to do this automation, you need to be able to specify your audits in some sort of a DSL. So we wanna describe in very high levels the DSL that we use at Netflix. So, here we are using pyspark, we’re loading up the aggregations table for a given dateint, and that gives us a data frame. And then, we have this pseudo-code that represents the DSL that I was talking about. At Netflix, we have a DSL written by the data platform team, so shout out to them. So although this code is not the same DSL that we use, it should give you an idea on how to structure your audits when you are building them up. So you can see here, we have an audit on the number of null plays, there’s an upper bound threshold, if it exceeds the upper bound, we will trigger some sort of a pager alert. And you can do more fancier things, like anomaly detection, and then trigger some sort of alert based on that. Right?
Okay, so, moving on. So once you get an alert, you would want to monitor this, or debug this further. And for that, we have a very simple dashboard, built on Tableau, which spans across a long period of time. You can see that the time range here is 30 weeks, but we can go much higher than that. And the reason we can go higher, we can keep so much data for such a long period of time, is because we are operating on top of aggregated data. So that’s the key of why we need to compute aggregations. So you got an alert and you would want to dive into why, what happened with your data.
So you can see that from this graph, there was an issue here and then you can probably see from another graph that there was another issue at some point. So these deviations may or may not be expected.
But it should give you a good starting point to dive into the issue, to see which code part actually is larger than this, or if there was any system related issue that caused this. Okay, so that was about visualization.
So, at a high level, like Vivek described, we have the logger, ETL, and the query components. We have a siloed set of independent metrics components for each of these different components for our pipeline. We just described the automated data monitoring, but we also have some additional components to measure things like data usage, to make sure that the model data that’s logged is also actually used.
So, one thing that’s missing with this, is one unified consolidated view. Not only in terms of a dashboard where you can view all of these metrics together, but also a DSL where you can describe automations on top of these metrics.
So we don’t have that right now and it’s a work in progress.
So, so far we’ve talked about issues with data at the source, so now let’s talk about how to address issues that arise due to bad code.
So, we will describe the concept of swimlanes here. So, in essence, a swimlane is an isolated environment in which a pipeline can run end to end with all of its necessary resources. So, more concretely, if you have a dev branch where you’re making some code changes and you want to compare that against the master branch, and let’s say you’re adding an additional attribute and you want to ensure that the existing attributes that is in the master are exactly the same, even after your additions. That means that you’re not causing any regressions of the already existing attributes. So, the way that you would do that is using this concept of swimlanes. We do this automatically at Netflix. So the master branch will run an ETL Code 1 and your developer branch will run an ETL Code 2. And you can see that these pipelines have their own sets of resources, including their own version of the fact store. The advantage of having your own fact store is also, it also gives you the ability to do validations.
So you can attach a validation job on those two different fact stores and compare the attributes to see if they’re exactly the same. You could also attach a feature generation component on top of the fact store in your swimlanes. So you get an idea of what the swimlane is, you can basically attach anything to it and it’s a completely isolated environment. Of course, when you’re using feature generation or when you’re operating at the feature level, you can’t do exact matches. At that point, you would want to do distribution checks, like using the things that we described earlier.
So, just to talk a bit about results. So, we were able to achieve an 80% detection of the issues proactively. We had a 15% cost saving due to better detection of unused data.
And also, importantly, we were able to achieve a 99% validation rate during critical data migrations. And this is important because at Netflix we run several different A/B tests and so whenever we need to do data migrations, we need to ensure that they are completely transparent to the end A/B test.
And finally, we were able to achieve improved developer productivity because of the fact that we have an isolated environment and then you can run your pipeline end to end. And I’m gonna end with a reference to the sock universe where you have matching socks, but it’s also important to note that even those these problems might seem fairly trivial, they’re extremely important to ensure that a machine learning pipeline, or a workflow, runs correctly in production. Without which, you could have several different problems and several unintended problems. All right, so with that, I’m going to end. Thank you so much. Your feedback is important to us, so don’t forget to rate and review the sessions.
And now Vivek and I will open it up for further questions.
And now Vivek and I will open it up for further questions.
Preetam is currently a Senior Software Engineer at the Personalization Infrastructure team at Netflix. He builds systems that power machine learning models that operate at Petabyte scale. Prior to Netflix, he developed end-to-end machine learning models as part of the data science team at Thumbtack. He has also worked on the content recommendation and mobile search systems at Yahoo. He obtained his Masters from the College of Computing at Georgia Tech.
I work as a senior software engineer in the Personalization Infrastructure team at Netflix. I work on distributed systems and big data, currently focusing on storing and querying petabytes of data. Previously, I have worked at Apple, Sumologic and Amazon in similar roles.