Organizations rely heavily on time series metrics to measure and model key aspects of operational and business performance. The ability to reliably detect issues with these metrics is imperative to identifying early indicators of major problems before they become pervasive. This is a difficult machine learning and systems problem because temporal patterns are complex, ever changing, and often very noisy, traditionally requiring significant manual configuration and model maintenance.
At Zillow, we have built an orchestration framework around Luminaire, our open-source python library for hands-off time-series Anomaly Detection. Luminaire provides a suite of models and built-in AutoML capabilities which we process with Spark for distributed training and scoring of thousands of metrics. In this talk, we will cover the architecture of this framework and performance of the Luminaire package across detection and prediction accuracy as well as runtime efficiency.
Smit Shah: Hello everyone. My name is Smit Shah. And along with me, I have my colleague Sayan Chakraborty and we are both from Zillow. Today we are super excited to share the topics, Scaling AutoML-Driven Anomaly Detection With Luminaire. Detecting anomalies in your data as early as possible has always been the most important thing for any organization, and so it is for Zillow. Now with increasing in number of key metrics that the company monitors it has become really important to scale and make it an AutoML driven solution. So, one important note before we go further, if you have any questions please make sure to use these session’s Q&A feature.
So let’s begin. As I said, we are both from Zillow, but within Zillow we are part of the data governance platform team. Our team focuses on two primary components, which is platforms and processes around data governance.
There are four main pillars for our team focuses on, which is data discovery, quality, security, and privacy. Along with me. I have my colleague Sayan Chakraborty who is a senior applied scientists at Zillow and myself Smit Shah working as a senior software development engineer in the Big Data domain.
So our agenda for today is going to be… We’re just going to talk about what is Zillow, [inaudible] not many of you might know. We are going to talk about why monitor data quality, data quality challenges, the main topic which is luminaire and scaling, and we are going to wrap it up with some key takeaways. Okay. So let’s begin.
So let me first talk about Zillow. Zillow is an online real estate company, which is re-imagining real estate to make it easier to unlock life’s next chapter. Now, this is possible by offering customers an on-demand experience for selling, buying, renting, and financing with transparency and nearly seamless end-to-end services.
Now, this is possible with different brands and businesses that are available under Zillow group’s umbrella. And that is what you can see on the top right section. We are right now the most viewed real estate website in the United States. And our Q4 2020’s charts also shows the same. We had around 201 million average monthly users, 2.2 billion visits, and we have more than one 35 million homes, which are available on our website.
So let’s go into the next topic, which is why we should monitor data quality and why it is also important for Zillow. So, as I said, Zillow is all about homes and we are a real estate company. So let’s take an example of this, a home details page for a specific home. Over here, you can see information related to bedrooms, bathrooms, square feet, address. All the images that we show. So you can see data is very important to us and that bad data will lead to bad decisions and broken customer experiences.
Along with that we share information related to Zestimate, which is used to estimate the prices of each home based on various factors. We have this service called Zillow Offers, which helps an easy and hassle-free service for customers to sell or buy a home. Along with that, we have our Zillow premier agent services, which helps customers to connect to agents. And we have our economic research team, which publishes a lot of previous market research analysis on our website and various other teams.
Now, apart from the customer facing services, we also have a lot of ML and AI based services, which are either surfaced internally or externally. For example, Zestimate recommendations and everything. So we also want to make sure they are also performing reliably and also the data which is used by those systems are also of high quality. Overall. It is very important for us to monitor data quality and assume there’ll be similar even in your organization.
So then let’s talk about what does anomaly means and why is it important? So let’s take this example over here in this time series chart. The y-axis is the observed values. The x-axis is the time series. And over here, you see there is a certain spike. So, an anomaly can be any data instance or a behavior significantly different from its regular pattern, which is pretty quite obvious. And why is it important? Because this kind of anomalies are inevitable. Anomalies are going to happen within your data. And we just want to make sure that you figured that out or detect them. They are going to be complex. The reason is because your time series trend might have differences seasonalities, there might be certain spikes or drops, or there might be changes in your trend. And all this does the time-sensitive do because you want to catch the anomalies as early as possible in order to avoid downstream longer effects. Overall, it is very important to catch anomalies. And it also helps you to make better business decisions.
So let’s talk about various ways to monitor data quality. There are two main ways, which is firstly, there’s a rule-based checks. Now rule-based checks are very easy where you might have various domain experts who are either generating or utilizing this data sets. And they have various sets of predefined rules or thresholds that they can set on this data. Now, for example, the percentage of data should be less than 2% for a given metric and so on and so forth. They are less complicated to set up and they are easy to interpret. And they work well when the properties of the data are simple and remain stationary over time. But what happens when some of your metrics cannot be… You cannot apply rule-based checks on them.
That’s where the second part comes up, which is the machine learning based data quality checks. Now this checks are rules that are set through mathematical modeling. For example, models that understand the historical data and figures out the predictions are doing an anomaly detection on a future value. They work really well when properties of the data are complex, or they changes over time. As the example I showed before. And they are more hands-off approach because you don’t need to have much domain expertise. They just based… They just understand based on the historical trends ar any other external parameters like you supply to it. Okay. So let’s go onto the various data quality challenges, and I’m going to hand it over to my colleague Sayan. Thank you.
Sayan Chakrabor…: Hey, thanks Smit, for such a great introduction. Hi guys. My name is Sayan and I’ll be talking over about the different challenges of data quality monitoring. So first of all data quality is a highly contextual problem. And in general the context depends on the use case. For example, when you’re monitoring user traffic over in an annual time window to understand the status of a website that has one context versus you can observe the same user traffic over a longer term time window when you’re considering to understand the business growth or something similar.
Also, data quality depends on the reference time frame. It changes a lot when you observe that over a shorter time window versus a longer time window. One good example is COVID. If you take a time frame from COVID that might not look anomalous unless you’re taking a reference timeframe from past few years.
Also, temporal pattern depend on lots of internal and externalities… External factors. So all the external factors need to be considered into the model in order for the model to understand whether a spike is anomalies or non-anomalies. One example, it can be a user traffic which spikes up, and that might be related to some product launch or a marketing campaign, and that might not be known to the model and it would be flagged as anomalies.
So let’s talk about the challenges that we face due to the contextual problem I mentioned before. So all of this context from the previous slide creates a challenges for the modeling part. Because there is no one model that can fit or that can fit all the problems that I mentioned before. So, there are different times. It is with different patterns and you need different models for that.
Also, definition of anomalies changes for different level of aggregation. For example, you might want to observe user traffic per second, versus you might want to observe user traffic per day. And the pattern of anomaly change a lot in terms of the modeling context. And all of this contextual information and the corresponding technicalities requires strong expertise, not only from the machine learning side, but also needs lots of domain experience. For example, you might need to know economics, business or from software perspective. You need to understand data engineering in order to add anomaly detection for your data. And that kind of creates a huge friction point for PMs or data engineers who want to monitor their data and get high quality data.
And another obvious reason is a scalability. And you want to have a system that is scalable to support large amount of data across teams. So this all things kind of created the motivation to build a central data quality platform to not only democratize the access of high-quality data across teams, but also create a data quality standard in the company. So before diving deeper into the system that we built to do anomaly detection at our company, let’s create some set of wishlist that we want to have as a requirement for our system.
So, first of all, as I mentioned, all the contexts and the scenarios the data can come from, I mentioned before. So the system should be able to catch any data irregularities over different kinds of contexts. Then the system should be able to scale to support wide ranges of data from different teams, because there are different services that are processing enormous amount of data every day. Then the system should require minimal configuration and should be able to handle all the complexities related to anamoly classification automatically. And finally, last but not the least the system should require minimal maintenance over time. This is very important for time dependent data and keeping the scalability in mind and also will reduce intervention from the developers, which in turn improve the resource allocation for the company.
So before we started building this platform, we actually looked for several existing solution and we couldn’t find anything that meets all this full criterias. And that kind of created the motivation of building our own internal anomaly detection platform that we call Luminaire.
So let’s dive deeper into luminaire. So let’s understand what is Luminaire. So Luminaire is Zillow’s internal anomaly detection platform that performs automatic anomaly detection. And also is self-aware. We recently opened source Luminaire modeling library. So we just, all which also we call Luminaire, but that only contains the models. And in the next few slides, I’ll be going over some examples of how we built our system around this modeling to automate the data quality needs within our company. So before diving deeper… Before to dive deeper into luminaire, let’s understand some key features. So Luminaire is a time seize anomaly detection platform that is enabled… That is capable of doing data profiling and pre-processing. It is integrated with different kind of models for different scenarios that I mentioned before.
It is capable of running for both batch and streaming use-cases. It has an auto-email layer built in to optimize all the configurations. And finally it is not only proven to work well in different contexts, but also has been shown to outperform many existing solution in different use-cases. You can use the GitHub link and tutorials from the slides to go deeper into how Luminaire works and how you can use Luminaire for your system.
Another good resource to understand Luminaire’s deeper mechanism is the scientific paper that we published in the IEEE Big Data conference last year. So let’s get the high level understanding of the luminaire component. So at a high level Luminaire has two main components. One is the training component and another is a scoring component.
So the training component is consists of different sub components that can be called as layer. So at the very bottom, we have the data profiling or preprocessing layer followed by the modeling layer that can be called for both batch and streaming use cases. And finally, we have the automation layer that optimizes the configuration of the bottom two layers.
And then we have the scoring component that that pulls the right model based on the correct identifier for the given time series and score at a given time point or a time window. So I’ll be going over a couple of code examples in the next few slides and all the code examples are taken from our GitHub updates. So you can always go back and revisit them in future for your reference.
So first is the data profiling and pre-processing layer. This layer prepares the data before sending it for training. This layer performs all the transformation needed for anomaly detection. Also, includes imputation if the data consistently missing data. Also, this layer perform some data profiling such as… If the time series has observed a change point in the past, or if there is a trend change, which is an integral part… Which is a integral information about the time series and also can be leveraged at the time of training.
So you can see, you can use data exploration class from Luminaire. You can pass your configuration and call the profiling function and get all the profiling information along with the P process data.
Next is the modeling layer. The modeling layer consists of different type of model that supports different scenario. And also modeling layer is capable of running the model for both batch and streaming use cases. So in this example, we have the Luminaire structural model, which can understand time series pattern based on the historical structure of the time series. So you can use the Luminaire labs structural model class and pass your hyper parameter and call the training function to do the training.
Similarly, for streaming data use cases Luminaire has a window density model. And you can use a window density model for a class and pass your time series, and it will run the training based on the configuration you specified.
Finally, at the top of the training component we have the AutoML layer and that Auto-ML layer consists of optimization algorithm, which can optimize all the data transformation step data truncation step, the selection of the models. And also the model parameter that I mentioned at the previous slides. So you can use a Luminaire hyperparameter optimization class, and you can get the best configuration for a given time series.
So at the scoring time Luminaire pose a model object, and you can run the scoring function in order to score a given time point for batch scenario. So in this example, you can see we have scored for 2020/ 6/ 8, and we have passed a data point in that it has generated lots of information such as anomaly probabilities, confidence level predictions, and so on.
Also in the streaming scenario, instead of scoring a time point you can score a window. So you can have chain model for streaming use case and pass a time window in the scoring function. And you can get the time window score, and this will give you information like anamoly probabilities and so on.
So with that, I’ll pass it over to Smith again, and he will walk you through how we have achieved scaling our system to support wide ranges of dataset across the company. And also he will walk you through how we have achieved self-awareness within our platform. Thank you.
Smit Shah: Okay. So let me then go through the various scaling challenges that Sayan mentioned and how we are able to solve that. So this is the architecture that we are using within Zillow, and you can see how we are also leveraging our open-source determiner package and also something that you can also incorporate and build some systems around it. So let me break this down into four sections. And also for everyone, there’s also a link at the bottom that you can use to refer to our blogs, which also has more details around it.
Okay. So let me then go to the very first part of this architecture, which is the training data that we need to use to generate the train models. So in this case, we are going to prepare our input data set. Usually what happens is we don’t want to monitor only one metrics at a time. There might be… There’s always going to be cases, where you want to monitor like K amount of metrics and that’s where all this scaling and everything comes into place.
So what we want to do is we want to divide this whole training data sets into K time series. And we want to give that as an input to your training. Now, each of these time series that your K time series initially it will be assigned a unique identifier. This unique identifier becomes the key component when we store all the information related to the model and the scoring results in our system and map them all together. So once you are prepared how to distribute your training data, let’s go through the training process.
Now this is our training process, which is our actual Auto-ML part. Now think about this process. This process takes place for all of your key metrics. So each of our metrics is going to go through this entire cycle, and that is something we want also distributed.
So let’s take the example of our time series one in this case. Now the very first thing when we load our historical data for time series one, since this is a very first time we don’t have any historical information about all the scoring or the models for that specific time series. So in this case, what we do is we first go through our configuration optimization for which we use our open source packet and word sign also explained. Using that configuration optimization we figure out what is the right model to use from our suit of models that we provide and also relevant parameters that need to be passed for that model. Now, all this information is later on stored in our config storage.
Now, this training data for time series one also goes through our data prepreparation process. And in this also we are using our open source packet which is responsible for cleaning the data. So there might be some kind of friendships happening in your historical data. So you want to exclude that from your training process and other benefits that it comes with.
Now, what we do is we use this clean data and the configuration that we have stored in our database altogether do our training process. Now this training process later on generates a model, which is a trained model, and that is something that we store in our models storage. Now we also associate the unique identifier for that time series one. Now at the end of this process, what will happen is we are storing K models within our database.
Along with… During our data prepreparation we are also showing some data profile information in our database as well. So it’s pretty obvious after this training process, we are going to talk about our scoring process. So during this time, what happens is you want to score some newer data points that are coming in for this data set. So you’re also going to do a similar process, like splitting those data into K time series, or K data points.
Now, if we’re only scoring one data or if we’re scoring any future data points, now what happens during the scoring process is it identifies the specific time series and figures out what is a unique identifier associated with that. It would pull that relevant model which is stored in our database, and it will be most recent model at that time. And it will do the scoring process. And the scoring results are then stored into our results storage database. We also have our alerting services on top of the scored results because we… As Sayan mentioned, we are also are putting operating latest anomaly probabilities.
So you can set up the sensitivities based on that and alert your stakeholders accordingly. Now over here there is this one important pipeline that is going, which is that information from the scoring related to the scoring performance metrics is later on stored in our log storage database. Now this becomes very important because one of the key problems that Sayan mentioned was having a very minimal maintenance of our models. And this is the one that helps us drive that. So let’s see how that looks like. So that the final step is using that logs for it, which is our scoring performance metrics. And the next time, when you go through the training process again, because this is time series data, and your patterns keeps on changing. So you want to make sure you keep on retraining your database on newer historical data as well.
So what happens… Let’s take the example of time series one again. Now at this time when the data comes in it will look at the log storage and figure out if any of our performance metrics were not meeting our specific benchmarks. If the answer is yes, then we will again go through the whole configuration optimization process, store the new configurations in our database. We will definitely again… And if the answer is no, we’ll just do the whole data prepreparation based on the newer data. We’ll clean them. And then during the training, we’ll either use the past configuration storage configuration information which is stored in our database, or we’ll use the newer one and generate the model object. So this way, this whole loop keeps on going round and round. And that way we have very minimal maintenance and very less human intervention.
So let’s go through the example of how we are scaling it. Because we talked about K time series metrics, and how do you handle them.
So over here, we are using Spark to do the distributed processing. So let’s take an example of those key training data. In this case we have to metrics one and matrix two. And all the historical data and the observed values which are preprocessed in a single row and its specific run date. Now we are using Sparks UDF which is user defined functions where we put all the Luminaire open-source functions within this UDF and do the training process. So each of this metric will then generate a model object associated with that metric. And we started this model object. Now it comes the scoring part during the scoring part, we again pull the newer data points that needs to be scored. So in this case, you’ll see that time series column has 42 and 42. It’s a new data points that need to be scored.
We pulled the relevant model object that was stored in our database for those specific metrics. And then again, for the scoring we again do a UDF using our open source functions to score. And the scoring then generates the scoring results for each of this metric for each of the data points that were observed. And then you can use the scoring results to do all your downstream processes.
So this kind of completely explains how we are leveraging Spark and building an AutoML process. So let’s talk about other things that are already on system and something that our listeners can also do. Which is integrating with central data systems. So we have our central systems that we wanted to make sure we have a centralized place for all this information about the data and the metrics and the end of the data. The one of the things was creating a self-service UI because we wanted to make it easier for anyone within our organization to create a data quality job.
So it’s just one click and one form for them to fill up. And we do all the orchestration behind it. As I said, we want to surface all of our health metrics related to the scoring on our central UI and tie that to the table, which the monitoring was associated with to bring more transparency. We want to tag all the producers and consumers of that specific data to give us a clear picture. And if there are any anomalies who needs to be notified. The one of this very important feature is the smart alerting, because alerting is very… Setting up the thresholds for your alerting is always tricky. So what happens is our users always find a trickier to set up… Should it be if the anomaly probability is greater than 95% should be bigger than 98% are or anything that I should be alerted?
So what we have done is we are also abstracting that process away from them. And we are using data driven decisions to figure out the sensitivity for them. This helps us to redo some work for them and not sending too many wrong alerts.
So let’s go to some of the future directions for our system, as well as our open source package. As we talked about that currently our open source package only supports time series, but we want to go beyond that as well. So we will be working more on that. And also we will be happy to take more contributions as well from the community. We want to build a decision systems for not just the data, but also for our ML pipelines. As I said, we want to also make sure the MLRE pipelines and we want to monitor their performances as well.
Now, what we’ve talked about was only doing detection, but we also want to do and go another step by diagnosing the problems and figuring out the fix. So that’s what we want to incorporate as the root cause analysis. And at the end, as Sayan mentioned, some of the problems are the anomalies that happens. User might have more feedback or understanding on why this is an anomaly. So we want also incorporate user feedback to get those labeled information.
So let me wrap it up about all the talks that we did so far. So we talked about Luminaiere as our platform and also Luminaire as our pattern library, which we have open source which supports anomaly detection for wide variety of time series patterns and use-cases. We also talked about a proposed technique to build a fully automated anomaly detection system that scales for big data use cases and also require very minimal maintenance.
Thank you everyone for attending our presentation. We’ll go over through our questions right now. And just wanted to let everyone knows, Zillow’s growing and doing a lot of hiring right now. And we also have some open positions in our team. And if you’re interested in this type of work, I encourage you to apply or reach out to us on LinkedIn. Thank you very much.
Sayan is a senior applied scientist in the Zillow A.I. team. Sayan’s role is positioned in the intersection of ML and software engineering where he is building the centralized ML system to automate ...
Smit is a data and software engineering enthusiast. Currently working as a Senior Software Engineer, Big Data at Zillow where he is building centralized data products and democratizing data quality. H...