Anomaly Detection at Scale!

May 27, 2021 11:00 AM (PT)

Download Slides

We all know how to create ML models, but the path to turning them into a highly scalable easy to use system by users is not always clear. What happens when you need to run thousands of them, on many different datasets, simultaneously and at a huge scale? AND, do it reliably so you can sleep well at night!!


To achieve exactly that, we’ve decided to go down the serverless route and build an anomaly detection system on top of it. We’ll go over the pros and cons of building such a system using serverless and when such an approach could work for you. 


Our SpotLight anomaly detection system is capable of easily reusing ML models, and scale to run millions of time series simultaneously with ease. Our system eliminates manual work and allows our end users with no scientific background to set anomalies to detect in a plug and play way and get alerts in no time.


In this talk, we’ll walk you through the architecture and share useful ideas you can adopt and implement in your own projects.

In this session watch:
Opher Dubrovsky, Big Data Team Lead, Nielsen
Max Peres, Developer, Nielsen



Opher Dubrovsky: Hi, everybody. This talk is about Anomaly Detection at Scale. Before we start, I want to talk about a little problem with data. This is Waldo. If I ask you where Waldo is now, it’s obviously very easy to find him, but if I ask you where Waldo is now, it’s much more difficult. You have to put more work into it and search through a crowd of people. So one of the problems with data is scale. At small scale, lots of things are really easy to do. At larger scale, they become a tougher problem to crack. If you look at our data, we have a lot of data. And within it, we have lots of hidden anomalies hiding inside the data and our users expect us to help them find those anomalies in the data. Currently in our anomaly detection system, which we talk about, we have over 1.1 million time series that we constantly run and detect anomalies within them.
So in this talk, we’re going to talk about detecting anomalies, we’re going to talk about architecting for costs and for scale, we talk about predicting the future, some models, and give you a summary, a recipe you may want to use in your use cases. Before we start, I want to introduce ourselves. My name is Opher Dubrovsky. I’m a Big Data Dev Lead at Nielsen. And I think serverless is the revolution. And with me is Max Peres, a Big Data Engineer. And in his view, the devil is in the details. We are part of Nielsen Marketing Cloud, which is a data management platform, DMP for short, and we build marketing segments and device graphs. Our data is used for running campaigns and making business decisions. In a nutshell, we’re cloud native, we run predominantly on AWS. We use a lot of Spark, we run many Spark jobs every day and lots of Lambda functions and other serverless technologies.
In total, we store about five petabytes of data and process new data in roughly in the order of about 60 terabytes every day. In 2019, we were asked to alert on anomalies within our data. Our users found that in many cases, they have a hard time tracking datasets that we have in the system. And in many cases, they find about problems in the data after customers complain. And this is obviously not a great situation to be in. So we set out to build an anomaly detection system and make sure it can run on a huge scale of data. We started with a pilot, ended up going in with a beta into production, and then spend time a doing scale improvements, and redesigns. In terms of the amount of data, when we started out, we had 15,000 time series and this grew over time. And since today we have over 1.1 million times series, and this is growing daily. In total, we have about 150 million data points in the system.
So how do you detect anomalies? Let’s look at an example. Let’s say I was asked to predict the weather at noon on some particular day. I can look back at the weather since the morning, and I would see that it was sunny and warm throughout the previous hours. So very likely I would predict that at noon, it could also be sunny. We could wait till noon and then check the weather. And if it’s sunny, then we would conclude that my prediction was okay and everything is great. We can do this again and again. And then if it’s some hour, suddenly it turns, the temperature drops, it becomes very cloudy and dark, and it starts snowing, that would obviously look very odd.
And in this case, if I was running an alert service, I would probably send an alert to people. Hey guys, something has changed in the weather, beware. So we would determine this is an anomaly. So in a nutshell, detecting anomalies involves creating some prediction model, predicting the future, and then comparing it to the actual values when the time comes. And this is all there is to it. So we started a pilot within a naive architecture. We built it on a Postgres relational data storage. The reason we did this, it was really quick and easy to get started, and we could learn a lot about the user needs. So this was awesome for a POC. This was also great for a beta.
It started to become a bottleneck at scale and became a bottleneck for two reasons. One scaling up and putting a lot of data and lots of queries on the relational database started becoming a bottlenecks and we had slow downs. And second, a relation database is not very cheap. So as you store more and more data and need to enlarge it, it’s obviously costs more money. So to move to the next step, we rearchitected the system and let’s talk about how we did it and how we achieve scale and very low cost. The first decision we made was to go all out on serverless. We decided to use Lambda functions as the basic compute unit, on the functions are the serverless compute unit of AWS. We chose lazy execution, which means we don’t process any data until it’s needed. And the reason for that is we don’t want to process data and then discover nobody really needed it and we just wasted resources. And we moved the storage of the data from a relational database to low cost S3 storage. At the end of the day, we got very low costs and endless scale.
Let’s look at the architecture. We have a work manager Lambda that wakes up, checks which jobs need to get processed and which data arrived and needs processing, creates a task that is basically a message that’s sent on a queue. And that invokes a worker Lambda that does all the work. The work Lambda calculates the new data that comes in, does any aggregations or cleaning if it’s needed, it reads the previous values of the time series, does the predictions and stores that back. And if it finds anomalies, it will report the anomalies. So if you look at our unit of work, a Lambda gets a message with the time series that needs to and where to find them. It reads the configuration for those time series, which tells it how to aggregate the data, if there’s any other cleaning that’s needed. And the model to use, reads the time series files for those time series from S3, does all the predictions and calculations and stores back the results.
And if it finds anomalies, we still use Postgres, but only for a little sliver of data of the anomalies. So we store them in Postgres for local purposes. But again, this is a very small amount of data, so it’s not a big deal anymore. So what’s the secret to scale and low cost in our system? We use low cost storage, S3, we use low cost processing, which is Lambda functions, and they terminate once the processing is completely terminate and we no longer pay for them. And we get endless scale because we can invoke as many Lambda functions as we need as the system grows with more and more data. If you look at our costs for 10,000 time series, running a prediction once a day, we pay a mere $2.70, as you can see, storage is very negligible and the DB is also cheap because we don’t store much in it. So we just store a small set of the anomalies. So in total, for 10,000 time series, we pay $4.36 a month, which is obviously negligible.
All right. So let’s move on to talk about predictions and models and to do that, I’m going to pass it on to Max. Max?

Max Peres: Thank you, Opher. All right. So let’s talk about the predictions. Prediction is very difficult, especially about the future. If you would ask people back in the 50s, how would the future will look like? They will probably say that they go on to have flying cars back in their garages. But today, currently when we already in this future point, we see that they were wrong. And the reason for that, is that it’s very hard to predict so far into the future. And this is what we look, actually see in this chart, the longer the prediction periods, the higher the error rate will be. And, but in our case, just like Opher told you in the weather demonstration, we don’t need that, we use short prediction. So we get only small errors. All we need is the next observation in order to identify, we have an anomaly. Right?
Okay. So let’s see how does our algorithm flow works. So those numbers actually can be everything, right? They can be web traffic, they can be sales, or they can be anything that you can imagine, the algorithm will work in any case. So we take the first end observations, and training our model, and predicting the first value. This one is 11, and then we just compare it with the real value, it’s 10. So between them, the standard deviation way boundaries that is allowed, so we want to alert on anything. And then in the sliding window mode, we take the end next observation, and again, training our model and predicting the next value. And again, here with calculating the difference between the real value and the predicted one, and we see that there is no difference. And again, let’s take the third one. Again, we are predicting the number 10 and the real value is 18. But in this case, the standard deviation is higher in the allowed the boundaries. And therefore we will alert on this observation as an anomaly.
Great. So now we understand how does our algorithm works, but which model should we use? There is hundreds of them. And if there is a huge variety of libraries that we can support, where for example, we use a currency, the stock model library, but there is a lot of more that we can choose. Luckily, our infrastructure built in the plug and play way. So all we need is to import the needed library with the model that we want to implement, extend some basic apps, and implement few simple methods. And that’s it, the time for production is super fast. All right. So now let’s take some a real life example of a web traffic time series. And only by looking on the chart, we can already identify the anomaly, right? It’s here, but let’s see how can we choose the right model in order to feed it, so we could identify this anomaly.
First, we need to decompose the signal into three components. We have the seasonality component, the trend component, and the noise. So the seasonality, it’s actually, it’s a cycle that is repeated of a trend, goes up and down in a certain period of time. For example, if we want to predict tomorrow sells, the most naive way to do that is take today’s observation and say that tomorrow going to be the same, right? But what if tomorrow is a weekend date. It means that the sales are going to be much more higher than today. That’s why we need to take into consideration the last week weekend and take the, some way weight from that, or is it going to be holidays? So maybe we should take a last year’s holiday period and take away some weighted value from last year’s observation.
And this is why it’s important to identify the seasonality component in our signal. The next one is the trend. The trend is actually, it’s the growth rate of our signal. It can grow linearly, for example, by some certain percentage between the observation or it can grow exponentially if it’s in a double himself, every observation. This one is very important to understand because there’s some of the models to support trends and some others are require stationary data. And obviously you can fix your data in order to be stationary, but it’s very important to identify whether you have the trend or not. And the last and not the least is the noise component. Let’s take an example of measuring the weather. Let’s say that currently we have a 25 degrees of Celsius. We take some measurement device and it would never give you the exact number of, it will probably show you 24.9, maybe 25.1. But if the device will be legit, all the observation, the variance will be distributed there normally around the real number, the 25 degrees of Celsius.
And so actually the residual between the observed value and the real one, this is the noise because it’s unexplained. We do not know where, what is the reason for that noise. And this component is very important in our case, because if our noise component distributed, normally we can test it by, we have a certain test for that. It means that our model is actually have all the needed feature in order to forecast the future. And all we need is to add the noise component to complete it and we were very, very precise. And again, there is really some specific models that do support noise, if it’s distributed normally. After we decompose the signal and they chose the right model that feeds our specific time series, we can see how does it, they perform.
We see that our model actually takes a couple of days behind and it’s actually, it’s expecting to have the same pattern that it has lack in the previous days, but the real that actually have some decrease in the data. So that’s why it’s actually shows an anomaly on this specifically observation. All right, trust, how can we actually gain trust with our end user? We have understood that no matter how good your infrastructure is and how precise your model, we still need to gain trust with our end user in order for him to really work with our infrastructure. And the solution for that was that we have implemented the push notification via Slack. With all the statistical data that the end user would like to have. But we have, we saw that it’s very hard to imagine and see with all this data, how does, what’s really happening with your data, only by see some message.
So we’ve added a picture and as you know, one picture equals a thousand words. And by looking on the same picture, you getting so much more information about the anomaly that you see, or the difference between the real data and the predicted one. And by that, the end user could actually believe in our system and they started to use it.
Okay. So let’s summarize and see what we have talked about. We talked about how to keep our costs low by using a cheaper components, how to design our system to be scalable and their support in endless scale. The importance of gaining the trust with our end user and how to fit the right model for our specific data. And here’s the recipe of how to do it. Low cost, get the loose, all those expensive parts, such as their relational data stores. And in our case, we use the serverless AWS Lambda infrastructure, and save all our data in S3 buckets. Scale, we took all our workload, split it into small unit of work so the cost would be steady and the scale will be endless. Trust, build some feature that we would be able to gain trust with our end users, such as our push notification to see all the needed information. And models, by using some plug and play infrastructure that we can implement in no time, some new models, the time to production goes super fast and makes our infrastructure very good. Thank you very much. Don’t forget to rate us high. Thank you.

Opher Dubrovsky

Opher is a big data team lead at Nielsen. His team builds massive data pipelines that are cost effective and scalable (~250 Billion events/day). Their projects run on AWS, using Spark, serverless L...
Read more

Max Peres

I am an experienced Data Engineer with a history of big data and machine learning projects. I love to tackle big data and scale challenges and find ways to run the same workloads at 1/10 the processin...
Read more