AI Modernization at AT&T and the Application to Fraud with Databricks

May 26, 2021 03:50 PM (PT)

Download Slides

AT&T has been involved in AI from the beginning, with many firsts; “first to coin the term AI”, “inventors of R”, “foundational work on Conv. Neural Nets”, etc. and we have applied AI to hundreds of solutions. Today we are modernizing these AI solutions in the cloud with the help of Databricks and a variety of in-house developments. This talk will highlight our AI modernization effort along with its application to Fraud which is one of our biggest benefitting applications.

In this session watch:
Mark Austin, VP, Data Science, AT&T Chief Data Office
Prince Paulraj, AVP, Data Insights, AT&T Chief Data Office



Mark Austin: Okay, greetings. This is Mark Austin, and I’m here with Prince Paulraj. And we’re going to talk a little bit about our journey on AI modernization, and the application to fraud with Databricks. To start out, we’ll start out with a little bit of history on AT&T and our history of AI. I’ll give you a little bit of background on the fraud problem, and why it’s so challenging, and why we need something like Databricks. And then we’ll talk about different pieces of the strategy. Think of these as different pieces of the modernization in terms of creating AI, which is creating models and features, deploying and serving it to make sure it’s real time, monitoring it, and then governing it. That will be all about the bias and explainability. And then we’ll wrap up with some conclusions at the end.
Let me just do a quick fly over of AT&T’s history in AI. This is part of our modernization. You’re going to see some AI stuff here, some technology stack as well. But you could even go back to the 50s. Claude Shannon started looking at AI to solve the chess problem, auto-playing chess. And then in 1955, the actual term artificial intelligence, it was AT&T, IBM, Harvard, and Dartmouth that first coined that term. And then in the 70s, some of the technology stack that we all know and love, Unix, C, C++, of course S which turned into R, first one of the statistical programming languages for data science was out of the 70s.
And then in the 1980s and 90s we have the neural network foundational work, [inaudible] out of the labs was looking at convolutional neural nets there. And then in the 2000s, it’s all been about applying AI. Of course, to fraud and many other things. Of course AT&T has got wireless and HBO Max and fiber. And then in the 2000s as well, there was a pretty important hackathon. The Netflix competition was a million dollars, and Chris Volinsky out of our group and team won that competition for the Recommender Algorithm there.
In terms of applying AI, one of the most important ones to apply it to has been the fraud problem. And fraud is so big, it’s a billion dollar issue for the industry, and there’s many different things that fraudsters go after here. I’m just showing a few of these here. Of course you have the gaming fraud. This is where there’s no intention to pay. Identity theft is huge, the fraudsters getting stuff out of the dark web, getting credentials, masquerading as the customer. We got to detect that. And of course there’s the obvious bribing and impersonating the customer or getting illegal unlocks to be able to get the phones there as well.
Now technically, you have to detect all these things. And in the past, years ago, we used to use rules. And yes, we found a lot of things, you can catch those things. But it’s really been the advance of applying the machine learning and AI, which has really made a dent. Now, years ago we started with rules like you can see here. And then we started applying machine learning. You could see the fraud stops going down. That’s the line going down there with machine learning one through machine learning five. And we were so successful that we just put more on that. I haven’t even put the latest on here. But you can see the year after that, we added 20 more algorithms, knocked the fraud down. And the fraud’s dropping here. You can look, the percentage is probably down 70-80% versus just the rules alone.
Hugely successful here. But it’s all about doing this at speed, doing it in realtime. And from a technical challenge, for any kind of purchase, there’s multiple side transactions that go off. We have to monitor all of these actual transactions. And when you add these up, believe it or not, it’s about 10 million transactions per day that we have to score likelihood to be fraud on. And we have to do this fast, we have to do it in 50 milliseconds or less so that it’s not a delay for the customer. And when we’re doing this, it’s amounts in capturing hundreds of realtime features. API calls, drivers license checks, variety of other things on there. And then it’s probably four times as many batch feeders. All of these things, grabbing them realtime, scoring these things, and making that call.
Now, in terms of the process, and this is what we’re going to dive down a bit more, doing all this in a platform like Databricks and Spark, doing this in realtime that we can score these things is super important. And this is what our modernization is around. And if you broke it into the three components, you might say you could call it, it starts with creating the AI. That’s getting the data, developing the features, building the models. And then deploying and serving it, that’s the pipeline, deploying the model. And then monitoring the AI at the end, making sure it’s doing what you intended to do. And of course, you can’t forget the govern AI in the middle there too. We have to make sure that everything we do is done at an unbiased way, it’s explainable, it’s interpretable.
This is the overall process, and we’re going to dive down into each one of these things. We’ll show you a bit of the technology stack, a bit of our strategy, a mix of external and internal technologies to actually do this. Let me start out with create AI. Create AI, again, this is about creating the features, creating the models. Of course, Spark is amazing for that. You can do batch, you can do realtime, you can do the Spark streaming there. But then you’re actually creating the features. And this is where sharing and cataloging those in a common place becomes super important. We even look on our own team, and sometimes we’ll have two data scientists almost on the same team creating almost the same features.
We’ve found that it’s super important to get those in a shared place. Delta Lake is great for that, but even further when you want to serve that in a realtime, you need something like a feature store. You can serve online and offline, and you can serve these features up for your actual scoring of your actual models. Now, building the models is really two pieces. And that’s these boxes on the right there. Individually, a data scientist wants to be able to try many different things. They want to try different hyper-parameters, they want to try different models. We have the auto ML, stuff like H2O Driverless AI. That’s the automatic thing. But we’re also going to have our own thing.
Individually, a data scientist is going to try to get the best thing they can do. But what we found, if a model is super important like in stopping fraud where one percent could be millions of dollars of savings, you almost want to put that out to the crowd as well. And that’s the piece on the bottom right. Pinnacle, think about that, after you have the best one, you crowdsource it out. And if I go to the next slide, you create a competition or a collaboration, and you put this model out for people to compete on. Think of this as our own internal Kaggle where everybody’s competing on a model, different features, maybe different data, and different models to get the best it can be.
This has been hugely successful for us. We’ve found on some of these important models, on average, we’ve done over 200 of these competitions, we’ve improved the accuracy by about 29%. We got about 1,100 people that are on the platform that get the notice. Not all of them show up to compete, but many times it’s not uncommon to have 70 or 80 people competing on an algorithm here. Of course, we have the auto ML bots. H2O Driverless AI is one. And we’ve benchmarked about 4,700 of these models, and that’s what’s resulted in the 29% improvement. That’s a bit of the create AI. Of course, what’s important next is to deploy and serve it. And I’m going to hand it over to Prince to talk about that.

Prince Paulraj: Thank you Mark. I’m going to talk about the next puzzle in the machine learning pipeline, that is deploy and serve AI. Before I touch about the model deployment, I just want to talk about the model offline training. Most of the time, our data scientists have a problem that they want to travel back in time and create the features, or sometimes backfilling the features. Those are very critical components of fraud ML. We need to have that sort of a capability in terms of a technology, and of course Delta Lake is doing a great job there for helping us.
Now talking about the model deployment, once the model has been built, and that needs to be recorded. We need to understand the modeling needs. What feature’s actually going into the model is very important. And then version those models. It gives an opportunity for us, when you’re really working on the A/B model framework, or champion/challenger mode, we need to just roll back and switch the versions. It’s very important for us to track the version of the model, and MLflow, and H2O MLOps, those are all the tools that’s really helping us part of this model deployment.
The next one to talk about the model online scoring. As important talking about this lightning fast, and things needs to happen in 50 milliseconds. The online scoring is very important, because we might pre-compute some of the features in offline, but store them in the offline feature store like a Redis. Then, also, enable some of the streaming features and actually compute them in a real runtime, right? It’s very important for us to give the high scalability in a lightning fast. If you look at the offline feature store and the online feature store, we have at AT&T called Atlantis. That’s the enterprise level feature store that’s really helping us and syncing the data between online and offline, and serving at a runtime, that’s really helping us a lot.
Now of course, we are using a Delta Lake for our offline feature store. And the fourth thing that’s very important is about the feature governance. Like Mark mentioned, the data scientists in our team, if you don’t want them to recreate the features many times, so we need to provide a metadata layer where data scientists can go and search if some of the features already exist, and reuse them. Or maybe they announce the features. With a great access control in place, and also monitoring these features help with the statistics. And also do the compliance and legal, those are all very important challenges that we have part of the feature governance.
I’m going to dig a little deeper about Atlantis. If you look at the AT&T, we have multiple data pipeline in place. If you think about the machine learning point of view, we have a Databricks, we have Snowflake, we have in-house Pinnacle Kaggle platform, and we have H2O Driverless, and also Jupiter. We have different model pipeline. The data scientists can get into any one of the pipeline. Either they are consuming the data or processing the data in a batch mode or realtime mode. They have to create these features. But really, we need a feature store a centralized fashion where people can actually consume and reuse it at the model scoring point of view and model training point of view.
That’s why there is a big need for us to have a centralized offline feature store. That’s really helping us and data scientists working on different pipeline, how we can reuse them across the enterprise. One of the great benefit that what we got from the feature store at a enterprise level, the old way of doing is the batch learning. Most of the time we do the batch learning. You create the snapshot of static data, and you split them training and testing, and you evolve it, the model. If you look at the chart here, you can look at the blue line. That’s basically the model has been trained, but offline. And it’s tested, and evaluated. And look at the line, the ROC curve.
And then actually when you’ve deployed in production, you can look at the green line. It’s not performing as expected, right? That’s where, especially on the fraud cases, and it’s very [inaudible] in nature, be how to build and retrain the model even in online. That’s why the concept of online learning or incremental learning come to the play. And this feature store, because we are keeping all this data offline and online in one place, that’s really helping us. That’s really enabling our data science to do the online learning. That’s the wonderful benefit that what we have in this enterprise level feature store.
Now, I’m going to talk about the monitor AI. It is very important for us to monitor all the machine learning models in fraud space. Because fraudsters, they come up with the different scheme of the things, and it can really challenge. We have to monitor the data, the model, and also the infrastructure and the process around it. When I talk about the data, the data drifting, it’s really an very important thing. Because there is [inaudible] systems that’s pushing the data to your pipeline. You need to know some important, very strong features, maybe the values [inaudible] value, you got to know. You need to notify your data scientist. We are using MLOps there.
And then about also the model drifting point of view. The performance of the model is very important for us. And when we see some sort of a drifting happens, and we were able to visualize that and monitor the health of the model, and let know the data scientists, let know the AI engineers, and when that things has happened. We really need to monitor the data, the data that goes into the model, as well as the model, what is [inaudible]. And now we talk about the infrastructure. Always these models are deployed on some physical machine, either on premises or cloud. But end of the day, you have to look at the system performance as well. What is the CPU, what is the RAM, what is I/O usage?
And always these models are wrapped as a micro-service. What if the VM goes down? What if the [inaudible] connectivities are bad? You need to correlate along with the data and model, also at infrastructure. Then the process is something it’s really helping you, how you can actually coordinate this all and correlate them, and create the necessary remediation or actions. And you want to provide [inaudible] functionality to our data scientists and AI engineers, and they can shut the threshold when the breach has happened and things like that. And then also, take some automated remediations. Maybe give an indication to our data scientist to retrain the model, and especially during the online training piece. And sometimes you want to perform the A/B testing or champion/challenger mode, or even you want to rollback the model.
All those things is an important part of the process point of view. The workflow of actions really helping this. What we are doing there is we are putting a AI to monitor the AI. If I take a little deep dive about the Watchtower, that is our internal platform that we built in. Like I talked about in the previous slide, we monitored data, model infrastructure. And this is an end-to-end platform actually is to helping us to do it in a realtime manner. You can look at the drifting that happens, and Mark mentioned 10 million transactions gets scored. You never know, the model drifting happens in realtime, one particular features that can really damage the model as well. How importance is what really matter.
If you look at it in the right-hand side, we collect all the instrumentation, the logs from the models, the data, what features is going to the model, and how the scoring is happening in the system. We collect all those [inaudible] in the logs, and then we set up the monitoring in place. Then through this, our machine learning framework, we take the decisions. The decision will tell us to take the actions, some of the actions are in automated fashions. We go and retrain the model, or we go and reboot the server. It depends what sort of a problem it comes to. And based on the feedback, the intelligence on the models keep learning and continuously learning about this all [inaudible] and the monitoring capabilities.
And over the period of time, that’s really getting sharper. And in realtime fashion and realtime transactions are happening in milliseconds, this is really helping us to take actions in [inaudible]. The last one to talk about the is govern AI. This is all about understanding the bias, fairness, and transparency. If you look at in AT&T, we have a workflow in place. In a use case, or a machine learning model come to the play, we have a framework where we just get it to learn about the past what has been done. And then we document it from the legal point of view, the privacy standpoint, and the process standpoint. And the real places where we are evaluating a model, we understanding what is the biasness.
If we detect some sort of a biasness, can we mitigate that? If you mitigate it most of the time, if you de-bias the model, how well it is performing after the fact would be biased, it’s also matters. And also we use some of the vendor tools, the opensource tools to understand the model explainability, and also the data drift. Because end of the day, our business user needs to take a decision, even if the unbiased model is doing a good job or not. And what sort of a features is providing importance to the model prediction. All this place, all this the points that I talked about, it’s all in one framework, and that’s how we go in the AI. We create AI, and deploy and serve AI, and monitor AI, and also the govern AI. Back to you, Mark.

Mark Austin: Thanks Prince. I think you gave a great view of the deploy, serve, and monitor, why that’s important as well. Of course, throughout this whole talk, you’ve seen not only our journey, you’ve seen how important it is for fraud, you’ve seen the 10 million transactions doing this in realtime and doing it fast as well. A bit of the technology stack, we love what Databricks is bringing, we love the Delta Lake, we love the MLflow. All of those are super important here. We like the advancements. You’ve seen a little bit of the things that we’ve done for internal things, to kind of close it off. But we’d love to hear other’s thoughts on this, and I hope it was useful. Thank you.

Mark Austin

Dr. Mark Austin currently manages a team of ~250 Data Science and developers in the Chief Data Office applying AI, machine learning, and automation with applications across many areas of the company s...
Read more

Prince Paulraj

Prince Paulraj is an accomplished ML/AI leader known for delivering meaningful and actionable insights that delivered value for AT&T across multiple organizations, including “Global Fraud Management...
Read more