Battling Model Decay with Deep Learning and Gamification

Download Slides

Conversational AI systems suffer from two forms of decay: concept drift, when interpretation of data changes, and data drift, when the underlying distributions of the data change. These forms of decay cause static AI models to degrade, often within days of creation. Using a combination of state-of-the-art NLP transfer learning tasks, a modern data pipeline using Databricks, and a network of experts completing distributed gamified data labelling tasks, Directly is able to provide a more effective and powerful end-to-end machine learning and conversation automation solution than systems that train static models and then expect performance to stay steady over time. This talk will dive into the specific mechanics required to create and maintain a living, breathing AI ecosystem, including lessons learned by creating a global network of experts and the pitfalls of training/hosting/versioning high-performance dynamic AI. Both technical and non-technical attendees are highly encouraged to participate in this talk. We will have deep dives into AI code/theory that will always be backed by an underlying real business use-case and performance metrics.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Welcome to Battling Model Decay with Deep Learning and Gamification. My name is Sinan Ozdemir. I’m the Director of Data Science at Directly.

SS Machine learning models degrade over time without

As many of you know, machine learning models degrade over time. And this is due to a variety of reasons. One of which is shifting trends and data. To keep the performance of machine learning up. Continuous upkeep is needed.

Up-keep is expensive, time consuming and bften a reactive

upkeep is unfortunately expensive, time consuming and often a very reactive process.

Directly, proactively and continuously enhances virtual agents, by leveraging community experts to boost automation rates and help fellow customers. We do this by providing an AI platform and a gig economy, that helps our companies transform customer service. By tapping highly skilled product experts from around the globe, to help our customers and make a partner ecosystem, to help them analyze data, author automations and engage with their customers, to give personalized answers.

Now in this talk, I’d like to talk about three main areas of our machine learning pipeline. I wanna talk about how we run daily deep learning based clustering algorithms, on recent customer inquiries. In order to capture new automation potential from trending intents. I wanna talk about how we take these trending intents, validate them, and send training tasks to our gamified intent discovery and training platform. As we discover these new intents and automate them. I wanna talk about how we update triage algorithms, to effectively route incoming inquiries to either our automations, our experts, or our customer’s internal agent network.

So let’s start with our trending cluster analysis.

The core of our Understand product, begins with our daily cluster analysis. And as we say at Directly, this really is the voice of our end users.

Our Understand product as we’re seeing here, gives our clients real time insights into what their users are asking, and what they’re talking about, and how Directly is taking action on these inquiries. On the right, we offer up the top automated insights from our cluster analysis, to show our customers which opportunities we are actively pursuing for deflection and automation.

In order to stay up to date with our customers latest clusters and intents, we developed a clustering pipeline, leveraging much of Databricks technology. Raw questions are adjusted and stored in the Delta Lake. These questions are encoded using deep learning, and then clustered in a high dimensional space. These cluster maps are then streamed immediately to our in house automation operations teams. As well as our clients dashboards, for transparency from that Understand product, which I previously showed. This pipeline of ingestion, encoding and clustering, runs at least once a day per customer, and oftentimes even more. So that we are always surfacing the most urgent, and largest clusters ready for automation.

Now, let’s ta6ke a look at what this really looks like. Let’s take a closer look at the raw cluster analysis, coming out of our pipeline. To see how quickly trending intents can rise and fall. We are looking at the beginning of March, March 1st to the 15th. For one of our large global travel and hospitality companies, which is one of our clients. They have just issued a statement regarding their COVID-19 Refund Policy. Our cluster pipeline is outputting tightly packed clusters of questions on the left. All around people asking about location specific refund statuses. We can see that in some of the clusters, people are asking about their reservations in Disney, in Israel, in Canada. Being able to isolate these location specific questions, and with this granularity. Allows us to author automations that are highly tailored to our end users needs. Now we still see clusters like the one on the right, corresponding to more longer standing intents. Like troubleshooting editing profiles, but these clusters at this time are much lighter in volume in comparison.

Now, let’s look at the same company. Same source of data, two weeks later. March 16th through the 31st. We no longer see as many of those locations specific COVID refund questions on the left. But they’ve actually been replaced with questions from people who have already submitted their request for a refund. And now they’re asking about the status of said refund. This is a potential new source of automation that did not exist two weeks prior. And if we hadn’t have caught this, we would not have been able to deflect it. Intents and therefore, opportunities for automation, can rapidly rise and fall within weeks and sometimes days. It is relatively simple to focus on longer standing intents. Like profile editing, troubleshooting, but only truly proactive data pipelines like ours, are able to act on these intents that can vanish as suddenly as they arise. Now, of course, I was able to label these clusters for us fairly simply, for the purposes of this talk. But in production, clustering is only the first step in this process. We also have to be able to effectively utilize the raw output of these cluster analysis, in our pipeline to automate these intents.

SS 24/7 Intent Discovery and Training

To be able to continuously and confidently detect these trends in our clustering pipeline. We also developed an intent discovery and trading pipeline, that is capable of producing results to enhance our AI models 24 seven.

It is based on the theory that among the clusters like the ones I just showed you, there are potential intents to automate, and therefore deflect for our clients.

Intent Discovery and Training with Gamified Platform

The questions that were extracted previously to the Delta Lake and clustered, go through a secondary pipeline to highlight potential clusters and trending intents. These clusters are flagged for review, via proprietary automated process of Directly’s making, and are validated as new intents by a specialized team of in house automation specialists. These automation specialists confirm the new intents coming out of the cluster analysis. And once these intents are validated, further data labeling tasks, training tasks are sent to our global network of experts, through our gamified training platform. Seen here on the bottom right. The results of these tasks with our train 24 seven, are among other things, training phrases for our natural language models, and authored automated responses for sending texts. If you think back to our previous example, intents surrounding location specific COVID refunds, can be sent immediately to our training platform, to author automations within days and sometimes hours. Directly’s ability to draw on the collective knowledge of experts around the world, to continually train our automation AI and automation pipelines, sets us apart in a big way.

Let’s see what this looks like. We are looking at a graph of an automation classifier’s match rate percentage, for one of our clients.

What that looks like

We are specifically looking at the performance for the first 30 days of this models lifespan. So not that long. In the first two weeks of our engagement with this customer, we deployed an industry specific model, that is able to capture intents from the industry that the client belongs to. We were able to fine tune that model, with a sample of data provided to us, prior to the engagement. This way we are able to capture some intents and automations, that are a bit more specific than an industry model. Now, that’s only in the first two weeks. As questions were coming in during that two week period. We were sending these tasks that we were previously talking about, to our global network of experts, through that gamifying platform. These tasks are then used to validate and train new intents, along with our specialized team of automation experts. And fine tune our natural language model with companies specific intents and automations, that are not going to be found in an industry specific model. And were not captured from our sample data. Two weeks after launch, where that orange arrow is. We turned on the updating of our natural language models, from the trainings from our expert network. We can see that within days, our intent match rate for this one classifier grew by 150%, and has since continued to climb steadily. This is showing how our distributed training pipeline, is able to train natural language and automation classifiers, without having to rely on reactive teams, in house reactive teams. And therefore can be more proactive about finding these new intents and automating them.

Predictive Routing Deep learning based triage

Now our global network of community experts, are always training our AI models. But that’s not all they’re doing. They’re also engaging with our clients and users, and answering the questions that our models are not successfully able to classify or contain.

Our deep learning based triage system, which we call Predictive Routing. Is charged with the task of routing questions to our network, incoming questions to our Directly network. And sending the end user to the optimal customer journey. Whether that’s with one of our automations, one of our experts, or back to our clients internal agent network, if it requires specialized access.

This is yet another pipeline that relies heavily on Databricks, that we employ to make this a reality. Questions that are attempted by our experts, and therefore were not able to be contained by our automation. Are normalized and fed into our predictive routing triage model training pipeline. And the results are tracked via MLflow. The predictive routing models, very similarly to our clustering and intent models, are updated at least daily. And they are always learning from our experts recent behavior patterns of answering questions. What this all means is, this allows predictive routing to more accurately and effectively and quickly, predict expert behavior, and route questions to our expert network that have the highest chance of resolution with a high CSAT or NPS. All other questions that are rejected by predictive routing, are either fed into our automation pipeline, or sent back to the clients. If our predictive routing model has detected, that it requires a higher level of account access, than our experts or automation are able to provide.

Now let’s see what this looks like from one of our large gaming companies case studies that we’ve provided.

Now we’re looking at the performance of two predictive routing models, for this large gaming client. The purple line represents the performance of a predictive routing model, that is being trained using the pipeline I just outlined. Being trained every day, learning from the most recent expert behavior, and trying to effectively wrap questions with the highest chance of resolution. The orange line is the exact same predictive routing model, but with continuous training turned off. And it is only trained on the initial day, and then never trained again. Now within the first few days, and the first few weeks. We see that both models performances are about the same. They rise and they fall through the types of questions being asked by the network, and by the end users. Now it is worth noting the purple line, which is continuously learning, is generally above the orange line, but not by that much. On September 21st, the company, our client, experiences an outage for their most popular global online game. Both models are inundated with several new intents all asking about the outage. When is the outage gonna be over? What is causing the outage? When is this gonna be fixed? Our community experts were there, handling the backlog of tickets that our automations were not able to handle. Because they had just never seen these intents before. And this is true for both models. The difference however, it is the purple line, the one being trained daily is learning from the experts work every single day. That model is quickly recovering in routing performance, while the orange line, who is still only learning from data from day one, decays much more rapidly, and takes a much larger hit to performance. And only recovers when the outage is over, and people just stop asking about it. The Purple Line recovers much quicker, because it is able to understand these new intents coming in, and understand which types of questions that our experts are able to handle, and resolve with a high CSAT or NPS. So this goes to show, that continuous training is not only for clustering, not only for intent classifiers. But it is also being used to optimize the expert network that is supporting all of our AI models. Continuous training of models is imperative to a modern effective machine learning pipeline. And it greatly improves the experiences of both our clients and our clients end users. In order for Directly to proactively capture automations, capture these new intents, and optimize our expert network, we must be able to surface potential intents from daily trending cluster analysis. We also have to validate these new potential new intents, with our specialized automation teams, and automatically send training tasks to our global network of experts. For further training and fine tuning of our automation models. These expert networks are able to keep the quality of the machine learning high, like we’ve seen in previous examples. And lastly, we must always be optimizing expert network, supporting our automation. And we have to do this 24 seven. Because as these new intents come in, as expert behavior changes, our predictive routing, our deep learning triage models, have to maximize resolution rates, maximize CSAT scores. All while understanding which questions are gonna be most likely to be flagged for automation, or sending back to our clients internal agent network. Now, all of this is possible through Databricks. And it is necessary to stay relevant in the markets growing demand for proactive intelligent automations.

So thank you. Thank you for sitting here and listening to me, talking about battling model decay with deep learning and gamification. I hope each and every one of you will go back and really think about, the pipeline’s that you are building, and whether or not they are ready to continuously be updated and optimized, so that you’re always capturing, all of the latest shifts in data.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Sinan Ozdemir


Sinan is a former lecturer of Data Science at The Johns Hopkins University and the author of 4 textbooks about Data Science and Machine Learning. He is the founder of the acquired company, an enterprise-grade conversational AI platform with RPA capabilities. Sinan is currently the Director of Data Science at Directly in San Francisco, CA.