With the advancements in Artificial Intelligence (AI) and cognitive technologies, automation has been a key prospect for many enterprises in various domains. Conversational AI is one such area where many organizations are heavily investing in.
In this session, we discuss the building blocks of conversational agents, Natural Language Understanding Engine with transformer models which have proven to offer state of the art results in standard NLP tasks.
We will first talk about the advantages of Transformer models over RNN/LSTM models and later talk about knowledge distillation and model compression techniques to make these parameter heavy models work in production environments with limited resources.
Rajesh Shreedha…: Hi everyone. I’m Rajesh Shreedhar Bhat, I work as a senior data scientist at Walmart. So it’s been a five years journey with Walmart for me. And I mainly work with the image and text data.
Dinesh Ladi: Hi everyone. This is the Dinesh, I have giant Walmart of give back, and I have around four or five years of experience. And I mostly work as a full stack data scientist from building models, data funding, to building dashboards, et cetera.
Rajesh Shreedha…: Thanks, Dinesh. Yeah, so today’s topic would be conversational AI with transformer models. So agenda looks something like this. We’ll start with, why do we need chat bots in conversational AI. Then we’ll be talking about chat bot conversation framework, and the use case that we are trying to solve at Walmart. Then I’ll talk about a typical chat bot flow diagram, and we’ll be covering the NLU Engine, which is crux of any chat bots that we are going to build. And so later we’ll be talking about transformer models specifically for intent classification. Then Dinesh should be covering data and model chain summary and productionizing BERT for CPU inference and simple technique that we have and similar model that Microsoft use. Yeah.
So let’s start with, why do we need conversational AI chat bots? So as most of you are aware, messaging is a popular form of interaction nowadays and chat bots streamlines the interaction between people and services. So if you look at any industry, it could be retail or real estate or any kind of industry, for a customer support, or for asking any information related to certain things. So there is a chat bot that is in place answering most of the user questions. And if a bot is not able to answer, then that questions are redirected to a real agent who was sitting. So previously it was users calling the customer care or any other services that are provided by the company and then getting the answers straight. But now with the chat bots, it’s chat bot answering most of the questions and if chat bot is not able to do that, so those are redirected to the real agents and the questions are answered accordingly, right.
And other one thing to just with chat bots is scalability. So it’s based on the load. Let’s take in a retail industry example, [inaudible]. So the traffic that goes into the website might differ based on different months, right. So based on different festivals and any special events, right. So accordingly we can scale the bots and we can make the bot data to the questions that are coming in. And of course it’s always available. There’s, I mean, humans are not answering the questions, a bot is answering the question. So once it’s deployed and once it’s up and running, it’s always available and yeah. And of course, helpful for organizations with multiple jobs. So there is no need for hiring people for different geographies. So there is a bot up and running and answering all the [inaudible].
Yeah. So coming to the chat bot conversation framework, so it’s classified based on conversations and responses. So when I say connotations, it could be open domain and close domain. So in open domain, it’s like the user can ask any question to the bot, right. So that is open domain. And let’s say it’s a closed domain chat bot, let’s take an example of a bot answering queries related to some questions related to a restaurant or something, then users are expected to ask only questions related to that particular domain. Let’s take an example of a restaurant search and booking. So in that case, a user would be asking only those questions related to a particular restaurant search or type of restaurant he’s looking for, those kinds of things.
So if a user ask something else, then it’s classified as out of the main question. And so bot would be giving some default answers for that. And coming to the responses, it could be retrieval based or generative based. So in a retrieval based, it’s like a bot giving the predefined answer for a particular type of question. So if the same question is asked by multiple users, the kind of response that is given by the bot remains the same, in a retrieval based. So in a generative based, based on how the question is being asked, the answer is generated on the flight and returned to the user. So I hope this differences are clear.
Now coming to the classification that is made based on responses and conversation. So if it’s an open domain and a retrieval based, it’s impossible to do because user can ask any question to the bot and bot should be able to search for a particular answer and then give it as an answer to the user. Right. So we should be having everything under the sun, so that bot is able to answer any question that has been asked. So that’s impossible, I would say. So coming to the open domain, generative based, right. So this is mainly a general AI. So there is no fixed knowledge base or something based on, there are different frameworks for achieving this. People are using [inaudible] for achieving this. So this is not impossible, it’s possible, but it’s pretty hard to solve, right. Yeah. Now coming to the close domain and a generative based chat bot, right. So one thing which is easier now is, it’s a close domain compared to open domain. So user is asking only a certain particular type of questions, right. So he’s not asking anything under the sun. So in that sense, it’s a bit easier and it’s a generative based.
So there are a few solutions that are already available in this bucket that is close domain, generative based. But the problem is if it’s a generative based, then it’s pretty hard to control the kind of responses that are coming from the chat bot, right. So that could be grammatical errors or any other errors, right. So that is a bit harder. So coming to the last category, that is a closed domain, retrieval based. So it’s easier in both the sense on its own compositions and responses. So it’s a close domain and responses are also, it’s based on retrieval, right. So given a question, we have a knowledge base defined for that. So we’ve see that given a question based on a rule, or based on a ML model or a depending this model, we find the most suitable question, given a query, and then try to get the answer for that, right.
So we can say rule based or an ML model based deep learning model that is sitting out here. I hope this framework is clear to everyone. Oh, so coming to the use case. So at Walmart, the bot that we have built as mainly for answering the HR policies. So at Walmart scale, it becomes very difficult to sit and answer the questions manually, which are coming into mail or any of the channels. So we can’t have a actual person sitting and answering all these questions coming from different employees from different geographies. Right. So that’s why, based on the knowledge base, a bot is built, a bot is trained, and then it’s deployed and it’s able to answer queries that are coming in by the end-users.
So how will this bot help? One is like, obviously, very convenient to get the queries clarified on various policies because it’s trained on various policies and it’s available 24/7, as I said earlier. And so it definitely eliminates a person dependency. A bot is deployed, up-and-running, answering the questions. So there is no manual intervention once the bot is deployed and provides a consistent experience. When I say consistent experience, so if we ask the same question to different people, then we might get different answers based on their understanding or based on their tenure with Walmart and knowledge of policies that are available in Walmart. Right. So, what a bot is trained on a knowledge base, which is pretty much selected in a very right fashion. And then a bot is trained using that data and deployed using that. So it can’t answer differently for different users, right. The answers has to be consistent in this case. So right now we have integrated with various communication platforms, like Slack, little bit on Zoom and it might be other communication platforms.
Okay. Yeah. Yeah. So I’ll be talking about how a typical chat bot flow would look like. So basically when a user types in a query, it’s unstructured data that is coming in, so natural language texts that is coming in, right. So the main aspect here is understanding what the text is. And then based on the understanding, getting the relevant answers to end-users, right. So the crux of any chat bot is NLU Engine. So I know NLU Engine is nothing but a natural language understanding engine. So given an unstructured text, so it’s basically converting the unstructured text to a structured fashion, right. So that is the role of NLU Engine. I will be talking in detail of what are the components of NLU Engine in upcoming slides. So for now we can think of it as a black fox, where given unstructured text data, we are getting a structured data from it, right.
So once we have the structured data, then we can use this structure data and call an API or API would be necessary sometimes. Let’s say, in this example, a person is looking for a Mexican restaurant in the center of the town, right. So then, I mean, a particular API needs to be hit based on the structure data that is retrieved, and then the results are displayed to the end-user, right. So this is one type of way, I would say, it could be a FAQ kind of scenario, right. So there is no API call that is required to get the required information for the user. So in that case, it could be a static answer, a static, plain text as an answer, right. So that could be two ways. One is making an API call, getting the results and showing it to user, or based on the question that is being asked, if it’s an [inaudible], then giving a static answer to the user.
Okay. Yeah. So now I’ll be talking about the different components of NLU Engine. So intent and entities are the correct [inaudible] engine. So let’s understand what these are. Coming to the intent, it’s basically telling what the user is willing to do, right. So let’s say, example, searching for a restaurant or his raising the ticket of some issue in a company, right. Or booking a taxi booking, booking an airplane or anything right. So here, the intent is based on the query, could be restaurant search or booking a taxi, anything that’s right. So coming to the entities, so these are nothing but the attributes, which you details about the user task. So let’s take a restaurant search as an example, right. So the type of restaurant a user is looking for, right. So that is like adding more details to the query. Okay. He’s looking for the restaurant, but what type of restaurant is he looking for or where exactly a user is looking for a restaurant.
Right. So these are like additional information that are coming with, right. So these are nothing but entities, right. And so in NLP domain, so these two tasks are called classification tasks, or sentence classification tasks. For the intents and for the entities, it’s named recognition task. Right. So these are like two well-known problems in NLP space. Okay. And so we have specifically used transformer based models for intent classification. So the reason is transformer based models have proven to be doing good in multiple scenarios with the classification, a named entity recognition or question answering, right. So that’s why transformer based models are used. And so the main reasons are because of the contextual embeddings, right. So previously with [inaudible 00:13:37] or glow or fast text kind of models.
So what used to get was, irrespective of a word being used in different contexts. So we used to have a fixed vector representation for it, right. So in this example given over here, when you say open a bank account, was on the river bank. So the bank has a fixed representation. It’s not a contextual representation, but with transformer models, based on the context it is being used. Right. So bank, when you say, it counted as related to finance, right, and when you say on the river banks, it’s like a natural element, right. So it’s like something close to a river and some person is sitting along the river bank, something like that. Right. So what I’m trying to tell is word “bank” has different embedding based on the context that is being used. Right. So a bank will initially have a same embedding, but with the transformers and self recognition mechanism. So after having those blocks, we’ll be getting the contextual evidence.
Right. So a bank account would be having a different embedding and bank when used with river, will be having a different embedding, so those are contextual. And other, I don’t want to just with transformer models as, yeah, previously with RNN and LSTM kind of models, we used to process the information sequentially, and then we used to classify a sentence or any other tasks. Right. So everything was sequential, [inaudible] training is possible. That’s because of position limitings, so embedding certain coatings. So positional encodings are also learned. Right. So basically one can say, how even can just give an entire sentence and process it as a whole. Right. So what is the order and how is it taken care? So that’s why positional embeddings are there, to aligned with the word embeddings. We do learn the positional embeddings and then collectively the model is trained and then the final decision is taken based on the task, that doesn’t have to be classification of any other task.
Right. So, yeah. So this was just an overview of why transformers are used for classification tasks. Why not, I mean, using a traditional approaches with a RNN or an LSTM model. So this is just a gist of it. So if I had to cover the technicalities of why transformers are better, it will take lot of time. So, but I hope the intuition is clear over here.
Specifically, we are using BERT model. So it’s based on transformer architectures and BERT mainly involves two tasks. One is pre-training or people call it as self-supervised learning and then fine-tuning. So fine tuning could be for classification question answering or any other kind of problem statements. Okay. So I’ll be talking about pre-training to start with.
So in pre-training there are two variants, one is a mask language model, and next sentence prediction. So here in the master language model, so there is no explicit label that has been, right. So given an input sentence, as seen below, certain words are masked. And so this is with the BERT model, right. So the task here is trying to find an appropriate word for the masked position here, right. So in this example, let’s stick to dash in this pit, right. So we want to find the appropriate word in this fifth position, right. So that is what we are trying to do. So in this sense, I mean, we are trying to understand the structure of a sentence and what would be the most appropriate word and the grammar of it as well. Right. So what comes after what and all those information. So it’s not a random sentence or random set of words that are said, so it’s a sentence and certain words are being masked here, and then we’re trying to predict the masked words, right.
So that is the task in hand. So as you can see, there is no explicit labeling here, right. So we are just masking keywords and then trying to predict the word for this particular mask, right. So there is no explicit label. Given the data, we are only preparing based on the information or sentence that is available. So that’s why it’s called as self-supervised learning, or people call it as a reading as well. So coming to the next preparing task. So that is next sentence prediction. So given two sentences, some models should see that are these coming one after the other or not. Right. So basically given a full sentence A, sentence B, so the model should tell that sentence B the next sentence of sentence E. So that is how it is trying to learn the context over here. So that is a main task over here.
Based on the context, if we learn the embeddings, then this can be… Learned embeddings can be used for any other tasks, right. It will be classification or any other task. So that is the main goal here. So in this training task, sometimes two consecutive sentences are taken. And then, so we are trying to predict, is it the next sentence or not. Basically the second sentence and given a random sentence for sentence A right, So then the model should predict it’s not an X sentence. So that is a task in hand. Right, so these are two different pre-training task and this is trained on open source [inaudible] like Wikipedia purpose or social media purpose. Trained on [inaudible] and this model weeks are given to us, so that we can fine tune the model for any other downstream tasks, right.
Now, let’s see how the classification is done in BERT. So basically, as I said, there are different tokens and then we are getting the embeddings and finally the contextual embeddings for different tokens. And there is a CLS token as well. So basically a classification token, a special token that is used for doing the classification task, right. So why should we include a special token? Why can’t we just rely on the embeddings or contextual embeddings that are coming for different tokens? So that is a question, right. So the reason is, let’s say we have initial representation for token one, right. So then we’ll be having a contextual representation for this particular token. So that representation would be more biased towards this particular tokens. It would take care of the context, but the representation is still biased towards this token, right. So when I say biased, so whatever the initial representation of this particular token is, we don’t want to lose it entirely, right.
So we want to retain that, but also look into the context and then just modify that particular representation. So if you take any of this tokens, then it is little bit biased towards a specific token, right. So that’s why a CLS token is included, so that it’s not part of a sentence, but that’s just the overall semantics of the sentence, I would say. So that’s why the representation for a CLS token is considered for classification tasks. So let’s say if you’re doing a name entity recognition task, then a CLS token will not be coming into the picture. So we’ll be having representation, contextual representation for each of the tokens, and then we can get tasks for each of the tokens. So in that case, CLS would not be required. For a classification task, we need a contextual representation of a particular sentence.
So that’s why a CLS token is not part of the sentence. And then we are taking the evidence of it for the classification task. Okay. Yeah. So there are like different models, BERT large, then open AI has a model that is open, GPD and various organizations have come up with different variants and specifically so Dinesh should be talking about DistilBERT and different methods to compass the models and all those things. And so we have used DistilBERT, which is a lighter version of BERT. And so that is being productionized and used for intent classification task. Yeah. Over to you, Dinesh.
Dinesh Ladi: Thank you, Rajesh. So let’s talk about the data related to this particular use case. So whenever any user types in a query in the chat bot, right. That particular query can be related to any policy in the company. Right. And any of that in that policy, sub policy for that policy, let’s take an example, let’s say we have a user typing in what is his leave balance, right. So the first level of prediction would be what policy does that really belong to since the user has type in accredited to leave balance it. So the first level of prediction should be the policy belongs to leave balance, right, then we need to identify what’s a policy within that leave, the utterance belongs to. So since he said that how many leaves he has, it should be something related to leave balance, right.
So in leaves we have a leave intro, some policy link, apply policies that you have, and other policies, right. So I really need to predict it belongs to other policy leave and in their policy balance. And similarly we have 50 plus policies in the company and, for each policy we’ll have sub points. So the idea is to predict the policy and it’s a policy given electorates. So for that, we need to train models at the policy level and then try to identify the policy level. And once you identify the policy, what’s [inaudible], right. So we’ll have one model for identifying the altered level policy, and then you have 50 plus models to identify the sub policy, right. So we have one model, plus 50 plus models for the response. Some of the challenges we face was yeah, the data was highly imbalanced and a few classes had more examples and few of those classes had very few examples.
So the way we started dealing with this model training was, we started to extract the BERT embeddings and try to train linear stick regression model. And even though the initial present was particular good every day, two 87% for the grading accuracy and four 84% for the validation accuracy, we thought we can improve that number. And we started to use a different approach and fine tune BERT model by adding a few more layers for it. So it able to improve the printing accuracy by significant amount. They can put it to 30% improvement, [inaudible] to achieve like 10% implemented validation as well. So when we adding deep valuable data and I looked at why the validation activists was a little less compared to the training accuracy. We identified that a few [inaudible] has been confusing among multiple policies, like few policies. If we look at it manually, we were able to identify whether the policy belongs to either of the policies. So what we had to do was, so we had to [inaudible] the policy since they were very similar. So after watching the policies and retraining the model, even though we didn’t find a significant difference in the training accuracy, but we were able to see a little boost in the validation accuracy.
So this is just one model. Similarly we had 50 plus models. So in order to solve these models, it would require a huge amount of resources. So, yeah, so the challenge we had in front of us was how do we efficiently sell these models without using [inaudible] resources. So there are few methods to improve the model inference to make it more efficient and cheaper.
So the first one, the first principle with that we [inaudible] with was knowledge distillation. Knowledge distillation nothing, but you have a large and accurate teacher model. And then you have a student model, which is small and compact. So use the teacher model, train the student model, the way you feed the same data and using other predictions of the teacher model from the data, you’ll have the soft labels, the predictions of the soft labels and those things. And use those soft labels as labels to the student model and you will train the smaller and compact version of the teacher model. So the widely popular distilled version of what was DistilBERT, which was created by the people at [inaudible]. And initial bot was basically like 20% smaller, the bot, and it was 60% faster with minimal impact on the accuracy, on any of that metrics that you want to look at. And to some the student model like a smaller, faster, and cheaper and lighter way. For our use case, we thought initially, what is good enough for us?
And the next idea is to use quantization. So there’s several ways to do quantization. You can do, during training as well, do quantization. Post training you can do dynamic quantization. Study of quantization is to, generally the weights in the models, right, are floating by. So if you can reduce the accuracy of those weights, right, and if you can bring it down to 13.16, or you can reduce the memory footprint by twice at four times. So what we do is, after the training, we modified those weights of few layers, in our case, these are just to find lenient leads. We changed the leader types from 14. [inaudible] 14.16, after the training. So this reduces the size of the model significantly, while having minimal impact on the performance. Obviously this depends on use case to use case, but in our use case, we were able to see who stays in the modern size, but very less impact on the model performances.
So we went ahead with this quantization technique. The final one, it’s not any technical task, but in this case, it’s just more of how do you apply the inference, right? How do you handle data dealing with inference? So generally during our training, deep learning models, the model requires the data to be in batches of 16 toward 32, 64, in those margins. For efficient training and fitting those models in the GPU to train efficiently. So, right, generally since the text comes in variable sizes, right, if you look at the example on the right side, if you see, every text has different number of characters and different number of words, right. So in order to fit the text in the GPU, right, you need to have similar sites, right, so for that, you need to bring everything to a constant length.
So what we generally do is we pad the inserts at the very end with zero, and then you [inaudible] all the other texts in batches and do the training, right. That’s all you require during the training. But since in our case, inferences on the one created time. In our case, the pad says it’s one. So we don’t have to do the zero padding. Now it isn’t necessarily that since you’re only predicting, you’re only using one query at a time. So this will, based on the type of query you’re getting, the model generally gets, it depends on that. So in our case, we were able to see a significant improvement in the inference when we are not using padding, because most of our queries are very small. In order to compare all the methodologies, right, we track it with BERT, the model DistilBERT. If you compare how the performance has seen across these different scenarios. So what we’re able to achieve the 3.7 validation accuracy, but the model says what’s [inaudible] for that gate and B for each model. And the inference speed was 190. So when we moved to DistilBERT, as you seen earlier, the modern size is around 40% of the bot size. And even that system model, the model is smaller and compact. We saw us implementing the inference speed as well.
When you use the small bearing technique, even though there is no change in the model, be able to see a significant improvement or MTX improvement in the inference speed. Finally, when we use the quantization on the DistilBERT, along with no padding, the model size decreased by almost half from previous scenario, and you could see it incremental inference. So the way we started from scenario one from bot to [inaudible] to distill padding, we are able to see around 2.5 X improvement the model size, the model size decreased by 2.5 X and inference speed almost improved by 20 X, which are 198 milliseconds on average for inference. And now the last scenario to 9 milliseconds inference speed.
So this is how we try to improve BERT and use BERT, using CPU inference instead of GPU, which is expensive and costly. So we finally have a system in place which is just Microsoft LUIS as the inference engine. So what Microsoft LUIS, is basically an ultra price solution, it doesn’t have much control on that. It’s a black box solution for us, right. So we wanted something that we can control the model and, [inaudible]. So in order to help control our system. So we wanted to use the custom model, the DistilBERT. So along with that, instead of just replacing these which is cheaper, we wanted to see, but that we can argument LUIS using our custom built DistilBERT. So we wanted to see that ensembling LUIS and DistilBERT, will make the solution better. So this is, if you look at, these are few policies that you can see, there was a significant implemented field, the policies, but not much other policies, the reason behind that was the number.
Whenever LUIS and DistilBERT are diverse in the predictions, right. If you see, look at the probability distributions, we look at the first one, we can see the one in red, which is DistilBERT, the probability on left side, right, and the LUIS is on the right side. So in these cases, when those predictions are diverse, you can see a significant improvement, in this case, to see from the backs of the barrel, the improvement over LUIS was 73% and DistilBERT was 73%. And then we ensemble both of them, and there was around a 13% improvement compared to just using we LUIS, but when you look at some of the policies like RSU, when both of the models are the problem we did in just a similar rate, and then we didn’t see a huge improvement in the predictive part of ensembling technique.
That’s it from us. And if you visit, this is the team behind the project, along with Rajesh and me, two other folks helped us in designing and building the solution. And the Sample code, if you want to look at the sample for the tensor flow and titles, you can access this URL, scan the barcode, and you can see the sample for the solution we have built.
Thank you for attending.
"Rajesh Shreedhar Bhat is working as a Sr. Data Scientist at Walmart, Bangalore. His work is primarily focused on building reusable machine/deep learning solutions that can be used across various busi...
Dinesh has a Bachelors’s and Masters’s degrees in Energy Engineering from the Indian Institute of Technology, Bombay. He has an overall work experience of 4.5 years of experience in data analyses...