Testing is a crucial enabler for the success of chatbots and voice assistants. Doing it manually requires enormous time and effort. As DevOps and furthermore AIOps grow in importance, automated testing will remain critical to ensure that bots actually do what their designers intend.
Unlike traditional software where the application follows a predefined flow, a chatbot runs without any restrictions. Talking to a chatbot has no barriers. Combining this with an unpredictable user behavior, it becomes utmost difficult to verify the correctness of conversational AI. Training data and test sets are infinitely large. In fact, quantity plays a mature role in quality assurance for bots but makes it impossible to test manually.
We will tackle the main questions arising: “Why are bots failing?”, “What and how should you test?” and of course mainly “How can we automate the testing and training?”. In addition, you will get hands-on experience in running your first automated tests using Botium.
Christoph Borne…: Hi guys, it’s a pleasure to meet you. My name’s Christoph and today we are going to talk a bit about bots testing bots or from manual to automated testing for conversational AI. Conversational AI, meaning chatbots and virtual assistants are more or less the most trending topics right now in the whole IT industry. More and more big enterprises and businesses out there are developing these chatbots and the quality assurance and especially the automated testing is becoming more and more the key enabler for the success of these bots. So let’s have a look. Before we go into details, a few words about me.
My name is Christoph Borner. I’m founder and CEO at Botium. We are located in Vienna, in the heart of Europe. I’m a multiple keynote speaker and in my free time, I’m playing drums in a rock band. If you want to stay in contact with me or in case you have any questions, we cannot answer in the FAQ session. You will find me on social media. I’m usually using grisuBotium as a nickname. Same thing on Git. You can follow my blog on botium.ai/blog, or maybe the most easiest way of contacting me is write a direct email to my email address that you can see here.
So the playbook of today’s session, we will talk a bit about the question why are chat bots real game changers right now? We will discuss why failures especially cost heavily when it comes to conversational AI, then maybe the most three or the three most important topics on what do users expect from bots and what can testing do to fulfill these expectations? We’ll create a solid test plan. We will see where testing adds value in the entire bot development life cycle. We will have a look on the key enabler test automation and finally, I will give you some hands-on experience on test automation for chatbots, meaning that we will most probably fully end to end automate a smoke test for a bot that I’ve developed in Dialogflow so that you can see how fast and easy test automation for conversational AI could be.
Before we go into details about the game changer chat bots, I have included here a short video that was produced by the guys from HubSpot. That’s pretty cool CRM. In case you are looking for one, this video will really set the context and explain on what chatbots are for those guys who haven’t heard about them before. So let’s see the video.
Speaker 2: If you’re not building a bot, you’re already behind. The age of robots is now and the more people interact with them, the more they’ll realize value of humanless interaction. But what exactly is a bot? A bot is a computer program that automates certain tasks typically by chatting with a user through a conversational interface. Bots plugin to the messaging apps, apps that have even surpassed social media platforms and usage can play into a larger shift we’re seeing in consumer behavior. People shop and buy in a world of immediacy, messaging is how they communicate. Having an app won’t cut it anymore. Today, half of smartphone users download a whopping zero apps per month. Bots meet users where they already are, no app download or URL necessary. That’s the super power of bots, efficiency. There’s no complicated phone menu or ill-informed service rep. By chatting in a familiar, conversational interface, bots ask what they need to understand and solve a problem. Nothing more, nothing less.
Christoph Borne…: So after we know now what chatbots are and how they work, let’s shed some light on the question, why they are these big game changers? Therefore, I have some big quotes here from famous IT trend analysts out there and they are really big. When we have a look, for example, in the prediction that already last year, 85% of all customer interactions out there were handled without the human agent, meaning by a chat bot. For this year, 50% of all enterprises are considered to spend more already on chat bot developments than their mobile apps. For the next year, there is an estimation of potential 8 billion US dollars in cost savings from the use of chat bot conversations and overall the market for the upcoming years is estimated close to 10 billion US dollars. So pretty big numbers, pretty big quotes. This is mainly why chatbots are considered right now as these big game changers.
Coming to the slide, Why Failures Cost So Heavily. I included here a real life example from a pretty big airline. Unfortunately, I was afraid to use a real screenshot. I didn’t want to bring these guys into a bad light, but the story here is this airline changed their entire ticket booking process to a chatbot, meaning that you couldn’t go onto a website or some booking.com and book their flight tickets. You had to use the chatbot. In general, pretty good idea, but the problem was it was not working. As this screenshot here is a fake, but the real implementation looked exactly like this one. If you told the bot, “I want to book a flight,” he’d answer with, “Sorry, I don’t understand.” The user tried to change to, “I want to buy a ticket.” The bot again said, “Sorry, I don’t understand,” and so on.
So in short, the worst case that can happen and it took two to three days for this big airline to realize that they were not selling tickets anymore. In general, it’s pretty easy then after this failure to put a value behind your fixing costs and also behind the revenue loss because usually you know how much tickets you are selling from Monday till Wednesday. You can put there, “Usually we sell 100,000 tickets and our loss is a few million US dollars.” But what is pretty hard to estimate was the loss of reputation and also all the loss due to this due to the shit storms that were coming up. You have to see that all these boT technologies mainly used out there by millennials, by Generation X. These people are permanently using bots on the whole day and the same guys started shit storms immediately.
Asking questions like, “if these guys are not able to develop a simple chatbot to book a ticket, are they able to do the maintenance of a Boeing or an Airbus and so on?” It was pretty big reputation loss and I think the airline is still suffering from it. Long story short, with some testing and especially with continuous test automation in place, this problem could have been completely avoided. Coming to one of the most important questions out there when you’re thinking about developing a chatbot, and this is definitely what do you use as expect from conversational AI? After a lot of research within our community and our open source product is around about 150,000 users. So we can really collect a lot of data. There are four main topics. The first one is definitely accurate answers. So users are expecting accurate answers delivered quickly. The only thing to do this is a well fault conversational design behind it, cannot shorten this more.
Second big topic is great user experience, meaning that users are expecting your service on all preferred channels and platforms, meaning on Facebook Messenger and at the same time on a WhatsApp client, meaning on a website on a mobile app using voice speaking to an IVR system and so on. Users are expecting everywhere on all channels and platforms the same great user experience. Next topic, fast response. I think it’s self-explaining. Users are expecting excellence now and not in a few seconds and also on stressful days. So no matter if it’s Superbowl weekend, half time or something like that. They are expecting answers immediately and accurate. Last but not least highest security. So of course, state-of-the-art security and data privacy has to be in place. In Europe, in addition, we have GDPR compliance. So let’s have a look, what we can do in testing to achieve these expectations.
In terms of accurate answers, we are combining conversation flow testing to check if answers are appropriate and delivered quickly. We are combining this conversation flow testing with permanent NLP score checking. All this to improve the chatbot understanding. For great user experience, we are doing full end-to-end testing to verify the end-user experience and we are doing it across browser. We are doing it across mobile apps and across device. For example, in terms of an Alexa skill. We are testing text-based. We are testing voice. We are testing IVR systems. Everything is important to understand users on all different kinds of channels. In terms of fast response, we have added performance testing to ensure that chatbots are responsive on the high load. In detail, by doing stress and load testing in combination and for security testing, we have included security test sets in our products based on OWASP to keep bots secure and GDPR compliant at the same time.
Good. If you take these things, and I’m jumping once back to this last slide, if you take here the stuff I’ve mentioned in what testing can do you more or less have the ingredients for a solid bot testing strategy. So all the tests is here. All the QA guys can read in between the lines that we are talking here about different test types. If I go to the next slide, and if I take these ingredients, meaning these test types and combine them, then we will end up with what we call a solid test plan or a solid test strategy if you want to. Or some people are also talking here about holistic test approach and more or less to sum it up again, we’re talking about conversation flow testing to identify flaws in chatbot dialogues combined with end-to-end testing to verify really the end-user experience throughout all channels and platforms.
We are adding prominent NLP score testing to improve the chatbot understanding. Performance and security testing, to keep the boat secure and also under high loads. Finally, in production, of course, we also have to do monitoring to get notified when problems arise. As all QA guys can see, we are covering your functional and non-functional testing, we’re offering smoke regression, and also the acceptance testing. If you take all these tests types, combine them and continuously execute them through your ICT pipelines, and continuously means with every change in your conversation model, with every change on the widget of the bot, with every change that is somehow connected to the bot development, then we will end up with the highest possible level of end-user satisfaction in production and with the highest level of confidence on bot owner side to ship bots and also changes on the bots into production.
Good. Then I want to talk a bit about adding value because a lot of people are expecting that testing only adds value in the testing phase in the bot development life cycle, but maybe start one step earlier, how does the bot development life cycle look? Well on the bottom of the slide, you can see the different phases. We are going through planning, design, development, training, testing, and deploying to production. More or less, this ends up in an infinite loop because the chatbot development usually doesn’t end with the deployment to production. Usually you try to learn from your users, you add new end-to-end, you change your conversation models and you start over more or less from the beginning. But the important fact here on this slide is that we can see underneath this gray line in the middle with the different phases. We see typically challenges of the bot development team.
The business guys, architects start with challenges like analyzing user needs to have to decide for channels for an NLP engine or bot development platform. Designers have to do the conversation flow design and so on. On top of this gray line with the faces, you can see the value that we can add in testing. For example, in the planning phase, when an architect or the team has to decide for the right NLP engine, meaning natural language processing, for those who are not used to this term, we can do for example, provide a benchmark tests in the domain where the chatbot should act later on. Also, testing already brings in this early phase a lot of value and we can go on with this example. In the design phase, we can import conversation models and directly generate data sets, meaning tests and training data out of it. In the development phase we can do or automate the unit testing, do some voice based testing. In the training phase, of course, generate all the NLP scores.
We have an AI model downloaded included. Finally, in the testing phase, of course, this is where we can add most of the value. As we’ve seen before, the conversation flow testing, testing end-to-end the non-functional testing in terms of performance, security, GDPR, and so on to more or less cover all the quality gates that you have to cover for all software to make the time to market shorter, to be able to do quick adoptions and implement continuous integration in your pipelines. As I said before, finally in production, we can do monitoring to permanently collect all the chatbot analytics. So main takeaway here is, and I know this slide is pretty complex for those who are new to it, testing brings value in all phases of the bot development life cycle and this is an essential part. Going one step further, now on the next slide, I’m going to talk a bit about the automation. So, before it was mainly about testing itself. Let’s have a look now why automation is so important when it comes to conversational AI.
Well, first of all, the main factor is most probably that tests are infinitely large when we’re talking about conversational AI. So actually there are no barriers for users. Users can ask a bot more or less everything and as we see in production, they do. So the banking chat bot of the bank out there is asked by the end-users about the weather today in Vienna and so on. Also, this… actually the banking chatbot doesn’t have to be able to answer this question, but he should not reply with something like, “Sorry, I don’t understand.” Or in worst case break and provide some confidential information and so on. I have an edit here, a showcase or from an actual customer that is using our Botium Box. He’s operating in the telecommunication domain and this customer is you see on the top row is a support center that gets around about 100,000 incoming calls per week.
Internally they are calculating the costs of $3 per call, which ends up in annual costs of 16 million US dollars to run the support center. Then, these guys implemented a chatbot and this chatbot is right now answering 90,000 of these 100,000 incoming calls. Reliable 24/7, customers are super happy with the bot. So we are usually asking at the end of the conversation, “Did you like it? Were you happy?” So super cool end-user satisfaction there. Only the remaining 10,000 calls are handed over to a human agent. Therefore, this telecom company could reduce the annual cost from 60 million to 1.6 million US dollars. So it was definitely worth to implement this chatbot and the ROI was reached pretty quickly. But the important thing here is why did they reach this ROI quickly? Why could they reduce their costs so dramatically?
The answer is they were doing heavily testing, mainly test automation. This is why at the moment, the bot is intelligent enough to really answer 90,000 calls with high end user satisfaction. To give you an example, what are the savings from automated testing there versus manual testing, you see on the left and the first column the training data that is used in this project. So the telecom bot knows roundabout 100 intents. So they’re around about 100 topics he could answer. Behind each of these 100 intents, they are more or less 100 user examples or utterances safe, meaning 100 different ways of asking about this topic. So if you multiply those two numbers with each other, you end up with 10,000 conversations. This is typically the starting point for regression tests. So the regression test that just covers here the happy paths.
So without some negative test cases or whatever already consists of 10,000 conversations. Then the team there is doing roughly 100 builds in a sprint. So meaning 100 times in a week, they are changing something on the and they are doing this for 50 sprints a year. So if you multiply these numbers, you see that every week we have to execute 10,000 test cases in the regression tests at 100 times. At this point it becomes pretty clear that you cannot do this manually anymore. Without going too much into detail, I’ve just collected here the KPIs in the last line where I compared to manual efforts versus what we were doing in test automation. So for speeding up the intent definition and NLP provider selection, we saved there with all benchmark testing around about 100,000. Test data generation 50,000, the unit testing around about 300,000 per year. All the NLP testing, automating of the training also around about 300,000 and did the testing phase.
Of course here, most of the tests are happening. So 100 smoke tests in the sprint, 50 regression tests, 25 end-to-end tests. In brackets, I have always on the hours that the tests would need doing it manually. To be honest, they are pretty conservative. Five performance tests to sprint and five security tests, this ends up in the saving of 3 to 5 million per year. In addition, the monitoring and then in this project, we end up in a potential saving of 4.3 million US dollars. So as you can see, the ROI of test automation is reached here immediately.
On the other side, it’s totally necessary. Just imagine you do a small change in your conversation model and to ship it into production, you have to do at least a smoke regression test, to do some end-to-end and depending on the change, maybe also performance and security testing. If you have continuous test automation in place, all this stuff could happen in a few minutes versus if you have to do it manually, it might take days or even weeks. So time to market or time to value as we are talking about it nowadays is one of the key factors about test automation. Good. From this complex slide, I would like to go now to some hands on parts. So as promised at the beginning, I want to show you some test automation on the live bot and therefore I’m going to share my screen with you.
So this is the bot we are going to automate as promised in the beginning. It’s a Dialogflow based bot. So Dialogflow is one of the biggest bot building platforms out there provided by Google, and we have just here pretty small bot that I developed for a webinar lately. As you can see, this guy just knows three intents, more or less. Welcome intent to do a greeting. He can tell us a joke and he can also tell us who he is. The cool thing is Dialogflow has some sandbox environment included where you can see the bot end-to-end. So this is the widget of the bot and we can test if he really knows three intents. So if I say, “Hi,” he replies with, “Hey buddy.” If I ask him, “Who are you?” He will tell us that he is the Bochner Botium Webinar Bot.
Finally, I can ask him to tell me a joke and there we go, typical developer joke. Why’s six afraid of seven? Because 7, 8, 9. Yeah. This… which you could integrate in your live website, in your live mobile app and so on. Therefore we want to automate now real end-to-end test case, meaning that we will really talk to this bot in this work, which for a simple smoke test, we will ask exactly those three questions as I did here and check for the replies of the bot. This is of course just simple smoke test, but we will do it end-to-end. We will have full CICD integration ready. This could be used for example, to do a smoke test after deployment. So let’s say you do a change on a conversation while you redeploy a bot. Then you need this smoke test to check if the deployment was okay, if the bot is there, if you can start the conversation and so on.
So let’s jump into our flagship product called Botium Box. You see here the Enterprise Edition, but there is also a completely free version for our community users available out there that you could register to on botium.ai. This is the quick start menu and no surprise, you see here exactly those six test types that we had on the slides before that we’re defining a solid test plan or a solid test strategy. So once again, we’re talking here about conversation flow or regression testing and NLP score testing, monitoring end-to-end testing, performance and security testing. In this case, now we want to do real end-to-end testing because this is always the coolest thing to show. Therefore, we start our Quickstart with… I will give the whole project here a name. Let’s call it Demo Data AI Summit and I select my demo bot.
I have tested this bot before, so I don’t need to enter my credentials again, but in general, this would be just entering my Google Dialogflow credentials here to connect or even easier, download the credentials as a json from my Dialogflow account, drag and drop it here into Botium Box and we are connected. Second step is about composing tests and therefore we would have here the options to select from out of the box tests that are if you type just “small talk,” you’ll find a lot of out of the box test sets in different languages. We could start using our visual test case designer where you can just in clear text, define your test cases and assertions and of course you can assert you for everything, not only texts. So you can assert also for cards, carousels, for NLP data, everything you could imagine. We could directly download the conversation from conversational model of the bot and generate data sets out of it.
That’s a very good starting point. For NLP score testing, we could use a brand new feature that we have called Conversation Crawler. This feature will crawl through the entire conversation tree and generate test cases for every path in the conversation tree. So very, very good starting point to create a regression test. It usually ends up after a few minutes with thousands of test cases, of course, depending on the size of the conversation model, but for this very easy smoke test that we are planning to do, I will use our live chat recorder. This is more or less a capture replay feature to create tests and maybe the easiest way of composing them. In the first step, and for everyone who is used to Selenium or Appium as industry standards for automating web and mobile apps, you will be pretty thrilled right now when I show you what it means inside Botium here to test end-to-end. You just tell Botium Box where you want to test.
So in my case, I want to test against some Chrome browsers. We could connect here to device cloud providers, or for this quick demo, I will just use the integrated headless Chrome. That’s it, meaning that you don’t have to deal with some Selenium code or whatever. You don’t have to script anything. We will just record now a test session. We have told Botium Box to run the tests end-to-end in the headless Chrome and that’s already it. So hit save the project, and we are starting here now live session with the bot and you will see in a second, if I say, “Hi,” to the guy, he should reply with, “Hey buddy.” There we go. Also, if we compare this to the live bot, this is exactly what we see in here. We sent, “Hi.” He replied with, “Hey buddy.” For, “Who are you?” We are expecting, “I am the webinar bot.”
We are also expecting to get back this joke. So let’s do exactly the same. Let’s send, “Who are you?” He replies with, “I’m the Webinar bot.” We’ll do, “Tell me a joke.” Of course, following these principles now you could have a pretty long entire conversation with your bot going through various paths of your conversation model. At the end, you just save this whole thing as a new test case. Now, let’s call it here end-to-end smoke test, and that’s already it. We are ready. So if I start the test session right now, Botium Box will do all the magic in the background, meaning it will execute those tests that we have just recorded right now. We will get a few seconds results the tests we’re running fully end-to-end in the Headless Chrome as we have specified it. There we go.
This is the result. So as you can see, there is the visual representation of the conversation flow. Down here, we also have the image attachments. So we can see that that tests were really running fully end-to-end exactly against these spots that we wanted to test. So once again, for those guys who are used to writing tests and Selenium and so on, just imagine how much time it would take you to start with setting up the framework, creating a WebDriver, setting capabilities, connecting to the bot, executing those tests, doing the assertion store. Yeah, depending on the environment you do this in, most probably days or even longer to set up what we have done now in five minutes. Coolest thing on test project level, you can see that we have even generated full CICD integration for this project, meaning that here’s a web hook now that can be called from from our continuous integration and delivery pipeline.
Whenever we do a change to the box, to the conversation model, to the visual experience of the bot or whatever, to really execute those tests. There are a lot of open source tools like Jenkins or something to call these tests and execute them, or you could also trigger them not only based on the change, you could run them on a nightly basis. You could run them daily in the morning, whatever you prefer. We also have scheduling integrated here to do this. This was all set up just in a few minutes. Yeah. This was a pretty easy smoke test for this Dialogflow based bot. Apart from that, of course, I could show you for the next few hours although all the other test types here, but I think I will save the time and give you now the option to ask some questions if there are some.
Multiple entrepreneur, developer, tester, keynote speaker and drummer. Studied information technology at the Technical University of Vienna and worked in various fields of software engineering. Active...