Natural Language Query and Conversational Interface to Apache Spark

May 28, 2021 11:40 AM (PT)

Download Slides

Apache Spark has been a great technology for processing and analyzing Big Data. However, it is not accessible to business users, who don’t have technical or programming skills. In this talk, I’ll talk about recent efforts in the space of “Conversational analytics”. This paradigm allows any user to ask text and voice questions, in natural language, of their data to a bot and receive back a natural language and visual result. A key technology is natural language to SQL translation, where we translate natural language queries from a user into Spark SQL queries that can go against a Databricks system, and that can be easily trained on different schemas and databases.

This NLP technology needs to be further combined with dialog management, natural-language generation/narration, data understanding and modeling, augmented analytics and automated visualization generation in order to achieve the goal of “Conversational Analytics”. Using such a technology, a user can ask, in plain English, “How many cases of Covid were there in the last 2 months in states that had no social distancing mandates by type of transmission”, and then dig deeper into the results in a conversational manner to uncover hidden insights from Covid datasets in a Spark instance. We believe that having access to such data and insights at their fingertips can help users make appropriate decisions quickly, improve data literacy and even overcome the scourge of fake news for the general public.

In this session watch:
Anand Ranganathan, Corporate (CIO, CTO, Chief Data Officer), Unscrambl



Anand Ranganath…: Hello everybody. My name is Anand Ranganathan. I’m the Chief AI officer at Unscrambl. Unscrambl is a startup in the AI space and we are pioneering a new way for people to get access to data and insights, and it’s something we call conversational analytics.
In this talk, I’m going to be focusing on how natural language queries and conversation analytics can be supported on top of Apache Spark, as well as similar databases. The problem that I’ll tackle is that data adoption still remains very low in most enterprises. And in fact, it’s been going down especially with the start of the pandemic. So, while enterprises have done a great job in terms of getting data from various sources into platforms like Spark, the last mile problem, which is a problem of business users actually getting access to data still remains challenging. In fact, only 31% of businesses think that they are data-driven.
This is from a statistic from 2020, and this has actually gone down since 2017. So most business users still find it very difficult to access data. They may not know the right programming languages or tools, whether it’s SQL, [Biten], scholar or even BI tools like [inaudible] and Tableu. So it often takes a very long time for them to figure out how to get the data. The data might be split or spread across different data bases or data warehouses, or even Excel sheets in an enterprise. So they often have to rely on IT teams or data analyst teams.
And this can add delay of hours to days to weeks from the time they need some data to the time they actually get it. So while we use tools like Spark have helped tackle the problem of actually processing and analyzing big data, they don’t actually solve the problem of consuming data, especially for non-technical users. So while most business users are pretty familiar with collaboration tools like Microsoft teams and Slack and the productivity’s actually been pretty high in the pandemic, working remotely, data is still not part of those tools. So while they can communicate with each other on this kind of platforms, data is separate.
So there’s a big gap today between business users and the data that they need. And in fact, this is a problem, not just in enterprises, but even in the general public. So in the open world, there’s a whole bunch of interesting data sets that organizations, including [garments] that the enterprises have released on various kinds of socially relevant topics. So, as an example, there’ve been lots of interesting data sets on COVID. But most users, most of general public, have no way of really analyzing this data. So there’s still, again, a gap between the public as well and data sets of interest. So the way most enterprises tackle this problem today is by creating dashboards and reports. But the challenge with that is that each dashboard report tends to be fixed and pretty rigid. They really can support any drill downs.
You can’t ask a slightly different question. And often time, there is a danger that the dashboard creator might have used some assumptions or biases with creating the dashboard which the business user might not be familiar with. So it’s very easy to create fake news or fake stats and dashboards. And this is probably not just to in the public space, but even in enterprises. So while the challenge remains in terms of business users getting access to data, the whole BI and analytics space has slowly been moving towards something Gartner calls the age of the augmented consumer. So in this space more and more data has been re consumed using data stories in a conversational manner, insights [inaudible] automatically for business users, they can get insights in a personalized fashion that’s [inaudible] to their need and requirement.
They can get access to insights from configurable analytics without necessarily needing a data scientist always on the site. And most importantly, they need to be able to collaborate on data and insights. So that’s where we’ve been pioneering this new space called conversational BI and analytics where the idea is to allow any user to be able to ask text or voice based questions of their data within a conversation interface, and receive back a natural language or visual analysis of the most interesting insights and data for that specific user. So this is a new space that there’ve been a few enterprises that are starting to tackle this space, but we think as one of the frontiers in enabling to data democratization and data literacy. So before I jump ahead, I just want to clarify that often two ways people interpret the term conversational analytics.
In fact, if you do a Google search, half the results will be on the first definition, the other half on the second definition. So the first definition is analyzing conversations and this often tends to be in the context of chatbots or just chat with customer support or calls to a call center, and enterprises might want to analyze those conversations. The second definition is accessing insights analysis through conversational interface. So, for the purpose of this talk, we’re going to focus on the second definition because we’re focused on how insights analysis can be caught through conversational interface, especially for data in Spark. So I just have one slide describing our product. So what we have built is a personal AI powered data analyst.
It’s called QBO, where the idea is that users can ask questions of the data using natural language conversations. So essentially they undertake conversations with QBO. And this can be as part of collaboration platforms like Microsoft teams, or it can be in customer interfaces or can be through speech. And they can start asking questions of the data in plain English. Like why have sales dropped in a given region or a certain month? Why were there fewer customer acquisitions in February? What’s a daily revenue by device in a given week? And QBO tries to connect to different data sources, enterprise wide or external data sources. You just can converse with QBO and most importantly, collaborate with each other along with QBO to get interesting insights. So what does it take to actually have a conversation analytics system? So the first step is actually analyzing and understanding a user’s query and responding to it. So here’s some of the key steps that are involved in that process. So here’s an example of a data set they’re going to be using through the course of this talk.
This is a city bike data set, which has information about all bike leisure trips taken in New York city as part of the city bike program. So users can ask questions like number of trips in winter, 2017 by age and gender. So that’s the kind of natural language question that we want the users to be able to ask in a conversation analytics system. So when a user asks a question, there are several steps involved. The first step is to understand the question. So this could include techniques like entity detection recognition, as well as entity construction to understand some other key concepts in that query.
Then would understand what kind of query is it and how to map various terms and phrases in that query to relevant terms and concepts in the database. So, as an example, in this particular case, number of trips could be one entity. The winter 2017 could be another one, age and genders, third one. And each of those now need to be mapped into terms actually put it in a database. So for instance, the database might not have a [inaudible] or concept called age, but it has a birth year concept.
So the system wasn’t able to derive age from birth year. Again, the system needs to understand what winter 2017 means and come up with reasonable time period for that definition. So all that needs to happen in a seamless manner and the finally a SQL query could be generated. That could go to a backend database that contains this data and the results come back, the system needs to decide on the best way of presenting the results, either through the form of chart. As an example, in this case, we’ve picked something called a heat map chart. Or it could be also explained back to users in the form of national language narratives and all this and then present it back to the user as part of a conversation. So those are some of the key steps that a conversational system must take in order to answer user’s queries. So the way conversational analytics bots like QBO need to be set up is that they act as an intermediary between users and different data sources.
There might be a certain step involved where the system is configured and natural processing is trained. So still be able to understand users queries in the context of a given data set, as well as a given industry or domain. Once that’s done, users can start asking questions to QBO natural language, the bot needs to be able to figure out which database might have the right answer, gets back the query results and then presents the results back to users in the form of charts and natural language. So at this point, I’m going to pause and show a quick demo of the tool. So you get an idea of what we think of as a conversational analytics system. All right. So here’s a quick demo of the conversational analytics system. This demo is going to be using the Citibike data set which has information about all bike trips in New York city from I think, 2011 or 12 to 2020.
And we’re going to show how people can ask questions in natural language, within a conversational interface to get interesting insights of the data. So this is our web interface. We also have Microsoft teams interface where users can chat with this bot from within teams. So in the web interface, we want to make this collaborative. So we get an idea of what are the key kinds of questions people have been asking recently. Who have been asking questions? How many questions that people have been asking in the last 30 days, and the way users can start asking questions is by creating a thread in a channel. So these talents could be either one-on-one conversations with the bot or could be group chats where different people could be chatting with each other, along with the board. So I’m going to start a one-on-one conversation with QBO and create a new thread. So the interesting thing about conversation interface is that we have a transcript of all previous analysis.
So, which means that I know what is going on and what kind of analysis I did yesterday, a week back, a month back. And I can go back to those transcripts and understand how we might’ve arrived at certain kinds of insights or analysis. So I’m going to say a demo for AI summit. And that’s the name of the demo. There’s some hints over here with sample questions of what I can do to get started. See the data sets available. So in this case, Citibike has information about programs, stations and trips, some sample questions here to get started. But let me start by asking the same question I just showed in the slides. I can ask a question like number of trips and winter 2017 by age and gender. So when I ask this question, QBO tries to figure out exactly… It goes to different steps that I laid out earlier, it came back with a replay of the same question to help explain to the user how it interpreted the question.
So in this case, it clarified what it thought December… Or I’ll say winter 2017 means. So instead of winter 2017, it starts from December 2017 to early 2018. It mapped age to something called age group and number to total number. So I can also look at the actual SQL query that got generated. And if I had any doubts about the correctness, I can look at the SQL query, go through it, verify that this actually does match my requirements. In this particular case, the SQL query gets connected to a spark SQL query that goes to a spark backend that has the actual Citibike data. The other things I can do, I can change this visualization and do other kinds of charts. If I didn’t want a heat chart, if I wanted bar chart or some other kind of pie chart or so on, I can switch the visualizations around. But let me keep it as a heat chart for now. So that’s basically how QBO works.
Let me ask a few other kinds of questions. So what is the, let’s say the yearly average duration and max duration of all the trips. So again, QBO does some internal mapping to figure out how best to interpret this question. So it knows that… It took max. It knew to interpret max as maximum. Duration as trip duration, and it comes back with a chart. So one of the things it does is to figure it out what’s the best way of showing the data and in this case it’s showing this as a line chart. So yeah. Let me take a couple of other examples. What is the total number of trips yearly? And while again, now it’s… So basically the number of trips the yearly. And that’s actually… We see a big decline 2020, which we can expect to be part of the pandemic. But one of the things we want a conversation analytics bot to do is to not just answer basic questions, but more complex questions.
So maybe you want to analyze, why did 2020 did go down compared to 2019. Just based on data available. We might know that there are other causes like the pandemic. Let’s see what the data says. So if I ask a question, like, why is the total number of trips in 2020 less than 2019? I can go a bit deeper. So in this case, QBO doesn’t just do one SQL query, there’s multiple SQL queries to get back different angles of the data and comes up to the analysis. So it says that… So one thing it wants is that we don’t have full data for 2020. We only have data for three months. So that’s actually one reason why 2020 is so much lower. But even when it’s three months, it’s possible that it’s not enough that it’s not going to meet 2019 standards.
So this immediately give us an insight of why this was lower. But it still went ahead to do some analysis and said that if it still conidered 2020 and 2019 completely, some of the key reasons are there’s a big decline and the number of trips taken by 21 to 40 age group, decline of number of trips taken by meal riders, declining number of trips taken by subscribers. So those are the biggest differences, some of the biggest reasons why overall the number of trips went down. So in this case, this is an example of something, we call it diagnostic analytics where the bot is not just answering basic what questions, but why questions. And the whole goal of such a system is to not just ask what, why, but also what will happen next and what should I do about it kind of questions. So that completes the demo. I’m now going to go back to the slides.
So, yep. So hopefully you got a good sense of at least our vision of conversation analytics and how people can interact with a personal data and AI analyst to understand the data sets better. So when we use a term like personal AI powered data analyst, one of the assumptions you’re making is that an AI system like this can essentially replace over time a human data analyst. And when you think of that question, naturally Turing test comes into play. So if we want to come up with a Turing test for condition analytics, the definition of that test might be, can a conversation analytics bot be indistinguishable from a human data analyst or a human data scientist? So we’ve dug into this question a little bit and we’ve come up with some definitions or requirements for what would might take for a conversation analytics bot to imitate a human data analyst.
So human data analysts, these are people in the company that respond to business user requirements for data sets and insights, and dig into different data sets available, come up with reports, come up with dashboards or other analysis, or even just powerpoint presentations and present them back to the user or to the business user and collaboratory work to help figure out or understand what to do next. So some of the key tasks that a human data analyst might perform are gathering and clarifying requirements of business users. And oftentimes some of these requirements might be framed or phrased in a very ambiguous or poorly formed manner in national language often. Once they get the requirements and clarify what exactly a business user needs, they might then need to retrieve and analyze relevant data from various data sources. They need to understand the semantics of data and how to analyze it. Need to present the results to users in an appropriate way, using charts or natural language.
They might then collaborate with the business user to jointly decide what are the relevant findings and insights and dig into different parts of data sets together. Oftentimes they need to continuously learn preferences of business users so that they can respond to requests more quicker. And finally, another key thing that data [inaudible] analyst must do is to proactively keep looking at the data and coming up with findings on their own, not just rely on business users to ask questions all the time. Those are some of the key properties or behaviors that an AI powered or data analyst might need to exhibit to be able to imitate a human data analyst. So let’s go into some of these points and see exactly what these mean. So firstly, requirement gathering. So very often users in enterprises can ask ambiguous or poorly framed questions.
So, as an example, let’s say it’s an insurance space, and somebody wants to know the number of policies issued last year. Sorry, number of policies last year. And that’s a common question that we hear from some of our insurance customers too. So the different ways of interpreting that question. So the user could… One interpretation is number of policies issued last year, or the interpretations number of policies active last year. So the data analyst, either AI powered of human one must be able to ask the businesses what exactly they mean and this can happen through the course of a conversation. And the business can clarify and say, oh I really meant the number of policies issued last year. Another critical property or requirement that a data analyst must satisfy is that they need to be able to interpret users queries and map it to the data available. So business users may not know what is available and how it’s modeled. So they might ask questions the way they want. And there’s often a semantic gap between their questions and the actual physical data model that is present in the database.
And this semantic gap needs to be bridged by the personal data analyst. So, in this case, the personal data analyst must be able to understand different terms the user might use. So terms like ADPU, which is a jargon for average duration by user, or be able to understand that age group needs to be derived from an attribute called birth year. And then they can answer the user’s questions. So a critical requirement for a data analyst, whether it’s an AI powered one, or human data analyst, is to be able to map terms a business user uses to concepts, which just could be tables or columns or other information, present in the physical data model. Under key property of data analysts is iterative refinement, which means need to be able to engage with the business users in the course of an analysis of some kind of data.
So very often ,business users won’t stop at just one question. They might keep asking more and more questions to drill down into some interesting aspect of the data that they found. So they might ask at the start of the question, how many bike trips were there New York city? How many in Jan 2019? How many were longer that 15 minutes? Show those ones by age groups. Which ones did those trips start from? Which start station did those trips start from? What was a top star station [inaudible 00:22:33] durational trips? How many trips ended with the start stations? So they might keep drilling down into different aspects of data in fairly this exact manner. And the data analyst must keep up and figure out answers to those questions as quickly as possible.
Finally, this is our interesting point is that the data analyst must have memory. So there needs to be a transcript or a remembrance of what analysis they did with the business user or [inaudible] of time. And a the conversational interface is ideal for that because we are all naturally keep a transcript of all our analysis. So the business user comes back after a few weeks with a question, like, why was I interested in third avenue and 75th street last week? They know why they’re interested now because they can see the transcript. Another the critical thing that data analysts often do is to help guide the business users from descriptive to diagnostic, to prescriptive analytics. So from what questions to why questions to what’ll happen next questions to what’s [inaudible] the next questions. So all of that could potentially happen in the course of a conversation. So starting from a question like we’re talking about trips, users might go to why do trips decline in Februaries.
Predict number of trips next February, and then what kind of offer should I give in Februaries to increase the number of trips taken. Data analysis often a part of teams where they might work as a data analyst. They might work with multiple business users, and that’s something which a personally powered data analysts must do as well when they should facilitate collaboration between different users, where different people could ask questions one by one on different parts of the data and the bot keeps responding to them as you just keep asking questions. So that ends the talk. And one of the things that I encourage you to do is to imagine the future of work with the presence of a personal AI powered data analysts. If you’re interested in trying out this system, you can check us out at our website. or QBO.AI, and see how conversational analytics might work for you. Thank you for your time and look forward to any questions or comments or suggestions.

Anand Ranganathan

Anand Ranganathan is a co-founder and the Chief AI Officer at Unscrambl, Inc. He is a data scientist, AI engineer, Big Data developer, architect, and researcher rolled into one person. He is leading U...
Read more