Azure Databricks and Data Science with Avanade

Ryan will present a case study of a large pan European supermarket, a story which illustrates the power of Azure Databricks and big data computing. He will share how this client improved their bottom line by levering data science to help fill shopping baskets, even in the midst of the global pandemic.

Speaker: Ryan Price


– Good afternoon everyone. My name is Ryan Price and I’m very, very happy to be here and part of the Databricks summit and, and looking forward to sharing a great story with you here today. So, so welcome everyone. We’ll just go ahead and get started. So, and today we’re going to talk a little bit about Databricks and some data science work that we’re doing here at Avanade in the Netherlands. And I’m joining you today from my home in near Boxmeer in the Netherlands and yeah, sunny day outside. And unfortunately we’re not actually live being able to do this together as we have in the past. We’re doing this under the current circumstances with with COVID, which is, which is unfortunate, but glad that we can still do it in this, in this format any way. So thanks again for joining. So my name is, is Ryan Price and I lead our Data & AI practice here in the Netherlands. I’ve been doing that for about the last eight years. I’ve been part of Avanade. And my role here is I’m responsible as a solution architect for solutioning and supporting our sales teams and leading a team of solution architects to create solutions like the one I’m going to talk about today and deploy these for our clients. And we do this over the full breadth of the data and AI span. So, from data platform to intelligent industry solutions to intelligent automation, so really the full, the full span of the data and AI space. I’ve been doing this as I said, about eight years and before that I spent 25 years in the defense industry where I was doing also a lot of data science but, and working with data, but also a lot of program management and things of this nature. So always involved with technology. It’s a passion of mine. And in my spare time here in the Netherlands, I enjoy spending time in the outdoors and I was a pilot so also in my air force career. So I, I really still enjoy to get out and do some flying. And, I really like doing the work that we do, do at Avanade and spend a lot of hubbing time in that space as well. And later on today Alexander Sherstnev will be joining me for the question and answer session. And Alexander is a, a very highly competent data scientist. He’s been working on things like , the accelerator Proton Accelerator Program being doing a lot of university work and is, will be a great addition to our, session in question and answer time. And he was also involved in solutioning and in the delivery of the project we will be talking about today. So also I’d like to, for those of you that haven’t heard about, Avanade so I’d like to just to talk a little bit about introduce us. Yeah.Avanade , we’ve been around since the, since 2000. We’re a joint venture between Microsoft and Accenture. And in the meantime we’ve, grown over the last 20 years to more than 40,000 people located across more than 20 countries. And, you see some of the stats here in the, in the presentation but we focus on the full breadth of the Microsoft platform. So today what we’re talking about falls under our data and artificial intelligence group. And but we also have a modern workplace business applications and applications and infrastructure. So we, we operate over the the full spectrum of the Microsoft practice and here in the Netherlands, where I’m, I’m presenting from today we have a team of over 400 people that are dedicated to the Microsoft platform and those that are involved in the data and AI piece about more than 200. So, and doing everything from, deep data science work to data platform work to intelligent automation over the full gambit. So I’m also very proud of our collaboration that we have with Databricks . So we have more people trained on the Databricks platform in Europe than, than any other company. So we’re really really proud of our work that we do together and our partnership with, with Azure Databricks . So again, thanks all for joining. Yeah. So a little bit of background about, about the client. We’re talking about, you know, a major appliance supermarket in the, in consumer goods client here in Europe. They’ve got operations across multiple countries and quite some scale, as you see here, you know, over well over 2 billion in turnover, you know, more than a hundred million transactions 50,000 different articles and several million people working at the company as well. So we’ve been working and engaging with this client for a, for a number of years and both in the Azure data platform space. So providing support there bringing all of the data together but also in the, on the high value side. So, delivering data science models and helping provide and demonstrate also the insights. So pretty exciting work it’s, as you know retail consumer goods, very tight margins, things like pandemics can have a substantial impact, and which really makes the opportunity to create a high value with data science, quite opportune. So good stuff there. Yeah. So if you, if you look at the typical kinds of things we’re doing in data science and consumer goods, you know and in the supermarket industry, you know we’re looking at things like store sales forecasting, online marketing is quite a interesting, loyalty programs, customer journey, customer engagement all of these things are the areas that are getting a lot of focus and can create quite some value. And, and we’ve really, and then what we’re talking about today will be specifically more focused on the, on the online, online marketing piece. So, just to talk a little bit also about the industry. So I know that we probably have a lot of industry experts in the audience today, or that are tuning in, but for those that maybe don’t have quite so much just a little bit of background. So specifically what we’re going to talk about is an online cross-sell model that we use to relative to this online marketing we just talked about. So, you know, what is it? Yeah, the idea is that, that we want to, you know try to cross-sell a particular product or which results in, in selling more than what than what you might have. So, for example, if you know, if I go into a supermarket and if I look at the shelves, you’ll see that there are arranged also to achieve cross-sell. So maybe I go in and I choose a Coca-Cola and you may find that you have peanuts very close to the Coca-Cola or the soft drinks, because the chance of you buying peanuts if you buy a Coke, if there were situated next to each other, might be higher. And so this is the same kind of thing but, but then on, on sell, I mean, online. So, but what this isn’t is replenishment, stock replenishment or predicting replenishment, or things of that nature this is really about creating additional value. And why do you want to do this? Why is this interesting? Yeah, because bottom line is can increase revenue really enhance the customer journey, we know how important that is. You know, if you’ve ever been online and you get recommended a product that is totally nothing you would ever buy. As I mentioned, I like flying and I also like doing outdoor things, whether it’s sports or fishing. And if I get recommended a product that’s interesting to me, then I might actually click on and take a look at it. If I get offered something that’s not interesting not only will I probably not click further I might even get irritated and not want to shop there anymore. So these are the things that are quite important in this kind of marketing. And ultimately what we really want to do is encourage our customers to get a little bit deeper in their pockets and spend a little bit more. So there’ll be happier, have a better experience. And ultimately we’ll be happier because we are selling more. And how do we do that again, just to repeat a little bit but it’s, it’s all about providing relevant recommendations and how do we further do that? You know, behind the scenes, you know we’re using data science and statistical modeling to help predict or help make the proper recommendations. So, that’s the kind of the, the details behind the scenes that are going on. So important also to note that the model isn’t personalized, that that is to say that it wouldn’t offer me something any different than another person. There isn’t a personal element to it. It’s, it’s purely based on the product itself. So if you buy a particular product then what’s the likelihood that any person would buy you know, the next product that’s recommended. And, so, and if you look at the, what are the metrics that are measured in the industry in this space? So you’re you know, you’re looking at what is the additional turnover and looking at, you know, conversion average number you know, and from a per basket perspective. And, and we had in this, in this particular example so we’re several months into it now. And really so far, we’re showing about an additional three and a half percent turnover. So what’s that? 18 million a year if we project this out. So, and things that continue similarly, so quite a quite a positive impact. And also if you look at it from a, from a conversion perspective you know, two additional products per basket, which is, you know, and if you go out over a year, then you’re looking at 10 million products. So this is, you know, this is having a significant impact in a time where you know, there’s quite some challenges in the market. So very positive. So now let’s take a look at the, at the modeling approach. So basically what we want to do is be able to, you know predict the list of relevant cross-sell products based on, what we call child products based on the parent product that a customer puts into the basket. So, if you look at the scenario that’s shown here, and the way the algorithm is working is it’s basically going to analyze all possible product pairs that could be brought together. And it will define the pairs that are more likely to be to be bought together and then independently. So, and we’re doing this with binomial statistics test and it basically at the end and you will select the pairs with the novel child products. And this is also done with binomial statistics testing again. So that’s a little bit of data science, but if we look at it in terms of, you know, maybe a bit more simplistic if we look at the, the depiction here if you have a basket that has products in it, in this case we’re talking about Greek yogurt and croissants. And if, what we’re saying is if you had placed Greek yogurt in your basket, then, and you’re offered a croissant the chances are quite high. The probability is quite high that you would like to purchase that. And in, and as you can see in this particular case here whereas it would result in 200,000 sales then it would result in a further, a hundred thousand. So 300,000 products that would be to be sold. So basic conclusion is, if you somebody that would like to buy Greek yogurt will probably also want to buy a croissant. So, you know, yeah. Who wouldn’t want to buy a croissant? They’re good. Right. Especially when you’re on a vacation in France for the summer. Okay. Now I’m just going to talk a little bit about the data sources and the architecture. So this isn’t really a depiction of the full data platform. It’s very much zoomed in on the specific data science work that that we’re addressing and talking about here, but it was basically redeployed on a, on a very large scale Azure data platform. And we’re using Databricks as our data science and, big data motor to do this work. Although the model itself has run in batch it is actually being consumed in the, the data is being consumed, results are being consumed in real time. So just to walk across, and if you look at the orchestration of the entire breadth from ingestion to consumption is being orchestrated with Azure Data Factory. Again, we’re using Azure Databricks is our big data data science stuff platform. If you look at the required data in this case, our data sources you know, we’re harvesting basically raw transaction lines the product master, product hierarchy. And in some instances there might be data that’s blacklisted or seasonal products. And so for the stores where that plays we can also take that into consideration. We then are taking the data and ingesting that into Databricks where we’re doing basically some data preparation. So some data wrangling and eliminating any poor quality data or any you know, addressing any things that might come up to make sure that we have good, good data quality and good data preparation. And then we’re moving along into moving our data into the dead Lake, where after we’ve transformed it and created the product data that we need. And, then we can actually ingest it from there into our data model. And run the data model that’s basically making the cross-sell product pairs and, and the prioritization and that’s being consumed by Azure Cosmos DB. And that’s, from there, we are able to use the online website to actually consume the recommendations and the results of the data model. So, and that’s done again. So we’re running the model in a batch scenario, but we’re consuming in, in real time. So just to give a little bit of detail about the, the architecture that, that we’re using here. So now just talking a little bit about the, the modeling results. So again, I think we’re, we’re pretty excited about the value that we were able to create with this, project. So some of the results, it is a complex model in the sense that they’re, over 15 different parameters and it requires a grid search for the optimization. And if you look at that the, the two main metrics for this is, are parent coverage. So what we mean by that is how many products we’ll have cross-sell products associated with them, but also child coverage. So, how many products become cross-sell products? And, and that’s you know, a percentage of the, of the assortment and on the right you see there some of the parameters in the table and, you know, basically how do we do so if you look at the parent coverage it is if, you know, looking at the model returns at least for at least six cross-sell products we have 90, more than 90% coverage of the assortment. So I think that’s quite quite impressive if you look at the child coverage, so 92% in this case. So that means that the 90%, 92% of the assortment appears as a cross-sell product for at least one product selected by the customer. And so I think both are over 90%, which is quite impressive. And, you know, the model if you look at the model is taking about 50 minutes to run we’re using a cluster of 13 from four CPU virtual machines and, using it on Azure Databricks. So you know, it’s, even though it’s quite a, quite a heavy set up here, we’ve got to still take a little bit of time to run because of how much data we’re actually dealing with here it’s fully automated. And right now we’re running it every two weeks. And again, we’re scheduling this orchestrating this with Azure data factory but that is adjustable weekend can run it more frequently or less frequently. If, we find that that’s more optimal. If we look at this, it, some of the example output. So, you know for these wafers on the left, you can kind of see a number of the recommended pairings, and those are actually ranked so that you start from highest and move down. So this is kind of gives you a tabular form to see what really all of this data science is resulting in that’s running in the, in the background. So why Databricks? You know, what was the, what is the real value that Databricks and why is it uniquely positioned to provide you know, to be a great platform to do this on? So, you know, one of the main things is calculation attractability so, you know, this model needs to be run on a on a distributed cluster in order to be able to provide the performance we need. And, you know, we know Databricks is very easy to do that. It’s very easy to scale out and, you know, increase your performance. This is really, this is big data. So, you know, we’ve got over a year of historical data you know, including 900 million transaction lines, you know 40 gigabytes just for that. And it’s estimating child product novelty and, one month of data to run the market basket. So, you know this and, you know, and doing all the calculations for the binomial probability. So this is definitely a big data scenario. And also your memory consumption is unpredictable here. So internally, you know, the model is really building on the possible product to product pairs. And, you know, there are dozens of, you know, of millions of of transactions. So it’s really impossible to estimate, your memory consumption there. So, you know, Spark is very advantageous and in, of course we all know Databricks leverages Spark. So, and when you’re looking at these kinds of scenarios is very helpful. And if you also look at productionizing requirements the productionizing patterns. So, you know, Databricks really allows you to to run various models in batch. So, you know, we’re running four different models and then productionizing them with the same pattern. So also something that Databricks is well suited to do data availability. So all this data in CSV, Apache files or Delta tables and stored in data Lake it’s, you know, it’s mounted easily into Databricks notebook. So also the interaction and integration with, with Azure and Databricks is also quite quite handy here. And lastly, the integration into the business pipelines so Databricks notebooks as, you know may or may not, but it is very easy to schedule with Azure data factory for consumption. So it’s, it’s just a great platform for Databricks, whether it’s, you know you’re doing data engineering work, as we mentioned earlier or, you know, you’re doing the data science or you’re doing the business analysts it’s, it’s just a great collaboration platform and way of, of integrating those various activities into a process. So lastly, just to wrap things up you know, I think what you’ve seen here today is it’s not a highly complex case. I think we see it quite often but I think it really does illustrate the power of Databricks and the power of big data computing and how it can bring value to, your clients or to the market. And especially in times, like now with pandemics, you know this is a capability that can really be interesting to help improve the bottom line. And, you know, again, even though it’s based on a fairly simple idea it also really shows significant high level of success. And though we’ve, the statistics have been around for years you know, we’ve been able to do these things mathematically but we haven’t had the big compute engines that were able to do these from the from the compute side and the, and the technical side. And now we really we know Databricks is really a game changer in this space. And, you know, a lot of large retailers, consumer goods, products companies, you know they have these massive amounts of big data, you know, conversion and turnover is, is absolutely important. And, you know, whether there’s replenishment forecast and we’ve only touched lightly on on a scenario today, there’s of course, many many more scenarios, big data scenarios that could equally bring value. But today it was just a zoom in, on on this one particular interesting case. Lastly Databricks, really, you know, again I’ve mentioned it already a couple of times, but, you know it’s really helping us in terms of bringing the full breadth of the business together with IT, joining these two worlds from, you know, the data preparation and data wrangling that’s data engineering and analyst am I doing to actually building the data science models that our data scientists would be doing providing the engine underneath that for the compute that’s necessary to handle these big data kind of scenarios. And then, and then ultimately on the consumption side, you know having making easy access for our business analyst and in this case, having it being consumed in a, something like a, an online a website is quite easy to do. And really, I think an equally important aspect is the ability to industrialize this and, really you know, create a production advisable solution that can can bring value. So this is the presentation for today. I would just like to thank everybody for your attention. I hope you found it informative, interesting, also relevant in the times that we have. And of course you know, we’re here to also talk about Databricks and a is a very important partner to Databricks . We’re also quite happy to be able to share this kind of story and, and our passion for working with the technology here at Avanade . So now what I would like to do is hand this back to you as the, those participating. And again, thank you for, for dialing in. And what we’ll do now is open up to a number of questions and Sherstnev will join me as well. So if you have some of the real technical data science questions, he will be able to help out on that front as well. So again, thank you very much. And we look forward to your questions and hope you have a great summer. Take care. Thank you.

Watch more Data + AI sessions here
Try Databricks for free
« back
About Ryan Price

Ryan leads the Data & Artificial Intelligence group at Avanade Netherlands where he is responsible for the full breadth of offerings in this space including Data Platform Modernization, Intelligent Industry Solutions and Intelligent Automation. As the lead Solution Architect in the Data & AI space, he and his team help clients in their digital transformation journey by implementing industry leading solutions on the Microsoft Platform.