Energy wastage by residential buildings is a significant contributor to total worldwide energy consumption. Quby, an Amsterdam based technology company, offers solutions to empower homeowners to stay in control of their electricity, gas and water usage. Using Europe’s largest energy dataset, consisting of petabytes of IoT data, the company has developed AI-powered products that are used by hundreds of thousands of users daily to maintain a comfortable climate in their homes and reduce their environmental footprint. In this talk, Erni and Stephen will take you on a tour of how Quby leverages the full Databricks stack to quickly prototype, validate, launch and scale data science products. We will explore the technical workflow of a Data Science project from end to end. Starting from developing a notebook prototype and tracking model performance with MLflow, we move towards production-grade Databricks jobs with a CI/CD pipeline and monitoring system in place. We will see how Quby manages more than 1 million models in production, how Delta Lake allows batch and streaming on the same IoT data and the impact these tools have had on the team itself.
– Hi, welcome to our talk. This is Stephen and Erni from Quby, and today we’re gonna be talking about saving energy in homes with a unified approach to data and AI. We’re gonna take you on a tour of how Quby leverages the full Databrick stack, including Delta Lake, and MLflow. To quickly prototype, validate, launch and scale data science products. We’re gonna explore the technical workflow of a data science project from end-to-end. We’re going to explain, really, how to automate those end-to-end workflows for the data ingestion through featurization, prototyping, validation, re-training and productionalization of the models. We’ve actually even opened source, a repository and covering some of the key elements of our approach.
But more on that later. First, I’d like to introduce you to Quby.
So, at Quby, we’re an Amsterdam-based tech company. We believe in a world where living without wasting natural resources is easy. We’ve been active since around about 2004, in the areas of energy insight and control for residential buildings, and we help to make buildings more efficient by encouraging people to change their energy consumption and behaviors.
So, what’s the scale of the problem we’re dealing with? So, energy usage is massive. Within the EU, households use all about 44% of all of the natural gas, and 20% of all of the electricity. But how much of that is just wasted? So, at Quby, we’re not here to tell people how much to consume, but we wanna make sure that all of their use is intended. So, how do we do that? What we do is we offer a number of technology products. At the center of the screen, you can see our smart thermostat and energy display, and we also offer a number of apps that are available with energy insights. We operate as being a technology provider to a number of large utilities, insurance companies and banks, and we’re generally partnering up with some of the largest, the top three providers in the territories. We’re currently active all over Europe, in the Netherlands, Belgium, Germany and Spain, and we have around about half a million homes that have our technology installed, or home owners using our technology through apps. So, to give you a little bit of an idea of the stack we’ve got going on, we’ve shown it here. So, we’re collecting a large amount of IoT and customer data. This can be, for example, electricity meter data that’s coming in at either a one or 10 second resolution. Through to water data, data on the heating system, user interaction data, profile information that users provide themselves. We store all of this data in the Cloud, we work with AWS as our Cloud provider, and on top of that, we’re running a lot of machine learning models using the Databricks platform with MLflow and Delta Lake.
On top of this, we offer personalized insights focusing on the areas of advice, insights, smart thermostat control and home monitoring. To give a bit more of a flavor of that, I wanna just dive into one example and a little bit of detail. So, the product we offer is called the Waste Checker. We’ve identified that over 40% of appliances in homes are inefficient or being used inefficiently. So, we’ve developed a tool called the Waste Checker that developed, that investigates the usage within homes, and offers personalized advice to home owners on how they can change their behavior, or change their inefficient appliances. The example on the screen shows the advice you would receive if you’ve got an inefficient dishwasher. (mouse click) How do we do this? This is really by starting with the data. In this instance we’re concentrating, really on the gas data, electricity, central heating data or water data. As my colleague Erni will explain in detail in a minute, we’re collecting a large amount of data on a daily basis. (mouse clicks) So, that comes with some challenges. On top of that we were running a number of machine learning models. On the screen here, you can see an example of electricity meter data at a 10 second resolution. This is from one home for one day. We are able to run algorithms on top of that, that are able to pick out the particular patterns from different appliances, so we’re able to identify when a washing machine is running, or dishwasher, or refrigerator, or even something like an electric vehicle. From these cycles, we analyze the behavior and then we want to shift that towards giving personalized advice for our end users. This is presented through an app that we can really see that in the app itself, and where we are with that was, we’re offering advice on whether appliances are inefficient or not.
(types) So, once we move on there, we’ve got a large unified data analytics setup. I’m gonna now hand over to my colleague Erni, who’s gonna take you through some more of the details of that. Over to you, Erni.
– Thank you, Steven. So, yes, I’m gonna walk through the unified data analytics setup, and I’m gonna show you how we run our batch streams and R&D processing at the same time using Databricks. So, it all starts with the IoT data stream that we are ingesting in our Delta Lake, and it’s more than 1TB of data that is coming in every day. (mouse clicks) So, we store this in Delta Lake, on top of AWS Cloud stream. And over the time, we have also added SQL and Click data streams. (mouse clicks) So, this, this collection of data sums up to more than 3PB at the moment, and it’s still growing. So on top of that, we run our batch processing by branch. So we have several ETLs that run on top of that and also, we run a lot of models on top of that. So, if we think that we have one model for each user, for each use case that easily adds up to more than a million models that are re-trained every day. So what we do when the data is refined enough, so we read it from Delta Lake and store it back, and we refine this until it’s ready to ship to the user, and in that case, we send it to our services APIs, and the services APIs are responsible for storing the data, for serving to the customers. So, by taking a closer look to the batch processing byplan, we can see that all the transformations that we have have more or less the same shape. So, they have some data in, and some data out. And this way we make it easy to run unit tests on top of it. We pull some data in, and we check the expectation on the data that comes out. So this makes it easy to unit test, but also, it’s quite easy to go from batch to streaming. And most of the things that work on the batch, they actually work also on the same. So, the Select, Filter, WithColumn. They would all work the same way without touching the code on the streaming. Everything except the non time-based window functions, like lead, lag, lag first or last. In this case, we will have to tweak the upgrade time a bit to have, to have a better. To make it run on the stream. (mouse clicks) But by going back to them, even if I set up, so, we have this batch processing pipeline, and then, we also have exploratory data science Notebooks. So, what we do, we run all the same production, or staging data, we can run our Notebooks and also chain our models.
We also open sourced recently a repository that you can download here at this link. So, what we did, we tried to show you the setup for this part of the pipeline that we have. (mouse clicks) So, you can think that this is for you if you’re part of the data science team, and you want to get some IoT pipelines in production by using Databricks. So, what we try to do. We try to help you to to set up your own production pipeline by using Databricks and it can, it can really answer some of the questions that you might have when you start using it, like, how do you run unit tests on the code? And how do you deploy it to a Notebook? And how do you run integration tests between Notebooks and maybe, chart code that has been tested? And also, how do you go back and forth from Databricks? By rapidly prototyping on the Notebooks and getting the code back into the job perhaps, Into your jar, perhaps? Or how to manage multiple environments like staging and production, and deploy that? And we think that this demo is a bit special, because it doesn’t give you an overview of the code, but it’s really a walkthrough our way of working within Databricks, and we tried to bake in as many best practices as we could. (mouse clicks) So, in this demo, I’m gonna show you how we run different environments. So, here, you can see that we have a staging environment with an initialization job, some raw data being stored by the initialized job. Then, a create feature job that creates some electricity power features out of that. And then, this is replicated in, also, a production environment, and the same goes also for the integration test environments. So, this environment doesn’t have scheduled jobs, but it’s triggered by a script that you will also find in the repository. So, what I’m gonna try to do in this demo is take the data from the staging environment, the raw data from the staging environment, and extract it into a new environment that I’m gonna create only for myself. So, this way, I don’t touch any data that is being stored by the actual production environment. And, on top of that, I will try to prototype in a Notebook some new feature creations, and then, I’m gonna import them back into the code so that you can see the full process of creating these features. So, let’s go to do that.
So, let’s get started. You can start by downloading the Git repository, and this is what you’re gonna get into your computer. So, you’re gonna get some jobs folder, some scala folders, scripts, and a submit file that can help you to get started. So, the requirements to run this is having your Databricks here aligned. (types) Or, if you type Databricks, you should see something like this. Which means that you’re ready to go. So, once you have it configured, then you have connected to your database environment, you’re ready to go. So, in order to help you get started, you can type make (types) and this will give you an overview of what are the actions and that are available for your repository. But before I go into this, I wanted to get started with the demo, so, let’s go to the code. So if you explore the code, you have the Databricks workflow folder, and you have a jobs folder. So, I promised you that we’re gonna create a new environment. So, if you expand the environments, you’ll see the staging, production and integration tests. These are configurations of your jobs. So, if you want to create a new environment, (types) you copy paste one of the existing ones, and I’m gonna call it (types) dev_erni, okay. I can also add it to Git, why not? So, the active jobs for now are initialize and create features. So, the two jobs that I mentioned. But, what I want to do in this demo is work with some real data maybe we have coming into staging. So, I would remove the initialize, and only keep the create features job, which is what I want to modify.
So, I won’t stash the raw database configuration because we want to get data from the staging environment. Or we could also type production here. So, if you want, maybe you want to work on production data. With the feature database, we don’t want to overwrite the staging features. We want (types) to have something new, like that, add new features. In this way, we can completely independently try out the whole pipeline in a new database without compromising the staging environments. Here, I can define some classic configuration and I can also write all the configuration of standard data-base (indistinct). And I could also add my email to get notified if one of my jobs fails, because what I would like to do is have an environment that is also scheduled, that runs daily, on top of the raw data that, it’s real, so, like the staging environments. So once I have saved this, I’m ready to go. So what I can do, I can go back to my Databricks workflow folder in the terminal, and make will help me once again. So, what we want to do, we want to deploy not the staging environment, but the dev_erni environment, so, when I hit enter, this is gonna build the jar with all the code that we have, and it’s also gonna deploy to the Notebooks here, that are the containers. Let’s say the Notebooks are the starting point of each job, and they trigger the transformations sequentially. And it’s also gonna create the jobs, actually on the database. So, if we go to the Databricks environment, here, in the jobs list, we can see what we have already seen in the first slide. In one of the first slides, so, we have a staging environment, we’ve done initialize and create feature, we have a production environment, again, with initialize and create features, and you see it, now, we’ve also got the dev_erni with the create feature job. So this job never run, and the other ones were always run successfully.
And we can go and take a look at what we just acquired. So, we didn’t deploy the initialize, because we didn’t want it. We just wanted to use the staging. So, in real production, what you will have instead of an initialize will be probably a streaming job that collects the data and stores it into your log tables. So, let’s take a look at this job. So, this job never run, so you don’t have any active or completed jobs here, but you can go directly to the Notebook. So, you can see that you have the pipeline, your environment, and this an independent Notebook for each environment we set you can go and, and actually change. So, since this is the job that you deployed, you will have permissions to change it, but you wouldn’t usually have the permissions to deploy or to change any production job because that would be something that is up for your continuous integration and deployment user to manage.
So, in the create features, we have some weather plate code that reads some parameters out of the data. We have just a definition of a repository, to where we pass the, what is the old database, and, what is the future database? We have some code to read, the raw data from the raw table. And then, an existing transformation. So, this is a transformation that has already been written and tested inside the jar that is untouched and doing this jar. So, we can use it directly like this.
And then, it’s persisting the results. So, what I would like to do now, oops, I can’t go back to Notebook. Is just add a new, a new transformation. So, for instance we wanted to create from the power that they have here we wanted maybe to create something like this. So I’m pasting it to save some time. So, we could group it by date, because we want to maintain the partition column, which is UTC date, in this case. We want also to group it by user ID, because we want every user to have a different feature set. And then, we can group it also by a certain timestamp range. So, this is the kind of time base window. For one hour, let’s say. And around some applications, min, max, average, for instance, in this case. Just some simple features.
And then, we can also display it here straight away. So, if we want to run this, we can run it through the job, but we don’t have to, because we just need to connect to a cluster, and then we could run it straight away from here, and it would take the code from the cluster.
After that, we would, we can persist the results, so we can duplicate this all, and maybe persist also director’s features. If we want, we keep the electricity power, otherwise, we also remove it.
And then we also at the end, it’s good to write some assertions. So, to assert that it’s happened, what is happening, what do you expect? So, in this case, we are checking that the users for which we get the power signal are at least 90 percent of the total users that we have for the raw data. So, this way, we don’t have some surprises where only half of the users get through. So, if this fails, then the job will fail, and we will get an email notification. So, this way you can also test if you are matching the expected accuracy, precision and recall, if you have a model, and if you have labels coming in. Or if you are solving, actually, the requirements of your business taste. So once you have set this up, you have added your code, you have tweaked it, and displayed it a little bit to see if it works, what you can do, you can go back to your terminal, and be helped by make, once again. So you see that there is an import, import_notebooks DOS. So you can take this import_notebooks, and you import them from (types) that, So this is the environment that we have. So, the script will check the data, and then we can keep the status. We will see that we have a new file, which is the new environment, but also, we have modified the key features. So, with git diff, I can show you that we have this new transformation inside the code in our ID. So, what I would do, usually, is go here into the Notebook section, and I will see the new code right here. So, this has been taken from the database, and imported into the ID. So, first thing, I would refactor this code, like we did for the electricity power. So, I would create a class in here (mouse clicks) into the scala, main, (mouse clicks) transformations. Here we have all the transformations. So, the electricity power, I would create electricity features, and then, I would also use the same pattern to create some unit tests. So I would write something exactly like this. So, with some input data some expected values, and some assertions for the transformation that we have just written on the real data. And then, you can also take a look here at what are the, How we use the storage, the schemas, and how we created new user functions to help us streamline this a little bit the workflow. So, yes, I would encourage you to try this yourself. Let me go back to the slides. Okay, so, going back to our unified data set up. So we have seen a little bit how we managed our environment, and now, we can go ahead and take a look at how we manage our models. So when we run these exploratory data science Notebooks over and over, what we do, we also train our machine learning model style, and we log them to MLflow. And in MLflow, we have a list of of everything that we have trained, and at the moment, there are more than 500 models that have been logged. So, once we have chosen the right one, the model that we want to deploy, we deploy it to the MLflow model registry. Any model registry, we now have 15 models to deploy for instance, so that we can see which moments in time they were deployed and to which environment. So, I’m gonna show you a little bit more in the next slides. And on top of Delta Lake, we can also run our unified analytics streaming jobs. So some dashboards and realtime services. So this is how we run all the stacks. So from batch, streaming and R&D. Going back to MLflow, we think that Mlflow is, and more rigorously, solved some more crucial aspects of of model development. So, first, it helps us to choose the right model. Second, find the latest model, and third, deploying it in production. So, we believe that the source code of a model is not just the code, but it’s the combination of code, configuration and data. So, how we used to do this before, we would have a list of Notebooks, and each one, with their own code, configuration and data. And they would each have their own peculiarities, and we would have to choose between them by going from Notebook to Notebook. So, with MLflow, this workflow has changed. So we have a dashboard in which we can log all the metrics, and perhaps the tax of each of our models that we have trained, and we can easily go through the exact point in time and the data that we use to train them. It’s also much easier to find the latest model, and which one is running in production. It’s not anymore a data scientist that you can approach, but we go to a dashboard, and we see in this dashboard each of the versions deployed and where. Not only but MLflow model registry had all, it’s also helping us to untangle the model deployment from the deployment of the code. So, previously, we used to bundle the code with the model part into a deployment bundle, and then, we deployed that to each of the environments. But after we have MLflow, we reversed these dependencies. So it’s MLflow that’s deciding which models get deployed to which environment. And this makes it easier for data scientists to iterate over the model, and have the right one in the right environment at the time. So, with this, I would like to give the word back to Steven. – Thanks, Erni. That was really insightful to see how we’re using MLflow in particular at Quby. There’s been another a lot of other transformations we’ve been able to make by using unified data analytics. This has really been powering Quby’s transformation into an AI-first company. A huge thing there is we’ve spent less time just to managing the data infrastructure. So, we have a relatively small data engineering team to be able to handle, as Erni was saying, millions of models that are being trained on a daily basis. That means that data scientists in particular, are able to collaborate and re-use code and models across different product teams.
So, that actually allowed us to move from more of a a centralized data science setup into something that’s a little bit more de-centralized with the impact of data being felt across the entire organization.
In addition to that, there’s really the uniform way of working that provides a number of benefits. That includes a uniform body of logging metrics, or, as you’ve just seen in MLflow, a uniform way of tracking performance. Uniformity makes it very easy for data scientists to contribute in multiple teams, and to be able to pick up things where others have left off, and to be able to build upon a model that has already been developed, and just build upon that and really making a difference across the whole organization. Ultimately, what does this lead to? Is really faster development cycles. When we were starting to get active with data science within Quby, I mean, our first proof of concepts were taking months. We’re now at a stage where we’re able to bring active and new features to our end users on a per sprint basis, and that’s every two weeks. That makes a huge difference in the just that the rate that we can bring features to market, but also cost implications from a commercial basis.
Ultimately, we’re doing all of this for a purpose. The end result here is that we’re saving energy in homes across Europe. So, the Waste Checker service that I showed you earlier, in the last 12 months, we’ve identified 87 million inefficient appliance cycles, and out of that, there’s 67 million kilowatt hours of wastage that’s been targeted. Well, been identified, and now, can be targeted. Now, another product we have called the smart thermostat advice, is really looking to give advice to people. That they get alerts when they’re using gas for heating the home, but when other sensor data suggests that they’re not actually in the building. We’ve saved 87,500 meters cubed of gas in the last winter, and with the fall/winter 2020, we expect this to be a huge amount. So, overall, that means we are, step-by-step we’re enabling a transition to a sustainable energy system of the future. We hope that this talk has given you some inspiration in the use cases that we’re offering at Quby, as well as some of the techniques and tools that we’re using behind the scenes to make this happen. I hope this gives you inspiration to go off and explore how you can use the data, data science, and machine learning, to solve some of the world’s toughest problems. Thanks for your attention.
Dr. Stephen Galsworthy is a data driven executive and advisor who loves to create products which address significant challenges. With an analytical background, including a Master’s degree and Ph.D. in Mathematics from Oxford University, he has been leading data science teams since 2011. Currently Stephen is Chief Data Officer at Quby, a leading company offering data driven home services technology and known for creating the in-home display and smart thermostat Toon. In this role, he is responsible for the creation of value from data and Quby’s overall product strategy to enable commodity suppliers such as utilities, banks and insurance companies to play a dominant role in the home services domain.
Erni Durdevic is a Senior Machine Learning Engineer at Quby, a leading company offering data-driven home services technology, known for creating the in-home display and smart thermostat Toon. In this role, he is responsible for building end-to-end data science products. He enjoys pairing with Data Scientists and Data Engineers to transform proofs-of-concept into products running at scale. Erni has a Master degree in Computer Science Engineering and a passion for tackling the world's toughest problems using Data and AI.