In this talk we will present how Databricks has enabled the author to achieve more with data, enabling one person to build a coherent data project with data engineering, analysis and science components, with better collaboration, better productionalization methods, with larger datasets and faster.
The talk will include a demo that will illustrate how the multiple functionalities of Databricks help to build a coherent data project with Databricks jobs, Delta Lake and auto-loader for data engineering, SQL Analytics for Data Analysis, Spark ML and MLFlow for data science, and Projects for collaboration.
Francois Callew…: Good morning. Good afternoon, everyone. My name is Francois Callewaert. I am a Data Scientist at Databricks, and as one of the top users of the product within the company, I’d like to share with you my experience with how I feel it=has empowered me to do more with data and it can be empowering you to do the same. At first, I like to give a brief intro to explain my personal motivation to use Databricks, then explain how Databricks can be used to help you manage the entire data life cycle, and then we will explore that with a demo.
And to begin with, and to motivate the rest of the presentation, let me talk a little bit about my experience with data tools. I started my data scientist journey using the Jupyter to do data exploration and machine learning, but quickly as I joined my first company, I realized that more tools were needed to productionalize the results, SHAREit for example, using Power BI to do more advanced data analytics and sharing it with the stakeholders, but also to build and refresh my own data sets. I needed to use more tools. In SQL I could use database and data factory, but with Python, I would need to use a VM, Spyder, and Windows Scheduler. And finally for model serving and delivery, again, I used Azure ML and PowerApps.
And as you can see, it paints the picture of, for each application using a different tool. It makes it difficult. It’s can take several days to learn the usage of a new technology, then more days to debug it, and once you’ve implemented your pipeline, you need to monitor it very closely and be aware that one failure with one tool can fail the entire pipeline. So, as you can see, this is hard to scale and that’s really what changed completely when I started using Databricks because now I use one single tool to do all those applications.
And that really empowers you because you spent so less time focusing on learning the tools. You can focus a lot more on the business applications and also on expanding the things that you can do, because you can think maybe in the past, you’re a data scientist. You used to work with a one gigabyte data set in Jupyter And then at the time of productionalization on one terabyte, you may want to have a data engineer take care of that and that could take some time to plan, to prioritize, and potentially there could be weeks to months before productionalization of your results. With Databricks, you can do that within a couple of hours and have full control over the results and over the productionization of it and it really expands your capabilities as a data person.
Now to explain which part of the data life cycle, Databricks covers, I have this. I’ll start explaining the data life cycle. First data is created by the interaction between users and software. And typically you have a software engineer who coded the logging of some of the user interactions into and saving that into raw logs, what we call at Databricks, the Bronze data set. And you may have hundreds of locations with raw logs from multiple software engineers, maybe in different formats. So you need a data engineer to come collect all those logs from different locations, process them, potentially clean to deduplicate, do some PII removal, and aggregate everything in one centralized location where everyone can access it, which we call the Silver data set. At that level, our data is centralized but it is not yet useful for the business as a log.
You need to do a little bit more, typically aggregates and joins, to transform it into business data such as metrics and features. And maybe in the past, you would need a data engineer to take care of that process. But I believe with Databricks, any data person can do that process themselves and I will show that in a demo. Once you have business data, you can do analytics, which can be useful then to take for decision-making and modeling by data scientist, which can be used to create recommender systems and eventually the final goal is to improve the UX of your software and eventually improve the success of your business.
Where does Databricks lie in that process? We have Spark Delta to process data and save it in a reliable manner. We have SQL analytics which enables the data analyst to share visualization to the company. We have spark.ml MLflow to do machine learning and productionalize the models. And with all that, you can use it in Notebooks for interactive usage and Jobs to schedule your processes at regular periods.
And that paints a picture which we call the Lake House, which is that Databricks offers a unified platform that enables you to manage your entire data life cycle. And not only it enables you to do that as a company to make data people work together more efficiently, but it also enables one user to blur the line of all those data personnel and be able to manage the entire data life cycle for your project. And that’s what I want to demonstrate next with the demo.
I’ll start with data sets from Kaggle. That’s an eCommerce data set. I need to have user actions that are view, cart, and purchase. So those are your raw logs. And now I will have 18 minutes to show you that you can be a data engineer, a data analyst, and the data scientists in one project. And all the code is available in GitHub.
I will start with the data engineer and let’s assume that our data engineer starts with eCommerce events data and a software engineer has coded a process to log those events every few seconds as CSV files, CSV files that we will be appended to a cloud packet, what we call raw logs. And the work of our data engineer is going to take those were raw logs to join them with a dimension table that provide the ID to name mapping and save it as Delta format in what we call cleaned event logs.
Let’s start the demo. Here I am in a Databricks notebook called Prepared data, which you can run if you want to reproduce what I’m doing. I want to show here that the technology that I used, which is Repos, to gather all the notebooks that will be used through that demo and that you can use to sync all the results to a GitHub, very convenient.
When you’ve run that notebook, what you get is two tables. The first one is this raw events table, where you can see here one event which is a purchase, product ID, category ID. Please note that we do not have category name here and that’s because we have a second table that provides the mapping between the category ID and the category name, and the job of our data engineer will be to produce the final table which contains all the information. For that, our data engineer is going to use Databrick’s technology called the Auto-loader, which I’m going to illustrate in that demo.
I’m running the notebook here. First part, define the expected schema of the input data. Here, load the dimension table, ecommercedemo.categories. And here’s how the Auto-loader is generated. Spark_readStream cloud files and we are going to read CSV file with the schema defined to sales above and with the path which is provided here. And when we run that, the next step is going to be to join with the dimension table and to write as a Delta file called ecommercedemo.eventstream. And as you can see, now we are running that.
We do not have any events recorded here because our software engineer hasn’t yet created the logging. We’re going to start this process here. I simulated that process, starting from the full dataset and iteratively saving parts of the dataset as CSV files in my cloud packet. And when I do that, and I come back to my streaming, here I can see that we are logging indeed several thousands of record per second. And, as you can see, my data set here, which was empty, if I refresh that you will be able to see that data is coming. We already have eight files and 1.3 megabytes of data and that’s the end of my demo.
That was the data engineer process and now we have our clean event logs that are available to the entire company. And next, our data analyst is going to take that data, do some data exploration using the Databricks notebooks and once she found out the right metrics to provide to stakeholders, create a job that would productionalize those metrics and productionalize the dataset, and eventually create a dashboard to share with stakeholders using SQL Analytics.
And this is the next notebook, the data analyst. Three simple parts. The first we look at the table, and this is the same table that you saw earlier. The next part is to get some statistics to better understand the data that we have here. A little bit long SQL query, but simple results. The start date is October 1st, end date is November 30th. So we have two months of eCommerce data. 68 million events, 64 million views, and a couple of millions of cart and purchases. You can see, we have a lot of products, but only 129 categories. So, I feel as a data analyst, that this is kind of a human processable amount of categories so we’re going to start exploring the sales per category. And this is the purpose of the third cell where we are look at the sales per category and here you can see that the top category is the smartphone with sales of around $335 million in the last two months.
Now we’ve defined what are important metrics. This is the sales per category and we would like to provide to our users the ability to use sales per category, per date. So that’s the next step where our data analyst to share the processed data set, process aggregated at the business level, so that users don’t need to look at the raw events. And she can do that in a very simple manner. Here’s the query that provides the sale per category per day. And the only thing to add to create that table and provide it to her stakeholders is this line: create or replace table ecommercedemo.categoryinsights.
And now the next step is to schedule a job to run this notebook on a regular basis. Here, what I did is to schedule it every day. Gere you can see, there is a job called Gold table’s job. If I go to this job, you can see that there have been a couple of runs before, couple manual runs, and the last two have been scheduled automatically, and the schedule is 12:00 AM, and the job lasts only a few minutes. Now our data analyst has productionalized this data sets that will be available to everyone in the company and refreshed everyday.
The next step that you can do as a data analyst is to create a dashboard. Now you share visualization to your company. And for that, we use a Databricks component called the SQL Analytics which enables you to create dashboards and queries. Here I created a dashboard to show two visualizations. The first one is the total sales by category. Here, for example, if I search bicycle, I can see that there were 362 sales and also the second part shows you the sales overtime and I can select whichever category I would like to show and that’s very convenient. That means that the user of that dashboard can select and can look at all the categories that they are interested in. And it was very simple to code. That is the dashboard.
Next I’ll show the two queries that have been coded in order to generate those visualizations. It’s basically to SQL queries that you run this one for the sales per category and the next one for the sales overtime. And please note this little thing called the parameter that enable the user to parameterize, parameterize a query and to parameterize the things that are visualized in this chart. Now that our data analyst has built great insights, the next step is for our data scientists to again, start from the logs, do some exploration within notebooks with feature engineering and machine learning. Productionalize the feature table with a job and eventually create an MLflow experiment to productionalize the model.
And here, we’re going to try to solve one problem, which is find the user that are most likely to buy headphones. I have shown that notebook before for the data analyst, but there is the second cell here for the data scientist, where the scientist create the Feature table starting from the events dataset and aggregating it at the user level. And, as you can see, it’s very few lines and the results of it is a table aggregated at the user id level where you can see whatever the user has done on the platform. Those are just Boolean fields that tell us whether the user has put a bag in a cart. Here you can see this user viewed a bag. If you go a little bit further, you can see they also view a costume and viewed a dress. So it paints picture of a user that is interested in clothes. And so that’s the data set that our data scientist is going to use to build predictions.
Let’s deep dig into that data scientist notebook. I will not explain too much the code really explain just the process. We start from the top 10 categories and we pick one, so the headphone, and we want to predict whether our user is a potential headphone buyer based on these other actions in the other categories. There are a couple steps. The first is the data preparation. Here I want to highlight one line of code to convert a Delta format to pandas, and then we’re ready to do some modeling.
Here again, modeling headphone purchase as a function of activity with the other features. We use logistic regression. If we use a random guest, the score is 0.5. It makes sense. We have a balanced data set, but when using all the features, our middle score is 0.684. So we gained a lot of knowledge on the likelihood of users to buy headphones.
Now, there’s something I don’t really like about this model. It’s we use 27 features. I like to make it simpler for productionalization. We will use here a technological Recursive Feature Elimination. We start from the 27 features and iteratively eliminate the least important feature because this process will generate around 27 models. We want to use the technological MLflow tracking to track those models and their performance at each iteration. For that, we create an experiment with MLflow, second step for each run, for each model, we create a MLflow run that we can give it a name and for each run, we log the score, we log the model, and we log metadata that are the coefficient and the importance.
Then we have a piece of UI here that we can access to access to the experiment. If I click here, I can access the next notebook. This UI that shows me the summary of my experiment. Here my experiment is called the Headphone purchase prediction and you can see my experiment event log runs. It started with all the features, and then we remove one feature at a time. And what’s interesting is that at first you do not see any change in the score of the modeling. And then after some time you see a slight decrease of the score. And finally, one before the last, you can see that the score is still very, very good, very close to 0.684.
Let’s choose that model to understand it a little bit deeper. This leads us to another UI, which is the UI that show the summary of information about each model. And here we have first this information about the feature importance. We see the two features used here are clocks.view, which I think clocks is probably a smartwatch and a smartphone purchase. And it makes sense that a headphone purchase could be linked with smartwatch and smartphone. Please note, also we have every run has a unique ID, which enables Databricks to store that so that it’s accessible in the future by any person in your company.
If I come back to the data scientist notebook, then I can retrieve that run, retrieved the model, and run my final prediction. And here’s my final prediction a user who viewed the smartwatch and purchased a smartphone has 92% probability to be a headphone buyer and that’s extremely powerful. You can imagine these kind of predictions scaled across thousands of products for Amazon, can generate billions of dollars in additional revenue.
In this demo, I focused on the breadth of the data processes that you can cover with Databricks. And let me give you a brief recap of the highlights. First, you saw code that you save in GitHub, that you can version control and share with everyone in your company. You saw data engineering, SLA of a few seconds, that saves several gigabytes of data per day and that you could scale to terabytes, potentially. You saw a data engineer and the data analysts that have been able to productionalize their data sets, their metrics and their features with an SLA of one day but that could be reduced down to a couple of minutes. You saw dashboard that’s available for the entire company to monitor how the categories of products are doing, are selling. And you saw a model that’s available to the company in the UI and machine learning engineers can pick it up at any time to productionalize.
As a result, I hope to convey the idea that you can really do more with data and simpler with Databricks. Thank you.
Francois is a graduate of the Ecole Polytechnique and Northwestern University. He has worked for Microsoft and currently works at Databricks.