Data is the new oil and to transform it into new products, you need a high performing oil refinery. Every organization is realizing the value of creating a data driven culture to accelerate innovation, increase revenue and improve their product.
While most organizations have standardized on Apache Spark for data processing, Delta lake allows for bringing performance, transactionality and reliability to your data. They can unlock significant value with the next generation Lakehouse architecture to run their downstream applications of Machine Learning, AI and Analytics. Through this session, we will explain how you can leverage the Lakehouse platform to make data a part of each business function, the leadership, Sales, Customer Success, Marketing, Product, HR teams so they can produce actionable insights to accelerate innovation further and drive revenue for the business.
In a nutshell, we will cover:
Vini Jaiswal: Data is the new oil and to transform it into new products, you need a high-performing oil refinery. Today, I will walk you through my own journey of data end to end from nine years ago till date, because I think it might resonate with the audience here.
I am Vini Jaiswal, senior developer advocate at Databricks, where I help data practitioners to be successful building on open source technologies like Apache, Spark, Delta, MLflow and on Databricks.
Throughout this session I will explain how you can leverage lakehouse platform to make data a part of each business function so that different teams can produce actionable insights to accelerate innovation further and drive revenue for the business.
In a nutshell, we will cover data evolution and evolution of technology, quick overview of robust lakehouse architecture, where I will talk about evolution from data architectures to data lakes and to modern lakehouse platform. And I will also demo you how you can build it yourself.
So over the last 20 years, advances in technology have completely changed the world we live in. Today, the amount of data that is being generated every second has increased dramatically due to the invention of new sources of data that didn’t exist before. Almost everything we do is now generating data.
Think about the activities you did today. For example, you must have called a friend or attended a business meeting, the logs that are getting generated or the data that is getting generated out of audio files, video files. So it’s a lot of it. You might be using the smart devices. You are generating data for AI, you are using Wi-Fi that generates data too, and so on.
So let’s see what this big data is. Well, let’s start with velocity, which refers to the speed at which new data is generated and the speed at which data moves around.
According to the IDC, which is a global provider of market intelligence and advisory services, it’s predicted that in 2025, each connected person will have at least one data interaction every 18 seconds. That’s about 6 million people we are talking. Even now, think about the speed at which your Uber ride gets to your doorstep. You verify that this is your driver, this is your passenger. Then you start the ride. And as soon as you’re done with the ride, you check out on the screen and the ride’s completed. All these things are happening super fast.
From a business perspective, how can you make use of this data coming in so quickly? How can your teams process such an enormous amount of data constantly? The second [inaudible] take of big data is volume. Volume refers to the incredible of data that’s being generated every second, every day. The IDC forecast that the amount of data that exists in the world will grow from 33 zettabytes in 2018 to 177 zettabytes in 2022. 20 to five.
Think about how much data each and everyone offers, generates every single day through social media, through phone calls, through chats, through Gmail, through cell phones, through personal smartphone devices, fitness jackets, et cetera. If we look at this from the business perspective, holy cow, it’s great. It’s so much more data coming in that we can use. However, how are we going to store that data? How are we going to capture this data?
Finally, when we look at big data, we look at data variety. There are so many types of data that is being generated today. Social media posts, credit card, autonomous vehicle, transaction, data stocks, data… Just to name a few. And this makes it incredibly difficult to make sense of it all together. So again, thinking from a business perspective, we ask ourselves, how can we combine all these disparate data sources and extract useful information out of it?
So let’s look at mini history of technology and evolution, because there is big data and now you need technology to manage the data together. This was back in 2012, I used to work for an audit firm as a data engineering intern. We got the data from our retail clients in the form of magnetic tapes. We had like a mini-infrastructure lab where we would post all the data together. It was a lot of manual work when we did the ETL process. Why? Because not all the new arriving data had the same schema. Different retailers would use different file formats, their format changes with dates.
Also the data collection process in different country varies. For example, if a retail store in a Western country or… For example, United States has only two lines in the address, whereas in Eastern countries, some of them have four lines of address.
And the column length is also important. You have to have a scalable column length to cater to different sizes. UltraEdit was a software we used back then to do text comparison and data discovery. There were missing fields, no schema enforcement manually go into an added software, Excel to merge different types of data, different volumes before ETLing and bunch of macros in Excel.
Then do the ETL process in SQL database, bunch of views and materialized views as the data layers. For downstream, we combined the code in the Ms Access and other reporting tool so that the auditors can do their jobs. So what were the problems here? Schema changes, schema inconsistency, data was in proprietary formats because when we got it, it was in retail stored format. We had to make a meaningful human readable format. Transition data was outsizing the Excel and other traditional tools we had.
Also, previously, I worked as a maintenance quality control data analyst at Southwest. This was back in 2013. I built a reporting tool using Southwest in-house technology and its own proprietary data warehouse. Southwest data was siloed and maintained in different legacy platforms. Absence of an enterprise wide view made it difficult to quickly consume integrate, lens, analyze the growing data resulting in discrepancies, duplications and multiple sources of truth.
I would manually pull data from legacy systems and various sources, including Access, Excel, SQL Server or ACO and Southwest its own data warehouse. Then I would work with quality inspectors to get the maintenance records and to manage overall health of Southwest airlines fleet. We would analyze data generated data by flight operations, maintenance checks, in flight data and other relevant reports.
And finally, after a few weeks, we would have a quality report for the supervisors and decision makers available in some form of dashboards because they needed to know how the maintenance quality tests are performed. So this was a very critical business use case. While we had the tools and the data, there were some problems that might resonate to you. Between so many departments like 25 plus departments generating and analyzing different data, information was siloed, making it difficult to complete.
Data was in legacy format, proprietary systems, it ran delivering the analytics. We were spending a lot of time building the tools, compiling the data, collecting the data, gathering it all together to support all different data types as well as compute instead of accelerating the value from data. So what was happening is, time to value was getting delayed, slower processing delayed the value generation, which could have been accelerated our fleet operations, high maintenance standards.
So I hope those problems resonate with you in some form. So when we talk about evolution, we started with data warehouses, data warehouses started with on-prem, but obviously it moved to cloud eventually. They are phenomenal at BI applications, answer to, you can run SQL queries on top. But some challenges of data warehouses were, it had limited support for ML. You had all these data, you had all these people working on it, but you couldn’t do ML on it because data warehouses lack those capabilities. It wouldn’t support different file formats and you just had work with the structured formats.
And data was in proprietary formats, usually providing a SQL interface where it would be easy to get the data in. But when you want to get data out of proprietary systems, as I explained earlier, you had to deal with a lot of complexities, your time is actually spent on building the infrastructure and tools around it.
So then comes data lakes. When I was at Citibank, these were the problems that resonated with us. Namely, we wanted our systems to support all of our data, not just a subset and we wanted it to be open meaning, this data can be accessed by a wide variety of systems and APIs. The problem was it couldn’t support ML and wide variety of data. It was complex to use.
With vast amount of historical data across our hundreds of systems, we often found ourselves struggling to get data access. Data was siloed. There were immediate off tools that our staff in each department would use. And we had to build some central hub to house all of our data, thankfully for data lakes, so that our businesses can accelerate the product innovation and services.
So we heard about data lakes, data as a wire, data streams and schema on, schema on write. Data warehouses [inaudible] like this data management capability, but data lakes struggled at this. So you end up with complex data quality issues. However, data lakes does support all the formats and machine learning.
So this is where the need for lakehouse architecture comes into play. You have data lakes, you have data warehouse, both have pros and cons. So how do you take the advantages from both of these systems and build something. So we have now seen this industry gravitate towards this concept of lake house, which effectively says, can we take from best of both worlds from data warehouse notion, bring the data management capabilities that make it easier to give high quality, reliable pipelines for your data lake. So that it’s performing and it’s supporting all downstream workloads on top. And we have seen all the cloud vendors talk about it.
So this is how a typical end to end data architecture looks like. You have the source data. It is either getting into the state as a batch or as chain. Data engineers write the pipelines to make data available to downstream applications. And downstream users might be data scientists who will be implementing machine learning models or maybe they are doing like AI use cases.
Another downstream users, may be data analysts who would want to write SQL queries to generate business insights and they might need to talk to stakeholders. And to facilitate the data engineering process, you will need a data platform where you will build the required capabilities of compute and storage, implement proper security controls, et cetera. For example, an alternate data provider, you might have to integrate your existing infrastructure to support different client facing applications, that’s where integration comes into play.
So what if all of this is simplified as well as unified? So you don’t need to invest resources into maintaining the infrastructure, building the infrastructure and you can start realizing the value out of the data you have.
So this is how lakehouse platform looks like. While most organizations have standardized on Apache Spark for data processing, data lake allows for bringing performance, transactionally and reliability to your Delta lakes. This is where combining best of two was comes into play. You can actually unlock significant value with next generation lakehouse architecture to run downstream applications such as machine learning, SQL analytics, AI.
In a nutshell, it has three main properties to remember. It’s simple, meaning data only needs to exist once to support all your data workloads on one common platform. It’s open, because it’s based on open source and open standards, it makes it easy to work with the data. So you can feed in the data, integrate with existing tools and when you need to take data out, it’s not much of a processing.
It’s collaborative, it provides you an unified platform so that the data engineers, data analysts, data scientists are not working in silos. They are working together and they are able to get the value much more faster and easily.
So let’s take a look at how we can build the lakehouse architecture. All right. So in this demo, we will be building a unified pipeline with lakehouse architecture. I will show you how to ingest streaming data into raw table and feed it into Delta Medallion architecture. Similarly, I will show you how to build a batch pipeline and then we can look at the lakehouse ending where we will talk about Databricks SQL and visualization.
I’m using Databricks notebook here, but you can run the code in your choice of ID. This notebook is made available in Github repo as well. So the data sets that I’m using for streaming use case is the Wikipedia edits in real time with a multitude of different languages. As a data practitioner, you need to build a high quality data engineering pipeline so that the downstream applications and users like machine learning engineers and data scientists or analytical folks can actually generate quality results and businesses can rely on data driven business decision making.
So let’s walk through this Delta architecture, which is a vast improvement upon the traditional Lambda architecture. So text files, RDBMS data and streaming data is all collected into a raw table. This is the first layer, which in Delta concept, we also call it the bronze layer. Then you have a view, which is a query table view. A raw table is parsed into this query table also known as silver tables in Delta concept. They may be joined with the mentioned tables and so on.
And summary tables also known as gold tables, our business level aggregates. These are often used for reporting and dashboarding. This would include aggregations such as daily active users and final visualizations. And then you can also top it off with actionable insights or a dashboard, which we call it, platinum layer.
As I said, we described Delta Medallion architecture, we use the terminology, bronze, silver and gold. I am setting up relevant data directories here just to mimic that architecture, silver, gold and bronze. Once we set the parts, we need to set up the schema. So the first step is we are going to put the schema around our data and then write the raw file into the bronze Delta directory.
So what we are going to do is, we’re going to use the readStream, which is going to read our Wikipedia IRC tenors that has been dumped into our Kafka server. And these are the Kafka server properties that we’re using so that we can get the data from this fire hose, which is setup in the form of a Kafka server and dumps the raw data into our data lake in the bronze directory.
So writestream is what dumps the data in. And then here we have option, checkpointLocation, checkpointPath, which is bronze. And then we are giving the query name. Notice that we use this .start and bronzePath. This is so that the data is streamed into the path we want and not a default location. So as our streaming is initializing, you can actually notice how input versus processing rate looks like over the period of time. Once this is done streaming, I’m going to run this command. So that we can take a look at the data. So I’m going to create a data frame out of my stream.
So I’m just going to use spark.sql and then write a query here. So I’m going to run that. Notice how our data looks like. This is just a quick glance of first three rows. Data is not perfect in human readable format. There are way too many columns.
So as you remember, we said query tables comes next after the raw data. So that’s what we are going to do. We are going to apply some transformation so that it’s in human readable format. So let’s do that. I’m going to create my query table. So I’m going to use .format(Delta). I’m using Spark reader again here, readsStream, loading it from the bronzePath and then selecting the columns that I really want.
And then I’m writing the stream into my silver table using the .format(Delta). Again, this is where I’m giving the part. And here I can specify output mode, if this is append. Append is basically if I have already existing data, it’s going to just append on top of it. And then it starts to stream into the silver part.
So as this is done, this gives a nice dashboard. You can also see the raw data from here and look how the data format looks like here. Great. So once this team is done initializing, we are going to prepare it for loading it into the silver part so we can run a query. Let’s look at how the format looks like. Much more human readable. Earlier, you could see that there were some characters which were not in human readable format, but now we do have some nice format of the data here.
So what we’re going to do next is either you can use this silver table for analytics or building machine learning orders or you can go further, level down and just build the queries that really is needed by your business. We’ll need those two fields into my gold table. Again, same thing, I’m going to specify the read stream and the part where it is reading the data from and then what are the columns I want?
So with column countryCode and geocoding.countryCode3. So that I have it in a Spark readable format where I can use this display function to build a map chart. Then I’m applying some filters because I don’t want null values to be there, I want the namespace to only match article and is anonymous true so that I’m not dealing with GDPR issues and then grouping by country code.
So once we run this… All right. So now I’m going to create visualizations out of my gold data frame. So what I’m doing here is using the display function because display function will allow me to do some plotting on my data in Databricks. Let’s see. This is where I can again see the data in real time. And this is where I’m using the plot option which is like a geographical map. And you remember, I had the geo-encoding here. That’s why I’m able to use those values here.
So he’s our country code and it’s total. This is how I’m applying the specialization. And if we noticed, this stream for a little bit, we will see that this data [inaudible] real time.
So if there is a spike we will be able to see that. These are the number of records. You see how it became darker blue. So you can see all the changes in real time using the map plotting functions as well. So that was for the streaming. I’m going to stop all the streams. This is how we built the streaming pipelines using Delta Medallion architecture. Now, I also want to show you how the batch pipelines work.
So I’m going to run some prerequisite code, which is nothing but it is actually showing us the raw data. So in my raw data here, I’m using some customer information and sales transaction information. So if you look at the raw data, we have customer IDs, some skews off sales, some first sale date, ship date, customer name, customer flag. Basically, if it’s a loyalty customer or not. Email address and if they have provided any reviews, what does the review date look like?
So all those awesome tables. So as a first layer, we call it bronze layer, we are going to ingest the raw data into our bronze layer. So what I’m doing here is using a Spark APIs, I’m creating a data frame and then display the customer data bronze. It hasn’t changed much. I’m just able to use the display function and view it in a nice format.
The next is, once we have data in the bronze layer, we are going to add it to the silver layer. So what I’m doing here is applying the transformations to my bronze table. And while applying the transformation what I’m doing is, because the email addresses are PII information, I’m pseudonimizing that. So I’m adding another column in my bronze table so that I’m anonymizing one of the fields.
This is one of the GDPR use case or audit use case. And then I’m using sha algorithm, you can use salt keys and then a lot of other encryption algorithms available there. So you can use algorithm of your choice. So now let’s display our customer silver data.
So notice how all the fields are there from the bronze table. But then there’s another email field which is pseudonymized. Each column and row is equivalent to email address field here.
And then let’s do transformation too on the bronze layer, because we don’t want the downstream application to consume any kind of PII data. So we are going to separate our look up table, where we have all the PII data available and we have a separate table which can be used for downstream application, where all the fields are anonymized and PII data is removed.
So that’s why I’m applying second transformation here. You can see that I only selected the PII data for the second transformation and because we have email address anonymous and also our main table that we are going to reference for downstream applications, so that there is a link just in case, if we need to retrieve the record from customer lookup table for various reasons. For example, if a customer is asking, like, “I need to change my name.” Your system is saying, “My name is wrong.” Then you can go back, do the customer look up and change those fields.
So once this is done, now I’m going to apply some business level transformation. As I said, I’m going to create a pseudo tables, which can be used for downstream applications. So I’m going to drop email address, all the PII information and what I’m using here is mode overwrite and format delta. And then I’m saving it as a gold table.
So once that is done, let’s look at how our gold table looks like. And this is the main table, which is going to be used by all of our BI analysts, all of our other reporting. You can further drill down or aggregate this based on the use case you are trying to do. And then this is how the data looks like. You have customer_id, customer_sk. Notice that you have no PII data. You only have a linkage to the customer look up table.
So this is how the transformation work. You saw the stream and batch processing in delta. Now we have done that. We have done the engineering part. Now comes the downstream applications. You need a unified platform. So because we have this as a unified lakehouse platform, we will be able to do Databrick SQL for the serving layer. And we can also do MLflow for ML experiments.
For example, we have these models here where you can actually go ahead and create the models. I’m not going to go into detail into this just for this session. But I do want to show you how the Databrick SQL works. So I’m going to go back to SQL analytics. Remember that we created customer, gold customer, silver customer pseudo table and I will show you how does that replicate here.
As long as you have the data in, it is going to match up the tables here, but to do that, you have to create endpoints. So SQL endpoints allow you to run processing. You need a processing engine to run any SQL. So I’m going to give it a name, Lakehouse Demo Data AI Summit, and then for the cluster size select appropriately based on your table, how much processing you need, how faster you want to go.
I’m going to select 4X-Large for this one, because I have huge data and I want to speed up the processing. You can also specify Auto Stop. This will allow you to terminate the clusters if there is any inactivities saving you costs so that you don’t run into huge business where your clusters were inactive.
Multi-cluster load balancing. This is another thing you can do. Depending on your workload, it will auto scale up and down. As a data analyst, I don’t want to go into infrastructure concepts. So that’s why database SQL makes it easy for SQL analytics to just have this capability that’s easy to understand, easy to implement and cost optimized parts or reliability optimized based on, if you are trying to save costs or if you have production application, you will build reliable optimized clusters here.
And then for chargebacks, I just do have a reporting mechanism you can put tags here, so it’s easy for your organization to understand. Report development. I’m going to create the cluster. Here, another good thing is, you can also provide permissions for your organization users or groups. So I will give all my data scientists can use permission. If you give can manage, they can edit the clusters, they can actually delete the clusters as well. So I’m going to use that and then save.
So once I have my endpoint created, I can actually run SQL queries, build the dashboards in my Databricks SQL. Some more details that I can show you here is connection details. For example, earlier, I showed you end to end architecture. Let’s say, if you have a client facing application, you have another BI tool that you’re using, like Tableau or Power BI, how do you connect that? We do provide native integration. So you have all these connection details and if you have Tableau, you can use Tableau, input all these connection details.
If you have other BI tools, similarly, you can use JDBC or DBC connection to do that. And then nice thing is you can also set up monitoring. So for example, once this cluster is being used in the applications, you will be able to see how many queries were done, how many clusters were created, things like that.
So basically gives you a good idea on observability of the work you are doing. Main page of query where you can see all the queries that has ever been written into your SQL workspace. But you can also control this as an admin so that not everybody will have access to see what queries were written. And then you have dashboards. Now, if you want some beautifying plots, you can use Databrick SQL.
You can use bar chart, doughnut chart and also do the image loading here. And then you can set up alerts so you can create new alerts every time something fails. You can also view the history of your queries and points we already covered. So that’s pretty much it. If you want to create a new query. You can go into this query window.
Let’s look at our own tables. We had this default database, this we created. Customer data bronze. We created customer look up silver and then we also created customer pseudocode for downstream applications. All of our type of tables are reflected here.
So I’m going to run this query and the query’s result is this. All right, so in this query editor, I’m going to select star from the default customer_pseudo_gold, these are our tables. Notice we feed at customer_data_bronze, we created customer_lookup_silver. And then we also created customer_pseudocode for downstream applications, all our type of tables that are reflected here.
So I’m going to run this query and the query’s result is this. [inaudible] visualization out of the data that we have. So I’m going to select a counter. Let’s pick a counter to see how other reviews are done. Then good thing about the visualization tool here is you have chart, you have cohort, counter, details view, funnel and all these different charts.
So I’m going to use a counter for this one, just because I want to see how many reviews were done. How many reviews were, yes, no flag. So I’m going to create a label. Y/N flag. And then here I will pick the customer flag detail and then value column, I will select, last_review_date. Select preferred customer. And then here, I want to count the rows for each N or Y. So I’m going to see that once I’ve done the visualization, now I can also edit the visualization back or maybe I can build a dashboard out of it.
So I can either add it to existing dashboard. I can add download this result as a CSV file, et cetera. Since I didn’t specify one of the dashboards, let’s specify our dashboard here. So I’m going to create a dashboard, new dashboard. I’m going to call it, Lakehouse Demo. Here it is. And once I have this dashboard, I can actually go back to my query, which was this one and then I’m going to add to dashboard and Lakehouse. It will appear, you just have to type first few characters and it will pop the rest of it and then do okay.
So once I add this, now you’ll notice that it will get added to your Lakehouse Demo dashboard. Dashboard Lakehouse Demo DAIS. Notice how it quickly got added. So that was the demo. I hope you learnt something from lakehouse architecture. This is how you build the lakehouse architecture.
I have few more takeaways from this session. So we have AMA sessions where you can bring in your questions, ask anything you want in terms of our technologies. And then some more talks that I have gathered together for your convenience, please take them out. Even if you are not able to check it out right now, this will be recorded and you can log back in and watch it.
We are working on a book. We are releasing the early preview of three chapters. So please make your way there. This is the link to get access and we’d love your feedback. If you have any feedback about the session, please rate us so that we can bring in awesome content to your disposal.
Vini Jaiswal is a Senior Developer Advocate at Databricks, where she helps data practitioners to be successful in building on Databricks and open source technologies like Apache Spark, Delta, and MLfl...