What is Azure Databricks?
Azure Databricks starts with bringing a collaborative experience with coauthor multilingual notebooks and advanced role based access controls. We’re really trying to blend together the roles of data engineers, data scientists and business analysts. These multilingual notebooks that we offer, you’re able to write and develop in Python, Scala, R and even SQL, I will focus on a little bit more today. Azure Databricks being a first party service on Azure has deep service integrations across the entire Azure portfolio. We have ML lifecycle management solutions like MLflow, Azure Machine Learning, these two work together behind the scenes to enable you to end to end manage the machine learning development lifecycle. Azure Databricks is a fully managed Apache Sparks service providing an optimized runtime that effortlessly scales to your most demanding workloads. Now this product is able to then bring together all the data that you have from your wide variety of sources and drive business outcomes allowing you to develop machine learning models. Business intelligence tools take actions in real time and build reliable data lakes. This is a key here building reliable data lakes. A lot of people have this promise and this understanding that they can have reliable data lakes and they start down this journey of creating their data lakes and often times I see customers time and again wind up with data swamps. Data that they don’t feel they can efficiently access. Data that they feel is not clean and not reliable.
How Delta Lake Can Bring Reliability to Your Data Lakes
Before we do that I wanna take a step back and look at the high level reference architecture of how you would use Azure Databricks in your end to end solution on Azure. To start on your journey you’re gonna be bringing in data from a wide variety of sources. Data that’s structured or unstructured, data coming from a traditional SQL kind of a relational stores or maybe even you have data that’s streaming in real time from sensors or other IoT devices. Your key goal is to get this data centralized into a single Data Lake storage and you can use tools like Azure Data Factory, you’ve got real time messaging buses like Azure Event Hub, Azure IoT Hub Apache Kafka, these help you stream that data and you can stream it directly into Azure Databricks or you can stream and store it in your Data Lake storage as well as cold storage. Then you have Azure Databricks here, you wanna think of Azure Databricks as a decouple of compute end storage so you have all of your data on your Azure Data Lake storage and now you’re able to efficiently run and process all of this massive amount of data. You can start exploring that data, cleaning this data, merging it together, joining it, enriching it. Building and training machine learning models with Azure machine learning services and MLflow together. When you’re doing these advanced analytics, you may want to choose to send this data to another downstream service to take further action for example if you train and build a machine learning model that you wanna deploy and serve, you may choose something like Azure Kubernetes Services. Or if you want to stream that data in real time to an operational database like Cosmos DB then you can power real time line of business application. Maybe your goal is about BI solutions and BI scenarios. In that case you may wanna use something like Azure Synapse one of our next gen data warehouse offerings on Azure. And you can load and serve that data into Azure Synapse and take advantage of power BI to create enterprise grade BI solutions. So now that you kind of roughly understand what Azure Databricks is, I wanna turn it over to Mike Cornell, a Senior Solution Architect at Databricks to introduce a relatively new open source technology called Delta Lake. Important to note that 75% of our customers today use Delta Lake, this is very exciting. Delta Lake helps you bring this to the next level and Mike’s gonna show you how.
What is Delta Lake?
– [Mike] Yeah great thanks Kyle. So let’s talk a little bit about what Delta Lake is and before we talk about those specifics, let’s talk about some of the reliability challenges that customers have had historically using Data Lakes. So a common issue that people have had historically is this idea of failed production jobs. When you’re running production jobs using a framework like Spark, when those jobs fail, a lot of times leaves your data in some sort of corrupted state right we don’t know what files were written, what files weren’t, there’s no process for commitments to files and no easy way to roll things back. A second problem that we find in Lakes is a lack of schema enforcement okay so there’s no as files are being written and as new data is being inserted into tables, there’s no real way to enforce schemas and so over time if a date field changes or it was a why, why, why, month, month, day, day, today and then tomorrow that that has changed to why, why, month, month, day, day, there’s, there’s no, there’s no easy way to enforce those schemas there’s also no easy way to evolve the schemas so let’s say today you’re dealing with 10 columns and tomorrow your source data set adds an 11th or 12th column, there’s no easy way to add those two columns to your existing schemas right? Finally, Data Lakes have provided a real lack of consistency right so if you’ve got multiple jobs that are reading and writing from a single path inside of your lake, there’s no real isolation, there’s no real way to, for job one to know what’s being read and written to from job two. When a user queries a table and a job is writing to that table at the same time there’s no easy way for that user to understand exactly what the state of that data is. With all three of these challenges, a lot of times, we don’t recognize problems until the user begins to query the data lake path okay so a lot of times, a user two months, three months down the road begins to query some data from the lake and they find that there’s some inconsistency with schema or the data just doesn’t look right and then the data engineers have to go back and start to search and figure out what happened and then they find out that three months ago some job failed or some schema changed and that resulted in that data being incorrect and trying to go back and fix that is extremely painful. So now that we understand the challenges, how do we fix that right and Delta Lake is a brand new open source service that helps to fix some of these challenges right. So it’s open source, it’s open format, it’s built on top of parquet right and it adds the reliability and performance enhancements that help with the challenges that we just talked about. As we’ll note it’s fully compatible with Spark APIs so you can use all of the same code that you’ve been using historically, you can use with Delta Lake with minimal change. And to show how easy it is to use, today where you’ve been saying create table using a parquet file or let me read or write with a format of parquet, it’s very simple to plug delta into this all you do is copy and paste where you were using parquet, you’ll paste in delta now. To stay on the easy track as well, you’ll note that Delta fully supports SQL. SQL is a universal language used by everybody right, whether you’re a data engineer, a data scientist, a SQL analyst, some sort of person just wanted to query tables on the front end right, SQL is a common, a common programming language that’s used across the entire data stack. And we wanna make sure that Delta supports SQL and show you how easy it is to use SQL on top of your Data Lake using Delta. So what exactly is Delta Lake right, so Delta Lake tries to solve all of these reliability challenges by providing a collection of parquet files along with the transaction log. Okay so what this transaction log provides is a way for us to do asset transactions right so now you can do your inserts, your deletes, your updates, even merge transactions like you would do with a traditional kinda SQL data warehouse. You’re able to do that inside of Delta Lake. What Delta Lake also does is it creates versioned parquet files so as your data changes overtime, we create new versions of that data so that when you, when you insert new records, when you delete new records, when you update records, you actually create a new version of that data and now you might have two versions of the same records but what this enables is capabilities like time travel right so now we can go and we can select what a table would look like at some point in time right. Hey show me what this data looked like three weeks ago and because of the transaction log and because of the versioning, this now allows us to go back and look and see what tables look like. We also provide schema enforcement so that no data can be written to a Delta path if it is not of a specific schema. We also provide schema evolution which is kind of the opposite of that so if over time your schema changes, you can actually allow for the schema to evolve and change with the data that’s flowing into it. Finally, we allow both streaming sources in sync so you can stream into Delta, you can stream out of Delta, at the same time you can have batch jobs that are reading and writing from Delta, you can have customers and users that are writing select statements and querying Delta. And because of our snapshot isolation and the versioning of the data, each job and each user is always getting a consistent isolated snapshot view of that data. So now that you understand a little bit about what Delta Lake is, I’d like to turn it back over to Kyle so he can show you a little bit about how we use Delta Lake.
– [Kyle] Awesome, thanks Mike. This is wonderful, it’s amazing to look at all of the features that this offers to us and how it can provide asynchronous or guarantees, those snapshot isolations, the schema enforcement, efficient up search, let’s take a look at exactly where it fits in you know, back to where I was chatting about the reference architecture. Especially sometimes people hear the name Delta Lake and if they’re already familiar with the Azure platform sometimes they can get confused with hey, Azure Data Lake, Delta Lake, you know, sometimes it’s even hard to hear which one someone’s talking about. So something to call out here in this high level reference architecture that we talked about just earlier, where Delta fits in is actually right down here in your existing data lake storage. So this isn’t about creating a new storage technology, a new storage container, these kind of things. You can think of Delta Lake as even like a storage format technology and what you do is, all your data that you had, all of that existing data you can convert, you can upgrade this data from your raw format whatever it was JSON, CSV, Avro, Parquet, whatever the format was that you’re dealing with. And now convert it to this Delta format. And out of the box you know just by switching those keywords like Mike showed you how easy it is to get started. By switching those keywords to using Delta, everything else works as if it was your original data. So the advantages that Mike just described on the last slide all of those advantages will apply when you upgrade your raw format data, your data in this raw format to this Delta format. And that’s kind of how it fits in, in this high level overview. Let’s take one more double click down just to make sure that we drive this point home. So going through a multi step process to understand how you would use Delta Lake at scale on Azure. Step number one is you load your raw data to some Azure Data Lake storage. You get it from some other storage, you bring it onto Azure Data Lake and you bring it in in it’s raw format as it is you know, unaltered, untouched. And then in step number two you can combine string in a batch if you have that scenario or if you’re just using batch, you can convert this raw data right away to the Delta format and there’s an advantage to doing that at this stage without introducing any business logic. Without cleaning the data at all, without performing any extra actions just by converting it to this Delta format, now you have a performance optimized, schema enforced, reliable source of data in this Delta format that if you ever need to go back to the raw source ’cause honestly you know, how many times are you going to have to go back and reprocess data, how many times have you been working on a project where someone’s asked you or you’ve lost some data or somethings gone wrong and you have to go reprocess months and months of data and it’s really painful exercise. Now if you’ve converted at the raw format to a Delta format right away without any other business logic, you can land at that source instead of having to go all the way back to raw and getting some of those benefits of reliability, but also some performance benefits that we’ll talk about later near the end of this webinar as well. So you convert it to Delta format, then you can start to join it, clean it, enrich it, transform it, maybe you start developing and training some machine learning models, doing some more advanced analytics and then you take it down to your downstream serving layers or whatever else you wanna do with this data. Once you’ve got it into an actionable place to drive business outcomes. Now I lay out this format into this bronze, silver, gold kind of design paradigm where, I mean we’re not talking literally you have to have three stages or anything like that. We’re just trying to show you an idea, a design pattern of enabling and democratizing this data at various levels. So for example you know we talked about personas and people’s roles in the data lifecycle. If you have data engineers in your organization that need to touch this raw data. That need to create a sharable platforms of this data throughout the company, they’re gonna be using the bronze tables for sure. If you have BI analysts, people that are working with aggregated data, people that are working with summarized data. People that don’t need to go munging through the trenches and trying to figure out what data is what. They may be going for something more like a gold or silver kind of table so this just kind of gives you the idea that by creating a multi stage pipeline in this format, you’re democratizing the data at each stage and you are enabling the schema enforcement and data reliability. Mike was talking a little bit about that schema enforcement and how you can allow for schema evolution in some cases where you choose and so here you can get really nitpicky as you go through your pipeline and say, hey at this bronze table we’re going to allow for some schema evolution. If something from my upstream data source changes, I’m going to allow that schema change to happen because I don’t wanna just break the pipeline right away. But when it comes to my silver and gold tables, man I wanna protect that gold table because my BI analysts are making a report for the CEO, making a report for our executives that we can’t have anything go wrong, we need to enforce the reliability in the schema and not allow unexpected changes to happen at that layer. So hopefully this kind of gives you the idea of how you would use Delta Lake with Azure Databricks at scale on Azure. What I wanna do now at this stage is walk you through a demo. I’ma show you these principles live with some hands on keyboard or with some code running in Azure Databricks. What I’m going to do is ingest a stream of IoT sensors with Azure Databricks, I’m going to clean this data and build a multi stage streaming pipeline with Delta Tables. Then I’m gonna use Spark SQL to explore this data a little bit and I’m going to use time travel, Mike mentioned this feature, this capability, SQL queries to debug and correct some erroneous data which sneaks into my pipeline. So this is the format of the demo, let me flip over to the browser. So to give you context and to tell a little bit of a story as we go, we’re going to the IoT sensors that we’re using, this data set is, I’m going to describe these as sensors that are measuring carbon emissions for a company and the data set that I’m using is actually stored here when you see this prefix called DBFS, this is Databricks file system, anytime you create a Databricks, an Azure Databricks workspace, you get a default storage account created in that workspace and you have access to a bunch of demo data sets. I just want to call this out so that these data sets, this data set that I’m playing with and many more data sets are available to you as soon as you create an Azure Databricks workspace. So here I’m using an IoT data set that has this JSON format. So you can see there’s device IDs, there’s sensor names, IP addresses, all kinds of information here. One of these that we’re gonna leverage is this measuring of CO2 level, okay and I’ve run this command to just pull out the head of the file. But you can see in the next cell, I run a Spark SQL command, that’s just a regular SQL statement like you may be more familiar with and I’m specifying that this is a JSON file and after I run this, right away it kind of infers the schema and it pulls it out into a more structured kind of format maybe easier to read for me here just to inspect this. So I just wanted to show you what kind of data we’re working with. I have a simulator that’s going to run in the background, let me kick that off, this simulator is just taking this raw data and it is pumping it into Azure Event Hubs. Azure Event Hubs we don’t have to get into the details here but this is basically a real time messaging bus you can think of it as like a giant buffer in memory where lot’s of different sensors can pump this data into it and holds it in memory on a highly efficiently so that I’m able to consume it and stream it in real time through Databricks so I can make that connection. I define a couple connection settings so that event hub and then I can start streaming this in. So I’m creating this new device raw data frame, I say spark.readStream from my event hubs and I’m loading this data in. And I’m going to take the raw stream body as is and I’m gonna write that stream as delta format to a location in my Azure Data Lake storage that I’ve decided to be named Delta Bronze Path. So as I write this down you’re gonna see we’re able to inspect this stream as it’s going and you see I throttled my simulator to be really basic, really easy like one record a second. But you can imagine the scale that you can get to with millions, hundreds of millions of sensors streaming in real time and have massive data sets that you’re processing in this micro batch mode this structure is streaming. So let’s take a peek at what’s flowing through at this point in time. Here I ran this command that says spark.readStream, format delta I’m reading this delta file on this Delta Bronze Path. And you can see it’s just an array, a string array, but we don’t understand the structure or schema. I can write the same command with SQL here. Sorry I advanced the cell after I executed. I can run the same command with SQL just say select star from the Delta Bronze Path and let’s take a peek. So same results from up above. Here I’m using a more familiar SQL kind of command. So I see it’s a string array and yeah I see the basic information here. So now we can take this to the next level, make it a little bit more actionable, this cell is a little bit boring and verbose here but I’m basically typecasting, I’m adding column names, I can do a little bit of cleanup, convert time stamps, so let me execute this cell and then I will write this clean stream down to the next stage and this will actually be the final stage for the purposes of this demo in what I call the Delta Silver Path. And so as I write this down into my Delta Silver Path, let’s take a peek at what it looks like now at this stage. Start around this and you’re going to see that both in this version and again showing you here, I can just write this basic SQL statement as well to explore this data select star from the Delta Silver Path. So now it has structure right? I’ve added that structure, I’ve cleaned up, I’ve converted some timestamps and I’m exploring this data. You know something to call out in this notebook is, this is a multilingual notebook I don’t know if you noticed as I’ve been going through but this notebook’s default language that I’ve chosen is actually Scala. So some of the other code that I’m writing is in Scala. But here when I come in to execute this SQL command, you’ll see I used this header here that says percent SQL and I’m able to switch into another language. I can do the same thing in another cell. I can create another cell and say percent Python and then I can start writing Python code or R or et cetera. So you see kind of the multilingual support of this notebook. Also something to call out these are coauthor notebooks so multiple people could be in here. Me and Mike and Clinton could all be in the same notebook authoring code I can see where everyone’s cursors are and see what they’re typing as they go. Let’s say that I wanted to collaborate a little bit more with the people I’m working on here. I can highlight some lines of code and I can write comments to my peers. I can even come in and build in, let’s see where did it go here? Here’s a MLflow tracking where I can get that. I also have automated version control. Where I can come in and link this to my Azure DevOps account or Bitbucket or GitHub and have it automatically sync with version control. So a lot of capabilities of collaboration in these notebooks I wanted to call it as we move through. So where I was is I just took a peek at what this Delta Silver Path looks like and I have this data that’s been formatted. Let’s do something to make it a little bit more interesting, what we’re going to do is we’re gonna do a sum of the CO2 level and we’re gonna group it by device type and timestamp so we’re gonna get a graph here down below that shows us what the CO2 level is for various device types over time. It’s gonna chart us the trend through time. All right so you see this data streaming in real time, again this is the CO2 levels for various device types so device and meter, thermometer maybe params and these are calculating the quote unquote CO2 level for carbon emissions that the sensors are measuring. So this data will update in real time, it seems it’s being a little bit slow for me today. But this graph moves every few seconds and will update with new values here you see it update. So we see that there’s some orange bumps that are a little bit higher than others. But what we’re going to do is I’m gonna simulate a failure, I’m gonna simulate that something went wrong with this data so let me go over to my break all the things simulation notebook and run this. Let me just make sure that it runs really quick, then we’ll flip over. What we wanna do is we wanna watch this stream of data and we understand that yeah there’s some bumpy orange values here for the meters that are measuring the CO2. But as we watch this stream, what’s going to happen is some old data, some data that has already been recorded, that’s already been captured, already in the past, some old data is going to unexpectedly change. It’s going to unexpectedly change values. So let’s take a peek at that after this runs. Okay looks like it’s running fine for me good, good, good. Okay, watch these values, we’re gonna see the orange lines get an exaggeration of values in the past. All right there it went. You see some of these extreme peaks showed up that weren’t there previously. Let’s try to debug this together and understand what’s going on I’m gonna show you the power of Delta Lake time travel and the simplicity of how you can debug these streams with SQL commands. So let’s go down here, let’s maybe use one of these values. I can see if I hover this tells me the x-axis has a number ending with 4405 let’s come down here and remember that number 4405. Let’s look at what we have right now, if I say select some of the CO2 level, group by device type and timestamp from the Delta server pack this is what we have and here’s one of those peaks that we’re gonna analyze at 4405. So if I come down here 4405 is the peak that I wanna analyze. Let’s see what’s going on in the data at that timestamp. Let me sort it by the CO2 level here. And I see that there were two records of the same CO2 level that were recorded at the same timestamp and hey there’s the same device ID. Is this duplicate data? Was it late arriving data? Looks to me like this is probably duplicate data, the same device ID at the same exact granular timestamp humped the same CO2 level value. Hey but you know, the device name is different. So maybe something happened where perhaps there was a firmware update on the sensors and somehow the device name got reset or you know you guys have probably experienced weird cases with your data pipelines before where duplicate data somehow erroneously was pumped into the system. So let’s see what we can do about this, let’s see how we can correct the situation. Let’s use time travel to find an older version of the table to confirm we aren’t crazy, that we weren’t imagining that these values all of a sudden spiked up. So I run this command right here that says describe history of my Delta Silver Path table. Now what this does is I can then see all the versions of this table, I can see the users that modified this table. Some of their operation parameters. What notebook they come from, cluster they came from, all these different values. But let me go down here, let’s look for an older version, let me see which older version I wanna use. Maybe I passed the one that I was already thinking of. Here and let’s grab a timestamp from the past. And let’s say this is what the data used to look like. Let me replace this timestamp right here. And run this. So in the past for 4405 it was something like that, this wasn’t a particularly large peak in our CO2 levels that are measured for these meters. So in the past this wasn’t a problem, but in the now it is. So let’s write a query that shows us the unique version, the unique difference between the two versions of this data. So I’m basically doing an anti join, if you look at this, if you pause the video or go back to review this later and look through this query, basically an anti join to see what is the difference between these two versions. I’m querying from the Delta Silver Path. And then I’m joining with the same Delta Silver Path but I am looking at a previous timestamp. This concept of time travel is so powerful I can go back in the past, I can immediately roll back to a previous version. I can use this to correct things like we’re going to be doing. I can use this to inspect other values so here this result shows that the orange values, the CO2 level that are standing alone, these are the double pumped data values. And so what I can do is you can see that I made a temp view when I ran this query I said create or replace a temp view called diff and so now I can come and say let’s merge into our original table using diff and we’re gonna look at the join to say if it has the same device ID, but the device name is different, you know because of the firmware restart that we’re talking about in this story. If we find that match, then let’s delete those erroneous records. So let’s run this query. And after that runs, we’ll be able to see what the correction looks like down here from the silver path. Okay alright this looks like what we had expected. 4405 is somewhere around here you can tell that this isn’t an abnormal peak and we have all the current data up to the current timestamp as well. So here we were able to use time travel SQL commands to debug what went wrong with our data pipeline. We were able to go back and compare previous versions. We were able to use like Mike was calling out a simple merge statements, simple kind of up search to be able to correct this data. Be able to roll back to the previous good state. And make our data pipelines reliable as we go. Now if you think how could I even prevent this from happening altogether ’cause it’s one thing to go in and debug and be able to figure out what was going on and correct this issue. But how can I even prevent this from the get go is probably a better thing to be thinking up. And Delta Lake helps you do those things. So for example, instead of writing in regular appends to your data like you see us doing up above with some of our Scala commands here we were saying write stream dot format output mode append. If we are writing using merge statements, instead we can even write a query like this that says hey if you ever see a device name that’s different, but the device ID’s the same, if there’s something weird happening there. Then let’s not merge that data in. Let’s keep it separate. So there’s various options that you have at your hand because of Delta Lake, because of how it enables these asset transactional guarantee snapshot isolation. These upserts, efficient upserts that you’re able to do a lot more things with your data lakes. You can truly democratize your data. You can truly make it accessible and useful for more people at your company. So hopefully you enjoyed this demo. I wanna hand it back to Mike, to show us one more aspect about Delta Lake.
– [Mike] Yeah great thanks Kyle.
Performance Enhancements of Azure Databricks on Top of Delta Lake
So one of the key features that we provide is the capability of indexing along with what we call Z Ordering which is the ability to change the order of the data on your file so that if you know that you’re going to be querying or joining on specific columns we can reorder the data on those files so that the files, so that the queries will return even quicker. What this enables is the second feature here called data skipping so what this allows us to do is because we’re indexing those files and we know the order of the data that’s good on those files we’re now able to skip entire files when we read them so if you write a query that has a where statement in it and we’ve indexed or Z ordered on one of those columns in your work statement we can actually take a look at the indexes of those files and say you know what, we don’t need to open of these 100 files, we don’t need to open 98 of them because we know that the data resides on just two of them. So we’re able to skip 98 files and not open the ones that we don’t need to open up. Another key feature that we have here is this idea of compaction. So here Databricks we understand kind of the optimal file size for parquet files when working with Spark and so a lot of times when you have jobs, especially streaming jobs that are writing to a file location, you end up with a lot of tiny files and if Spark has to go, when somebody’s querying those files or querying that table or path, if Spark or Databricks is having to go and pop open all of those small files, it really, you get a pretty good performance impact there so it degrades performance so, we have a process for compacting smaller files into larger files so that now when Spark goes to read those files it will read them much faster. The final thing that I’ll comment on here is the Delta cache that we provide. So whenever you read and query a Delta Lake table or path inside of Databricks, we will actually throw some of the data that comes through your query we’ll throw it out onto the SSDs of your cluster. This way the subsequent queries that you or somebody else on the cluster when they write similar queries instead of Databricks having to make the round trip all the way back down to your storage, your Data Lake or blob storage we actually are able to access the cached data that’s on the SSDs and this results in often times four and five x query improvements. So what I’d like to do is show you a couple of these features. And in a quick demo. And let me switch over here. So what I’d like to show you is a quick demo of some of the better performance that we get with file compaction and the Delta Cache so what you guys see here is a very simple query, a very select from a small aggregate, we’ve got a where statement, we’ve got a group by, we’ve even got this order by statement right. And this is a very simple query from a Delta Lake table and it takes about 15, 16 seconds right. So we wanna take note of how long this query takes to actually execute it takes about 16 seconds right. So let’s take a look at the files that are underneath this table okay so here we issue a simple command. And this is a list of the actual files from this table. You can see that all the files are about 1.2 megs in size. This is considered a small file size for Spark. Spark would really like to be reading files that are in the 100s of meg size or even one gig in size right so what’s happening right now is when we issue a query against this table we’re actually, there’s 1200 small files that we’re actually having to read through and that’s making performance a little bit slower than what we’d like. So let’s run this optimize statement so when we run the optimize statement, this is that key feature that compacts the file so hopefully what’s gonna happen here is it takes all of those 1200 really small files and compacts them into one or more much larger files that we can query much faster. So after we run this optimize statement, what we now get is in addition to all of the small files we get a new version of the Delta data that has now been compacted into two much larger files. So you see now we have a 900 megabyte file and a 300 meg file that makes up all of our data instead of just all of the individual 1.2 meg files. All right so let’s fire off that same query and see how fast it performs. So when we run this same exact SQL query you now note that the query comes back in just a in somewhere between three and four seconds right. That’s a four x improvement over the 15 or 16 seconds that the first query ran all right. So that was simple compaction, that’s a simple one liner SQL statement that takes the small files and compacts them into a fewer larger files right. So let’s take that a step further and let’s take a look at the cache so if we look at the Spark log from the query that we just ran, you’ll notice that we read data from the file system okay. So we had tens of megabytes that were read from the actual file system. You’ll see that no data was read from the Delta cache here right we had zero bytes that were read from the cache. But, because we were reading from a Delta table, you’ll also see that about 26 megs of data was written to the cache okay so now we have some data that’s sitting in the cache let’s see what happens if we write a similar query. So now we have a similar query, it’s not exactly like the old query, I’ve added some additional aggregations. So the first query had the average. This query adds in a function to get the minimum, the max and actually count the number of rows. And but it’s a very similar query, it includes a lot of the same elements. So hopefully we can trigger the actual cache. So when we run this query, we’ll note that it only took 1.28 seconds right, that’s three times faster than the optimized table that the query off of just the optimized data and it’s more than 12 x faster than the original query that we ran from the raw table right. So let’s take a look at the Spark log and see exactly what happened here. So you see this time we didn’t read any data from the actual file system so we didn’t read from the Azure Data Lake in this case. Where we did read so you see that we actually did hit the cache so instead of a zero for the cache hits we actually hit the data that was put down into the cache so again raw data from the Data Lake as it passed through we threw some of it out on the SSDs and this time a similar query instead of accessing the file system actually accessed the cache. You see that we didn’t write any new data to the cache on this case because we’re still dealing with the same version of data so we don’t need to write any new, any new data out to the cache. So the final comment here is that you know, using these Delta Lake performance optimizations that we’ve included with Azure Data Lakes or with Azure Databricks it really, dramatically improves query speed right so again the first query that we issued was right around 15 seconds, 15, 16 seconds. The second one just by running the simple optimized statement gave us almost a four x improvement and then the third query that we ran, we didn’t have to do anything to cache that data, it did it passively, it returned in just over one second right. So hopefully we’ve been able to show you a little bit more about what Delta Lake is. Some of the usage patterns for it and really how easy it is to get started using Delta Lake with SQL, thanks.
– [Clinton] Well thank you everyone for attending today’s webinar, we’d like to now switch over to a live Q and A to answer some of the questions we see coming in. So first of all, one of the questions that’s come in is will we be able to get these slides and a link to the recording as well as these demo notebooks and the answer is yes. Within a couple of days we’ll send out to the email you used to register, links to all that so that you can see the slides and go back to these demo notebooks and also see the recording of this session. Okay so we have a lot of good questions that have come in. All right so first of all let’s talk about, a little bit about the format that we use for Delta. So there’s been a question about first of all, if this is Parquet, is it just Parquet format and how are users charged for using this format. So maybe Kyle would you mind taking that question?
– [Kyle] Yeah definitely, thanks Clinton. So the question again was, is Delta Lake just a storage format, how are users going to be charged. To note a couple things to note first is, Delta Lake when you’re using it in Azure Databricks, it’s built into the product and it’s free. You don’t have to pay to have access and availability to this feature and then when you think of your costs, remember Azure Databricks is a separation of compute and storage and so your compute actually won’t be affected at all by using Delta Lake. And in fact you may even find some cost savings based on some of the things that Mike was just talking about for performance enhancements and seeing customers get up to a 100 x performance on reads and these kind of things. And the reliability you’re bringing to your Data Lake now you don’t have to reprocess data as much. You may have some cost savings when it comes to compute. Now when you think of the storage if it’s a different storage format and now you’re saving versions of this data history and there’s a transaction log that’s built and so you will have, you will be storing a little bit more data than normal. These costs should be minimal. And you also will incur extra transactions in your reads and writes and those also should be minimal based on the design that you choose. And so storage you will wanna consider what cost you’re bringing to the table. And compute no impact and in fact maybe some cost savings there, yeah, great.
– [Clinton] And then there’s another question coming in about the limitation of stored data in the SSD and so Mike you had talked a little bit about using cache and I was curious if you could take this question about, what is the limit of storage in the SSD and what amount of data could be stored?
– [Mike] Yeah sure so the Delta Cache that might read data from where the storage sits right the Azure Data Lake store and then throw some of the bits out to the SSDs that are on the cluster. With Azure Databricks we actually have, we have an idea of auto sealing SSDs right. So the SSDs will grow as they need to. And also the cache will purge as it needs to. So there’s this constant cycle of growing and then also the cache purging right. So the cache is going to continue to monitor the queries and queries that are being issued more frequently and keeping that data around a little bit longer and then also purging data that doesn’t, that really hasn’t had any queries for a while. So there’s this constant cycle. As far as how much data, so it’s a little more of a, it’s hard to come up with an exact number there, again things grow and shrink as they need to. Are you gonna be able to cache terabytes of data onto the SSDs likely, likely not. Again, the cache is going to kind of unpersist a lot of that but it’s also being constantly monitored by the query volume if that makes sense.
– [Clinton] Thank you Mike, yeah that’s helpful. So there’s another question coming in about the time travel so Kyle we talked a little bit about time travel, how long can Delta Lake store data for time travel and is there some configuration involved as far as like how far back you can go?
– [Kyle] Yeah certainly and it kinda relates to the other question that I answered about cost so, theoretically you can keep data for as long as you want and you can have the data available for time travel as long as you want but practically you probably want to conserve a few things. We talked about the storage and how you are saving extra versions of this data and so there’s an extra volume of data that you’re storing but also when you go to reference this data in the past you may find that it’s practical to time travel for up to months but then beyond this you may wanna consider other options for example you may want to start snapshotting versions of data, here’s a quarterly snapshot, this is the snapshot of what the data looked like in this quarter and the snapshot of what the data looked like in this quarter or this month or this year. So if you need to time travel over a time period of years, you probably wanna consider a slightly different option. But this is really great, like you saw in the demo, we walked through and we did like live debugging on data that was real close to us in near time. That’s an excellent scenario and of course you can time travel a little farther back in approaching the months. I don’t know if Mike you also had any comments on this in terms of what you see in customers using and the practicality of how long you wanna do time travel for.
– [Mike] Yeah for sure and I think that you pretty much nailed it, I think that what we also provide in Databricks is a way for you to clean up those snapshots right for time travel so if you decide that you really only need to keep you know, three months or six months of transactions and history for time travel then we provide a a vacuum command, it’s a very simple command that will allow you to vacuum up all of the files and transactions that are, that exist before a certain data right so if you wanted to keep only the transactions that were wrote for the last three months you can back up, you can vacuum up all of the old transactions that are beyond three months and that will clean up your storage and reduce the amount of storage that’s being kept.
– [Clinton] Excellent thanks Mike. So just before Kyle’s demo, he had introduced some steps to use in terms of processing, adjusting and processing the data. And Kyle could you walk us through again what were those key steps involved in that?
– [Kyle] Yeah and maybe the question was targeted to those four bullet points I talked about of what I was going to do during the demo or maybe it was about the multistage nature, I’ll talk to both just in case to make sure I answer their question accurately so to cover what the demo was entailing, first we ingested a stream of IoT sensors with Azure Databricks, we ingested these through event hubs is where they were stored. Then we cleaned this data, we built a multistage streaming pipeline with Delta Tables and then we used Spark SQL to explore the data and then we used time travel SQL queries to debug and correct an erroneous data pipeline. But in terms of like steps, set up steps as we go, there was a similar question to this that I wanna address both together. When you think of this, we talked about that bronze, silver, gold storage layout and format and progression, this iteration of your data. Don’t think about it as like you have to have like three stages to the near thing and this is some magic recipe of if I have a bronze and a silver and a gold. I have three stages and this is what’s been wildly successful for people. This is more of a design paradigm to help people understand that you want to have an iterative data pipeline and you want to have an iterative data lake because different people in your organization will want to access data at different levels and in different variations and of different maturity levels and so by iterating on your data and having these formats or these checkpoints you may call them for bronze, silver, gold, these ideas. You’ve now enabled that data to be accessible, usable and high performance at those stages so.
– [Clinton] Thank you Kyle. Okay and then one other question, we’re just about out of time, thank you by the way for all these great questions. There’s a question about using code in the notebooks and with Azure Databricks so Mike I wonder if you can address this the question is, can we code in R, Python, Spark, Scala and SQL too are all of these options?
– [Mike] Yeah for sure that’s one of the great things about the notebooks and Databricks and really the notebook concept in general. So Databricks fully supports SQL, R, Python, Scala, all of the native Spark APIs and so if you’re doing PySpark, if you’re writing Spark in Scala or Spark SQL, all of those things you can certainly do. You can also bring just your native Python and R to the table as well so then if you’re just doing single threaded Python, single threaded R, using your favorite Panda’s libraries and those kind of things you can absolutely use all of those inside of the Databricks notebooks.
– [Clinton] Awesome okay one last question I wanted to squeeze in we see several of these coming in, okay Parquet format is good for reading, a little slow on writing, how does Delta Lake do on rights and maybe Kyle you could address–
– [Kyle] Sure yeah so yeah like we talked about the Delta format is actually Parquet format on other hoods so that it’s actual Parquet files that are being written to so on the ranks you should have parody in terms of your performance on rights. When it comes to reads, some of those key points that Mike called out at the end of the demo, talking about how you have a file compaction. You can Z order index so an additional to your traditional index and you can now Z order and have this multidimensional indexing and now that enables the data skipping so when you follow the best practices on tuning, you can actually enable 100 x performance on reads and we’ve seen many customers achieve that, really powerful.
– [Clinton] That’s great and thank you so much all the attendees for your questions. We will follow up with an email and do some outreach after this to try to address any questions we didn’t have time to get to in this session. But again just wanna take the time to thank you for your participation and for you know the involvement here and Kyle I wanna thank you as well as Mike for presenting, I really appreciate it, thanks.