Delta from a Data Engineer’s Perspective

Download Slides

Take a walk through the daily struggles of a data engineer in this presentation as we cover what is truly needed to create robust end to end Big Data solutions.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hello everybody and welcome to this Delta tutorial in an intro. My name is Palla Lentz and I’m an associate RSA. Right after my slides I will pass it on to Jake Therianos, a databricks customer success engineer for a Delta demo.

A Data Engineer’s Dream…

What is it that all data engineers want? We wanna push play on a pipeline and have it flow beautifully and flawlessly as it continues to process incremental data in the most cost efficient way. Oh, and don’t forget, it’d be a dream if we could have both streaming and batch processing work seamlessly together.

The Data Engineer’s Journey..

So the data engineer begins their journey to make this dream a reality. But there are a lot of hills to climb on the way. First, we start with a stream from Kafka into our Table, which we have a Spark job running in order to create meaningful reports from our data. However, after some time, our data volume may grow and our Spark job may simply not be able to keep up with a number of files every single time we write to the Table.

So we add a batch process which compacts our Table. Problem solved. But what about this late arriving data that the data engineer didn’t plan for in their original architecture? How do we update our reporting Table with everything that might be going on with our incoming data? This causes delays in our pipeline.

Based on the user’s SLAs, they might not be able to handle or accept hours worth of latency for data being unavailable.

So then the data engineer tries to add streaming to the mix. Now it’s a Lambda architecture. This seems pretty great in theory, but the data engineer quickly realizes that operational overhead they pose to themselves. Now they have to maintain this pipeline in two different places.

Not to mention how can the data engineer be certain that the data that is being processed through the stream is validated against the batch process. Now there is a need for validation checks, which adds complexity and latency to our pipeline? Even worse, what if something went wrong and the data engineer needs to reprocess an entire batch of data? There will need to be cleanups in multiple places. And this is becoming harder and harder to manage. More than it’s really worth without a large team focus solely on this pipeline alone.

Updates and merges become overly complicated and costly as a data engineer devises a way to make sure that no data is missing, and no data is duplicated throughout the pipeline.

At this point, the data engineer may be thinking there has to be a better way to do this.

What was missing?

But in order to work smarter, the data engineer needs to identify what is missing.

So basically, what is missing from the pipeline that we showed previously is the ability to read consistent data, the ability to read incrementally from a large Table with still with good throughput. The ability to rollback in case of bad batch writes and also the ability to replay historical data, as well as the ability to handle late arriving data.

So… What is the answer?

The answer to our problems is Delta. With Delta we can stream and perform batch merges into the same Delta Table at the same time. This unifies the streaming and batch in a continuous flow. With Delta we can also historically query snapshots of the data with the capability to retain the historical data indefinitely. And lastly, Delta architecture allows for independend elastic compute and storage to scale.

So how can this be, let’s discover how Delta works.

Delta On Disk

First, let’s see how Delta is written to disk. Delta is essentially Parquet files that have a transaction log per Delta Table. This transaction log keeps track of all of the manipulations performed on the data and can be thought of as a CDC log. This transaction log is then reconciled with a Parquet files in order to build an accurate and current picture of the data. Due to the fact that the transactions have a timestamp, the users are able to query back in time and what the Table would have looked like at any given period. Data within a Delta Table can still be partitioned. However, there are even more optimizations, you can do on a Delta Table that I will get to in a moment. You’ll notice that there are two Table Versions saved for this Table. At any given time, the user is able to take an actual snapshot of the data and store that to a Version. This can come in handy if the user does not want to save all of the historical transaction data in the log, but still wants to have certain snapshots at certain points in time.

Table = result of a set of actions

In other words, a Delta Table is a set of actions, those actions can be reconciled to the data itself to build a picture of the data for the user. The actions of the transaction log consists of, our changes to the metadata, entire files that are added or appended to the Table and removal of files from that Table. The result of these actions is the Version of the Table itself. Keep in mind that we can even time travel to see what the Table looks like at a specific point in time. Based on what these actions were for that Table at that time. We will show you an example of time travel in our demo.

Implementing Atomicity

Changes to the Table are stored as ordered atomic units, which we call commits, you can see to the right side that we first added two Parquet files in one job and then followed that with the removal of those two files and an addition of a file called 3.Parquet. Since this is the order in which these tasks were performed, this is the order in which they are stored within the transaction log.

Solving Conflicts Optimistically

But let’s say that two users try to do something at the same time to the same rows. The order of operations would be as follows. Record start Version, record read/writes, attempt the commit. If someone else wins, check if anything that you’re trying to read has changed, and if so, try again.

So let’s say that I was trying to write an update record that someone changed milliseconds before me, their action to the data would win and my process would be re attempted so that I get the latest accurate version of the data when I’m reading it in. If I’m if I’m user two in this diagram, my task would have to wait for user one’s commit to go through, and then my commit would be the next one.

Handling Massive Metadata

Large Tables can have a significant amount of files in them, especially if we have yet to run an optimization command, such as the keyword optimize, which we will cover later in the demo. How can we scale the metadata? Of course, we can use Spark, we can use checkpointing to our advantage to choose a time frame for how often we wanna checkpoint, for example, every 10 seconds every minute, and so on.

So let’s dive into the Delta architecture a little bit more.

Connecting the dots..

If you recall, this is our data engineers dream pipeline, the one they originally architected but could not bring into reality.

With Delta we brought in the ability to read consistent data. We talked about the functionality of the optimistic conflict resolution transaction log. We also discussed how we can read incrementally from a large Table with good throughput, with Delta’s ability to optimize the underlying files without disturbing the process running on the Table, and its ability to scale to handle the metadata.

Delta also gives us the ability to rollback in case of bad writes. With Delta, we can roll back to a certain place in time and continue as if those bad changes never happened.

Similarly, with Delta, we have the ability to replay historical data, or compare Versions of data without having to save hefty snapshots of the entire data itself. We can query an older Version of the Table without touching our current Version of the Table and even compare them side by side.

With Delta, we have the ability to handle late arriving data without a delaying downstream process due to Delta’s capability to have many processes writing to the Table at a single point in time.

If we replace the middle piece with Delta, the data engineer can still reach his original goal of having a robust streaming and batch continuous pipeline.

The Delta Architecture

Since my time working at data bricks, I have seen Delta help many customers wrangle their most complicated pipelines into simple and easy to maintain workflows. Delta can also be used directly for reporting Tables to provide the end user with the fastest up to date data that the data engineer could possibly supply.

So now that you know how Delta Works, I’m going to pass it to my colleague Jake for a Delta demo. – [Jake] Hi everyone, my name is Jake Therianos, I am a customer success engineer here at databricks. I have been with the company for about a year now, before this, I was in the Tech Consulting Industry. I came over because as a data engineer, you know, mostly working with Oracle databases and dupe implementations. I wanted the opportunity to start working with some of these newer awesome Big Data technologies like Spark and Delta Lake.

So today I’ll be using this notebook to demo some of these Delta Lake architecture concepts that Palla just explained to us. This notebook has a lot of good content, it gives us some nice code examples of these features here listed in bold. I’ll try to get through as many of these as possible time permitting. We’ll start by showing Delta’s ability to unify batch and streaming processes with this ACID transactions. A Delta Table is both a batch Table as well as a streaming source and sink or multiple concurrent reads and writes. So this ability, along with Schema Enforcement and Evolution provide tremendous advantages over your typical Delta Lake architecture. We’ll then get into deletes and Upserts and hopefully have enough time to show you its audit history and time travel capabilities.

Okay, so let’s dive into it.

So basically, what I’m doing here is first is downloading data from the internet and landing it in a dbfs location. Specifically, this is loan risk data that we’re generating which is relatively small in size.

So in case you wanted to run this notebook on the databricks Community Edition, you can absolutely do so. And we’ve already configured it for you to do that, using our Community Edition is free and a great way to get some practice on databricks and learn learn some Spark.

So coming down here, first thing we’re doing is creating a Parquet Version of the Table. We’re doing this because we don’t want to dive into Delta right away. We want to start with your standard Parquet format to give you some context on how this operates without Delta, and to really give you that baseline from Palla’s examples earlier. So I’m creating this Parquet Table and defining my 10th view called loans Parquet.

So then we can go ahead and view the data in the Schema. It has four columns, loan ID, funded amount, paid amount, and state. So now how many records do I have? I can run this, this count query and see that I have 14,705 records in this data set.

So now I’m gonna go ahead and kick off the Spark stream that Palla mentioned earlier. So to find this particular function called generate and append data stream, which is generating a read stream of random data values, as you can see here with this a

five records per second function. And then I’m just adding columns on to match my original data. All of these values are just randomly generated in the columns. And then I’m gonna go ahead and do a create a write stream and write this data out to my Table path which will trigger and will trigger micro batch processes every 10 seconds.

So now let’s go ahead and define the function. And then we can simply run the function and start up our stream.

Okay, so our streams running. I’m gonna go ahead and run this count command.

So notice here that this is on the batch Table not on the stream.

So our count is 35. Something doesn’t look right. Okay, so a few problems here.

Clearly our record count is off before we add 14,705 Records, but now we’re only seeing 35.

But first, let’s go ahead and try to add a second write stream. So I’m just generating the same code again, running the same code again, and just writing to a second data source.

So when I kick off the stream,

and what we’ll notice here is the input rate and the processing rate are both at zero. So no data is actually being written in to my Table from the second stream. If we come back up here to our first stream, you can see our input rate of five seconds, which we defined earlier. So that makes sense. And we have a processing rate. So we have data going into the Table. And if we rerun our count function,

you can see that that count is going up.

So I do have data going in. But I can’t add a second stream here. So basically, what this means is if I’m building these big data pipelines, where I need to scale and process multiple threads from multiple sources and load these all into my central target Table, I’m not able to do that. Another way to think of this, is that by opening up my first stream, I’ve basically locked my Table to all other sources write into that Table while the stream is active. So there’s one problem. Now the next problem that we discovered earlier, what happened to all my rows?

So where did our existing 14,705 rows go? So let’s look at the data once again.

So here we can see the actual data and all my columns sitting here and you can see your, our initial four columns. But then we also have these additional two columns. This is not what we were expecting. So what happened here is our initial function to generate our streaming data up here is actually using this, this rate stream source function. This .formatrate.option rows per second five to generate our new records, which is really used mostly for development and testing purposes, was actually suits our needs for this example quite nicely. It produces new streaming records and by default, it includes these columns. So it includes this timestamp and value column. So the data that we’re generating accidentally has six columns instead of our initial four columns as intended.

So we didn’t account for these extra columns. And we’ve effectively broken the format and the Schema of my Parquet Table. In other words, I’ve basically corrupted my data on accident. So I’m gonna come down here, and I’m gonna go ahead and stop all my streams.

But basically, what it comes down to is that because I’m using a Parquet Table, I actually can’t run multiple streams against my Table at the same time. And once there’s a Schema change, I end up corrupting my Table, either these are very fun or easy issues to work with. However, they are pretty common problems that most of us do see as data engineers, or when working with pretty much any data pipeline. So let’s go ahead and do this all over again with Delta.

So I’m going to rerun this portion of my code again, which is just creating my Table and my temporary views. But this time I’m doing it in Delta formats. So I’m cleaning everything up with this Delete command, and then I’m creating my loans Delta Table.

Okay, so let’s go ahead and run our count again.

And surprise, surprise, 14,705 rows.

And now I can go ahead and look at the actual data again. So there are my four columns, just like we expected to see. So we’re at a good starting point here. So we’re back in action ready to go using the Delta Table now instead of our Parquet Table. But can I do some of these Delta architecture things that Palla just mentioned? So let’s start getting into Schema Enforcement.

And we’re just going to cover this at a high level. There are some great indepth talks on Schema Enforcement out there, which I highly recommend you look at. At some point and feel free to reach out to myself or Palla,

to provide you some links if you’re interested, and I’ll try to send you some links in the chat when this actually plays.

Okay, so now I’m creating a view called loans Delta stream off of my Delta bath location. So this is a streaming view straight off of our persistent Delta Table.

Now I’m just running a count star query on this view. And so what this is doing is essentially giving us a count that updates automatically.

And I’ll let that kick in. Okay, cool, so that’s going we can see our count of 14,705, which makes sense.

And so now we can try running our function to append our Table. But this time, we’re trying to use a Delta format and giving it the Delta bath as you can see here, so run it, we’re running the same code before, if you remember back, this is generating, generating this two additional columns that broke our Table last time when we were using Parquet format. So I’ll go ahead and run this.

So if I run this now,

okay, we get an error, there’s a Schema mismatch. So here’s one of the really, really cool things about Delta and Schema Enforcement, it will protect the Table, from you writing anything to it that will corrupt the Table. So let’s open up the error message.

And it tells me that if I want to go ahead and merge these Schemas, and go ahead and turn on this option here, merge Schema. We’ll talk more about this in a sec. Down here, it explains the issue on what’s actually going on. So you can see I have my Table Schema with four columns. And then my data Schema which has six total fields.

All right, so let’s go ahead and rewrite our function to produce the data that we actually want. And all I’m doing here is really adding this, this Select statement that selects the four columns that we want. And then I’m updating this rows per second to 50. So if I go ahead and run that function, start to find that function.

And then let’s go ahead and run this again.

So what so we’re basically doing the exact same thing here, except for

increasing the input rate.

So as you can see, I’ve got data going into my Table at a rate of about 50 Records port per second.

Now if we scroll up to my count that I generated earlier, you can see that this count is actually increasing, which is a good thing. So now everything looks like it’s working.

Now, because I actually have ACID transactions, as I mentioned earlier, I’m actually able to run a second stream down here.

So what happens when I run the second stream?

So I’m gonna go ahead and initialize this. And I could actually go ahead and run a third and fourth and fifth stream. But for the purposes of this demo, we’re just gonna stick to two.

So the same idea here is, you see that this is actually running and if you go look at the input and processing right here, yep, we’re inputting records at a rate of 50 per second and processing at a rate of 188, 116 Records per second.

So now, if I go back up, you can see that my count is still increasing and now it’s increasing at an even faster rate. And that’s because I have two write streams writing in this Table right now. So right now what I have going on is a read stream for the count and two write streams all into the same Delta Table.

So now just for a sanity check, let’s go ahead and run it battery.

Damn, so that count looks correct. So now what we have is one write stream, sorry, two write streams, a reads stream and read batch all from the same Delta Table.

So I’m gonna go ahead and stop all my streams. And now let’s start talking about Schema Evolution.

So now you all might be asking, Hey, Jake, what happens if you actually have to change the Schema of the Table? The data in my company slowly evolves over time, or we have end users in my department that can make up their mind on the data that they’re interested in seeing. Well, don’t worry, because we’ve got you covered with that backup option of merge Schema.

So what we’re gonna do here is actually create some dummy data with an extra column. And do a simple write to append that to our Table. So you can see here, we have our columns fund with this extra column called closed. And, and then Boolean values. Now we’re creating a data frame, we’re actually appending that data frame and we’re using this option merge Schema true.

So I run that.

So as I’m running append, I can go ahead and view my data. And I’m sorting on this close field so that you can see the two new records at the top that I actually just added to my Table.

So you can see down here, I have these two new records that I’ve just added. And then all of my old data that had no concept of this close record beforehand. And so what Spark and Delta are actually doing here is adding a column to my data set, and then inputing these null values and when needed. So that Schema evolution for you can actually update your Schema or evolve it over time.

So to recap what we just just walked through around some of the advantages that Delta Lake will give you over regular Parquet. It has full support for simultaneous batch and streaming workloads, and any number of reads and writes to and from a Delta Table. It gives you Schema Enforcement and Schema Evolution capabilities to ensure you maintain data quality that will always be usable and readable. And it gives you ACID transactions that will guarantee data integrity, even during several concurrent workloads and queries.

Okay, so now we have some time left. So I definitely wanna go ahead and show a few more features here.

So the first one I want to cover

is what most of you know as an Upsert otherwise called merge. So scrolling down here

to Upsert. So an Upsert is essentially a logical atomic combination of a typical SQL insert, and update hence the name Upsert. So the logic of an Upsert basically goes when trying to insert new records into a Table, check to see if the record already exist based on a matched predicate key. If so, then update the existing record, and if not, then insert the net new record.

So Delta gives us this Upsert capability with the merge command. So just real quick, before we dive into that, let’s walk through what an operation like this would look like using normal Parquet Tables. So here’s the steps. So first we’ve identified the new rows to be inserted. Then we’ll identify rows that will be replaced, i.e updated. Next to identify all of the rows that are not impacted by the insert or update. We’ll create a new tape temp Table based on all three of those insert statements we’ll delete the original Table we’ll then rename the temp Table back to the original Table name and then drop the temp Table.

So here we have a little moving diagram to kind of show what that kind of looks like in Parquet.

Now with Delta like that same operation can be done in one simple atomic step.

So here I’m showing you the Spark Python API syntax. But almost all Delta commands support original SQL syntax that most analysts and DBAs are used to. So right here, I’m just creating a subset of my initial loans Delta Delta data using a filter, right here on the loan ID and then making a temp view called loans Delta small. And all this. And this just has three Records, which we can see when I run this display command down here.

So here’s our three records. All right, now let’s say we wanna go ahead and add some new loan information. So we wannna add two new records, one, the first record is an existing one ID of two, which has been fully repaid, the corresponding read needs to be updated here. And then the next new record is a new an ID of three, which has been fully funded in California and this row needs to be up next are inserted as a new row. So I’m creating this

this data frame that we just just ran through

And then now down here, I can perform my actual merge of this data on a small Delta Table. So this actually shows the syntax for Spark SQL

right here. And then my actual command is using the PI Spark DataFrame API syntax. And both things are doing the same thing at the end of the day. So both of them are using this merge command. And you’re identifying your predicate that you wanna match to, and then you have your match logic after so when matched what to do and when not matched what to do.

So as you can see,

this Upsert command state saves you a lot of steps. Let’s go ahead and run this

and then view the results of our data so we can see what our new data looks like. So we have our two, two original records, and then our third original record which has been updated with the new paid amount. And then finally our last inserted record.

So as you can see Upsert is a super powerful feature, and it acts a lot like the SQL DML merge command, but it provides even more flexibility to assist with more advanced use cases, including streaming and complex conditionality and deduplication. And this notebook actually does cover cover some of those, but I won’t get into those right now. Just so I have time to hit on a few other important features.

So scrolling back up

to this Delete command.

So delete from Delta Lake Table. Similar to Upsert, there’s delete, in the same way we were matching on a predicate to perform an Upsert before, Delta lets you specify a predicate to determine which individual rows need to be deleted. This is just another standard SQLite command that Delta supports. And of course, just like Upsert, we provide the standard SQL syntax as well as the Python API syntax. So if we were using SQL, you know, we could write, delete from Table where pretty a predicate and I can actually show you that real quick.

So in notebooks we can actually use these merging commands to specify what language we’re typing in. And I could write, delete,

delete, from and that would be my loans Delta Table.

Where, and then we can just use for this funded demand equals paid amount and predicate.

This is what our delete statement would look like if we were using our SQL syntax. But we are going to stick with the Python API syntax for now, so I’m going to delete that cell.

So here we’re just

we’re just defining our Delta Table object. And then we’re actually running our delete method and passing in our predicate. And our predicate here can actually just be a simple SQL style string.

So I’m gonna go ahead and run this code to get a count of my fully paid loans.

And to notice here, this is just a count on top of my loans Delta temporary view Table. And so my count is 5134. So I can come down here and run my Delete command.

And then I can go ahead and run my count command again. And well look, zero, see all of those fully, fully paid loans have not been deleted on my Table.

So it’s really important when working with Delta and using these super handy commands to really understand how it’s all working under the covers, then you can really start building efficient data pipelines and unlocking the power of Delta’s optimization potential. So we’ve talked about Delta’s transaction log, briefly. And that’s a really key concept, if you want to start opening up the hood, and understanding how this all works. We’ll go over the transaction log in a few minutes. But the key thing to understand with this Delete command is that Delta is not actually deleting any physical data or files. It’s simply intelligently understanding which files contain the records that need to be deleted, marking them as removed in the transaction log, and then writing additional duplicates of those files just without the rows that you deleted. Now, that’s not to say that if there are five files containing rows that match your predicate, that Delta will be rewriting exactly five files, instead it can combine the data that it needs to into maybe just one file, arrange it to disk and then add that file to the transaction log. And of course, this gets more complex when you start working with things like partitions.

Okay, so moving on now, another feature here is Audit Table history. This records all commits or operations that happened to our Delta Table based off the Tables transaction log. So we can use this Table. Sorry, we can use this History command to view the history of operations made on the Table, Version by Version with tons of detailed information. So if I go ahead and run this History command,

I get this nice pretty transaction log. It’s easy to read. And I can go in here and see all of my operations on the Table. So here’s my initial write that first you know, initialize my Table, and I’ve got all of my streaming updates.

And then I’ve got that write that I showed you earlier where we made the append, you know, when we were working with the Schema Evolution, and then we’ve got our Delete command right here.

So if we go over here, you can see all this super detailed information where it shows us our number of files, removed number of deleted rows, number of added files and number of copied rows. Scrolling back over. Another really important thing to note here is that each operation comes with a Version and a timestamp. So this actually brings me to the next feature I wanna talk about, which is time travel.

So with time travel, we’re actually able to query old Versions of a Table based on either the Version number or the timestamp. I’ll show you examples of both. When thinking about time travel, it’s important to remember what we just discussed the concept of the transaction log, and that Delta doesn’t actually delete files. Instead, it just records the state of the Table and the Delta log, which is actually right here inside of the Tables main path, which I will show you now.

So here’s the Tables, main path that we defined. And so here’s all the data. And within that main path, we also have this folder called Delta log. So if we go in here,

you can actually see that each transaction produces a json file. So each transaction is all true, all details about that transaction are contained in each of these json files, and they’re numbered for us. And every 10

transaction, we checkpoint it by adding this Parquet file was big, which basically aggregates all of our previous transaction data. And that’s kinda how we deal with large scale metadata management in Delta.

So if we come down here, we can kinda see our most recent transaction number. And we can actually go into that and see what’s going on in that transaction log.

So you can see all this really detailed information. And this is actually where the History command pulls from when it creates that really nice audit history Table. So you can see all that similar information, timestamp, user ID, the operation delete, and you can see the operation metrics that we went through before we remove two files, number of rows, deleted, the number of files added, the number of rows copied. And then you can actually see the detailed transaction history. So here it says remove, and then has a file path, and then another removed as a file path. So see here, we removed two files. These are the two files that were removed. So Delta identified these files as containing rows with our match predicate. It marked in the transaction log that these files need to be removed from our Table. We’re not actually physically removing the files from memory or sorry, from physical disk storage. And then it’s adding back a file here and you can see the size of the file. And I believe you can get and you can even see the number of records that it’s adding back in this file that contains all of the data that was in these files. That does not match our predicate key. And so that’s basically how these transaction logs work in a nutshell.

So by maintaining these transaction details, Delta can intelligently interpret what any given Version of the Table should look like to query the appropriate files. So if I come down here, I can run my query using this option for time travel. And here I’m just using Version as of to specify that I want to see a specific Version number, which I’m passing through here with this previous Version variable. And that just contains the Version number of my Table as it was right before I ran that Delete command.

So then I’m creating a temporary view from that running account. And as you can see, after I run this,

we can get the count of our Table as it was before we actually ran that Delete command.

We can also do the same thing with the timestamp. This works similarly, but the big thing to note here is that Delta will always go back to the closest timestamp that is prior to the date given. So I’ll show you what I mean by that. If I come up to the transaction log, and I take my most current timestamp, so the timestamp of the current Version

and I pass that through as my argument

and run it

and I’ll get the state of the Table as it is right now. So zero records. However, let’s say I pass in a time, that is, you know, just one second for my current timestamp. So this is a time that is in between my previous Version and my current Version, obviously much closer to my current Version, but it’s still you know, in between both of them. And let’s see what I get here.

Okay, so now it’s querying the Table as it was in the previous Version. So that’s what I mean by it will always go back to the closest timestamp that is prior to the date given.

And finally, we have the Vacuum command.

So while it’s nice to be able to time travel to any previous Version, sometimes you want to actually delete the data from storage completely for reducing storage costs are for compliance reasons, such as GDPR. So the vacuum operation deletes data files that have been removed from the Table for a certain amount of time. And by default vacuum retains all the data that is needed for the last seven days. So what I mean by that is, even if

you have files that are no longer used by the Table, so they have been removed, marked as removed in the transaction log, as long as they are occurring within the past seven days, by default vacuum will not delete those files, it will only delete files that are older than seven days. So for the purposes of this example, since you know, we just created our Table and we don’t have seven days, we can actually go and change that default setting. So we’ll set retention duration check, enabled to false, and then we can actually pass through a retention hour or retention time of zero, and we’ll and that will just go ahead and delete all files that are no longer used in the transaction log. So we go ahead and run this.

And we can go ahead and run our previous time travel, and try to query our old Version of the Table.

And so that’s gonna go ahead and give us this error here that’s gonna fail, because those files no longer exist. And so vacuum is actually really, really powerful. For GDPR use cases when use little used along with with, you know, like Delete and

update commands, because you can actually go ahead and delete out the physical data.

So that’s all that I wanted to get through for today. I hope this was really helpful for you guys. Thank you so much for listening and excited to hear everyone’s questions.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Palla Lentz


Palla is a member of the Resident Solutions Architect team at Databricks. She works directly with customers to build robust, end-to-end Spark solutions and help guide users towards Spark best practices. She received a BS in Computer Science from San Diego State University and has a background in Application Engineering, Data Warehousing and Data Engineering. Palla has lived in a few different states across the US and studied abroad in New Zealand.

About Jake Therianos


Jake is a Customer Success Engineer at Databricks and helps both end users and platform owners/admins tackle their technical challenges related to Spark and the Databricks ecosystem. He has received certifications in both Spark and AWS. Jake has a background in Technology Consulting with a specialization in Advanced Analytics. He graduated from Virginia Tech and currently resides in Washington DC.