Bring reliability, performance and security to your data lake
Delta Lake: 레이크하우스의 기초
As an open format storage layer, Delta Lake delivers reliability, security and performance to data lakes. Customers have seen 48x faster data processing, leading to 50% faster time to insight, after implementing Delta Lake.
Watch a live demo and learn how Delta Lake:
Solves the challenges of traditional data lakes — giving you better data reliability, support for advanced analytics and lower total cost of ownership
Provides the perfect foundation for a cost-effective, highly scalable lakehouse architecture
Offers auditing and governance features to streamline GDPR compliance
Has dramatically simplified data engineering for our customers
Product Management, Databricks
Product Marketing, Databricks
Product Marketing, Databricks
Software Architect, Comcast
Sam Steiny: Hi, and welcome to the Databricks event, Delta Lake, the foundation of your lakehouse. My name is Sam Steiny and I work in product marketing at Databricks, focusing specifically on data engineering and on Delta Lake. I’m excited to be here today. I get to be the MC for today’s event, and I will be guiding you through today’s sessions. More and more, we’ve seen the term lakehouse referenced in news at events in tech blogs, and thought leadership. And beyond our work at Databricks, organizations across industries have really increasingly turned to this idea of a lakehouse as the future for unified analytics, data science and machine learning.
Sam Steiny: In today’s event, we’ll see an overview of Delta Lake, which is the secure data storage and management layer for your data lake that really forms the foundation of a lakehouse. We’ll see a demo of Delta Lake in action, and we’ll actually hear how Comcast has leveraged Delta Lake to bring reliability, performance, and security to their data. We’ll finish today’s event with a live Q and A. So, come prepared with your questions and we’ll do our best to answer as many as possible. So, before we start just some quick housekeeping, today’s session is being recorded. So, it’ll be available on demand to anyone who has registered.
Sam Steiny: And then also, if you have any questions throughout the event, please feel free to add them to the Q and A box. We’ll do our best to actually answer them in real time there. But we’ll also answer the leftover questions as well as any additional ones in the live Q and A at the end of the session. So, now before we get to our speakers, I wanted to share a quick overview of Delta Lake in a video we recently launched. This’ll give you a high level understanding of what Delta Lake is, before Himanshu who is the Delta Lake product manager, will go into more detail about Delta Lake and how it forms the foundation of a lakehouse.
Speaker 3: Businesses today have the ability to collect more data than ever before. And that data contains valuable insights into your business and your customers, if you can unlock it. As most organizations have discovered, it’s no simple task to turn data into insights. Today’s data comes in a variety of formats, video, audio, and text. Data lakes have become the defacto solution because they can store these different formats at a low cost and don’t lock businesses into a particular vendor like a data warehouse does. But traditional data lakes have challenges, as data lakes accumulate data in different format, maintaining reliable data is challenging and can often lead to inaccurate query results.
Speaker 3: The growing data volume also impacts performance, slowing down analysis and decision-making, and with few auditing and governance features data lakes are very hard to properly secure and govern. With all of these challenges, as much as 73% of company data goes unused for analytics and decision-making and value in it is never realized. Delta Lake solves these challenges. Delta Lake is a data storage and management layer for your data lake that enables you to scale insights throughout your organization with a reliable single source of truth for all data workloads, both batch and streaming, increase productivity by optimizing for speed at scale with performance features like advanced indexing and schema enforcement.
Speaker 3: Operate with flexibility in an open source environment stored in Apache parquet format and reduce risk by quickly and accurately updating data in your data lake for compliance and maintain better data governance through audit logging. By unlocking your data with Delta Lake, you can do things like dramatically simplified data engineering by performing ETL processes directly on the data lake. Make new real-time data instantly available for data analysis, data science and machine learning, gain confidence in your ability to reliably meet compliance standards like GDPR and CCPA.
Speaker 3: Delta Lake on Databricks brings reliability, performance and security to your data all in an open format, making it the perfect foundation for a cost-effective highly scalable lakehouse architecture. Delta Lake, the open, reliable, performant and secure foundation of your lakehouse.
Sam Steiny: Great. So, with that high level view, now you have an understanding of Delta Lake and I’m going to now pass it over to Himanshu Raja, who’s the product manager for Delta Lake at Databricks. He’s going to do a deeper dive into Delta Lake and explain how it really enables a lakehouse for our customers. Over to you, Himanshu.
Himanshu Raja: Thank you, Sam. I’m super excited to be here and talk to you about Delta Lake and why it is the right foundation for lakehouse. In today’s session, I will cover the challenges of building data analytics stack while lakehouse is the only future proof solution. What is Delta Lake? And why it is the best foundation for your lakehouse? Brenner, will then jump into the most exciting part of the session and do a demo. After the session, you will have enough context, links to the supporting material to get started and build your first data lake.
Himanshu Raja: Every company is feeling the pull to become a data company, because when large amounts of data are applied to even simple models, the improvements on use cases is exponential. And here at Databricks, our entire focus is on helping customers apply data to their toughest problems. I’ll dig examples of two such customers, Comcast and Nationwide. Comcast is a great example of a media company that has successfully adopted data and machine learning to create new experiences for their viewers that help improve satisfaction and retention.
Himanshu Raja: They have built a voice-activated remote control that allows you to speak into the remote, ask it a question, and it will provide some really relevant results, leveraging things like natural language processing and deep learning. And they’ve built all of this on top of Databricks platform. Nationwide is one of the largest insurance providers in the U.S. nationwide saw that the explosive growth in data availability and increasing market competition was challenging them to provide better pricing to their customers. With hundreds of millions of insurance records to analyze for downstream ML nationwide realized that their legacy batch analysis process was slow and inaccurate, providing limited insights to predict the frequency and severity of the claims.
Himanshu Raja: With Databricks, they have been able to employ deep learning models at scale to provide, more accurate pricing predictions resulting in more revenue from claims. Because of this potential, it’s not surprising that, 83% of CEOs say AI is a strategic priority. According to a report published by MIT Sloan management review, or that Gartner predicts AI will generate almost trillion dollars in business value in only a couple of years. But it is very hard to get right. Gartner says 85% of the big data projects will fail. Venture Beat published a report that said 87% of data science projects never make it into production. So, while some companies are having success most still struggle.
Himanshu Raja: So, the story starts with data warehouses, which it is hard to believe. Will soon celebrate its 40th birthday. Data warehouses came around in the 80s and were purpose-built for BI and reporting. Overtime they have become essential and today every enterprise on the planet has many of them. However, they weren’t built for modern data use cases. They have no support for data like video or audio or text. Datasets that are crucial for modern use cases. It had to be very structured data queriable only with SQL. As a result, there is no viable support for data science or machine learning. In addition, there is no support for real-time streaming. They are great for batch processing, but either do not support streaming or can be cost prohibitive.
Himanshu Raja: And because they are closed and proprietary systems, they force you to lock your data in, so you cannot easily move data around. So, today the result of all of that is that most organizations will first store all of their data in data lakes and block stores, and then move subsets of it into the data warehouse. So, then the thinking was that potentially data lakes could be the answer to all our problems. Data lakes came around about 10 years ago and they were great because they could indeed handle all your data. And they were there for good for data science and machine learning use cases. And data lakes serves as a great starting point for a lot of enterprises.
Himanshu Raja: However, they aren’t able to support that data warehousing or BI use cases. Data lakes are actually more complex to set up than a data warehouse. Our warehouse has a lot of familiar support semantics like asset transactions. With the data lakes, you are just dealing with files. So, those abstractions are not provided, you really have to build them yourself. And they’re very complex to set up. And even after you do all of that, the performance is not great. You’re just dealing with files, the end. In most cases, customers end up with a lot of small files and even the simplest queries will require you to list all those files. That takes time.
Himanshu Raja: And then lastly, when it comes to reliability, they are not that great either. We actually have lot more data in that data lakes, then the warehouse, but is the data reliable? Can I actually guarantee that the schema is going to stay the same? How easy it is for an analyst to merge a bunch of different schemas together. As a result of all of these problems, data lakes have sort of turned into these unreliable data swamps where you have all the data, but it’s very difficult to make any sense of it. So, understandably in the absence of a better alternative, what we are seeing with most organizations is a strategy of coexistence.
Himanshu Raja: So, this is what a data swamp look like. There are tons of different tools to power each architecture required by a business unit or the organization. It’s a whole slew of different open source tools that you have to connect. In the data warehousing stack, on the left side, you are often dealing with proprietary data formats. And if you want to enable advanced use cases, you have to move the data across to other stacks. It ends up being expensive and resource intensive to manage. And what does it result into? Because the systems are siloed, the teams become siloed too. Communication slows down, hindering innovation and speed.
Himanshu Raja: Different teams often end up with different versions of the truth. The result is multiple copies of data, no consistent security governance model, closed systems and disconnected, less productive data teams. So, how do we get the best of both worlds? We want some things from the data warehouse, we want some things from the data lakes. We want the performance and reliability of the data warehouses, and we want the flexibility and the scalability of the data lakes. This is what we called the lakehouse paradigm. And the idea here is that the data is in the data lake, but now we are going to add some components so that we can now do all the BI and reporting from the warehouse and all the data science and machine learning from data lakes and also support streaming analytics. So, let’s build a lakehouse. What are the things we need to build a lakehouse?
Himanshu Raja: We said that we want all our data to be in an really scalable storage layer. And we want a unified platform where we can do multiple use cases. We can achieve multiple use cases. So, we need some kind of transactional layer on top of that data storage layer. So, what do you really need is something like asset compliance, so that when you write data, it either fully succeeds or fully fails and things are consistent. The structure transaction layer is what is data lake. And then the other requirement we talked about was performance. So, to support the different type of use cases, we needed to be really fast. We have lot of data that we want to work with. So, there is the data engine, which is a high-performance query engine that Databricks has created in order to support different types of use cases, whether it is SQL, data science, ETL, BI reporting, streaming, all that stuff on top of the engine to make it really, really fast.
Himanshu Raja: So, let’s do a deep dive on what is data lake. Data lake is an open, reliable, performant and secure data storage and management layer for your data lakes that enable you to create a true single source of truth. Since it’s built upon a budget spot, you are able to build high-performance data pipelines to clean your data from raw injection to business level aggregates. And given the open format, it allows you to avoid unnecessary replication and proprietary lock-in. Ultimately, data lake provides that reliability, performance, and security you need to solve your downstream data use cases. Next, I’m going to talk about each of those benefits of data lake. The first and foremost benefit that you get with data lake is high-quality reliable data in your analytics stack.
Himanshu Raja: Let me just talk about three key things here. The first is asset transaction. The second is schema enforcement and schema evolution. And then third is unified batch and streaming. So, on asset transactions, Delta employs an all or nothing asset transaction approach to guarantee that any operation you do on your data lake either fully succeeds or gets aborted so that it can be rerun. On schema enforcement Delta Lake uses schema validation on right, which means that all new rights to our table are checked for compatibility with the target table schema at right time. If the schema is not compatible, Delta Lake cancels the transaction altogether, and no data is written and raises an exception to let the user know about the mismatch.
Himanshu Raja: We have very recently introduced capabilities to also do schema evolution, where we can evolve the schema on the fly as the data is coming in especially in the cases where the data is semi structured or unstructured. And you may not know what the data types are, or even in a lot of cases, what the columns that are coming in are. The third thing I would like to talk about is unified batch and streaming. Delta is able to handle both batch and streaming data, including the ability to concurrently, write batch and streaming to the same data table. Delta Lake directly integrates with spark structured streaming for low-latency updates.
Himanshu Raja: Not only does this result in a simpler system architecture by not requiring you to build a Lambda architecture anymore. It also results in a shorter time from data ingest to query results. The second key advantage of Delta Lake is performance, lightning, fast performance. There are two aspects to performance in data analytic stack. One is how the data is stored, and then the other is performance during query, during run time. So, let’s talk about how their data is stored and how does Delta optimizes the data storage format excel. Delta comes with out-of-box capabilities to store the data optimally for querying. Capabilities such as the ordering where the data is automatically structured along multiple dimensions for fast query performance is one. Delta also has data skipping, where Delta maintains the file statistics so that the data subsets relevant to the queries are used instead of the entire tables.
Himanshu Raja: We don’t have to go and read all the files. Files can be skipped based on the statistics. And then auto-optimize, optimize is a set of features that automatically compacts small files into fewer larger files so that the query performance is great out of the box. It’s paying a small pause during writes to offset and give really great benefit for the tables during the requering. So, that’s the part about how the data is stored. Now, let’s talk about the Delta engine, which comes into play when you actually query that data. Data engine has three key components to provide super fast performance, photon, the query optimizer and caching. Photon is a native vectorized engine, fully compatible with Apache spark, build to accelerate all structured and semi-structured workloads by more than 20X compared to spark 2.4.
Himanshu Raja: Second key component of Delta engine is query optimizer. The query optimizer extends the sparks cost-based optimizer and adaptive query execution with the advanced statistics to provide up to 18 X faster query performance for data warehousing workloads than Spark 3.0. And then the third key component of Delta engine is caching. Delta engine automatically caches IO data, and transcodes it into a more CPU efficient fallback to take advantage of NBMESSTs providing up to 5X faster performance for table scans than Spark 3.O. it also includes a second cache for query results to instantly provide results for any subsequent raps. This improves performance for repeated queries like dashboards, where the underlying tables are not changing frequently.
Himanshu Raja: So, let me talk about the third main benefit of Delta Lake, which is to provide security and compliance at scale. Delta Lake reduces risk by enabling you to quickly and accurately update data in your data lake, to comply with regulations like GDPR and maintain better data governance through audit logging. Let me talk about two specific features, time-travel and stable and role-based access controls. With time-travel Delta automatically versions the big data that you store in your data lake and enables you to access any historical version of that data. This temporal data management simplifies your data pipeline by making it easy to audit, rollback data in case of accidentally bad writes or deletes and reproduce experiments and reports.
Himanshu Raja: Your organization can finally standardize on a clean centralized version, big data repository in your own cloud storage for your analytics. The second feature I would love to talk about is that table and role based access controls. The data lake, you can programmatically grant and revoke access to your data based on specific workspace or role to ensure that your users can only access the data that you want them to. The Databrick’s is extensive ecosystem of partners. Customers can enable a variety of security and governance functionality based on their individual needs.
Himanshu Raja: Lastly, but one of the most important benefits of Delta Lake is that, it’s open and agile. Delta Lake is an open format that works with other open source technologies, avoiding vendor lock-in and opening up an entire community and ecosystem of tools. All the data in Delta Lake is stored in an open Apache parquet format, allowing data to be read by any compatible reader. Developers can use their Delta Lake with their existing data pipelines with minimum changes, as it is fully compatible with spark. The most commonly used big data processing engine. Delta Lake also supports SQL DML, out-of-box to enable customers to migrate SQL workloads to Delta simply and easily.
Himanshu Raja: So, let’s talk about how we have seen customers leverage Delta Lake for a number of use cases, primarily among them is improving data pipelines, doing ETL at scale, unifying batch, and streaming with direct integration with Apache spark structured streaming to run both batch and streaming workloads in sort of doing the Lambda architecture, doing BI on your data Lake with our Delta engine, super fast, ready performance. You don’t need to choose between a data lake and a data warehouse. As we talked about with the lakehouse, you can do BI directly on your data lake and then meeting regulatory needs with standards like GDPR by keeping a record of historical data changes. And who are these users?
Himanshu Raja: The data lake is being used by some of the largest Fortune 100 companies in the world. We have customers like Comcast, Wirecomm, Conde Nast, McAfee, Edmonds. In fact, here Databricks all the data analytics is done using data lake. So, I would love to just deep dive in and would like to talk about the Starbucks use case to just give you an idea as to how our customers have used data lake in their ecosystem. Starbucks today does demand forecasting and personalizes the experiences of their customers on their app. And their architectures are actually struggling to handle petabytes of data adjusted for downstream ML and analytics, and they needed a scalable platform to support multiple use cases across the organization.
Himanshu Raja: And with Azure Databricks and Delta Lake, their data engineers are able to build pipelines that support batch and real-time workloads on the same platform. They have enabled their data science teams to blend various data sets, to create new models that improve the customer experiences. And most importantly, data processing performance has improved dramatically allowing them to deploy environments and deliver insights in minutes. So, let me wrap up by summarizing what data lake can do for you, why it is the right foundation for your lakehouse. As we discovered that with Delta Lake, you can improve analytics and data science and machine learning throughout your organization by enabling teams to collaborate and ensure that they are working on reliable data to improve speed with which they make decisions.
Himanshu Raja: You can simplify data engineering, reduce infrastructure and maintenance costs with best price performance, and you can enable a multi-cloud secure infrastructure platform with data lake. So, how do you get started on data lake? It’s actually really easy, if you have a Databricks deployment already on Azure or AWS, and now GCP if you deploy a cluster with DBR, which is the Databricks right time release version 8.0 or higher, you actually do not need to do anything. Delta is now the default format for all creative tables and the data frame APIs. But we also have plenty of sources for you to try out the product and learn.
Himanshu Raja: It’s actually a lot of fun to deploy your first data lake and just build a really cool dashboard using a notebooks. If you have not tried Databricks before you can sign up for a free trial account and then you can follow our getting started guide. And Brenner, will do a demo very shortly to just showcase the capabilities that we talked about. So, with that over to you, Sam.
Sam Steiny: Awesome. Thank you, Himanshu. That was great. Now, in the past the stage over to Brenner Heintz and Brenner is going to take us through a demo that really brings Delta Lake to life. Now, that you’ve heard what it is and how powerful it can be, let’s see it in action. So, over to you, Brenner.
Brenner Heintz: My name is Brenner Heintz. I am a technical PMM at Databricks, and today I’m going to show you how Delta Lake provides the perfect foundation for your lakehouse architecture. We’re going to do a demo, and I’m going to show you how it works from a practitioner’s perspective. Before we do so, I want to highlight the Delta Lake cheat sheet. I’ve worked on this with several of my colleagues, and the idea here is be able to provide a resource for practitioners like yourself, to be able to quickly and easily get up to speed with Delta Lake and be able to be productive with it very, very quickly. We’ve provided most, if not all of the commands in this notebook, it’s part of the cheat sheet. So, I highly encourage you to download this notebook and you can click directly on this image, it’ll take you directly to the cheat sheet, provide a one pager for Delta Lake with Python and a one pager for Delta Lake with Spark SQL.
Brenner Heintz: So, first in order to use Delta Lake, you need to be able to convert your data to Delta Lake format. And the way that we’re able to do that is instead of saying parquet as part of your create table or your Spark data frame writer command, all you have to do is place that with the word Delta, to be able to start using Delta Lake right away. So, here’s a look at what that looks like. With Python, we can use Spark to read in our data in parquet format. You could also read in your data in CSV or other formats for example. Spark is very flexible in that way. And then we simply write it out in Delta format by indicating Delta here.
Brenner Heintz: And we’re going to save our data in the loans Delta table. We can do the same thing with SQL. We can use a create table command using Delta to then save our table in Delta format. And finally, the convert to Delta command makes it really easy to convert our data to Delta Lake format in place. So, now that we have shown you how to convert your data to Delta like format, let’s take a look at a Delta Lake table and what that looks like. So, I’ve run the cell already. We have 14,705 batch records in our loans Delta table. Today, we’re working with some data from the lending club, and you can see the columns that are currently part of our table here.
Brenner Heintz: So, I went ahead and kicked off a couple of right streams to our table. And the idea here was to show you that Delta Lake tables are able to handle batch and streaming data, and they’re able to integrate those straight out of the box without any additional configuration or anything else that’s needed. You don’t need to build a Lambda architecture, for example, to integrate both batch in real-time data. Delta Lake tables can easily manage both at once. So, as you can see, we’re writing about 500 records per second, into our existing Delta Lake table. And we’re doing so with two different writers, just to show you that you can concurrently both read and write from Delta Lake tables consistently with asset transactions, ensuring that you never deal with a pipeline breakage that corrupts the state of your table, for example.
Brenner Heintz: Everything in Delta Lake is a transaction. And so this allows us to create isolation between different readers and writers. And that’s really powerful, it saves us a lot of headache and a lot of time undoing mistakes that we may have made if we didn’t have acid transactions. So, as I promised as well, those two streaming writes have been coupled. I’ve also created two streaming reads to show you what’s happening in the table in near real time. So, we had those initial 14,705 batch records here. But since then we have about 124,000 streaming records that have entered our table since that time.
Brenner Heintz: This is essentially the same chart, but showing you what’s happening over each 10-second-window, each of these bars represents a 10-second-window, over which as you can see, since our streams began, we have about 5,000 records per stream that are written to our table at any time. So, all of this is just to say that Delta Lake is a very powerful tool that allows you to easily integrate batch and streaming data straight out of the box. It’s very easy to use, and you can get started right away. To put the cherry on top, we added a batch query just for good measure, and we plotted it using Databricks built-in visualization tools, which are very easy and allow you to visualize things very quickly.
Brenner Heintz: So, now, that we’ve showed you how easy it is to integrate batch and streaming data with Delta Lake, let’s talk about data quality. You need tools like schema enforcement and schema evolution in order to enforce the quality in your tables. And the reason for that is that what you don’t want are upstream data sources, adding additional columns, removing columns, or otherwise changing your schema without you knowing about it. Because what that can cause is a pipeline breakage that then affects all of your downstream data tables. So, to avoid that, we can use schema enforcement first and foremost. So, here I’ve created this new data, data frame that contains a new column, the credit score column, which is not present in our current table.
Brenner Heintz: So, because Delta Lake offers schema enforcement when we run this command, we get an exception because the schema mismatch has been detected by Delta Lake. And that’s a good thing. We don’t want our data to successfully write to our Delta Lake table because it doesn’t match what we expect. However, as long as we’re aware and we want to intentionally migrate our schema, we can do so by adding a single command to our write command, we include the merge schema option. And now, that extra column is successfully written to our table, and we’re also able to evolve our schema. So, now, when we try and select the records that were in our table, in our new data table, you can see that those records were in fact successfully written to the table and that new credit score column is now present in the schema of our table as well.
Brenner Heintz: So, these tools give you, they’re very powerful and they allow you to enforce your data quality the way that you need to in order to transition your data from raw unstructured data to high quality structured data, that’s ready for downstream apps and users overtime. So, now, that we’ve talked about schema enforcement and scheme evolution, I want to move on to Delta Lake time travel. Time travel is a really powerful feature of Delta Lake. And because everything in Delta Lake as a transaction, and we’re tracking all of the transactions that are made to our Delta Lake tables over time in the transaction log, that allows us to go back in time and recreate the state of our Delta Lake table at any point in time.
Brenner Heintz: First, let’s look at what that looks like. So, at any point, we can access the transaction log by running this describe history command. And as you can see, each of these versions of our table represent some sort of transaction, some sort of change that was made to our tables. So, our most recent change was that we upended those brand new records with a new column to our Delta Lake table. So, you can see that transaction here, before that we had some streaming updates. All of those rights that were occurring to our table were added as transactions. And basically this allows you to then go back and use the version number or timestamp, and then query historical versions of your Delta Lake tables at any point. That’s really powerful because you can even do creative things like compare your current version of a table to a previous version to see what has changed since then, and do other sorts of things along those lines.
Brenner Heintz: So, let’s go ahead and do that. Let’s look, we’ll use time travel to view the original version of our table, which was version zero. And this should include just those 14,705 records that we started with because at that point version zero of our table, we hadn’t streamed any new records into our table at all. And as you can see, the original version, those 14,705 records are the only records that are present in version as of zero. And there is no credit score column either, because of course, back in version zero, we had not yet evolved Delta Lake table schema.
Brenner Heintz: So, compare that 14,705 records to the current number of records in our table, which is over 326,000. Finally, another thing you can do with Delta Lake time-travel is restore a previous version of your tables at any given point in time. So, this is really powerful, if you accidentally delete a column you didn’t mean to, or delete some records you didn’t mean to, you can always go back and use the restore command to then have the current version of your table restored exactly the way that your data was at that given timestamp or version number. So, as you can see, when we run this command to restore our table to its original state version as of zero, we have been able to do so successfully. Now, when we query it, we only get those 14,705 records as part of the table.
Brenner Heintz: Next, one of the features that I think developers, data engineers and other data practitioners are really looking for when they’re building their lakehouse is the ability to run simple DML command with just one or two lines of code, be able to do operations like deletes, updates, merges, inserts, et cetera. On a traditional data lake, those simply aren’t possible. With Delta Lake, you can run those commands and they simply work and they do so transactionally. And they’re very, very simple. So, managing change data becomes much, much easier when you have these simple commands at your disposal.
Brenner Heintz: So, let’s take a look, we’ll choose user ID 4420 as our test case here, we’ll use sort of modify their data specifically to show you what Delta Lake can do. As you can see, they are currently present in our table, but if we run this delete command and we specify that specific user, when we run the command and then we select all from our table, we now have no results. The delete has occurred successfully. Next, when we look at the described history command, the transaction log, so you can see the delete that we just carried out is now present in our table. And you can also see the restore that we did to jump back to the original version of our table version zero is also present. We can also do things like insert records directly back into our table if we want to do so.
Brenner Heintz: Here, we’re going to use time-travel to look at version as of zero, the original version of our table before this user was deleted and then insert that user’s data back in. So, now when we run the select all command, the user is again, present in our table. The insert into command works great. Next, there’s the update command. Updates are really useful, if you have row level changes that you need to make. Here, we’re going to change this users funded amount to 22,000. Actually let’s make it 25,000, it looks like it was already 22,000 before.
Brenner Heintz: So, we’ll update that number and then when we query our table, now, in fact, the user’s funded amount has been updated successfully. Finally, in Delta Lake you have the ability to do really, really powerful merges. You can have a table full of change data that for example represents inserts and updates to your Delta Lake table. And with Delta Lake, you can do upsert. In just one single step you can… for each row in your data frame that you want to write to your Delta Lake table, if that row is already present in your table, you can simply update the values that are in that row. Whereas if that row is not present in your table, you can insert it.
Brenner Heintz: So, that’s what’s known as an upsert and those are completely possible and they’re very, very easy in Delta Lake. They make managing your Delta Lake very, very simple. So, first we create a quick data frame with just two records in it, we want to add user 4420’s data back into our table. And then we also created a user whose user ID rather is one under 1 million. So, it’s 999,999. And this user is not currently present in our table. We want to insert them. So, this is what our little data frame looks like. And as you can see, we have these as an update or an insert. And when we run our merge into command, Delta Lake is able to identify the rows that already exist, like user 4420, and those that don’t already exist. And when they don’t exist, we simply insert them.
Brenner Heintz: So, as you can see, these updates, and inserts occurred successfully and Delta Lake has no problem with upserts. Finally, the last thing I want to point out are some specific performance enhancements that are offered as part of Delta Lake. But also as part of Databricks, Delta Lake only. We have a couple of commands that are Databricks, Delta Lake only at the moment. First there’s the vacuum command. The vacuum command takes a look at the files that are currently a part of your table, and it removes any files that aren’t currently part of your table that have been around for a retention period that you specify. So, this allows you to clean up the old versions of your table that are older than a specific retention period, and sort of save on cloud costs that way.
Brenner Heintz: Another thing you can do on Databricks Delta Lake is you can cache the results of specific commands in memory. So, if you have a specific table that your downstream analysts tend to always group by a specific dimension, you can cache that SQL command, and it will always appear much quicker than it, and that way it’s able to avoid doing a full read of your data, for example. You also had the ability to use the Z order optimized command, which is really powerful. Z order optimize essentially looks at the layout of your data tables and it figures out the perfect way to locate your data in different files. It essentially lays out your files in an optimized fashion, and that allows you to save on cloud storage costs because the way that it lays them out is typically much more compact than would be when you start. And it also then optimizes those tables for a read and write throughput.
Brenner Heintz: So, it’s very powerful. It speeds up the results of your queries and saves you on storage and compute costs ultimately. So, that’s the demo. I hope you’ve enjoyed this demo. Again, take a look at the Delta Lake cheat sheet that we will post as part of the description or in the chat that is part of the presentation below. So, thanks so much. I hope you’ve enjoyed this demonstration. Check out Delta Lake and join us on GitHub, Slack, or as part of our mailing list. Thanks so much.
Sam Steiny: Awesome. Thanks, Brenner. That was really, really great. I’m so excited now to be joined by Barbara Eckman. Barbara is a senior principal software architect at Comcast, and she’s going to be sharing her experience with Delta Lake and how working with Databricks has really made an impact on her day-to-day and on the Comcast business. So, thanks so much for being here, Barbara. We’re super excited to have you.
Barbara Eckman: Hi, everybody. Really glad to be here. Hope you’re all doing well. I’m here to talk about hybrid cloud access control in a self-service computer environment here at Comcast. I want to just real briefly mentioned that Comcast takes very seriously its commitment to our customers to protect their data. I’m part of the Comcast, what we call data experience big data group. And big data in this case means not only public cloud, but also on-prem data. So, we have a heterogeneous data set, which offers some challenges and challenges are fun, right? Our vision is that data is treated as an enterprise asset. This is not a new idea, but it’s an important one.
Barbara Eckman: And our mission is to power Comcast enterprise through self-service platforms, data discovery lineage, stewardship governance, engineering services, all those important things that enable people to really use the data in important ways. And we know as many do that powerful business insights, the most powerful insights come from models that integrate data, that span silos. Insights for improving the customer experience as well as business value. So, what this means for the business there are some examples. Basically, this is based on the tons of telemetry data that we capture from sensors and Comcast’s network. We capture things like latency, traffic, signal to noise ratio, downstream and upstream, error rates and other things that I don’t even know what they mean.
Barbara Eckman: But this enables us to do things that improve customer experience like plan the network topology to help if there’s a region that has a ton of traffic, we might change the policy to support that. Minimizing truck rolls, truck rolls are what we call it when the Comcast guy cable guy comes or cable female comes to your house. And in this COVID times, we really would like to minimize that even more. And if we can analyze the data ahead of time, we can perhaps make any adjustments or suggest adjustments that the user can make to minimize the need for people to come to their house.
Barbara Eckman: We can monitor, predict problems and remedy them often before the user even knows because of this data and this involves both the telemetry data and integrating it with other kinds of data across the enterprise. And then optimizing network performance for region or for the whole household. So, now this is really important stuff and it really helps the customers. And we’re working to make this even more prevalent. So, what makes your life hard? This is a professional statement. If you want to talk about personally, what makes your life hard? We can do that later, but what makes your life harder as a data professional?
Barbara Eckman: People usually say, “I need to find the data. So if I’m going to be integrating data across silos, I need to find it. I know where it is in myself silo, but maybe.” And the way we do that is a metadata search and discovery, which we do through Elasticsearch. Then once I find the data that might be of interest to me, I need to understand what it means. So, what someone calls an account ID might not be the same account ID that you are used to calling an account ID, billing IDs, or back office account IDs need to know what it means in order to be able to join it, to make sense as opposed to Franklin data, monster data that isn’t really appropriately joined. We need to know who produced it, that it come from a set-top box. Did it come from a third party who touched it while it was journeying through Comcast, through Tenet, through Kafka or Kinesis and someone aggregated it and then maybe somebody else enriched it with other data.
Barbara Eckman: And then it landed in a data lake. The user of the data in the data lake wants to know where the data came from, and who added what piece. And you could see this as both the publisher looks at the data in the data lake and says, “This looks screwy, what’s wrong with this? Who messed up my data?” He could also say, or they could say, “Wow, this is enriched really great. I want to thank that person.” And also someone who’s just using the data wants to know who to ask questions. What did you enrich this with? Where did that data come from, that kind of thing? So, and all that really is helpful when you’re doing this integration. That’s data governance and lineage, which we do in Apache Atlas.
Barbara Eckman: That’s our metadata and lineage repository. Then once you found data and understood it, you have to be able to access it. And we do that through at Apache Ranger and its extension that’s provided by Privacera. Once you have access to it, you need to be able to integrate it and analyze it across the enterprise. So, finally, now we get to the good stuff to be able to actually get our hands on the data. And we do that with self-service compute using Databricks. And Databricks is a really powerful tool for that. And finally we find that we do really need asset compliance for important operations. And we do that with Delta Lake. So, I can talk about this in more detail, as this top goes on or in the question session.
Barbara Eckman: I’m an architect. So, I have to have bus and line diagrams. So, this is a high-level view of our hybrid cloud solution. So, income passed on our data centers, we have a Hadoop data lake that involves Hadoop Ranger and Apache Atlas working together. We are, as many companies are kind of phasing that out, but not immediately, it takes a while. We have a Terra data, enterprise data warehouse. Similarly, we are thinking to move that and not necessarily to the cloud entirely, but maybe to another on-prem source, like the object store. We use MinIO and basically that gives us the mix this object so it look like S3. So, when the spark jobs that we like to use on S3 also can run on our on pre data store.
Barbara Eckman: And that’s a big plus of course. And for that, we have a Ranger data service that helps with access control there. Up in the cloud, we use AWS though Azure also has a big footprint in Comcast. And Databricks compute is kind of the center here. We use it to access Kinesis. Redshift is only, we’re just starting with that. We use Delta Lake and S3 object store and we have a Ranger plugin that the Databricks folks worked carefully with Privacera to create so that our self-service Databricks environment can have all the nit script and the configurations that it needs to run the access control that Privacera provides.
Barbara Eckman: We also use Presto and for our federated query capability, it also has a Ranger plugin and all the tags that are applied to metadata on which policies are built, or are housed in Apache Atlas and Ranger and Atlas synced together. And that’s how Ranger knows what policies to apply to what data. And in the question session, if you want to dig deeper into any of this, I’d be very happy to do it. So, this is very exciting to me, we’re just rolling this out and it’s so elegant and I didn’t create it so I can say that. So, Ranger analysis together provide a declarative policy based access control. And as I said, Privacera extends Ranger, which originally only worked in Hadoop to AWS through plugins and proxies. And one of the key ones that we use, of course, is Databricks on all three of these environments. And basically what I like about this is we really have one ranger to rule them all and Atlas is his little buddy, because he provides or she, provides the tags that really power our access control.
Barbara Eckman: So, here’s again a diagram. And we have a portal that we built for our self service applications and the user tags, the metadata with tags, like this is PII, this is a video domain, that kind of stuff. That goes into Atlas, the tags and the metadata associations are synced with Ranger, the policies based on that. So, who gets the CPI? Who gets to see video domain data? Those are synced and cashed in the range of plugins. And then when a user calls an application, whether it’s a cloud application in Databricks, or even an on-prem application, the application asks Ranger, “Does this user have the access to do what they’re asking to do on this data?” If the answer is yes, and it’s very fast, because these are plugins. If the answer is yes, they get access.
Barbara Eckman: If no, then they get an error message and we can also do masking and show the data, if someone has access to many columns, but not all columns, I would say a glue table we can mask out the ones that they don’t have access to and still give them what data they are allowed to see. Recently we’ve really needed acid compliance. So, traditionally big data lakes are write once, read many. We have things streaming in from set top boxes in the cable world, those aren’t transactional, that’s not transactional data. That’s what we’re used to, but now increasingly we are finding that we need to delete specific records from our parquet files or whatever. We can do this in Spark, but it is a terribly performant. It certainly it can be done, but it turns out Delta Lake does it much better.
Barbara Eckman: The deletes are much more performant and you get to view snapshots of past data lake states, which is really pretty awesome. So, we’re really moving toward, I love this word, a lakehouse being able to do, write once, read many and acid all in one place. And that is largely thanks to data lakes. So, this is me. Please reach out to me an email if you wish. And I’ll be happy to answer questions in the live session if you have any. So, thank you very much for listening.
Sam Steiny: Thank you for joining this event, Barbara. That was so awesome. It’s great to hear the Comcast story. So, with that, let’s get to some questions. We’re going to move over to live Q&A. So, please add your questions to that Q&A.