As the size and complexity of data grows, building reliable data pipelines is increasingly important, but also complex and challenging. Learn how Delta Live Tables simplifies the ETL lifecycle on Delta Lake, helping data engineering teams greatly simplify ETL development and ongoing management to improve data quality, scale operations, and reduce cost.
Bryce Bartmann, Senior Data Engineer, Shell
Awez Sayed: Okay. Welcome everyone, glad you could join us today. In this session we’ll talk about a brand new service we are launching, which will change the way you do ETL with Databricks. Delta Live Tables is a new managed service, which makes it radically simple to build, deploy and manage your ETL pipelines. Later in this session, I’m joined with Bryce Bartmann from Shell, who’s been an early user of Delta Live Tables and has given us invaluable advice as we have built and evolved this service over the last few months. So with that, let’s dive right into Delta Live Tables.
Now, most of you are familiar with Databricks’s Lakehouse Platform, right? At the heart of our platform is Delta Lake, right, which is the data layer which gives you all the benefits of a relational database, but with economics of an open, file based store. Now Delta Lake gives you all those benefits in terms of performance, in terms of transactionality, in terms of time travel and it’s a great data layer for multiple use cases and scenarios, being data engineering, data science, machine learning and straight up, analytics.
Now, using those capabilities of Delta Lake, we wanted to make data engineering, the building and operating of your pipelines far more simpler than what we are used to doing. So let’s talk about why we started going down this path. Now for most of you here, this is not a… It’s a new thing. You’ll were used to the pattern of building data pipelines within your data lake. Right, you typical have these multiple zones in your data lakes.
You start, bring the data in, then you have a raw layer, or we sometimes call the bronze layer, and then you do a series of transformations. You’re cleaning your data, you’re augmenting it, you’re standardizing it in typically what’s called a silver layer. And then, finally, you build your business level aggregations. You add the front league field names, and just make it ready for consuming, by different applications obviously.
Now, all of this is really modeled as a pipeline, right? All the way from your source to the target. Now the complexity that comes in data lake’s solution is the fact that the kind of source that they’re arriving, are complex, meaning that first of all you have a lot of volume of data that’s arriving and it comes from multiple systems. Of course, your relational databases bring lot of initial load of data. Your change data that’s coming from your transactional systems and then, you get data that’s coming from files, and these could be rapidly arriving files. They can come in batch or they can come in streaming mode. And all this data has been processed and compounding that is the structure of the data itself.
Sometimes, you get well structured relational data. Sometimes it is JSON files and sometimes just unstructured data, like images and sound files and things like that. Now all of this needs to be handled properly within the data. Now, within this diagram, it looks simple enough, but if you actually start scaling it in the data lake, you very soon realize that now you have tons, if not dozens, if you have hundreds and not thousands that are built and managed with the data lake.
Now, managing the whole process of your pipelines across all these hundreds of tables can become very tedious and very complex. And typically, we have seen users had problems like, “Hey, one of my jobs failed, right. What is the impact of that? Did my data arrive at the right time? Can the rest of the processing happen? Is the quality of the data correct? Who’s checked it?” And then failure and recovery is another issue. Something stops in the middle of your flow, how do you actually manage it right? And what is the downstream impact. And then lineage is another problem.
So then, all of these things that come into the picture when you’re actually thinking of managing enterprise level data pipelines for the entire organization. So, our whole idea was like, “What do we do to make this simple?” So, when we started talking to many of you actually. Data engineers, data analysts who were trying to get the data right and prepared for different applications. We came across a separate set of questions and issues that were recurring.
The first thing was users found it hard to maintain dependencies within tables. Now you might say that, yes, we can do that with third party orchestration. Some of this can be done with an airflow, but all these orchestration tools, they actually control the processes. They’re a workflow solution, right? So, you know process one ran, and the process two can run after that. But they don’t actually know about the data that each step that process is actually touching, whether that was successful, how many rows got transmitted. So it’s not really modeling a data flow. It’s really just a control flow mechanism.
And then, there was this other issue of, batch versus stream processing. Lot of times, as your data volume grow, you want to switch to more stream based processing, while initially you might have been doing batch, a much more schedule way of processing data.
Then the second set of problems, that are on data quality. Now how do you monitor the quality of your data that is flowing through your pipelines? How do you actually understand the lineage of the data. So the business actually using the data can trust the data that they are consuming in the analytics, in the machine learning models, what have you.
The third set of problems and just observability, and just having much better monitoring capability for those pipelines, right? Rather than, just looking at whether the process has succeeded, you want to really use this one to understand at a data level, what is going on with the pipelines? And then, they wanted the system to be much more robust. Not just stop private and there’s a error, and require manual intervention to actually handle this problem. So the whole process of error handling and recovery is another issue that we found users were facing.
So, keeping all that in mind, it’s what we built Delta Live Tables for. So Delta Live Tables is a managed service, which runs your pipelines for you. You can very simply, just declaratively give the logic that you want to apply to transform your data, to ensure it’s quality and Delta Live Tables takes that information and runs the pipelines for you. It also abstracts, so a lot of the complexity you need, in terms of, monitoring, maintaining state and all of those and makes it very simple to build, run and operate your pipelines.
Now we added a ton of new functionality for Delta Live Tables and it’ll take me at least a couple for hours to go through all of that, but I will just touch upon the most salient ones. And it’s easy to think of them in buckets, right? So, you have built a set of functionalities and capabilities to actually make it simple to develop your data pipelines and the second thing is around how can you monitor your quality and enforce data quality in your pipelines. And the third thing is about how do we provide ability for you to scale your operations, be able to add new pipelines without having to increase the load and the administrative overhead of your ETL flows.
So, we’ll talk about the features in each of these buckets and then I’ll give a little bit of a demo towards the end of it in terms of what it looks like when you actually deploy these pipelines. Right, so on the first section, we have made it much more simpler to actually specify this data flow through your Lake, right? And the way we have done it is provided a very simple API, which allow you to chain your logic. So you have a set of transformation logic that you want to apply to your data as it flows from its source to the target.
Along the way, you may actually persist tables. You may create a bronze table, you might create a silver table, you might create a gold table, but even without it, you might create temporary views along the way to simplify your logic as the data flows from the source to target. So what we have done is allowed you to create these simple chains of transformations from your source to the target. Now we support all languages. It’s in SQL, Python, Scala so it’s very easy for you to actually link together the dependencies in the tables.
So, for example, here you can see that you first create a temporary view on the raw… Basically on the raw data that’s scribing. And then, this view, that feeds your bronze table and that bronze table, it feeds the silver table. So, now you have specified the dependency between these tables. Now you can let the Delta Live Tables make sure that… Ensures that the data flows in the right order, if there’s an error or if there is a hiccup in your pipeline, it knows how to recover from it and ensure that the data is delivered once and only once.
So, you don’t have to manage the dependencies in the tables and Live Tables, Delta Live Tables manages all of this. So it makes it far more simpler in terms of expressing the dependencies between these tables. Now other interesting thing we have done it, before when users used to do… Specify your pipelines, you will probably start building your initial pipelines as a batch process, all right? And then say, “Hey, it’s enough if I process data once a day.” And sometimes there may be some data that has to be processed in a much more faster, well, fashion. In fact, it may be much more rapidly arriving.
To do that, you might almost have to rethink how you’ve done your logic right? Switching from batch or streaming can be complex. So we have abstracted all of that problems for you. So we literally, if you want to switch processing data from a batch mode and a schedule mode to always on, continuous mode, all you need to do is set a configuration saying, “Hey, this pipeline runs in a continuous mode.”
Once you do that, Delta Live Tables makes sure that it handles all of those data, in a streaming mode and you don’t have to worry about it. You don’t have to worry about a check pointing, you don’t have to worry about recovery. All of that is hidden from you because Delta Live Tables handles all of that. Another interesting thing we have added, is an ability to do either incremental or complete computation for each table.
So, most of the data that you process with data lake, it’s incremental, meaning that you only want to change records that are coming into the system to be read and processed, because you’ve already processed the data through today’s… You don’t want to recompute everything. Now that’s great for your initial part of your pipeline. As the data is flowing through all the way, maybe even to silver tables where you are doing Scala processing. But, maybe when you go to the gold table, you want to recompute all the aggregates that you have there. So you have a consistent view, so your dashboard’s reflecting consistent data.
So, for that, you might do a complete computation, which means that that particular table will read all the data from the proceeding table and recreate all the calculations. So, now you can have part of pipeline doing processing data incrementally, which is basically changed data that’s flowing through your pipe and part of your pipeline will be doing complete computation, right.
So, it is very simple now to mix and match. Before it was tedious. You’ll have to do, figure out how to do CDC processing and then completely rethink how you do full processing right? So, this makes it far more simpler to specify.
Then you can think of Delta Live Tables as literally a template, right? You can then instantiate your pipeline with configurations. So, using variables, you can take the same code that you have specified by your pipeline and make it dynamic. So, in this particular example, you may be reading from… One incidence, you might be reading data from different file and in the second configuration, you can switch it, using the same code and you can now read it from a different source, right?
So, it makes your whole pipeline dynamic. It reduces the amount of code you need and it makes it much more simpler to deploy the same code in multiple environments. Maybe, saved as production, it becomes far more simpler to do it that way.
Now, of course, some of you love notebooks out there. So we have full support for notebooks, if you want to build your pipeline, either in Scala, Python or SQL, using a notebook environment and however you want to actually deploy and run these pipelines from there, or you can build your Delta Live Table pipelines and you IDEs with local development and they can deploy them into Databricks that way. So we give you complete flexibility, how you want to do it, in terms of development itself.
Right, now let’s switch to the second set of functionality that we’ve added. This is all around data quality. We have heard from many users that usually data quality is an after thought, so you build your pipelines, all that it has been delivered, and then, maybe it’s a process that you run outside of your pipelines which monitors the data in your tables. Maybe you do some checks. Occasionally, they’ll see whether the data meets the business requirements that you have.
But rather than doing this as a separate process, what we want to do was, we wanted to make this a key part of the pipeline itself. Right? So, Delta Live Tables has a capability called Delta Expectations, so where you can set the rule for your pipelines right? So, it’s very simple actually. So you can set a constraint that says, “Hey, my data needs to meet this particular rule,” and rule can be very simple. It can be as simple as any Boolean logic as you can see here. Or it can be arbitrarily complex. You can have user defined functions. You can have these rules in external tables, and you’ll be able to dynamically apply these rules as the data is flowing through your pipeline.
Now, if any record does not meet one of the rules you have specified, out of the box, we provide you with error handling policies. Now these error handling policies include like, fail main, stop the pipeline if this is a serious error. Drop, if you can drop the record before it hits the table. Target table or eventually we log the support quarantining the record which is coming a little ways down the road. Now, in doing all of this, it’s tracking the metrics for your quality test. So as each row goes, we know how many rules are being met, how many are failing and we’re capturing all of this information to what we call the event log table.
Now you can either look at the quality scores of your pipeline, either through the UI or Delta Live Tables as you can see here. Or since this is just a Delta table we are capturing all these metrics, you can build your own dashboard. Your quality dashboards, if you may, and see how the data is performing over time. Whether the quality of your data is improving or decreasing over time and this information can be used for building your own dashboard that you can show to your business users in terms of the trust of the data, in terms of how much… What the quality rules have been met and what haven’t been met.
And so, it gives you far better visibility into your data quality of the data. And, some of you may want to integrate this with catalog tools like Collibra, into in terms of surfacing this for the business objects that you may be most concerned about.
The other thing we have heard often is lineage. Understanding where comes for for a particular table and also to figure out impact analysis right? If you have a change in schema or your initial source table changes, you’d like to understand what is the downstream impact of it. So, now the lineages, because of the way you’ve specified the Delta Live Tables, lineage is inherent then, and it’s high quality, high fidelity lineage so you can be on… You can be sure that you understand all the downstream impact of a particular table or a field. And, so that’s visible to you. You can either use our own UIs that we have on Delta Live Tables to get a good sense of the flow of data within your pipeline.
But also, just like we did with data quality scores, we capture this lineage information in a table. So all you have to do is, you can do a select statement and just get the lineage for a particular… Any particular table, right? You can look at a target under standard search, and vice versa. So, you get a… This lineage is available to you at your fingertips. Now if you want to take this lineage, put it into your own catalog, that’s simple enough to do. So now you can have your enterprise wide lineage if you’d like to, then inspect how your data is flowing throughout your environment.
Now, let’s talk about the third set of functionality, which is all about operating of this pipeline, okay. So there are several key things and this is where a lot of the benefits of Delta Live Tables comes in because Delta Live Tables is a service to run your pipeline. So it is inherently, looking at and managing your pipelines and making sure they’re running smoothly and correctly.
So, what happens here is, once you deploy your pipeline, the service kicks in, making sure the pipeline is up all the time. If there are any errors, it has automatic error handling. We have several levels of error handling built in, it’s a lot of we try. Nine out of ten times the errors that you hit when your pipelines are transitory, they are temporary and they can, once you’ve retried, they are fixed. They are all there at first, but majority of them.
So, Delta Live Tables already does all the automatic error handling for you, reducing the time you have to go to respond to broken pipelines or trying to fix them manually. The other thing we’ve also added is the ability to replay the logic of your pipeline. Many times, you have to change the transformations you are doing. Maybe there’s a different calculation you want to apply to the data as it flows through the pipeline. Now it makes it very simple. All you do is very… With literally a one click, you deploy the new version of your pipeline and then, you full… do what we as call, a full refresh.
So, you can weave around the entire pipeline with all the data from the source, so all the tables along the way gets recomputed if you want them to get recomputed. If the formula is affecting certain tables, you can just recompute those tables alone and not the entire set of tables within your pipeline. But now it makes it simple for you to deploy and reprocess parts of your pipeline without then trying to figure out like, hey… Manually figure out which tables need to be reprocessing tables and which tables are just fine to be left as it is, even after the logic changes.
So, it really makes it far more simpler to do reprocessing, to do error handling and just, maintenance for the pipelines itself. Another key thing we’ve added is now Delta Live Tables is built on Delta Tables obviously, so one of the things. One of the advantages you get with that is Delta Live Tables has a full understanding of the tables and how they’re being updated and how often the data is coming through. So it’s able to do lot of optimizations itself.
Sometimes today, you might be using, you might be running auto optimize on a… or you might set auto optimize on a table. You might be collapsing the table, you might be optimizing it on a schedule. So you don’t have to do all of that anymore, because Delta Live Tables eliminates all of that and automatically optimizes and tunes the tables for performance, right. And you, of course, can set the additional Z order, indexing on it, if you’d like to, but the idea is that we want to remove that complexity from having to tune and manage your Delta Live Tables by the system. It takes care of it from that point of view. Okay. Finally probably the most important thing is the observability that you get of the operations of your pipeline.
So, now you get to get a sense of how the pipeline is behaving at a row by row level, right, in terms of how many rows have been processed, what’s the throughput of the pipeline. You get a good sense of all of that. Before, this used to be a black box in a way, right, in terms of understanding how the data is showing through your pipeline. You would probably know whether the whole process succeeded or not, but now you get at a row by row level, you get to see how the transforms are working, which tables have been updated, how long it has been since the last update, you understand the throughput and you can track your performance over time as the data is evolving.
So, oh this is just an incredible amount of benefit that you get in terms of manageability and observability with Delta Live Tables. With that in mind, let me just do a very short demo, to give you a sense of what it looks like, within Databricks itself. There’s it here. All right. Here’s a very simple SQL based pipeline. We had a little bit of a code snippet in the files itself, but as you can see here, what you are doing is really creating a set of tables and views and you’re linking those tables and views together by referring to each other right? So in this particular case we are creating a view on customers and orders, simple enough and then you are creating an intermediate tables, where you are joining your stores and customers, and then branching it out to both in terms of different tables in your gold layer, where you are actually doing additional processing and transformation.
So it’s very simple now that once you have specified it this way, in your notebook or your ITE, you can just switch into the pipeline’s tab here and you can just go ahead and do a create pipeline, give it a name that’s trivial enough there. Choose a particular pipeline and you can say create and it goes off and creates the pipeline for you. And while it’s creating it, here’s one that I’ve already deployed, and this is what it looks like once it’s deployed. You can get to see the structure of your pipelines, the tables it’s reading from, the joints that are happening and final target tables that it’s creating.
So you have a good sense of your entire pipeline in terms of the tables involved and the dependencies of the table itself. Now let’s, if you look at a slightly more complicated pipeline, for example, here’s a pipeline that has written Python and again, the same process goes. You’re creating product master and customer master in this particular case, which are both views. And then, you can go ahead and join those to the transformations that you want, your filters, your data quality checks, what have you, and then when you deploy it, you again, get to see the behavior of the pipeline. In this particular case it’s running. You can go again to see it here, and you can see the entire flow and it can get arbitrarily complex.
And, by the way, you don’t even have to link all of this together into a single pipeline. You might have independent flows within the same pipeline that you might want to take a look at. All right, and the other thing is, you can configure your… As we are talking about how you can configure and customize your pipeline, you can set lot of parameters here, and you can change the way the pipeline behaves by setting the parameters out here.
Now, in this particular case, you can see how we have set the continuous flak false, so this is a batch pipeline. So, it’ll run, it’ll breed all the data from a source and then when it’s finished, it shuts down. So it’s not always up and running. A different kind of a pipeline is here, for example. This is a slightly more complicated version of it. Here we are reading from a Kinesis source. Now when we’re talking about a Kinesis source, it’s always up so that it’s a continuously processing pipeline. You can see the data flowing through it, if you look at the settings, you will see that it is a continuously operating pipeline. It’s always streaming, picking up the data from your Kinesis source, transforming and applying it.
You can set a whole bunch of parameters in it if you want to. Right now you get to set the compute, but the whole idea is down the road, we will abstract all of that away, so you don’t have to worry about the compute. It’ll be a fully… This is something the pipelines itself base on the history of the data. Know how to… What kind of computer you use and how to auto scale and all of that right? So makes it far more simple.
Now, if you go back into a graph here, since this is a continuously running pipeline, I have all the statistics available so I can click on it. Of course I see the schema or the table, but I can also see the streaming, the stream that is running. Can see it’s processing the rows, how long it took to process the last set of rows in this particular system.
Now, you can get the same visibility into different sheets here. For example, if you got to the Delta Operations table that we are looking at here, you will that there is actually a data quality rule assigned to it, right, which shows that 90% of the records that past through this particular table, that arrive in this table meet the quality rules. Then if you want, you can click on it, you can see the details of that particular quality scores that have been set, quality rules that have been set on that particular table.
So, in this particular case, we got three rules, basic rules that’ll be just checking for now, in this particular instance. But you get a good sense of how good your data is as it is flowing through that particular target table. You still have 9% of those data, didn’t meet those rules, so you might be able to take corrective action.
Now, as I mentioned, most important thing for us was to make all this statistics, all the operational metadata of your pipelines easily available. Now, all of this is… It’s just stored in a Delta table. You can write your own queries. Your dashboards and you can write your own select statements and you can set your own alerts if you use SQL analytics. You can go ahead and set your alerts in terms of like, “Hey, I’m expecting the data process intake in a few minutes. If not raise an alert. So you can build your own monitoring system on top of it or export it out to any system that you may already have and you can also look at the quality rules, scores. You can analyze it so you have a much better sense of how the data is processing and proceeding through your pipeline, right?
In addition to data quality, you can also see your operational metrics here. So, in a way you get the level of visibility that it was impossible to get beforehand.
All right, in summary, basically, Delta Live Tables, think of… Just imagine a completely different way of doing ETL. It’s a much more simpler, easier way to do it. Development is far less complicated than it was before, in terms of how you specify it. How do you handle errors and direct our… You’ve got built in data quality and finally you got lot better observability, manageability and error handling that Delta Live Tables provide out of the box.
So, with that, let me actually turn it over Bryce from Shell. Bryce has been a great user of Delta Live Tables. He’s given us a lot of feedback over the months in terms of meeting the use case’s scenarios so, Bryce, I’d love to hand this over to you to talk about Shell and your experiences with Delta Live Tables.
Bryce Bartmann: Great, thanks Awez and thanks for introducing me. So, as Awez said, I’m Bryce Bartmann, principle AI engineer in Shell’s data science organization and I want to talk a little bit today about intelligent pipelining of data, some of the challenges that we face as an organization to ingest large amounts of data and how we’re using Delta Live Tables to solve some of those problems. As you are probably quite aware, we are in the midst of rapid change in the energy business, where we are facing not only a transition from hydrocarbon fuels into new energies, but also digitalization is having a large impact on that.
And so, today I want to talk a little bit about our journey and what Shell is doing in this phase and how we are working with Databricks to solve some of these problems. Just a cautionary note, some of the things I’m about to say are forward looking. Please treat them with caution and just something that we need to make sure that you’re aware of as part of any of our communications.
All right, so let’s get into it. The pace of a digital adoption is dramatically increasing at Shell. We are on a journey to move into the digital space and we’ve been investing very heavily in this space in terms of resources and people and processes. And we are making strides in this area and some examples of what we are starting to achieve.
In 2020, we had 63 AI powered applications. We are also ingesting vast amounts of sensor data from our assets. Up to date, about 1.6 trillion events are being ingested. We have 6000 pieces of equipment that are being monitored using AI. We’ve done 1200 and over drone missions. We have robots capturing video footage around our assets and we’re seeing a 10x increase in the uses of virtual rooms, due to COVID. We have one and a half million registered users of our Go Plus application, our loyalty scheme. And we’re also making big strides, in terms of investing in people at Shell. We have over 350 people in our digital Center of Excellence. We have around 800 data citizen data scientists that are working at Shell.
So, we’re on a journey and we are making great strides and we have a number of use cases at Shell and scenarios where data is making a big impact. So let’s talk about a few of them.
So digitalization and AI is driving efficiency across a number of our businesses. We use proactive technical monitoring to deliver around 130 million dollars of value to the business through finding small and fixing small. We are now supplementing this with AI and we have a predictive maintenance program that is using that large amount of event data that I mentioned earlier, about 1.6 trillion data points to be used in machine learning models to detect anomalies in our data.
We also have real time use cases. So could we start optimizing some of our production in real time? Looking at real time feeds coming from our assets and we’ve deployed the real time production optimization application that is deployed at our liquid natural gas sites and it’s achieving 1 to 2% production increases. At the same time, also making some of our… Reducing some of our CO2 emissions.
And then, another example is digital twins, so what digital twin achieves, is it creates virtual environments of our real assets. These are data hungry applications. We’re partnered with Config on this. And we need to pipeline vast amounts of data for each asset into the digital swing to make it useful. For example, we need to ingest sensor data, hundreds and thousands of sensors typically. We need to ingest video footage, audio footage, drawings, anything to make that virtual reality more meaningful and more a representation of real life.
And, we’ve rolled this out to one of plants in Nyhamna and we’ve got five other assets that we’re looking at deploying in 2021. So in our existing business, there is a lot of work that we need to do to deliver the data to these applications. Additionally, we also need to look at the next generation of clean energy and where we’re going with that. Shell offers customers over a 165 000 charge points around the world and it makes Shell one of the largest electric transport solutions in the world. That generates large amounts of data and we’re using that to try and optimize how vehicles are charged by understanding and learning people’s interactions with these points.
How can we make it more efficient for the user and how can we also protect the grid at the same time by making sure we deploy the energy in places where it needs to be at the optimal time. Additionally, we also are investing and working very hard on how we can start leveraging hydrogen more efficiently in the energy transition. Our projects include developing methods of production, storage and transportation. And converting existing industries, such as cement and steel industries to start using more clean energy is complex. In order to do this, we can’t just try out every type of scenario in the real world. And so, we’re using data to build data driven simulation models, combined with physics based models to optimize traditional experimentation.
And lastly, we’re also doing a lot of work around using data in wind farms. Building wind farms is complex. How we do it and where we build it and how we should set that all up can be optimized and so we’re using data and AI to calculate how to do all of that. Now to achieve all of this, we’ve obviously built out a capability within Shell. We’re very proud of what we’ve done in terms of AI and data science. But we obviously cannot do any of this without working with our partners and here, this is why I’m talking to you today is because we’ve been working really closely with Databricks and as Awez mentioned, I’ve also been an early user of Delta Live Tables.
And I want to talk a little bit about, the challenges that we face as an organization to deliver data for all of these kinds of applications that I’ve just mentioned, but also what we need in terms of building production grade pipelines that we can deploy around the world at scale. And then deliver everything that we need from one corner of the globe to the other.
So, let’s about what I consider some of the data pipelining key requirements we need to achieve. Firstly, connectors. So we need to be able to connect to vast types of data sources. Whether it’s a file system, data bases, APIs, streaming sources in the PUBs up domain. Cross-cloud, we need to be able to connect to as many data sources as we can as well as being able to write to different locations.
Secondly, we need transformation and AI as part of our data pipelines. We need to be able to perform simple and complex transformations on the data, but additionally what is very important these days is to start factoring in AI into your data ingestion. Could you do things like face detection to be conformed, to conform to people requirements? Could you do anything like machine learning on the file or anomaly detection while you’re ingesting data?
These are all becoming important features of what we’re developing at Shell. Quality. We need to make sure that we can be confident in the data that we’re looking at. Do we know the standards of the data that we’ve got or are we just ingesting whatever we’re receiving? Could we control it, quarantine it or even drop it if we don’t need it?
Recovery is also really important. When we deploy pipelines at scale, we want to make sure that it can recover automatically. When we’re thinking about a pipeline that is running across the globe, we need to make sure that we don’t have manual interventions to restart. So recovery is really important.
And in a similar vein, scale needs to happen as quickly as it is needed. We may go through periods where there is low demand on some of our pipelines and we want to use elasticity in the cloud to reduce our cost where we can, but we also want to scale up where we need to to achieve some of the SLAs that we have around our data.
And finally, monitoring. We want to be able to make sure that we have a good picture as to what is going on in our pipelines. One of the things that is often really useful from a streaming pipeline is for it to alert you when it’s seeing numbers that are much lower than expected. We also want to be able to identify where we are seeing challenges with performance and what the metrics are around all of these pipelines.
So, in my eyes, these are the six key features or requirements that we need to build a production grade, fault resilient pipelines in Shell. So, with that, when Databricks pitched the idea of Delta Live’s tables to me a while ago, I was really excited to see that it was tackling some of these problems.
So let’s talk about Delta Live Tables and what I see as the key features of it that make it really exciting for us to use at Shell. So A, it’s code driven. We have found that if you want to build production grade pipelines and need to take into account all six of the key requirements I mentioned earlier. They are not simple. They start off simple, but they don’t finish simple. They get complex and they have a lot of functionality that they need to achieve. They need to be sending metrics to certain locations. They need to be doing transformations. They may even be needing to send the data off to a separate vendor at some point and come back and carry on.
So, we need to… We really like the idea of code driven Delta Live Tables using Spark as the languages. It also, we’ve seen that it’s driven much better software engineering standard around how we build our pipelines. Secondly, Spark is obviously a great language to be using for building data pipelines and in this case, Delta Live Tables. Spark has obviously got many built in transformations that we can use. It’s also got scale. The connectors that we find in Spark are great. And so, it’s a very good language for what we want to achieve.
Thirdly, it’s Delta centric, so as Awez mentioned a little bit earlier, we find that developers often like to write their data into Delta but they forget the small things that have actually become very significant when it comes down to performance. Am I running optimize on my tables? Have I vacuumed my tables? Have I thought about Z ordering? What Delta Live Tables does is brings all of that into the thought process of our developers. And so, the maintenance jobs that get automatically scheduled by Delta Live Tables to perform all of those tasks, is actually key to making sure that your Delta Live Tables perform as you’d expect it to.
As Awez also demonstrated, the UI monitoring is really useful for us, being able to depict how our data and it’s lineage is formed and how the data is flowing through each of those points is really useful information to us. We get to see what the performance is, how many rows we’re expecting to flow through the system and so, that’s really a key component that we’ve been missing up until now that we really like about Delta Live Tables.
Lakehouse. One of the big steps and I’m sure you will be hearing many of the innovations going around Delta is the Lakehouse environment. What we like most about Delta is that it gives our data lake a data warehouse kind of capabilities. The more times we move data between different storage components in a pipeline, the more faults we introduce into the pipeline itself. If we can reduce the amount of hops that data is doing, the more accurate and more reliable we find our pipelines to be.
Lakehouse achieves this by making sure that we don’t move data between different data storage components just for the sake of needing to do so. Aggregating data in another storage component doesn’t really need to happen anymore when we start looking at Lakehouse. SQL endpoints is the other, obviously the other component that makes Lakehouse a very attractive option. Gives us the compute to query all of this data we’re writing into Delta and showing it in tools like, Power BI and other BI tools that you can use.
And lastly, the innovation. What we’ve really enjoyed about Delta Live Tables is how Databricks has engaged with Shell to built out the capabilities at all points and literally on a weekly basis, we’ve been working with Delta to, or with Databricks and the Delta Live Tables product team to explain how we’re using Delta Live Tables, what we like about it, what we think needs to be improved and what we want to see in the future.
We also see that the benefit of doing this is providing Databricks with real world scenarios of how we use Delta Live Tables. And how even our development teams and the organizations that they work in and how they would use Delta Live Tables in those scenarios. Some of the examples of where we deployed Delta Live Tables at Shell is by ingesting IOT censored data directly from the cloud or from 4G, 5G signals into Delta and automatically, we’ve also built on top of its SQL endpoints so that people can query that data and view it in real time.
Another one is that we’ve actually started ingesting video feeds from some of our assets and using Delta Live Tables and some intelligence to do facial detection on that video before we store it in our data lake.
So we’ve been having a great experience with Delta Live Tables and I wanted to also just give some highlights as to the benefits that Shell is seeing in terms of using Delta Live Tables. Some of our developers are calling it the new gold standard for data pipelines. Delta Live Tables makes it easier for us to build intelligence into our data ingestion processes. Having Py Spark as partner of your Delta Live Table opens up all the opportunity around the ML libraries that you have in Py Spark that you can actually start thinking about intelligence as part of your ingestion process.
Delta maintenance tasks are no longer an after thought for developers. So like I mentioned, it automatically sets up your maintenance job and it runs it on a daily basis for you, performing all the kinds of tasks to keep your Delta Live Table performing properly. And expectations has obviously been a very good feature that we’ve enjoyed seeing included in Delta Live Tables. And some people are saying that they’re starting to be able to trust their data more because of what expectations brings to their pipelines.
So, in all, we’re really excited about what we’re seeing with Delta Live Tables. We’ve really enjoyed working with Databricks on it and we look forward to all of the new functionality that is coming on the road map and to deploying more with Delta Live Tables at Shell. So back to you Awez.
Awez Sayed: Thank you Bryce. It’s been a great pleasure working with you, collaborating with you. Thank you for all the input you’ve given us. I’m glad that all that input that you’ve given us actually helped us make this product I think more very suitable for both Shell and hopefully for many other customers in the audience today. Thank you everyone for joining us, for taking the time to learn about our exciting new service that we are launching. We hope to be working with many of you as you start trying out Delta Live Tables going forward.
Awez is VP of Product Management for the data platform at Databricks helping customers build open, performant and effective data engineering and analytics solutions on Delta Lake. Previously Awez was ...
Bryce is a Senior Data Engineer in Shell’s Data Science COE overseeing cloud technologies, including Databricks and Azure. After 15 years of working on data in global SAP ERP implementations, Bryce ...