Since the availability of Delta Live Tables (DLT) on all clouds in April (announcement), we’ve introduced new features to make development easier, enhanced automated infrastructure management, announced a new optimization layer called Project Enzyme to speed up ETL processing, and enabled several enterprise capabilities and UX improvements.
DLT enables analysts and data engineers to quickly create production-ready streaming or batch ETL pipelines in SQL and Python. DLT simplifies ETL development by allowing you to define your data processing pipeline declaratively. DLT comprehends your pipeline’s dependencies and automates nearly all operational complexities.
Delta Live Tables has grown to power production ETL use cases at leading companies all over the world since its inception. DLT is used by over 1,000 companies ranging from startups to enterprises, including ADP, Shell, H&R Block, Jumbo, Bread Finance, and JLL.
With DLT, engineers can concentrate on delivering data rather than operating and maintaining pipelines and take advantage of key features. We have enabled several enterprise capabilities and UX improvements, including support for Change Data Capture (CDC) to efficiently and easily capture continually arriving data, and launched a preview of Enhanced Auto Scaling that provides superior performance for streaming workloads. Let’s look at the improvements in detail:
Make development easier
We have extended our UI to make it easier to manage the end-to-end lifecycle of ETL.
UX improvements. We have extended our UI to make managing DLT pipelines easier, view errors, and provide access to team members with rich pipeline ACLs. We have also added an observability UI to see data quality metrics in a single view, and made it easier to schedule pipelines directly from the UI. Learn more.
Schedule Pipeline button. DLT lets you run ETL pipelines continuously or in triggered mode. Continuous pipelines process new data as it arrives, and are useful in scenarios where data latency is critical. However, many customers choose to run DLT pipelines in triggered mode to control pipeline execution and costs more closely. To make it easy to trigger DLT pipelines on a recurring schedule with Databricks Jobs, we have added a ‘Schedule’ button in the DLT UI to enable users to set up a recurring schedule with only a few clicks without leaving the DLT UI. You can also see a history of runs and quickly navigate to your Job detail to configure email notifications. Learn more.
Change Data Capture (CDC). With DLT, data engineers can easily implement CDC with a new declarative APPLY CHANGES INTO API, in either SQL or Python. This new capability lets ETL pipelines easily detect source data changes and apply them to data sets throughout the lakehouse. DLT processes data changes into the Delta Lake incrementally, flagging records to insert, update, or delete when handling CDC events. Learn more.
CDC Slowly Changing Dimensions—Type 2. When dealing with changing data (CDC), you often need to update records to keep track of the most recent data. SCD Type 2 is a way to apply updates to a target so that the original data is preserved. For example, if a user entity in the database moves to a different address, we can store all previous addresses for that user. DLT supports SCD type 2 for organizations that require maintaining an audit trail of changes. SCD2 retains a full history of values. When the value of an attribute changes, the current record is closed, a new record is created with the changed data values, and this new record becomes the current record. Learn more.
Automated Infrastructure Management
Enhanced Autoscaling (preview). Sizing clusters manually for optimal performance given changing, unpredictable data volumes–as with streaming workloads– can be challenging and lead to overprovisioning. Current cluster autoscaling is unaware of streaming SLOs, and may not scale up quickly even if the processing is falling behind the data arrival rate, or it may not scale down when a load is low. DLT employs an enhanced auto-scaling algorithm purpose-built for streaming. DLTs Enhanced Autoscaling optimizes cluster utilization while ensuring that overall end-to-end latency is minimized. It does this by detecting fluctuations of streaming workloads, including data waiting to be ingested, and provisioning the right amount of resources needed (up to a user-specified limit). In addition, Enhanced Autoscaling will gracefully shut down clusters whenever utilization is low while guaranteeing the evacuation of all tasks to avoid impacting the pipeline. As a result, workloads using Enhanced Autoscaling save on costs because fewer infrastructure resources are used. Learn More.
Automated Upgrade & Release Channels. Delta Live Tables (DLT) clusters use a DLT runtime based on Databricks runtime (DBR). Databricks automatically upgrades the DLT runtime about every 1-2 months. DLT will automatically upgrade the DLT runtime without requiring end-user intervention and monitor pipeline health after the upgrade. If DLT detects that the DLT Pipeline cannot start due to a DLT runtime upgrade, we will revert the pipeline to the previous known-good version. You can get early warnings about breaking changes to init scripts or other DBR behavior by leveraging DLT channels to test the preview version of the DLT runtime and be notified automatically if there is a regression. Databricks recommends using the CURRENT channel for production workloads. Learn more.
Announcing Enzyme, a new optimization layer designed specifically to speed up the process of doing ETL
Transforming data to prepare it for downstream analysis is a prerequisite for most other workloads on the Databricks platform. While SQL and DataFrames make it relatively easy for users to express their transformations, the input data constantly changes. This requires recomputation of the tables produced by ETL. Recomputing the results from scratch is simple, but often cost-prohibitive at the scale many of our customers operate.
We are pleased to announce that we are developing project Enzyme, a new optimization layer for ETL. Enzyme efficiently keeps up-to-date a materialization of the results of a given query stored in a Delta table. It uses a cost model to choose between various techniques, including techniques used in traditional materialized views, delta-to-delta streaming, and manual ETL patterns commonly used by our customers.
Get started with Delta Live Tables on the Lakehouse
Watch the demo below to discover the ease of use of DLT for data engineers and analysts alike:
If you are a Databricks customer, simply follow the guide to get started. Read the release notes to learn more about what’s included in this GA release. If you are not an existing Databricks customer, sign up for a free trial, and you can view our detailed DLT Pricing here.
Join the conversation in the Databricks Community where data-obsessed peers are chatting about Data + AI Summit 2022 announcements and updates. Learn. Network. Celebrate.