Materialized views

Try Databricks for free

Delta Pipelines / Materialized Views in Databricks Delta

Intro

Delta Pipelines provides a set of APIs and UI for managing the data pipeline lifecycle. This open-source framework helps data engineering teams simplify ETL development, improve data reliability, and scale operations. Instead of coding transformations and scheduling jobs for your data, you can easily define the end state you want for your data by building declarative pipelines.  You can chain the dependencies between various tasks such that updating one table will automatically trigger all downstream reprocessing. With Delta Lake, you will have reliable data lakes; with delta pipelines, you will have reliable data pipelines for your data workflows.

Delta Pipelines Datasets

The primary abstraction in Delta Pipelines is what’s known as a Dataset. A Dataset is basically a query or view of an underlying table that can be expressed using APIs.  Below is an example using the Scala API:


dataset("events")
   .query {
       input("base_table")
       .select(...)
       .filter(...)
   }

Creating Materialized Views of Datasets

Delta Pipelines also offers the ability to create a materialized view of the data by simply chaining the .materialize() operator on to the dataset set like such:


dataset("events")
  .query {
     input("base_table")
     .select(...)
     .filter(...)
  }
     .materialize()

This means that the “events” materialized view will be continuously updated as data flows through the Delta Pipeline. As new data arrives and is committed to the base table via ACID transactions, the materialized view is continuously updated incrementally for each of those transactions as well. Users need not worry about when or how frequently to refresh the materialized view or even things like when to run OPTIMIZE or VACUUM commands -- it’s all done in the background automatically.

See the below video for more details on Delta Pipelines

To play this video, click here and accept cookies

    Back to Glossary