Skip to main content
Platform blog

Data Vault Best practice & Implementation on the Lakehouse

Leo Mao
Sumit Prakash
Soham Bhatt

February 24, 2023 in Platform Blog

Share this post

In the previous article Prescriptive Guidance for Implementing a Data Vault Model on the Databricks Lakehouse Platform, we explained core concepts of data vault and provided guidance of using it on Databricks. We have many customers in the field looking for examples and easy implementation of data vault on Lakehouse.

In this article, we aim to dive deeper on how to implement a Data Vault on Databricks' Lakehouse Platform and provide a live example to load an EDW Data Vault model in real-time using Delta Live Tables.

Here are the high-level topics we will cover in this blog:

  1. Why Data Vault
  2. Data Vault in Lakehouse
  3. Implementing a Data Vault Model in Databricks Lakehouse
  4. Conclusion

1. Why Data Vault

The main goal for Data Vault is to build a scalable modern data warehouse in today's world. At its core, it uses hubs, satellites and links to model the business world, which enables a stable (Hubs) yet flexible (Satellites) data model and architecture that are resilient to environmental changes. Hubs contain business keys that are unlikely to change unless core business changes and associations between the hubs make the skeleton of the Data Vault Model, while satellites contain contextual attributes of a hub that could be created and extended very easily.

Please refer to below for a high-level design of the Data Vault Model, with 3 key benefits by design:

  1. Enables efficient parallel loading for the enterprise data warehouse as there is less dependency between the tables of the model, as we could see below, hubs or satellites for the customer, product, order could all be loaded in parallel.
  2. Preserve a single version of the facts in the raw vault as it recommends insert only and keep the source metadata in the table.
  3. New Hubs or Satellites could be easily added to the model incrementally, enabling fast to market for data asset delivery.
Data Vault Model, Hubs, Links, Satellites
Data Vault Model Core Components

2. Data Vault in Lakehouse

The Databricks Lakehouse Platform supports Data Vault Model very well, please refer to below for high level architecture of Data Vault Model on Lakehouse. The robust and scalable Delta Lake storage format enables customers to build a raw vault where unmodified data is stored, and a business vault where business rules and transformation are applied if required. Both will align to the design earlier hence get the benefits of a Data Vault Model.

A diagram shows how data vault works in lakehouse
Data Vault Model on the Lakehouse

3. Implementing a Data Vault Model in Databricks Lakehouse

Based on the design in the previous section, loading the hubs, satellites and links tables are straightforward. All ETL loads could happen in parallel as they don't depend on each other, for example, customer and product hub tables could be loaded as they both have their only business keys. And customer_product_link table, customer satellite and product satellite could be loaded in parallel as well since they have all the required attributes from the source.

Overall Data Flow

Please refer to the high level data flow demonstrated in Delta Live Table pipeline below. For our example we use the TPCH data that are commonly used for decision support benchmarks. The data are loaded into the bronze layer first and stored in Delta format, then they are used to populate the Raw Vault for each object (e.g. hub or satellites of customer and orders, etc.). Business Vault are built on the objects from Raw Vault, and Data mart objects (e.g. dim_customer, dim_orders, fact_customer_order ) for reporting and analytics consumptions.

Overall Data Flow

Raw Vault

Raw Vault is where we store Hubs, Satellites and Links tables which contain the raw data and maintain a single version of truth. As we could see from below, we create a view raw_customer_vw based on raw_customer and use hash function sha1(UPPER(TRIM(c_custkey))) to create hash columns for checking existence or comparison if required.

-- create raw customer view and add hash columns for checking existence or comparison 
CREATE  STREAMING LIVE VIEW raw_customer_vw
COMMENT "RAW Customer Data View"
AS  SELECT
        sha1(UPPER(TRIM(c_custkey))) as sha1_hub_custkey,
        sha1(concat(UPPER(TRIM(c_name)),UPPER(TRIM(c_address)),UPPER(TRIM(c_phone)),UPPER(TRIM(c_mktsegment)))) as hash_diff,
        current_timestamp as load_ts,
        "Customer Source" as source,
        c_custkey,
        c_name,
        c_address,
        c_nationkey,
        c_phone,
        c_acctbal,
        c_mktsegment,
        c_comment
    FROM STREAM(LIVE.raw_customer)

Once the raw customer view is created, we use it to create hub customer and satellite customers respectively with the code example below. In Delta Live Table, you could also easily set up data quality expectation (e.g. CONSTRAINT valid_sha1_hub_custkey EXPECT (sha1_hub_custkey IS NOT NULL) ON VIOLATION DROP ROW) and use that define how the pipeline will handle data quality issues defined by the expectation. Here we drop all the rows if it does not have a valid business key.

-- create hub customer table from the raw customer view
CREATE OR REFRESH STREAMING LIVE TABLE hub_customer(
  sha1_hub_custkey        STRING     NOT NULL,
  c_custkey               BIGINT     NOT NULL,
  load_ts                 TIMESTAMP,
  source                  STRING
  CONSTRAINT valid_sha1_hub_custkey EXPECT (sha1_hub_custkey IS NOT NULL) ON VIOLATION DROP ROW,
  CONSTRAINT valid_custkey EXPECT (c_custkey IS NOT NULL) ON VIOLATION DROP ROW
)
COMMENT " HUb CUSTOMER TABLE"
AS SELECT
      sha1_hub_custkey,
      c_custkey,
      load_ts,
      source
   FROM
      STREAM(live.raw_customer_vw)
-- create satellite customer table from raw customer view 
CREATE OR REFRESH STREAMING LIVE TABLE sat_customer(
  sha1_hub_custkey        STRING    NOT NULL,
  c_name                  STRING,
  c_address               STRING,
  c_nationkey             BIGINT,
  c_phone                 STRING,
  c_acctbal               DECIMAL(18,2),
  c_mktsegment            STRING,
  hash_diff               STRING    NOT NULL,
  load_ts                 TIMESTAMP,
  source                  STRING    NOT NULL
  CONSTRAINT valid_sha1_hub_custkey EXPECT (sha1_hub_custkey IS NOT NULL) ON VIOLATION DROP ROW
)
COMMENT " SAT CUSTOMER TABLE"
AS SELECT
      sha1_hub_custkey,
      c_name,
      c_address,
      c_nationkey,
      c_phone,
      c_acctbal,
      c_mktsegment,
      hash_diff,
      load_ts,
      source
   FROM
      STREAM(live.raw_customer_vw)

Hubs and Satellites of other objects are loaded in the similar way. For Link tables, here is an example to populate lnk_customer_orders based on the raw_orders_vw.

-- create customer orders table from the raw orders view
CREATE OR REFRESH STREAMING LIVE TABLE lnk_customer_orders
(
  sha1_lnk_customer_order_key     STRING     NOT NULL ,  
  sha1_hub_orderkey               STRING ,
  sha1_hub_custkey                STRING ,
  load_ts                         TIMESTAMP  NOT NULL,
  source                          STRING     NOT NULL 
)
COMMENT " LNK CUSTOMER ORDERS TABLE "
AS SELECT
      sha1_lnk_customer_order_key,
      sha1_hub_orderkey,
      sha1_hub_custkey,
      load_ts,
      source
   FROM
       STREAM(live.raw_orders_vw)

Business Vault

Once the hubs, satellites and links are populated in the Raw Vault, Business Vault objects could be built based on them. This is to apply additional business rules or transformation rules on the data objects and prepare for easier consumption at a later stage. Here is an example of building sat_orders_bv, with which order_priority_tier is added as enrichment information of the orders object in the Business Vault.

-- create satellite order table in business vault from the satellite orders table in raw vault 
CREATE OR REFRESH LIVE TABLE sat_orders_bv
(
  sha1_hub_orderkey         STRING     NOT NULL ,  
  o_orderstatus             STRING ,
  o_totalprice              decimal(18,2) ,
  o_orderdate               DATE,
  o_orderpriority           STRING,
  o_clerk                   STRING,
  o_shippriority            INT,
  order_priority_tier       STRING,
  source                    STRING    NOT NULL
  
)
COMMENT " SAT Order Business Vault TABLE "
AS SELECT
          sha1_hub_orderkey     AS sha1_hub_orderkey,
          o_orderstatus         AS o_orderstatus,
		  o_totalprice          AS o_totalprice,
          o_orderdate           AS o_orderdate,
          o_orderpriority       AS o_orderpriority,
		  o_clerk               AS o_clerk,
          o_shippriority        AS o_shippriority,
		  CASE WHEN o_orderpriority IN ('2-HIGH', '1-URGENT') AND o_totalprice >= 225000 THEN 'Tier-1'
               WHEN o_orderpriority IN ('3-MEDIUM', '2-HIGH', '1-URGENT') AND o_totalprice BETWEEN 120000 AND 225000 THEN 'Tier-2'   
			   ELSE 'Tier-3'
		  END order_priority_tier,
          source
   FROM
       live.sat_orders

Data Mart

Finally, we see customers loading Data Vault Point-in-Time Views and Data marts for easy consumption in the last layer. Here the main focus is ease of use and good performance on read. For most simple tables, it will suffice with creating views on top of the Hubs or Satellites or you can even load a proper star-schema like Dimensional Model in the final layer. Here is an example that creates a customer dimension as a view dim_customer, and the view could be used by others to simplify their queries.

-- create customer dimension as view in data mart from the hub and satellite customer table, ref nation and ref region table
CREATE LIVE VIEW dim_customer
       AS
       SELECT 
             sat.sha1_hub_custkey      AS dim_customer_key,
	         sat.source                AS source,                  
	         sat.c_name                AS c_name ,             
	         sat.c_address             AS c_address ,             
	         sat.c_phone               AS c_phone ,              
	         sat.c_acctbal             AS c_acctbal,             
	         sat.c_mktsegment          AS c_mktsegment,                          
	         sat.c_nationkey           AS c_nationkey,  
             sat.load_ts               AS c_effective_ts,
	         -- derived 
	         nation.n_name             AS nation_name,
	         region.r_name             AS region_name
	     FROM LIVE.hub_customer hub
         INNER JOIN LIVE.sat_customer sat
           ON hub.sha1_hub_custkey = sat.sha1_hub_custkey
	     LEFT OUTER JOIN LIVE.ref_nation nation
	       ON (sat.c_nationkey = nation.n_nationkey)
	     LEFT OUTER JOIN LIVE.ref_region region
	       ON (nation.n_regionkey = region.r_regionkey)

One of the common issues with data vault is that sometimes it ends up with too many joins especially when you have a complex query or fact that requires attributes from many tables. The recommendation from Databricks is to pre-join the tables and stored calculated metrics if required so they don't have to be rebuilt many times on the fly. Here is an example of creating a fact table fact_customer_order based on multiple joins and storing it as a table for repeatable queries from the business users.

-- create fact customer order table in data mart from the lnk_customer_orders, dim_order, dim_customer, ref_nation and ref_region
CREATE OR REFRESH LIVE TABLE fact_customer_order
       AS
       SELECT 
           dim_customer.dim_customer_key,
           dim_orders.dim_order_key,
           nation.n_nationkey     AS dim_nation_key,
           region.r_regionkey     AS dim_region_key,
           dim_orders.o_totalprice AS total_price,
           dim_orders.o_orderdate  AS order_date
       FROM LIVE.lnk_customer_orders lnk
       INNER JOIN LIVE.dim_orders dim_orders
           ON lnk.sha1_hub_orderkey = dim_orders.dim_order_key
       INNER JOIN LIVE.dim_customer dim_customer
           ON lnk.sha1_hub_custkey = dim_customer.dim_customer_key
	   LEFT OUTER JOIN LIVE.ref_nation nation
           ON dim_customer.c_nationkey = nation.n_nationkey
       LEFT OUTER JOIN LIVE.ref_region region
           ON nation.n_regionkey = region.r_regionkey

Delta Live Table Pipeline Setup

All the code of above could be found here. Customers could easily orchestrate the whole data flow based on the Delta Live Table pipeline setup, the configuration below is how I set up the pipeline in my environment, click DLT Configuration for more details on how to set up a Delta Live Table Pipeline your workflow if required.

{
    "id": "6835c6ad-42a2-498d-9037-25c9d990b380",
    "clusters": [
        {
            "label": "default",
            "autoscale": {
                "min_workers": 1,
                "max_workers": 5,
                "mode": "ENHANCED"
            }
        }
    ],
    "development": true,
    "continuous": false,
    "channel": "CURRENT",
    "edition": "ADVANCED",
    "photon": false,
    "libraries": [
        {
            "notebook": {
                "path": "/Repos/prod/databricks-lakehouse/lakehouse-buildout/data-vault/TPC-DLT-Data-Vault-2.0"
            }
        }
    ],
    "name": "DLT Data Vault",
    "storage": "dbfs:/pipelines/6835c6ad-42a2-498d-9037-25c9d990b380",
    "configuration": {
        "pipelines.enzyme.mode": "advanced",
        "pipelines.enzyme.enabled": "true"
    },
    "target": "leo_lakehouse"
}

4. Conclusion

In this blog, we learned about core Data Vault modeling concepts, and how to implement them using Delta Live Tables. The Databricks Lakehouse Platform supports various modeling methods in a reliable, efficient and scalable way, while Databricks SQL - our serverless data warehouse - allows you to run all your BI and SQL applications on the Lakehouse. To see all of the above examples in a complete workflow, please look at this example.

Please also check out our related blogs:

Get started on building your Dimensional Models in the Lakehouse

Try Databricks free for 14 days.

Try Databricks for free

Related posts

Platform blog

Five Simple Steps for Implementing a Star Schema in Databricks With Delta Lake

Most data warehouse developers are very familiar with the ever-present star schema. Introduced by Ralph Kimball in the 1990s, a star schema is...
Platform blog

Identity Columns to Generate Surrogate Keys Are Now Available in a Lakehouse Near You!

August 8, 2022 by Franco Patano in Product
What is an identity column? An identity column is a column in a database that automatically generates a unique ID number for each...
Platform blog

Data Warehousing Modeling Techniques and Their Implementation on the Databricks Lakehouse Platform

June 24, 2022 by Soham Bhatt and Deepak Sekar in Platform Blog
The lakehouse is a new data platform paradigm that combines the best features of data lakes and data warehouses. It is designed as...
See all Platform Blog posts