Skip to main content

Implementing a Dimensional Data Warehouse with Databricks SQL: Part 2

Building the Dimension ETL Workflows

dimensional modeling pt 2

Summary

  • Dimension ETL Pipelines: Covers extraction, cleansing, and maintaining dimension tables.
  • Type-1 SCD: In-place updates without tracking historical changes.
  • Type-2 SCD: Tracks history by versioning records with surrogate keys and metadata.

As organizations consolidate analytics workloads to Databricks, they often need to adapt traditional data warehouse techniques. This series explores how to implement dimensional modeling—specifically, star schemas—on Databricks. The first blog focused on schema design. This blog walks through ETL pipelines for dimension tables, including Slowly Changing Dimensions (SCD) Type-1 and Type-2 patterns. The last blog will show you how to build ETL pipelines for fact tables.

Slowly Changing Dimensions (SCD)

In the last blog, we defined our star schema, including a fact table and its related dimensions.  We highlighted one dimension table in particular, DimCustomer, as shown here (with some attributes removed to conserve space):

The last three fields in this table, i.e., StartDate, EndDate and IsLateArriving, represent metadata that assists us with versioning records.  As a given customer’s income, marital status, home ownership, number of children at home, or other characteristics change, we will want to create new records for that customer so that facts such as our online sales transactions in FactInternetSales are associated with the right representation of that customer.  The natural (aka business) key, CustomerAlternateKey, will be the same across these records but the metadata will differ, allowing us to know the period for which that version of the customer was valid, as will the surrogate key, CustomerKey, allowing our facts to link to the right version.  

NOTE: Because the surrogate key is commonly used to link facts and dimensions, dimension tables are often clustered based on this key. Unlike traditional relational databases that utilize b-tree indexes on sorted records, Databricks implements a unique clustering method known as liquid clustering. While the specifics of liquid clustering are outside the scope of this blog, we consistently use the CLUSTER BY clause on the surrogate key of our dimension tables during their definition to leverage this feature effectively.

This pattern of versioning dimension records as attributes change is known as the Type-2 Slowly Changing Dimension (or simply Type-2 SCD) pattern. The Type-2 SCD pattern is preferred for recording dimension data in the classic dimensional methodology. However, there are other ways to deal with changes in dimension records.

One of the most common ways to deal with changing dimension values is to update existing records in place.  Only one version of the record is ever created, so that the business key remains the unique identifier for the record.  For various reasons, not the least of which are performance and consistency, we still implement a surrogate key and link our fact records to these dimensions on those keys. Still, the StartDate and EndDate metadata fields that describe the time intervals over which a given dimension record is considered active are not needed. This is known as the Type-1 SCD pattern.  The Promotion dimension in our star schema provides a good example of a Type-1 dimension table implementation:

But what about the IsLateArriving metadata field seen in the Type-2 Customer dimension but missing from the Type-1 Promotion dimension? This field is used to flag records as late arriving.  A late arriving record is one for which the business key shows up during a fact ETL cycle, but there is no record for that key located during prior dimension processing.  In the case of the Type-2 SCDs, this field is used to denote that when the data for a late arriving record is first observed in a dimension ETL cycle, the record should be updated in place (just like in a Type-1 SCD pattern) and then versioned from that point forward.  In the case of the Type-1 SCDs, this field isn’t necessary because the record will be updated in place regardless.

NOTE: The Kimball Group recognizes additional SCD patterns, most of which are variations and combinations of the Type-1 and Type-2 patterns. Because the Type-1 and Type-2 SCDs are the most frequently implemented of these patterns and the techniques used with the others are closely related to what’s employed with these, we are limiting this blog to just these two dimension types. For more information about the eight types of SCDs recognized by the Kimball Group, please see the Slowly Changing Dimension Techniques section of this document.

Implementing the Type-1 SCD Pattern

With data being updated in place, the Type-1 SCD workflow pattern is the most straightforward of the two-dimensional ETL patterns. To support these types of dimensions, we simply:

  1. Extract the required data from our operational system(s)
  2. Perform any required data cleansing operations
  3. Compare our incoming records to those already in the dimension table
  4. Update any existing records where incoming attributes differ from what’s already recorded
  5. Insert any incoming records that do not have a corresponding record in the dimension table

To illustrate a Type-1 SCD implementation, we’ll define the ETL for the ongoing population of the DimPromotion table.

Step 1: Extract data from an operational system

Our first step is to extract the data from our operational system.  As our data warehouse is patterned after the AdventureWorksDW sample database provided by Microsoft, we are using the closely associated AdventureWorks (OLTP) sample database as our source. This database has been deployed to an Azure SQL Database instance and made accessible within our Databricks environment via a federated query.  Extraction is then facilitated with a simple query (with some fields redacted to conserve space), with the query results persisted in a table in our staging schema (that is made accessible only to the data engineers in our environment through permission settings not shown here). This is but one of many ways we can access source system data in this environment:

Step 2: Compare incoming records to those in the table

Assuming we have no additional data cleansing steps to perform (which we could implement with an UPDATE or another CREATE TABLE AS statement),  we can then tackle our dimension data update/insert operations in a single step using a MERGE statement, matching our staged data and dimension data on the business key:

One important thing to note about the statement, as it’s been written here, is that we update any existing records when a match is found between the staged and published dimension table data. We could add additional criteria to the WHEN MATCHED clause to limit updates to those instances when a record in staging has different information from what is found in the dimension table, but given the relatively small number of records in this particular table, we’ve elected to employ the relatively leaner logic shown here.  (We will use the additional WHEN MATCHED logic with DimCustomer, which contains far more data.)

The Type-2 SCD pattern

The Type-2 SCD pattern is a bit more complex. To support these types of dimensions, we must:

  1. Extract the required data from our operational system(s)
  2. Perform any required data cleansing operations
  3. Update any late-arriving member records in the target table
  4. Expire any existing records in the target table for which new versions are found in staging
  5. Insert any new (or new versions) of records into the target table

Step 1: Extract and cleanse data from a source system

As in the Type-1 SCD pattern, our first steps are to extract and cleanse data from the source system.  Using the same approach as above, we issue a federated query and persist the extracted data to a table in our staging schema:

Step 2: Compare to a dimension table

With this data landed, we can now compare it to our dimension table in order to make any required data modifications.  The first of these is to update in place any records flagged as late arriving from prior fact table ETL processes.  Please note that these updates are limited to those records flagged as late arriving and the IsLateArriving flag is being reset with the update so that these records behave as normal Type-2 SCDs moving forward:

Step 3: Expire versioned records

The next set of data modifications is to expire any records that need to be versioned.  It’s important that the EndDate value we set for these matches the StartDate of the new record versions we will implement in the next step.  For that reason, we will set a timestamp variable to be used between these two steps:

NOTE: Depending on the data available to you, you may elect to employ an EndDate value originating from the source system, at which point you would not necessarily declare a variable as shown here.

Please note the additional criteria used in the WHEN MATCHED clause.  Because we are only performing one operation with this statement, it would be possible to move this logic to the ON clause, but we kept it separated from the core matching logic, where we are matching to the current version of the dimension record for clarity and maintainability.

As part of this logic, we are making heavy use of the equal_null() function.  This function returns TRUE when the first and second values are the same or both NULL; otherwise, it returns FALSE.  This provides an efficient way to look for changes on a column-by-column basis.  For more details on how Databricks supports NULL semantics, please refer to this document.

At this stage, any prior versions of records in the dimension table that have expired have been end-dated.  

Step 4: Insert new records

We can now insert new records, both truly new and newly versioned:

As before, this could have been implemented using an INSERT statement, but the result is the same.  With this statement, we have identified any records in the staging table that don’t have an unexpired corresponding record in the dimension tables. These records are simply inserted with a StartDate value consistent with any expired records that may exist in this table.

Next steps: implementing the fact table ETL

With the dimensions implemented and populated with data, we can now focus on the fact tables. In the next blog, we will demonstrate how the ETL for these tables can be implemented.

To learn more about Databricks SQL, visit our website or read the documentation. You can also check out the product tour for Databricks SQL. Suppose you want to migrate your existing warehouse to a high-performance, serverless data warehouse with a great user experience and lower total cost. In that case, Databricks SQL is the solution — try it for free.

Never miss a Databricks post

Subscribe to the categories you care about and get the latest posts delivered to your inbox