What is a data vault?
A data vault is a data modeling design pattern used to build a data warehouse for enterprise-scale analytics. The data vault has three types of entities: hubs, links, and satellites.
Hubs represent core business concepts, links represent relationships between hubs, and satellites store information about hubs and relationships between them.
The data vault is a data model that is well-suited to organizations that are adopting the lakehouse paradigm.
Data vault modeling: Hubs, links, and satellites
- Hubs - Each hub represents a core business concept, such as they represent Customer Id/Product Number/Vehicle identification number (VIN). Users will use a business key to get information about a Hub. The business key may have a combination of business concept ID and sequence ID, load date, and other metadata information.
- Links - Links represent the relationship between Hub entities.
- Satellites - Satellites fill the gap in answering the missing descriptive information on core business concepts. Satellites store information that belongs to Hub and relationships between them.
- A satellite cannot have a direct connection to another satellite.
- A hub or link may have one or more satellites.
Data vault benefits
- Structured, with flexibility for refactoring
- Extremely scalable, up to PBs volumes
- Uses patterns that support ETL code generation
- Familiar architecture: data layers, ETL, star schemas
Data vaults are based on agile methodologies and techniques, which means that they can adapt to fast-paced changing business requirements. One of the major advantages of using the Data Vault methodology is that ETL jobs need less refactoring when the model changes.
Modeling techniques by lakehouse layer
With these concepts in mind, let's explore how Data Vault fits into our Bronze, Silver and Gold data layers where data goes from a raw to a refined state that is ready for analytics. In this multi-hop architecture, raw data gets stored in a Bronze layer with minimum transformation and data structure as close to the source system. The Data Vault methodology can be applied to the Silver layer where data is transformed into Hubs, links and satellites.
In the Gold layer, multiple data marts/data warehouses can be built as per dimensional modeling/Kimball methodology. As discussed earlier, the Gold layer is for reporting and uses more de-normalized and read-optimized data models with fewer joins. Sometimes tables in the Gold Layer can be completely denormalized, typically if the Data Scientists want it that way to feed their algorithms for feature engineering.
If a Data Vault model is used in the Silver layer it simplifies and significantly reduces the changes needed to perform ETL into the Data Marts and Data Warehouses, as Hubs make key management (surrogate key / natural keys) easier. Satellites make loading dimensions easier because they have all the attributes, and links make loading fact tables quite straightforward because they have all the relationships.
- Plateforme Data Lakehouse par Databricks
- Databricks SQL product page
- Databricks Blog: Different Data Warehousing Modeling Techniques and How to Implement them on the Databricks Lakehouse Platform
- Databricks Blog: What’s a Data Vault and How to Implement It on the Databricks Lakehouse Platform
Back to glossary