Data Warehouse Architecture

Challenges and trade-offs in choosing your data warehouse architecture

What is data warehouse architecture?

A data warehouse is a data management system that stores current and historical data from multiple sources in a business-friendly manner for insights and reporting.

A data warehouse differs from a database. A data warehouse is a structured repository that provides data for business intelligence and analytics, while a database is a collection of structured data, extending beyond text and numbers to images, videos and more.

Data warehouse architecture refers to the framework that governs how a data warehouse is organized, structured and implemented, including components and processes.

According to Bill Inmon, who wrote Building the Data Warehouse, the groundbreaking book that launched the data warehouse industry, data warehouse architecture refers to a “subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management’s decision-making process.”

Breaking this down:

Subject-oriented — Data is organized and structured around specific business topics or themes, such as sales, marketing or distribution data
Integrated — The data warehouse integrates data from multiple sources in a unified way
Time-variant — Data represents a historical snapshot, capturing changes over time, allowing for trend analysis and reporting on how data has evolved
Nonvolatile — Data in the warehouse is read-only; it will not be changed or overwritten, keeping historical data intact and reliable for analysis

Here’s more to explore

A Guide to Data Warehousing

Databricks SQL extends analytics to all users.

Read now

Future-proof your strategy with a data lakehouse

The definitive guide — newly updated and expanded.

Read now

Big Book of Data Warehousing and BI

Your complete how-to guide to data warehousing with the Data Intelligence Platform.

Read now

When might you use data warehouse architecture?

Effective data warehouse architecture enables organizations to quickly and easily analyze consolidated business data uploaded from operational systems such as point-of-sale systems, inventory management systems, or marketing or sales databases. Data in a well-designed data warehouse is consistent, stored efficiently and easily accessible to enhance decision-making.

Data warehouses play a crucial role in business intelligence (BI), analytics, reporting, data applications, preparing data for machine learning (ML) and data analysis. Modern data warehouses have evolved to support nonstructured data like images and text, and many now include embedded AI capabilities for advanced analytics and automation.

Use cases include:

Customer segmentation
Financial reporting
Historical trend analysis
Supply chain optimization
Sales and marketing performance

Data warehouses can handle large volumes of data and are designed to store historical data and enable businesses to analyze trends over time. They can also facilitate analysis of information that's difficult to analyze directly from transactional databases. For example, an organization could use a data warehouse to analyze total revenues generated by each salesperson on a monthly basis for each product category, which may not be captured by a transactional database.

Types of data warehouse architecture

Data warehouse architecture varies in complexity, depending on structure and purpose.

Single tier
In this simple form, the data warehouse acts as a centralized repository for all data and as the platform for analysis and querying. Single-tier data warehouse architecture is suited for small organizations with a limited number of data sources, simple reporting needs and smaller budgets.

Two tier
This model separates source systems from the data warehouse, creating two layers. The data warehouse is the platform for both storage and querying. Two-tier architecture offers greater scalability and improved performance and allows for more complex transformations than single-tier architecture.

Three tier
In three-tier data warehousing architecture, the bottom tier comprises data sources and data storage, data access methods and data ingestion or extraction. The middle tier is an online analytical processing (OLAP) server. The top tier includes the front-end clients for querying, BI, dashboarding, reporting and analysis. This is the most complex type of data warehouse architecture and provides high performance and scalability, integrates with analytics tools and supports complex querying and analytics.

Data warehouse layers

Data warehouse architecture is built on a layered structure facilitating the efficient flow, transformation and consumption of data for analytics and decision-making. Each layer plays a role in ensuring data meets business needs.

Source layer
The source layer is the foundation of data warehouse architecture and the entry point for data. It contains raw data from various data sources, such as point-of-sale, marketing automation, CRM or ERP systems, third-party sources and more.

Staging layer
The staging layer temporarily stores data as it’s consolidated, cleaned and transformed, preparing the data for efficient loading into the warehouse. The staging layer acts as a buffer between the source layer and the warehouse layer, ensuring that errors in source data are addressed before further processing.

Warehouse layer
The warehouse layer is where all processed, cleaned and structured data is stored for long-term use. Data in this layer is often organized into schemas that are optimized for querying and analysis. This layer also enforces governance policies, such as data lineage and access controls, to maintain data integrity and security.

Consumption layer
The consumption layer ensures that data is accessible and relevant to business users. This layer includes BI tools, dashboards, data visualization platforms and APIs that provide user-friendly interfaces. Data in this layer is often aggregated or preprocessed into summary tables or cubes for faster query performance.

Data warehouse components

Data warehouse architecture is composed of key components that work together to ensure seamless data management and analytics. Core components include the data lakehouse architecture, data integration tools, metadata and data access tools, but organizations may use additional components as needed.

Data lakehouse architecture
The data lakehouse serves as the unified platform for storing and processing all types of data, combining the flexibility of data lakes with the management features of traditional warehouses. It handles both structured and unstructured data, supporting everything from SQL analytics to machine learning workloads, while maintaining data quality and performance.

Data integration tools
Data integration tools support two main approaches to data management: direct integration and data virtualization. Direct integration tools pull data into the central database and transform it into a unified format for analysis by using methods like ETL (extract, transform, load), ELT (extract, load, transform) and real-time and bulk-load processing. Data virtualization enables querying data where it resides through federation, creating a unified view across distributed data sources without moving the data. These approaches can be used together, supported by automation, orchestration, data quality and enrichment. Increasingly, customers are reporting directly in their data warehouse to avoid moving the data.

Metadata
Metadata is data about data, essential for data governance and management. It provides context and details about the data such as origins, transformations, structure, relationships and usage. Technical metadata details schemas, data types and lineage, while business metadata describes data for nontechnical users.

Data access tools
Data access tools allow users to query, analyze and visualize data stored in the data warehouse, bridging the gap between raw data and decision-makers. These tools include reporting software, BI platforms, OLAP tools, data mining tools, application development tools and APIs that make data accessible to both technical and nontechnical users.

Embedded AI and ML capabilities
Modern data warehouses often include built-in AI and ML capabilities that enable automatic data processing, pattern detection, anomaly detection and predictive analytics directly within the warehouse environment, eliminating the need for separate AI/ML systems.

Interactive dashboards
Visual analytics dashboards provide real-time access to data insights through interactive charts, graphs and reports. These self-service interfaces allow both technical and nontechnical users to explore data, create visualizations and generate insights without writing complex queries.

Governance framework
A comprehensive governance framework manages data access controls, security policies, compliance requirements and data quality standards. This includes tools for data lineage tracking, audit logging, privacy protection and regulatory compliance management across the entire data warehouse environment.

Data warehousing concepts: Inmon vs. Kimball

Bill Inmon and Ralph Kimball, early leaders of data warehousing, have different approaches to data warehouse design. Inmon’s approach starts with the data warehouse as the central repository for enterprise data and is known as the top-down approach.

Kimball’s model, known as the bottom-up approach, focuses on first creating data marts — specialized databases tailored to particular business units or departments — and integrating them into a larger data warehouse.

Inmon approach
Inmon’s top-down model envisions a centralized, enterprise-wide data warehouse that acts as a single source of truth for the whole business. In this approach, data is pulled from source systems, cleaned and stored in a normalized format within the central data warehouse. The normalization ensures data consistency, minimizes redundancy and facilitates integration across diverse datasets. Data marts focusing on specific business areas are created as subsets of the main data warehouse. These are derived from the central repository, ensuring consistency with the enterprise’s overall data architecture.

Kimball approach
Kimball’s bottom-up method focuses on building data marts, which directly address specific business questions and reporting requirements. These are combined to create the data warehouse, which facilitates analysis and reporting. Kimball’s approach uses a dimensional model with "fact" tables containing numerical measures and "dimension" tables with descriptive attributes, often structured in a star schema, which simplifies querying and analysis. Data is denormalized, which speeds up the initial phase of data warehousing design. Because the data warehouse focuses on individual business areas rather than the whole business, it takes up less space in the database, making system management easier.

Choosing the right approach
Organizations need to choose the data warehouse architecture approach that works best for their needs — and this includes using a blend of the Inmon and Kimball approaches, referred to as a hybrid model, when needed.

In general, the Inmon approach offers a comprehensive, scalable solution for managing large, enterprise-wide datasets. It ensures consistent, reliable analytics across the organization that enable sophisticated data insights while emphasizing data quality and governance. However, users need advanced and specialized tools for querying and analysis, and this method requires significant investment in terms of time, resources, and technical expertise to build a data warehouse.

In contrast, the Kimball approach provides flexible and fast data delivery. It enables end users to query and analyze data directly from data marts using familiar tools and self-serve models, simplifying discovery and analysis even for users without specialized skills or advanced tools. When organizations need user-friendly, fast reporting and analytics, or when budget and resources are more limited, the Kimball method may work best.

Structuring the data warehouse

Organizations use schemas to describe different logical arrangements of data, represented by objects such as indexes and tables, within a data warehouse. These schemas act as a blueprint for how data will be stored and managed, including definitions of terms and relationships and their arrangement. Enterprises use three types of schema to structure a data warehouse.

Star schema
A star schema is a multidimensional data model used to organize data in a database so that it’s easy to understand and analyze. The star schema is the simplest data warehouse schema, optimized for querying large datasets. It has a single fact table in the center, connecting to multiple other dimension tables. Star schemas enable users to slice and dice the data however they see fit, typically by joining two or more fact tables and dimension tables together.

Star schemas are used to denormalize business data into dimensions (such as time and product) and facts (such as transactions in amounts and quantities). Denormalized data models have more data redundancy (data duplication), which makes query performance faster at the cost of duplicated data.

Snowflake schema
A snowflake schema is an extension of a star schema where dimension tables are broken down into subdimensions. This makes the data model more complex, but it can be easier for analysts to work with for certain data types.

The main difference between star schemas and snowflake schemas is that snowflake schemas normalize the data. Snowflake schemas offer more storage efficiency, due to their tighter adherence to high normalization standards, but query performance isn't as good as with denormalized data models. Snowflake schemas are commonly used for business intelligence and reporting in OLAP data warehouses, data marts and relational databases.

Galaxy schema
A galaxy schema uses multiple fact tables connected with shared normalized dimension tables, as opposed to star and snowflake schemas, which only use one fact table. A galaxy schema is interlinked and normalized, nearly eliminating redundancy and inconsistency of data. Galaxy schemas are known for high data accuracy and data quality and provide a basis for effective analytics and reporting, making them a strong choice for complex database systems.

Challenges for data warehouse architecture

Designing and maintaining data warehouse architecture brings several challenges that can impact efficiency and effectiveness.

Unstructured data
Unstructured data, such as images, videos, text files and logs, is important for improvement, innovation and creativity because it creates opportunities to discover new patterns and insights from diverse data sources. However, traditional data warehouse architectures are designed for structured data, so organizations may need advanced tools to realize value from unstructured data. The volume of unstructured data can also pose challenges with storage and efficient data management.

Scalability
As organizations grow, exponential data volume expansion challenges the scalability of data warehouse architectures. Traditional on-premises systems may struggle to handle large-scale datasets, high query loads or real-time processing demands. Cloud-based data warehouses offer elastic scalability but require careful planning to optimize resources and costs.

Cost
Building and maintaining a data warehouse requires significant investment in infrastructure and skilled personnel. On-premises systems are expensive to establish, while cloud-based solutions can be expensive to operate. Costs rise with growing data volumes, increased user demands and the integration of advanced analytics or AI capabilities.

Performance and efficiency
Data warehouse performance and efficiency are crucial for business operations, especially when handling large datasets and complex queries. Slow query response times and inefficient data processing pipelines can affect user productivity and hinder decision-making. Achieving optimal performance often requires increasing the complexity of system design and management.

Nontechnical usage
Nontechnical users need to be able to access and analyze data, but traditional data warehouses often require technical expertise in SQL or other technical tools. This creates a slow, inefficient system in which users must submit requests to data teams and wait for the data to be delivered. This results in bottlenecks and delays, which are amplified in larger organizations.

Separate systems for AI and ML
Traditional data warehouses are designed for common data warehouse workloads, including historical reporting, BI and querying. However, they were never designed for or intended to support machine learning workloads. Using additional data pipelines to transfer data between the warehouse and specialized AI/ML environments increases complexity and latency. Integrating AI and ML capabilities directly into the data warehouse or leveraging hybrid platforms can address these challenges.

Separate systems for BI
Traditional architectures often require separate specialized systems for business intelligence and analytics, creating data silos and requiring complex data movement between systems. This separation can lead to data inconsistency, increased maintenance overhead and delayed insights. Modern integrated platforms that combine data storage and BI capabilities in one environment help organizations streamline their analytics workflows and ensure data consistency.

Separate systems for data governance
Having disconnected systems for data governance creates challenges in maintaining consistent policies, access controls and compliance standards across the data ecosystem. Organizations often struggle with fragmented governance tools that don’t communicate effectively with each other, leading to security vulnerabilities and compliance risks. A unified governance framework integrated within the data platform helps ensure consistent policy enforcement and simplified compliance management.