What is a data warehouse?

A data warehouse is a data management system that stores current and historical data from multiple sources in a business friendly manner for easier insights and reporting. Data warehouses are typically used for business intelligence (BI), reporting and data analysis.

Data warehouses make it possible to quickly and easily analyze business data uploaded from operational systems such as point-of-sale systems, inventory management systems, or marketing or sales databases. Data may pass through an operational data store and require data cleansing to ensure data quality before it can be used in the data warehouse for reporting.

What are data warehouses used for?

Data warehouses are used in BI, reporting, and data analysis to extract and summarize data from operational databases. Information that is difficult to obtain directly from transactional databases can be obtained via data warehouses. For example, management wants to know the total revenues generated by each salesperson on a monthly basis for each product category. Transactional databases may not capture this data, but the data warehouse does.

ETL and ELT on the data warehouse

ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are two different ways of transforming data. Data engineers often use ETL, or extract-transform-load, to extract data from different data sources and move it into the data warehouse, where they can easily cleanse and structure it. ELT, on the other hand, loads data into the warehouse in its original format first, and cleanses and structures it as it is processed.

ETL is typically done more centrally via Enterprise Data Engineering teams to apply company-wide data cleansing and conforming rules. ELT implies transformations are done at a later stage which are typically more project/business team specific - to enable self-service analytics.

Transaction processing (OLTP) vs. analytics processing (OLAP)

An Online Transaction Processing (OLTP) system captures and maintains transactional data in a database. Transactions involve individual database records made up of multiple fields or columns. OLTP databases are commonly used in applications like online banking, ERP systems, or inventory management, enabling quick updates to row-level data that are processed nearly instantly.

An Online Analytical Processing (OLAP) system applies complex queries to large amounts of historical data, aggregated from OLTP databases and other sources, for data mining, analytics, and business intelligence projects. Data warehouses are OLAP systems. OLAP databases and data warehouses give analysts and decision-makers the ability to use custom reporting tools to turn data into information and action. Query failure on an OLAP database does not interrupt or delay transaction processing for customers, but it can delay or impact the accuracy of business intelligence insights.

Benefits of data warehouses

  • Consolidate data obtained from many sources; acting as a single point of access for all data, rather than requiring users to connect to dozens or even hundreds of individual data stores.
  • Historical intelligence. A data warehouse integrates data from many sources to show historic trends.
  • Separate analytics processing from transactional databases, improving the performance of both systems.
  • Data quality, consistency, and accuracy. Data warehouses use a standard set of semantics around data, including consistency in naming conventions, codes for various product types, languages, currencies, and so on.

Challenges with data warehouses

  • No support for unstructured data like images, text, IoT data, or messaging frameworks like HL7, JSON, and XML. Traditional data warehouses are only capable of storing clean and highly structured data, even though Gartner estimates that up to 80% of an organization's data is unstructured. Organizations that want to use their unstructured data to unlock the power of AI have to look elsewhere.
  • No support for AI and machine learning. Data warehouses are purpose-built and optimized for common DWH workloads including historical reporting, BI, and querying — they were never designed for or intended to support machine learning workloads.
  • SQL-only — DWHs typically offer no support for Python or R, the languages of choice for app developers, data scientists, and machine learning engineers.
  • Duplicated data — Many enterprises have data warehouses and subject-area or (departmental) data marts in addition to a data lake, which results in duplicated data, lots of redundant ETL, and no single source of truth.
  • Tough to keep in sync - keeping two copies of the data synchronized between the lake and the warehouse adds complexity and fragility that is tough to manage. Data drift can cause inconsistent reporting and faulty analysis.
  • Closed, proprietary formats increase vendor lock-in — most enterprise data warehouses use their own proprietary data format, rather than formats based on open source and open standards. This increases vendor lock-in, makes it difficult or impossible to analyze your data with other tools, and makes it more difficult to migrate your data.
  • Expensive — commercial data warehouses charge you for storing your data, and also for analyzing it. Storage and compute costs are therefore still tightly coupled together. Separation of compute and storage with a lakehouse means you can independently scale either as needed.

How lakehouses solve these challenges

How Lakehouses Solve These Challeneges

Lakehouse architectures specifically solve these challenges in order to offer the best of both data lakes and warehouses. See the value of an open lakehouse architecture on Databricks.

Databricks Lakehouse for Data Warehousing

Databricks Lakehouse Platform Architecture

To build a successful lakehouse, organizations have turned to Delta Lake, an open source, open format data management and governance layer that combines the best of both data lakes and data warehouses. The Databricks Lakehouse Platform uses Delta Lake to give you:

  • World record data warehouse performance at data lake economics.
  • Serverless SQL compute that removes the need for infrastructure management.
  • Seamless integration with the modern data stack, like dbt, Tableau, PowerBI, and Fivetran to ingest, query, and transform data in-place.
  • A first class SQL development experience for every data practitioner across your organization with ANSI-SQL support.
  • Fine-grained governance with data lineage, table/row level tags, role-based access controls and more.

Additional Resources

Back to glossary