Skip to main content

Intelligent Data Warehousing on Databricks

This reference architecture illustrates how the Databricks Data Intelligence Platform enables modern data warehousing and BI by combining streaming and batch ingestion, governed storage, scalable SQL analytics and integrated AI on a unified lakehouse.

Image of a data warehousing session at the Data + AI Summit 2025, featuring Databricks' Lakehouse architecture.

Architecture Summary

The architecture supports traditional reporting, real-time dashboards, predictive modeling and self-service analytics — all while meeting enterprise standards for security, governance and performance.

This solution demonstrates how the Databricks Data Intelligence Platform, powered by Databricks SQL, helps organizations modernize their data warehousing strategy while meeting the needs of both data teams and business stakeholders.

The architecture starts with an open, governed lakehouse managed by Unity Catalog. Data is ingested from a range of systems — including operational databases, SaaS apps, event streams and file systems — and lands in a central storage layer. The platform’s data intelligence powers everything from ETL and SQL analytics to dashboards and AI use cases. By supporting flexible access through SQL, BI tools and natural language queries, the platform accelerates data product delivery and makes insights accessible across the organization.

 

Use Cases

Technical Use Cases

  • Ingesting structured, unstructured, batch and streaming data from diverse sources
  • Building robust declarative ETL pipelines
  • Modeling facts, dimensions and data marts using a medallion architecture
  • Running high-concurrency SQL queries for reporting and dashboarding
  • Integrating ML outputs directly into the warehouse for downstream use

Business Use Cases

  • Delivering real-time dashboards on sales, operations or customer metrics
  • Enabling ad hoc exploration through natural language interfaces like Genie
  • Supporting predictive use cases like demand forecasting and churn modeling
  • Sharing governed data products across departments or with partners
  • Providing fast, reliable insights for finance, marketing and product teams

 

Key Capabilities With Data Intelligence

The data intelligence component of this architecture makes the platform smarter, more adaptive and easier to use across personas and workloads. It applies AI and metadata awareness throughout the system to simplify experiences and automate decision-making:

  • Natural language interface (Genie): Understands business context and lets users ask data questions in plain language
  • Semantic awareness: Recognizes relationships between tables, columns and usage patterns to suggest joins, filters or calculations
  • Predictive optimization: Continuously tunes query performance and compute allocation based on historical workloads
  • Unified governance: Tags, classifies and tracks usage of data assets, making discovery more intuitive and secure
  • Key capability: A self-optimizing platform that adapts to your data and users
  • Differentiator: Data intelligence is embedded across ingestion, query, governance and visualization — not bolted on

 

Data Flow With Key Capabilities and Differentiators

  1. Data sources: Data is stored in a wide variety of systems, including enterprise apps (e.g., SAP, Salesforce), databases, IoT devices, application logs and external APIs. These sources can produce structured, semi-structured or unstructured data.
  2. Data ingestion: Brings in data through batch jobs, change data capture (CDC) or streaming. These pipelines feed the lakehouse architecture in near real time or on scheduled intervals, depending on the source system and use case.
    • Key differentiator: Unified ingestion for all modalities — batch, streaming and CDC — without needing separate infrastructure or pipelines
  3. Data transformation, ETL, Declarative Pipelines: Once ingested, data is transformed through the medallion architecture and progressively refined from raw to curated data.
    • Raw zone to Bronze zone: Data ingested from external source systems where the structures in this layer correspond to the source system table structures “as-is,” with no transformation or updates to the data
    • Bronze zone to Silver zone: Standardize and clean incoming data
    • Silver zone to Gold zone: Apply business logic to create reusable models
    • Facts and dimensions data marts: Aggregate and curate data for downstream analytics
    • Key differentiator: Declarative, production-grade pipelines with built-in lineage, observability and schema evolution
  4. Curated data for AI use cases: Curated data from data marts can be used to train or apply machine learning models. These models support use cases like demand forecasting, anomaly detection and customer scoring.
    • Model outputs are stored alongside traditional warehouse data for easy access via SQL or dashboards
    • Results can be updated on a schedule or scored in real time, depending on requirements
    • Key differentiator: Colocated analytics and AI workloads on the same platform — no data movement needed. Model outputs are treated as native, queryable governed assets.
  5. Query-feeding BI reporting tools: Databricks SQL supports high-concurrency, low-latency querying through serverless compute, and connects easily to popular BI tools.
    • Built-in query editor and query history
    • Queries return governed, up-to-date results from data marts or enriched model outputs
    • Key differentiator: Databricks SQL allows BI tools to query data directly — without replication — reducing complexity, avoiding additional licensing costs and lowering overall TCO. Combined with serverless compute and intelligent optimization, it delivers warehouse-grade performance with minimal tuning.
  6. Dashboards: Can be built directly in Databricks or in external BI tools like Power BI or Tableau. Users can describe visuals in natural language, and Databricks Assistant will generate the corresponding charts, which can then be refined using a point-and-click interface.
    • Create visualizations using natural language input
    • Modify and explore dashboards interactively with filters and drill-downs
    • Publish and securely share dashboards across the organization, including with users outside the Databricks workspace
    • Key differentiator: Offers a low-code and AI-assisted experience for building and exploring dashboards on governed, real-time data
  7. Serving curated data: Once refined, data can be served beyond dashboards:
    • Shared with downstream applications or operational databases for transactional decision-making
    • Used in collaborative notebooks for analysis
    • Distributed via Delta Sharing to partners, teams or external consumers with unified governance
  8. Natural language query (NLQ): Business users can access governed data using natural language. This conversational experience, powered by generative AI, enables teams to move beyond static dashboards and get real-time, self-service insights. NLQ translates user intent into SQL by leveraging the organization’s semantics and metadata from Unity Catalog.
    • Supports ad hoc, interactive, real-time questions that aren’t pre-built into dashboards
    • Intelligently adapts to evolving business terminology and context over time
    • Leverages existing data governance and access controls via Unity Catalog
    • Provides auditability and traceability of natural language queries for compliance and transparency
    • Key differentiator: Continuously adapts to evolving business concepts, delivering accurate, context-aware responses without requiring SQL expertise
  9. Platform capabilities: Governance, performance, orchestration and open storage: The architecture is underpinned by a set of platform-native capabilities that support security, optimization, automation and interoperability across the entire data lifecycle. Key capabilities:
    • Governance: Unity Catalog provides centralized access control, lineage, auditing and data classification across all workloads
    • Performance: Photon engine, intelligent caching and workload-aware optimization deliver fast queries without manual tuning
    • Orchestration: Built-in orchestration manages data pipelines, AI workflows and scheduled jobs across batch and streaming workloads, with native support for dependency management and error handling
    • Open storage: Data is stored in open formats (Delta Lake, Parquet, Iceberg), enabling interoperability across tools, portability between platforms and long-term durability without vendor lock-in
    • Monitoring and auditability: End-to-end visibility into query performance, pipeline execution and user access for better control and cost management
    • Key differentiator: Platform-level services are integrated — not layered on — ensuring governance, automation and performance are consistent across all data workflows, clouds and teams

Recommended

Data Intelligence end-to-end Architecture with Azure Databricks

Reference Architecture

Data Intelligence end-to-end Architecture with Azure Databricks
Data Ingestion Reference Architecture

Reference Architecture

Data Ingestion Reference Architecture
Reference Architecture for Credit Loss Forecasting

Industry Architecture

Reference Architecture for Credit Loss Forecasting