Data Ingestion Reference Architecture

This data ingestion reference architecture provides a simplified, unified and efficient foundation for loading data from diverse enterprise sources into the Databricks Data Intelligence Platform.

Architecture Summary

The data ingestion reference architecture supports a wide range of ingestion patterns — including batch, change data capture (CDC) and streaming — while ensuring governance, performance and interoperability. Once ingested, data is refined and made available for analytics, AI and secure data sharing across the organization.

This architecture is ideal for organizations looking to modernize and operationalize data pipelines while reducing complexity and integration overhead. It is built around three key principles:

Simple and low maintenance: Ingestion pipelines are easy to build and manage, enabling faster time to value, fewer operational bottlenecks and broader access to data across teams
Unified with the lakehouse architecture: Data flows directly into the lakehouse using open formats and governed by Unity Catalog — ensuring consistency across BI, AI and operational use cases
Efficient end-to-end flow: From ingestion to transformation and delivery, the platform supports efficient, incremental processing that minimizes duplication, latency and resource usage

Use Cases

Technical Use Cases

Periodic batch ingestion from flat files, exports or APIs into staging zones
Change data capture (CDC) ingestion to incrementally sync updates from transactional systems like Oracle or PostgreSQL
Streaming ingestion of real-time events from Kafka or message queues for use in live dashboards or alerting systems
Harmonizing ingestion across legacy systems, cloud-native databases and enterprise SaaS applications
Feeding curated and transformed data into data warehouses, AI applications and external APIs

Business Use Cases

Predicting customer churn by ingesting behavioral, transactional and support data
Powering executive dashboards with fresh operational metrics from ERP and CRM systems
Segmenting customers by combining campaign, sales and product usage data
Conducting HR analytics by integrating data from Workday and productivity platforms
Performing risk assessment by analyzing transactions and alert feeds in near real time

Data Ingestion Flow and Key Capabilities

Batch ingestion
- Loads data at scheduled intervals or on-demand from sources like flat files, APIs or database exports
- Suitable for daily reporting, historical data loads and system-of-record snapshots
- Supports both full and incremental loads, with native scheduling, retry logic and transformation using SQL or Python
Change data capture (CDC) ingestion
- Captures incremental changes from transactional systems such as Oracle, PostgreSQL and MySQL
- Keeps lakehouse tables updated without full reloads, improving efficiency and data freshness
- Enables near real-time data sync for fact tables, audit trails and reporting layers
Streaming ingestion
- Continuously processes data from event sources like Kafka, Kinesis, Pub/Sub or Event Hubs
- Ideal for real-time dashboards, alerting systems and anomaly detection
- Structured Streaming manages state, fault tolerance and throughput, reducing operational overhead

Additional Platform Capabilities

Unified governance
- Unity Catalog provides unified governance, including access control, lineage and audit tracking
- Data is stored in open, interoperable formats using Delta Lake and Apache Iceberg™, ensuring flexibility and interoperability across tools and environments
- A centralized orchestration layer manages pipeline scheduling, dependencies, monitoring and recovery
Lakehouse architecture: Ingested data is transformed and modeled into the medallion architecture (Bronze, Silver and Gold), powering high-performance querying in Databricks SQL
Orchestration: Built-in orchestration manages data pipelines, AI workflows and scheduled jobs across batch and streaming workloads, with native support for dependency management and error handling
AI and agent systems: Data feeds into agent systems for preparing features, evaluating models and deploying AI-powered applications
Downstream consumption:
- Streaming analytics: Real-time visualization of key metrics and operational signals
- BI/analytics: Curated datasets served to tools like Power BI, Lakeview and SQL clients
- AI applications: Governed datasets consumed by training pipelines and inference engines
- Data sharing and marketplace: Secure internal and external data sharing via Delta Sharing
- Operational apps: Embedded intelligence and contextual insights in enterprise tools

Data Ingestion Reference Architecture

Recommended

Reference Architecture

Reference Architecture

Industry Architecture