Data ingestion is the first step in the data engineering lifecycle. It involves gathering data from diverse sources such as databases, SaaS applications, file sources, APIs and IoT devices into a centralized repository like a data lake, data warehouse or lakehouse. This enables organizations to clean and unify the data to leverage analytics and AI for data-driven decision-making.
Traditionally, data ingestion has been handled using a combination of bespoke scripts, open source frameworks like Apache NiFi and Kafka or managed ingestion solutions from cloud providers such as AWS Glue, Google Cloud Dataflow and Azure Data Factory. These methods often require significant engineering effort to maintain, especially when handling schema evolution, data consistency and real-time processing at scale. Many enterprises also rely on separate ingestion, transformation and orchestration tools, leading to increased complexity and data silos.
Unlike ETL (extract, transform, load), which transforms data before loading, data ingestion moves raw data directly into a destination, allowing for faster access and flexibility.
Data ingestion methods vary based on the use case, enabling data collection in scheduled batches, continuous streams or a hybrid of both.
Incremental batch ingestion: Collects data at set intervals, ideal for periodic updates and scenarios where real-time data isn’t essential.
Streaming ingestion: This method incrementally ingests data and can support real-time scenarios, such as applications requiring fast access, like IoT monitoring.
Hybrid ingestion: Combines batch and streaming, allowing for both scheduled updates and real-time feeds, suited to operations needing both static updates and dynamic tracking.
Different data structures require specific ingestion and processing techniques:
Unstructured data: Data without a predefined format, such as text files, images and video, often requires specialized tools for processing and is commonly ingested through batch or hybrid methods.
Semi-structured data: Data with some structure, like JSON or XML, is suitable for both batch and streaming ingestion and offers flexibility when handling evolving attributes.
Structured data: Data organized in a defined schema (e.g., databases, spreadsheets) can be quickly integrated through batch or streaming ingestion, making it ideal for analysis and reporting.
Data ingestion tools range from open source options like Apache NiFi and Kafka, known for flexibility and customization, to commercial platforms like the Databricks Data Intelligence Platform, which combines ingestion, transformation and orchestration into one platform.
Databricks Lakeflow is a unified, intelligent solution for data engineering built on the Data Intelligence Platform. It covers ingestion, data transformation and orchestration of your data.
As part of Lakeflow, Lakeflow Connect offers connectors for diverse data sources, enabling flexibile, easy and efficient ways to ingest both structured and unstructured data from enterprise applications, file sources and databases.
Lakeflow Connect enables data ingestion from a variety of different data sources:
Managed connectors: Ingest data with built-in connectors for software-as-a-service (SaaS) applications and databases.
Standard connectors: Ingest data from cloud object storage and streaming sources like Kafka with developer tools.
Files: Ingest files that reside on your local network, have been uploaded to a volume or downloaded from an internet location.
Effective ingestion tools streamline data processing with features such as:
Schema evolution: Automatically adapts to changes in data structures, reducing manual intervention.
Data lineage tracking: Traces data origins, supporting governance and compliance requirements.
Error handling and monitoring: Identifies and resolves issues in real time, ensuring reliable data loads.
Scalability: Maintains performance as data volumes grow, essential for large-scale operations.
Data integration: Enables seamless integration with data lakes and warehouses, allowing unified data management.
Open source tools offer flexibility and control but may require more setup, making them ideal for technical teams. Databricks combines open source foundations with an extensive partner ecosystem. The Databricks Data Intelligence Platform provides managed ingestion with built-in governance and automation, reducing operational costs and complexity.
Data ingestion is usually the first step in data processing from collection to analysis, and leads to additional sequential operations. The main purpose of data ingestion is to both collect raw data from multiple sources and transfer this data to a data lake, data warehouse or lakehouse storage system. Additional steps beyond ingestion are required by most organizations because raw data needs further refinement before it becomes useful for analytics and decision-making. Data ingestion represents the process of obtaining multiple sources of data without altering the data format, focusing on high speed and flexible data availability to enable further processing opportunities.
Data ingestion brings raw data from various sources into a repository without transformation, prioritizing immediate access to unmodified data.
ETL involves extracting data, transforming it to meet specific requirements and loading it into a target system, focusing on data preparation for analytics. (Learn about the difference between ETL and ELT.)
Data pipelines cover the complete sequence of moving transformed data for processing. A pipeline contains several successive operations beyond data ingestion and ETL such as validation tests, removal of duplicates, execution of machine learning algorithms and processing of streaming data.
Data ingestion is ideal for cases requiring quick access to raw data, supporting near real-time insights. ETL suits situations that require prepared, structured data for business intelligence and analytics, such as standardized reporting. Data pipelines provide a broader framework for handling complex workflows, integrating multiple steps into a cohesive process.
In modern architectures, data ingestion and ETL often complement each other. For example, data can first be ingested into a lakehouse, where ETL processes later prepare it for deeper analysis and reporting, while a broader data pipeline automates the entire workflow, from ingestion to machine learning and analytics. Databricks Lakeflow integrates these processes, creating a unified workflow for flexibility and comprehensive data management.
Establishing foundational best practices helps ensure efficient, reliable and well-governed ingestion workflows:
Once ingestion processes are established, ongoing optimization helps adapt to evolving business needs and manage increasing data volumes effectively.
