Build reliable data lakes effortlessly at scale
We are excited to announce the open sourcing of the Delta Lake project. Delta Lake is a storage layer that brings reliability to your data lakes built on HDFS and cloud storage by providing ACID transactions through optimistic concurrency control between writes and snapshot isolation for consistent reads during writes. Delta Lake also provides built-in data versioning for easy rollbacks and reproducing reports. The project is available at delta.io to download and use under Apache License 2.0.
Challenges with Data Lakes
Data lakes are a common element within modern data architectures. They serve as a central ingestion point for the plethora of data that organizations seek to gather and mine. While a good step forward in getting to grips with the range of data, they run into the following common problems:
- Reading and writing into data lakes is not reliable. Data engineers often run into the problem of unsafe writes into data lakes that causes readers to see garbage data during writes. They have to build workarounds to ensure readers always see consistent data during writes.
- The data quality in data lakes is low. Dumping unstructured data into a data lake is easy. But this comes at the cost of data quality. Without any mechanisms for validating schema and the data, data lakes suffer from poor data quality. As a consequence, analytics projects that strive to mine this data also fail.
- Poor performance with increasing amounts of data. As the amount of data that gets dumped into a data lake increases, the number of files and directories also increase. Big data jobs and query engines that process the data spend significant amount of time in handling the metadata operations. This problem is more pronounced in the case of streaming jobs.
- Updating records in data lakes is hard. Engineers need to build complicated pipelines to read entire partitions or tables, modify the data and write them back. Such pipelines are inefficient and hard to maintain.
Because of these challenges, many big data projects fail to deliver on their vision or sometimes just fail altogether. We need a solution that enables data practitioners to make use of their existing data lakes, but while ensuring data quality.
Introducing the Delta Lake open source project
Delta Lake addresses the above problems to simplify how you build your data lakes. Delta Lake offers the following key functionalities:
- ACID transactions: Delta Lake provides ACID transactions between multiple writes. Every write is a transaction and there is a serial order for writes recorded in a transaction log. The transaction log tracks writes at file level and uses optimistic concurrency control which is ideally suited for data lakes since multiple writes trying to modify the same files don’t happen that often. In scenarios where there is a conflict, Delta Lake throws a concurrent modification exception for users to handle them and retry their jobs. Delta Lake also offers strong serializable isolation level that allows engineers to continuously keep writing to a directory or table and consumers to keep reading from the same directory or table. Readers will see the latest snapshot that existed at the time the reading started.
- Schema management: Delta Lake automatically validates that the schema of the DataFrame being written is compatible with the schema of the table. Columns that are present in the table but not in the DataFrame are set to null. If there are extra columns in the DataFrame that are not present in the table, this operation throws an exception. Delta Lake has DDL to explicitly add new columns explicitly and the ability to update the schema automatically.
- Scalable metadata handling: Delta Lake stores the metadata information of a table or directory in the transaction log instead of the metastore. This allows Delta Lake to list files in large directories in constant time and be efficient while reading data.
- Data versioning and time travel: Delta Lake allows users to read a previous snapshot of the table or directory. When files are modified during writes, Delta Lake creates newer versions of the files and preserves the older versions. When users want to read the older versions of the table or directory, they can provide a timestamp or a version number to Apache Spark’s read APIs and Delta Lake constructs the full snapshot as of that timestamp or version based on the information in the transaction log. This allows users to reproduce experiments and reports and also revert a table to its older versions, if needed.
- Unified batch and streaming sink: Apart from batch writes, Delta Lake can also be used as an efficient streaming sink with Apache Spark’s structured streaming. Combined with ACID transactions and scalable metadata handling, the efficient streaming sink now enables lot of near real-time analytics use cases without having to maintain a complicated streaming and batch pipeline.
- Record update and deletion (Coming soon): Delta Lake will support merge, update and delete DML commands. This allows engineers to easily upsert and delete records in data lakes and simplify their change data capture and GDPR use cases. Since Delta Lake tracks and modifies data at file-level granularity, it is much more efficient than reading and overwriting entire partitions or tables.
- Data expectations (Coming soon): Delta Lake will also support a new API to set data expectations on tables or directories. Engineers will be able to specify a boolean condition and tune the severity to handle data expectations. When Apache Spark jobs write to the table or directory, Delta Lake will automatically validate the records and when there is a violation, it will handle the records based on the severity provided.
Apache Spark transformed the big data processing landscape and allowed engineers to build efficient data pipelines. However, we found a critical gap in how engineers manage their storage layer with big data, both on-prem and cloud. They had to go through workarounds and build complicated data pipelines to deliver data to consumers. With the advent of Delta Lake, we are seeing Databricks customers building reliable data lakes effortlessly at scale. Now we are open sourcing the Delta Lake project for the broader community to benefit as well.
The Delta Lake project is available to download at delta.io. We also welcome contributions and are excited to work with the community to make it even better. You can join our mailing list or Slack channel for discussions with the community. To try out Delta Lake in action in the cloud, sign up for a free trial in Databricks (Azure | AWS).