Introduction to Data Lakes
Data lakes provide a complete and authoritative data store that can power data analytics, business intelligence and machine learning
Introduction to data lakes
What is a data lake?
A data lake is a central location that holds a large amount of data in its native, raw format. Compared to a hierarchical data warehouse, which stores data in files or folders, a data lake uses a flat architecture and object storage to store the data. Object storage stores data with metadata tags and a unique identifier, which makes it easier to locate and retrieve data across regions, and improves performance. By leveraging inexpensive object storage and open formats, data lakes enable many applications to take advantage of the data.
Data lakes were developed in response to the limitations of data warehouses. While data warehouses provide businesses with highly performant and scalable analytics, they are expensive and proprietary and can’t handle the modern use cases most companies are looking to address. Data lakes are often used to consolidate all of an organization’s data in a single, central location, where it can be saved “as is,” without the need to impose a schema (i.e., a formal structure for how the data is organized) up front like a data warehouse does. Data in all stages of the refinement process can be stored in a data lake: raw data can be ingested and stored right alongside an organization’s structured, tabular data sources (like database tables), as well as intermediate data tables generated in the process of refining raw data. Unlike most databases and data warehouses, data lakes can process all data types — including unstructured and semi-structured data like images, video, audio and documents — which are critical for today’s machine learning and advanced analytics use cases.
Why would you use a data lake?
First and foremost, data lakes are open format, so users avoid lock-in to a proprietary system like a data warehouse, which has become increasingly important in modern data architectures. Data lakes are also highly durable and low cost, because of their ability to scale and leverage object storage. Additionally, advanced analytics and machine learning on unstructured data are some of the most strategic priorities for enterprises today. The unique ability to ingest raw data in a variety of formats (structured, unstructured, semi-structured), along with the other benefits mentioned, makes a data lake the clear choice for data storage.
When properly architected, data lakes enable the ability to:
Data lakes allow you to transform raw data into structured data that is ready for SQL analytics, data science and machine learning with low latency. Raw data can be retained indefinitely at low cost for future use in machine learning and analytics.
A centralized data lake eliminates problems with data silos (like data duplication, multiple security policies and difficulty with collaboration), offering downstream users a single place to look for all sources of data.
Any and all data types can be collected and retained indefinitely in a data lake, including batch and streaming data, video, image, binary files and more. And since the data lake provides a landing zone for new data, it is always up to date.
Data lake challenges
Despite their pros, many of the promises of data lakes have not been realized due to the lack of some critical features: no support for transactions, no enforcement of data quality or governance, and poor performance optimizations. As a result, most of the data lakes in the enterprise have become data swamps.
Without the proper tools in place, data lakes can suffer from data reliability issues that make it difficult for data scientists and analysts to reason about the data. These issues can stem from difficulty combining batch and streaming data, data corruption and other factors.
As the size of the data in a data lake increases, the performance of traditional query engines has traditionally gotten slower. Some of the bottlenecks include metadata management, improper data partitioning and others.
Data lakes are hard to properly secure and govern due to the lack of visibility and ability to delete or update data. These limitations make it very difficult to meet the requirements of regulatory bodies.
For these reasons, a traditional data lake on its own is not sufficient to meet the needs of businesses looking to innovate, which is why businesses often operate in complex architectures, with data siloed away in different storage systems: data warehouses, databases and other storage systems across the enterprise. Simplifying that architecture by unifying all your data in a data lake is the first step for companies that aspire to harness the power of machine learning and data analytics to win in the next decade.
How a lakehouse solves those challenges
The answer to the challenges of data lakes is the lakehouse, which adds a transactional storage layer on top. A lakehouse that uses similar data structures and data management features as those in a data warehouse but instead runs them directly on cloud data lakes. Ultimately, a lakehouse allows traditional analytics, data science and machine learning to coexist in the same system, all in an open format.
A lakehouse enables a wide range of new use cases for cross-functional enterprise-scale analytics, BI and machine learning projects that can unlock massive business value. Data analysts can harvest rich insights by querying the data lake using SQL, data scientists can join and enrich data sets to generate ML models with ever greater accuracy, data engineers can build automated ETL pipelines, and business intelligence analysts can create visual dashboards and reporting tools faster and easier than before. These use cases can all be performed on the data lake simultaneously, without lifting and shifting the data, even while new data is streaming in.
Building a lakehouse with Delta Lake
To build a successful lakehouse, organizations have turned to Delta Lake, an open format data management and governance layer that combines the best of both data lakes and data warehouses. Across industries, enterprises are leveraging Delta Lake to power collaboration by providing a reliable, single source of truth. By delivering quality, reliability, security and performance on your data lake — for both streaming and batch operations — Delta Lake eliminates data silos and makes analytics accessible across the enterprise. With Delta Lake, customers can build a cost-efficient, highly scalable lakehouse that eliminates data silos and provides self-serving analytics to end users.
data lakes vs. data lakehouses vs. data warehouses
Types of dataCostFormatScalabilityIntended usersReliabilityEase of usePerformance
Data lakeAll types: Structured data, semi-structured data, unstructured (raw) data$Open formatScales to hold any amount of data at low cost, regardless of typeLimited: Data scientistsLow quality, data swampDifficult: Exploring large amounts of raw data can be difficult without tools to organize and catalog the dataPoor
Data lakehouseAll types: Structured data, semi-structured data, unstructured (raw) data$Open formatScales to hold any amount of data at low cost, regardless of typeUnified: Data analysts, data scientists, machine learning engineersHigh quality, reliable dataSimple: Provides simplicity and structure of a data warehouse with the broader use cases of a data lakeHigh
Data warehouseStructured data only$$$Closed, proprietary formatScaling up becomes exponentially more expensive due to vendor costsLimited: Data analystsHigh quality, reliable dataSimple: Structure of a data warehouse enables users to quickly and easily access data for reporting and analyticsHigh
Lakehouse best practices
Save all of your data into your data lake without transforming or aggregating it to preserve it for machine learning and data lineage purposes.
Personally identifiable information (PII) must be pseudonymized in order to comply with GDPR and to ensure that it can be saved indefinitely.
Adding view-based ACLs (access control levels) enables more precise tuning and control over the security of your data lake than role-based controls alone.
The nature of big data has made it difficult to offer the same level of reliability and performance available with databases until now. Delta Lake brings these important features to data lakes.
Use data catalog and metadata management tools at the point of ingestion to enable self-service data science and analytics.
Shell has been undergoing a digital transformation as part of our ambition to deliver more and cleaner energy solutions. As part of this, we have been investing heavily in our data lake architecture. Our ambition has been to enable our data teams to rapidly query our massive data sets in the simplest possible way. The ability to execute rapid queries on petabyte scale data sets using standard BI tools is a game changer for us.
—Dan Jeavons, GM Data Science, Shell
History and evolution of data lakes
The early days of data management: databases
In the early days of data management, the relational database was the primary method that companies used to collect, store and analyze data. Relational databases, also known as relational database management systems (RDBMSes), offered a way for companies to store and analyze highly structured data about their customers using Structured Query Language (SQL). For many years, relational databases were sufficient for companies’ needs: the amount of data that needed to be stored was relatively small, and relational databases were simple and reliable. To this day, a relational database is still an excellent choice for storing highly structured data that’s not too big. However, the speed and scale of data was about to explode.
The rise of the internet, and data silos
With the rise of the internet, companies found themselves awash in customer data. To store all this data, a single database was no longer sufficient. Companies often built multiple databases organized by line of business to hold the data instead. As the volume of data grew and grew, companies could often end up with dozens of disconnected databases with different users and purposes.
On the one hand, this was a blessing: with more and better data, companies were able to more precisely target customers and manage their operations than ever before. On the other hand, this led to data silos: decentralized, fragmented stores of data across the organization. Without a way to centralize and synthesize their data, many companies failed to synthesize it into actionable insights. This pain led to the rise of the data warehouse.data silos
Data warehouses are born to unite companies’ structured data under one roof
With so much data stored in different source systems, companies needed a way to integrate them. The idea of a “360-degree view of the customer” became the idea of the day, and data warehouses were born to meet this need and unite disparate databases across the organization.
Data warehouses emerged as a technology that brings together an organization’s collection of relational databases under a single umbrella, allowing the data to be queried and viewed as a whole. At first, data warehouses were typically run on expensive, on-premises appliance-based hardware from vendors like Teradata and Vertica, and later became available in the cloud. Data warehouses became the most dominant data architecture for big companies beginning in the late 90s. The primary advantages of this technology included:
- Integration of many data sources
- Data optimized for read access
- Ability to run quick ad hoc analytical queries
- Data audit, governance and lineage
Data warehouses served their purpose well, but over time, the downsides to this technology became apparent.
- Inability to store unstructured, raw data
- Expensive, proprietary hardware and software
- Difficulty scaling due to the tight coupling of storage and compute power
Apache Hadoop™ and Spark™ enable unstructured data analysis, and set the stage for modern data lakes
With the rise of “big data” in the early 2000s, companies found that they needed to do analytics on data sets that could not conceivably fit on a single computer. Furthermore, the type of data they needed to analyze was not always neatly structured — companies needed ways to make use of unstructured data as well. To make big data analytics possible, and to address concerns about the cost and vendor lock-in of data warehouses, Apache Hadoop™ emerged as an open source distributed data processing technology.
What is Hadoop?
Apache Hadoop™ is a collection of open source software for big data analytics that allows large data sets to be processed with clusters of computers working in parallel. It includes Hadoop MapReduce, the Hadoop Distributed File System (HDFS) and YARN (Yet Another Resource Negotiator). HDFS allows a single data set to be stored across many different storage devices as if it were a single file. It works hand-in-hand with the MapReduce algorithm, which determines how to split up a large computational task (like a statistical count or aggregation) into much smaller tasks that can be run in parallel on a computing cluster.
The introduction of Hadoop was a watershed moment for big data analytics for two main reasons. First, it meant that some companies could conceivably shift away from expensive, proprietary data warehouse software to in-house computing clusters running free and open source Hadoop. Second, it allowed companies to analyze massive amounts of unstructured data in a way that was not possible before. Prior to Hadoop, companies with data warehouses could typically analyze only highly structured data, but now they could extract value from a much larger pool of data that included semi-structured and unstructured data. Once companies had the capability to analyze raw data, collecting and storing this data became increasingly important — setting the stage for the modern data lake.
Early data lakes were built on Hadoop
Early data lakes built on Hadoop MapReduce and HDFS enjoyed varying degrees of success. Many of these early data lakes used Apache Hive™ to enable users to query their data with a Hadoop-oriented SQL engine. Some early data lakes succeeded, while others failed due to Hadoop’s complexity and other factors. To this day, many people still associate the term “data lake” with Hadoop because it was the first framework to enable the collection and analysis of massive amounts of unstructured data. Today, however, many modern data lake architectures have shifted from on-premises Hadoop to running Spark in the cloud. Still, these initial attempts were important as these Hadoop data lakes were the precursors of the modern data lake. Over time, Hadoop’s popularity leveled off somewhat, as it has problems that most organizations can’t overcome like slow performance, limited security and lack of support for important use cases like streaming.
Apache Spark: Unified analytics engine powering modern data lakes
Shortly after the introduction of Hadoop, Apache Spark was introduced. Spark took the idea of MapReduce a step further, providing a powerful, generalized framework for distributed computations on big data. Over time, Spark became increasingly popular among data practitioners, largely because it was easy to use, performed well on benchmark tests, and provided additional functionality that increased its utility and broadened its appeal. For example, Spark’s interactive mode enabled data scientists to perform exploratory data analysis on huge data sets without having to spend time on low-value work like writing complex code to transform the data into a reliable source. Spark also made it possible to train machine learning models at scale, query big data sets using SQL, and rapidly process real-time data with Spark Streaming, increasing the number of users and potential applications of the technology significantly.
Since its introduction, Spark’s popularity has grown and grown, and it has become the de facto standard for big data processing, in no small part due to a committed base of community members and dedicated open source contributors. Today, many modern data lake architectures use Spark as the processing engine that enables data engineers and data scientists to perform ETL, refine their data, and train machine learning models.
What are the challenges with data lakes?
Challenge #1: Data reliability
Without the proper tools in place, data lakes can suffer from reliability issues that make it difficult for data scientists and analysts to reason about the data. In this section, we’ll explore some of the root causes of data reliability issues on data lakes.
Reprocessing data due to broken pipelines
With traditional data lakes, the need to continuously reprocess missing or corrupted data can become a major problem. It often occurs when someone is writing data into the data lake, but because of a hardware or software failure, the write job does not complete. In this scenario, data engineers must spend time and energy deleting any corrupted data, checking the remainder of the data for correctness, and setting up a new write job to fill any holes in the data.
Delta Lake solves the issue of reprocessing by making your data lake transactional, which means that every operation performed on it is atomic: it will either succeed completely or fail completely. There is no in between, which is good because the state of your data lake can be kept clean. As a result, data scientists don’t have to spend time tediously reprocessing the data due to partially failed writes. Instead, they can devote that time to finding insights in the data and building machine learning models to drive better business outcomes.
Data validation and quality enforcement
When thinking about data applications, as opposed to software applications, data validation is vital because without it, there is no way to gauge whether something in your data is broken or inaccurate which ultimately leads to poor reliability. With traditional software applications, it’s easy to know when something is wrong — you can see the button on your website isn’t in the right place, for example. With data applications, however, data quality problems can easily go undetected. Edge cases, corrupted data, or improper data types can surface at critical times and break your data pipeline. Worse yet, data errors like these can go undetected and skew your data, causing you to make poor business decisions.
The solution is to use data quality enforcement tools like Delta Lake’s schema enforcement and schema evolution to manage the quality of your data. These tools, alongside Delta Lake’s ACID transactions, make it possible to have complete confidence in your data, even as it evolves and changes throughout its lifecycle and ensure data reliability. Learn more about Delta Lake.
Combining batch and streaming data
With the increasing amount of data that is collected in real time, data lakes need the ability to easily capture and combine streaming data with historical, batch data so that they can remain updated at all times. Traditionally, many systems architects have turned to a lambda architecture to solve this problem, but lambda architectures require two separate code bases (one for batch and one for streaming), and are difficult to build and maintain.
With Delta Lake, every table can easily integrate these types of data, serving as a batch and streaming source and sink. Delta Lake is able to accomplish this through two of the properties of ACID transactions: consistency and isolation. These properties ensure that every viewer sees a consistent view of the data, even when multiple users are modifying the table at once, and even while new data is streaming into the table all at the same time.
Bulk updates, merges and deletes
Data lakes can hold a tremendous amount of data, and companies need ways to reliably perform update, merge and delete operations on that data so that it can remain up to date at all times. With traditional data lakes, it can be incredibly difficult to perform simple operations like these, and to confirm that they occurred successfully, because there is no mechanism to ensure data consistency. Without such a mechanism, it becomes difficult for data scientists to reason about their data.
One common way that updates, merges and deletes on data lakes become a pain point for companies is in relation to data regulations like the CCPA and GDPR. Under these regulations, companies are obligated to delete all of a customer’s information upon their request. With a traditional data lake, there are two challenges with fulfilling this request. Companies need to be able to:
- Query all the data in the data lake using SQL
- Delete any data relevant to that customer on a row-by-row basis, something that traditional analytics engines are not equipped to do
Delta Lake solves this issue by enabling data analysts to easily query all the data in their data lake using SQL. Then, analysts can perform updates, merges or deletes on the data with a single command, owing to Delta Lake’s ACID transactions. Read more about how to make your data lake CCPA compliant with a unified approach to data and analytics.
Challenge #2: Query performance
Query performance is a key driver of user satisfaction for data lake analytics tools. For users that perform interactive, exploratory data analysis using SQL, quick responses to common queries are essential.
Data lakes can hold millions of files and tables, so it’s important that your data lake query engine is optimized for performance at scale. Some of the major performance bottlenecks that can occur with data lakes are discussed below.
Having a large number of small files in a data lake (rather than larger files optimized for analytics) can slow down performance considerably due to limitations with I/O throughput. Delta Lake uses small file compaction to consolidate small files into larger ones that are optimized for read access.
Unnecessary reads from disk
Repeatedly accessing data from storage can slow query performance significantly. Delta Lake uses caching to selectively hold important tables in memory, so that they can be recalled quicker. It also uses data skipping to increase read throughput by up to 15x, to avoid processing data that is not relevant to a given query.
On modern data lakes that use cloud storage, files that are “deleted” can actually remain in the data lake for up to 30 days, creating unnecessary overhead that slows query performance. Delta Lake offers the VACUUM command to permanently delete files that are no longer needed.
Data indexing and partitioning
For proper query performance, the data lake should be properly indexed and partitioned along the dimensions by which it is most likely to be grouped. Delta Lake can create and maintain indices and partitions that are optimized for analytics.
Data lakes that grow to become multiple petabytes or more can become bottlenecked not by the data itself, but by the metadata that accompanies it. Delta Lake uses Spark to offer scalable metadata management that distributes its processing just like the data itself.
Challenge #3: Governance
Data lakes traditionally have been very hard to properly secure and provide adequate support for governance requirements. Laws such as GDPR and CCPA require that companies are able to delete all data related to a customer if they request it. Deleting or updating data in a regular Parquet Data Lake is compute-intensive and sometimes near impossible. All the files that pertain to the personal data being requested must be identified, ingested, filtered, written out as new files, and the original ones deleted. This must be done in a way that does not disrupt or corrupt queries on the table. Without easy ways to delete data, organizations are highly limited (and often fined) by regulatory bodies.
Data lakes also make it challenging to keep historical versions of data at a reasonable cost, because they require manual snapshots to be put in place and all those snapshots to be stored.
Data lake best practices
As shared in an earlier section, a lakehouse is a platform architecture that uses similar data structures and data management features to those in a data warehouse but instead runs them directly on the low-cost, flexible storage used for cloud data lakes. Advanced analytics and machine learning on unstructured data is one of the most strategic priorities for enterprises today, and with the ability to ingest raw data in a variety of formats (structured, unstructured, semi-structured), a data lake is the clear choice for the foundation for this new, simplified architecture. Ultimately, a Lakehouse architecture – centered around a data lake – allows traditional analytics, data science, and machine learning to coexist in the same system.
Use the data lake as a foundation and landing zone for raw data
As you add new data into your data lake, it’s important not to perform any data transformations on your raw data (with one exception for personally identifiable information — see below). Data should be saved in its native format, so that no information is inadvertently lost by aggregating or otherwise modifying it. Even cleansing the data of null values, for example, can be detrimental to good data scientists, who can seemingly squeeze additional analytical value out of not just data, but even the lack of it.
However, data engineers do need to strip out PII (personally identifiable information) from any data sources that contain it, replacing it with a unique ID, before those sources can be saved to the data lake. This process maintains the link between a person and their data for analytics purposes, but ensures user privacy, and compliance with data regulations like the GDPR and CCPA. Since one of the major aims of the data lake is to persist raw data assets indefinitely, this step enables the retention of data that would otherwise need to be thrown out.
Secure your lakehouse with role- and view-based access controls
Traditional role-based access controls (like IAM roles on AWS and Role-Based Access Controls on Azure) provide a good starting point for managing data lake security, but they’re not fine-grained enough for many applications. In comparison, view-based access controls allow precise slicing of permission boundaries down to the individual column, row or notebook cell level, using SQL views. SQL is the easiest way to implement such a model, given its ubiquity and easy ability to filter based upon conditions and predicates.
View-based access controls are available on modern unified data platforms, and can integrate with cloud native role-based controls via credential pass-through, eliminating the need to hand over sensitive cloud-provider credentials. Once set up, administrators can begin by mapping users to role-based permissions, then layer in finely tuned view-based permissions to expand or contract the permission set based upon each user’s specific circumstances. You should review access control permissions periodically to ensure they do not become stale.
Build reliability and ACID transactions into your lakehouse by using Delta Lake
Until recently, ACID transactions have not been possible on data lakes. However, they are now available with the introduction of open source Delta Lake, bringing the reliability and consistency of data warehouses to data lakes.
ACID properties (atomicity, consistency, isolation and durability) are properties of database transactions that are typically found in traditional relational database management systems systems (RDBMSes). They’re desirable for databases, data warehouses and data lakes alike because they ensure data reliability, integrity and trustworthiness by preventing some of the aforementioned sources of data contamination.
Delta Lake builds upon the speed and reliability of open source Parquet (already a highly performant file format), adding transactional guarantees, scalable metadata handling, and batch and streaming unification to it. It’s also 100% compatible with the Apache Spark API, so it works seamlessly with the Spark unified analytics engine. Learn more about Delta Lake with Michael Armbrust’s webinar entitled Delta Lake: Open Source Reliability for Data Lakes, or take a look at a quickstart guide to Delta Lake here.
Catalog the data in your lakehouse
In order to implement a successful lakehouse strategy, it’s important for users to properly catalog new data as it enters your data lake, and continually curate it to ensure that it remains updated. The data catalog is an organized, comprehensive store of table metadata, including table and column descriptions, schema, data lineage information and more. It is the primary way that downstream consumers (for example, BI and data analysts) can discover what data is available, what it means, and how to make use of it. It should be available to users on a central platform or in a shared repository.
At the point of ingestion, data stewards should encourage (or perhaps require) users to “tag” new data sources or tables with information about them — including business unit, project, owner, data quality level and so forth — so that they can be sorted and discovered easily. In a perfect world, this ethos of annotation swells into a company-wide commitment to carefully tag new data. At the very least, data stewards can require any new commits to the data lake to be annotated and, over time, hope to cultivate a culture of collaborative curation, whereby tagging and classifying the data becomes a mutual imperative.
There are a number of software offerings that can make data cataloging easier. The major cloud providers offer their own proprietary data catalog software offerings, namely Azure Data Catalog and AWS Glue. Outside of those, Apache Atlas is available as open source software, and other options include offerings from Alation, Collibra and Informatica, to name a few.
Get started with a lakehouse
Now that you understand the value and importance of building a lakehouse, the next step is to build the foundation of your lakehouse with Delta Lake. Check our our website to learn more or try Databricks for free.
Ready to get started?