データレイク入門

データレイクは、完全かつ信頼性の高いデータストアを提供し、データ分析、BI、機械学習を促進させます。

背景

データレイク入門

データレイクとは?

データレイクとは、膨大な量のデータを集約し、未加工のネイティブな形式で格納するレポジトリです。ファイルやフォルダ内にデータを格納する階層型データウェアハウスと比較して、データレイクはフラットなアーキテクチャとオブジェクトストレージでデータを格納します。データをメタデータタグと一意の識別子で格納するオブジェクトストレージにより、リージョン間でのデータの検索と取得は容易で、性能を向上させます。安価なオブジェクトストレージとオープンフォーマットを活用し、データレイクは、多くのアプリケーションにおけるデータ活用を可能にします。

データレイクは、データウェハウスの制限に対処するために開発されました。データウェアハウスは、ビジネスにおける高性能かつスケーラブルな分析を可能にしますが、高価で独自仕様であることから、ほとんどの企業が対応しようとしている最新のユースケースを処理できません。ほとんどの組織では、データを一元的な場所に集約して格納することを目的にデータレイクを使用しています。データウェアハウスのようにスキーマ(データがどのように構成されるかの正式な構造)を事前に設定しなくても、データレイクではデータを「そのまま」格納できます。リファイメント処理のどの段階にあってもデータレイクにデータを格納し、取り込んだ未加工データを、組織の構造化されたデータベーステーブルなどの表形式のデータソースと一緒に格納することも、リファインメント処理の過程で生成される中間のデータテーブルと一緒に格納することも可能です。一般的なデータベースやデータウェアハウスとは異なり、データレイクは、あらゆるデータの種類を処理します。これには、現在の機械学習と高度な分析のユースケースには欠かすことができない、画像、動画、音声、文書などの非構造化や半構造化データが含まれます。

データレイクのメリット

データレイクはオープンフォーマットであるため、ユーザーはデータウェアハウスのように独自システムにロックインされることはありません。これは、現代のデータアーキテクチャにおいてますます重要視されている課題です。また、データレイクは、オブジェクトストレージを拡張して活用できるため、高耐久性、低コストのメリットがあります。現在の企業において、非構造化データを高度な分析と機械学習に利活用することは最も戦略的な優先事項の 1 つです。さまざまな形式(構造化、非構造化、半構造化)で未加工データを取り込む独自の機能と、これまでに挙げた他のメリットにより、データレイクはデータストレージの明確な選択肢となります。

データレイクが適切に設計されている場合、次のことが可能になります。

アイコン – タイトル
データサイエンスと機械学習の強化

データレイクを活用することで、未加工データを SQL分析、データサイエンス、機械学習に対応した構造化データに低レイテンシで変換できます。未加工データを低コストで無期限に格納できるため、将来の機械学習や分析でそのデータを使用できます。

アイコン – タイトル
データの一元化、統合、カタログ化

一元化されたデータレイクにより、データサイロに起因する問題(データの重複、複数のセキュリティポリシー、コラボレーションが困難など)が解消されます。ダウンストリームのユーザーは、あらゆるデータが集約された単一の場所にアクセスしてデータを探索できるようになります。

アイコン – タイトル
多様なデータソースとフォーマットを迅速かつシームレスに統合

バッチ、ストリーミングデータ、動画、画像、バイナリファイルなど、あらゆるデータタイプを無期限でデータレイクに格納できます。データレイクは新しいデータのランディングゾーンを提供するため、常に最新の状態に保たれています。

アイコン – タイトル
セルフサービスツールの提供によりデータを民主化

データレイクは柔軟性に優れており、異なるスキルを持ち、さまざまなツール、言語を使用するユーザーがそれぞれの分析タスクを同時に実行できます。

データレイクにおける課題

データレイクには多くのメリットがある一方で、トランザクションのサポートやデータ品質の保証がなく、ガバナンスの実施や性能が最適化されていないなど、重要な機能が欠けています。データレイクは、これらの理由から本来の目的の達成には至っておらず、データスワンプ(活用できないデータが大量に溜まっている沼のような状態)となっているのが実情です。

アイコン – タイトル
信頼性の欠如

適切なツールがないと、データレイクのデータは信頼性に問題があり、データサイエンティストやアナリストはそのデータを推論に使用できない可能性があります。これはバッチデータとストリーミングデータの結合の複雑さ、データの破損、その他の要因に起因しています。

アイコン – タイトル
低速

データレイク内のデータ量が増大すると、従来のクエリエンジンの性能が低速になります。メタデータ管理、不適切なデータパーティショニングなどがボトルネックの要因となっていました。

アイコン – タイトル
セキュリティ機能の欠如

データレイクには、可視化や削除・更新機能がなく、適切なセキュリティやガバナンスの確保が困難です。このような制限により、規制要件を満たすことが極めて困難になっています。

このような理由から、従来のデータレイクだけでは、変革を目指す企業の要件を満たすことができません。そのため、企業では、データウェアハウス、データベース、その他のストレージシステムなど、さまざまなストレージシステムを扱い、複雑なアーキテクチャで運用しているのが実情です。機械学習とデータ分析を活用して今後 10 年の成功を目指す企業にとっては、データレイクに格納されたあらゆるデータを統合し、アーキテクチャを簡素化することが最初のステップになります。

レイクハウスによる課題解決

データレイクの課題を解決するのは、データレイク上にトランザクションストレージレイヤーを追加するレイクハウスです。レイクハウスは、データウェアハウスと類似のデータ構造とデータ管理機能を使用しますが、その機能をクラウドデータレイク上で直接実行します。レイクハウスは、従来の分析、データサイエンス、機械学習を同一のシステムにオープンフォーマットで統合します。

レイクハウスは、さまざまな部門を横断した企業規模の分析、BI、機械学習プロジェクトにおける幅広いユースケースの新たな展開を可能にし、膨大なビジネス価値を引き出します。データアナリストは、データレイクのデータを SQL を使用してクエリし、効果的な知見を抽出できます。データサイエンティストは、データセットを統合、強化して、これまでにない正確度の高い機械学習モデルを生成できるようになります。データエンジニアは、自動 ETL パイプラインの構築、BI アナリストは、視覚的なダッシュボードとレポートツールの迅速かつ容易な作成が可能になります。これらのユースケースは、新たなデータをストリーミングしているときでも、データをリフトや移動させることなく、データレイク上で同時に実行できます。

Delta Lake でレイクハウスを構築

効果的なレイクハウスを構築するために、各企業が注目しているのが Delta Lake です。Delta Lake は、データレイクとデータウェアハウスの両方の利点を兼ね備えた、オープンフォーマットのデータ管理およびガバナンスレイヤーです。業界を問わず、企業では、Delta Lake の信頼性の高い単一のデータソースを活用して、コラボレーションを強化しています。また、Delta Lake は、ストリーミングとバッチ処理の両方においてデータレイクに品質、信頼性、セキュリティ、性能をもたらします。データのサイロが解消され、企業のさまざまな事業部が分析にアクセスできるようになります。Delta Lake を使用することで、データサイロを排除し、エンドユーザーによるセルフサービス型の分析を可能にするコスト効率と拡張性に優れたレイクハウスを構築できます。

Delta Lake について詳しく見る→

データレイク、データレイクハウス、データウェアハウス(DWH)の比較

  1. データのタイプ
    コスト
    フォーマット
    スケーラビリティ
    ユーザー
    信頼性
    使いやすさ
    性能
  2. データレイク
    全てのタイプ:構造化/半構造化/非構造化(生)データ
    $
    オープン
    あらゆるタイプ/量のデータを低コストでスケール
    データサイエンティスト(限定)
    低品質、データスワンプ
    膨大な量の未加工データの探索には、データの整理・カタログ化のためのツールが必要(使いにくい)
  3. レイクハウス
    全てのタイプ:構造化/半構造化/非構造化(生)データ
    $
    オープン
    あらゆるタイプ/量のデータを低コストでスケール
    あらゆるタイプのユーザー(統合型)
    データスワンプからの脱却
    データレイクの広範なユースケースで、DWH のシンプルさと構造を提供(使いやすい)
  4. DWH
    構造化データのみ
    $$$
    クローズド、独自
    ベンダーコストが爆発的に増大
    データアナリスト(限定)
    データスワンプからの脱却
    DWH の構造により、レポートや分析のためのデータに迅速かつ容易にアクセス可能(使いやすい)

レイクハウスのベストプラクティス

アイコン – タイトル
あらゆるデータのランディングゾーンとしてデータレイクを使用

機械学習やデータリネージに使用するために、データの変換や集計は行わず、あらゆるデータをデータレイクに格納しておくことができます。

アイコン – タイトル
個人情報を含むデータを匿名化してデータレイクに格納

GDPR に準拠し、データを無期限に保存できるようにするには、個人を特定できる情報 (PII) を匿名化する必要があります。

アイコン – タイトル
役割ベース、ビューベースのアクセス制御でデータレイクを保護

ビューベースのACL(アクセス制御レベル)が追加され、役割ベースの制御のみではできなかった、データレイクのセキュリティの緻密な調整、制御が可能になります。

アイコン – タイトル
Delta Lake がデータレイクの信頼性と性能を向上

ビッグデータの性質上、データベースと同レベルの信頼性と性能をデータレイクで提供することは困難とされていましたが、Delta Lake はそれを可能にする機能をデータレイクにもたらします。

アイコン – タイトル
データレイクのデータをカタログ化

データの取り込みの時点で、データカタログとメタデータ管理ツールを使用し、セルフサービスでのデータサイエンスと分析を可能にします。

データレイクのベストプラクティスのガイドを読む →

「私たちは、よりクリーンなエネルギーソリューションを提供するという目標の一環としてデジタル変革を進めており、データレイクアーキテクチャに対して積極的に投資してきました。ペタバイト規模のデータセットに対するクエリを、できるだけシンプルに、高速にしたいと考えていました。標準的な BI ツールを使用して、クエリを素早く実行できることは、私たちにとってのゲームチェンジャーとなります。」

シェル社 データサイエンス担当 GM ダン・ジーボンズ氏

導入事例を読む →

History and evolution of data lakes

The early days of data management: databases

In the early days of data management, the relational database was the primary method that companies used to collect, store and analyze data. Relational databases, also known as relational database management systems (RDBMSes), offered a way for companies to store and analyze highly structured data about their customers using Structured Query Language (SQL). For many years, relational databases were sufficient for companies’ needs: the amount of data that needed to be stored was relatively small, and relational databases were simple and reliable. To this day, a relational database is still an excellent choice for storing highly structured data that’s not too big. However, the speed and scale of data was about to explode.

The rise of the internet, and data silos

With the rise of the internet, companies found themselves awash in customer data. To store all this data, a single database was no longer sufficient. Companies often built multiple databases organized by line of business to hold the data instead. As the volume of data grew and grew, companies could often end up with dozens of disconnected databases with different users and purposes.

On the one hand, this was a blessing: with more and better data, companies were able to more precisely target customers and manage their operations than ever before. On the other hand, this led to data silos: decentralized, fragmented stores of data across the organization. Without a way to centralize and synthesize their data, many companies failed to synthesize it into actionable insights. This pain led to the rise of the data warehouse.data silos

Data warehouses are born to unite companies’ structured data under one roof

With so much data stored in different source systems, companies needed a way to integrate them. The idea of a “360-degree view of the customer” became the idea of the day, and data warehouses were born to meet this need and unite disparate databases across the organization.

Data warehouses emerged as a technology that brings together an organization’s collection of relational databases under a single umbrella, allowing the data to be queried and viewed as a whole. At first, data warehouses were typically run on expensive, on-premises appliance-based hardware from vendors like Teradata and Vertica, and later became available in the cloud. Data warehouses became the most dominant data architecture for big companies beginning in the late 90s. The primary advantages of this technology included:

  • Integration of many data sources
  • Data optimized for read access
  • Ability to run quick ad hoc analytical queries
  • Data audit, governance and lineage

Data warehouses served their purpose well, but over time, the downsides to this technology became apparent.

  • Inability to store unstructured, raw data
  • Expensive, proprietary hardware and software
  • Difficulty scaling due to the tight coupling of storage and compute power

Apache Hadoop™ and Spark™ enable unstructured data analysis, and set the stage for modern data lakes

With the rise of “big data” in the early 2000s, companies found that they needed to do analytics on data sets that could not conceivably fit on a single computer. Furthermore, the type of data they needed to analyze was not always neatly structured — companies needed ways to make use of unstructured data as well. To make big data analytics possible, and to address concerns about the cost and vendor lock-in of data warehouses, Apache Hadoop™ emerged as an open source distributed data processing technology.

Hadoop とは

Apache Hadoop™ is a collection of open source software for big data analytics that allows large data sets to be processed with clusters of computers working in parallel. It includes Hadoop MapReduce, the Hadoop Distributed File System (HDFS) and YARN (Yet Another Resource Negotiator). HDFS allows a single data set to be stored across many different storage devices as if it were a single file. It works hand-in-hand with the MapReduce algorithm, which determines how to split up a large computational task (like a statistical count or aggregation) into much smaller tasks that can be run in parallel on a computing cluster.

The introduction of Hadoop was a watershed moment for big data analytics for two main reasons. First, it meant that some companies could conceivably shift away from expensive, proprietary data warehouse software to in-house computing clusters running free and open source Hadoop. Second, it allowed companies to analyze massive amounts of unstructured data in a way that was not possible before. Prior to Hadoop, companies with data warehouses could typically analyze only highly structured data, but now they could extract value from a much larger pool of data that included semi-structured and unstructured data. Once companies had the capability to analyze raw data, collecting and storing this data became increasingly important — setting the stage for the modern data lake.

Early data lakes were built on Hadoop

Early data lakes built on Hadoop MapReduce and HDFS enjoyed varying degrees of success. Many of these early data lakes used Apache Hive™ to enable users to query their data with a Hadoop-oriented SQL engine. Some early data lakes succeeded, while others failed due to Hadoop’s complexity and other factors. To this day, many people still associate the term “data lake” with Hadoop because it was the first framework to enable the collection and analysis of massive amounts of unstructured data. Today, however, many modern data lake architectures have shifted from on-premises Hadoop to running Spark in the cloud. Still, these initial attempts were important as these Hadoop data lakes were the precursors of the modern data lake. Over time, Hadoop’s popularity leveled off somewhat, as it has problems that most organizations can’t overcome like slow performance, limited security and lack of support for important use cases like streaming.

Apache Spark: Unified analytics engine powering modern data lakes

Shortly after the introduction of Hadoop, Apache Spark was introduced. Spark took the idea of MapReduce a step further, providing a powerful, generalized framework for distributed computations on big data. Over time, Spark became increasingly popular among data practitioners, largely because it was easy to use, performed well on benchmark tests, and provided additional functionality that increased its utility and broadened its appeal. For example, Spark’s interactive mode enabled data scientists to perform exploratory data analysis on huge data sets without having to spend time on low-value work like writing complex code to transform the data into a reliable source. Spark also made it possible to train machine learning models at scale, query big data sets using SQL, and rapidly process real-time data with Spark Streaming, increasing the number of users and potential applications of the technology significantly.

Since its introduction, Spark’s popularity has grown and grown, and it has become the de facto standard for big data processing, in no small part due to a committed base of community members and dedicated open source contributors. Today, many modern data lake architectures use Spark as the processing engine that enables data engineers and data scientists to perform ETL, refine their data, and train machine learning models.

What are the challenges with data lakes?

Challenge #1: Data reliability

Without the proper tools in place, data lakes can suffer from reliability issues that make it difficult for data scientists and analysts to reason about the data. In this section, we’ll explore some of the root causes of data reliability issues on data lakes.

Reprocessing data due to broken pipelines

With traditional data lakes, the need to continuously reprocess missing or corrupted data can become a major problem. It often occurs when someone is writing data into the data lake, but because of a hardware or software failure, the write job does not complete. In this scenario, data engineers must spend time and energy deleting any corrupted data, checking the remainder of the data for correctness, and setting up a new write job to fill any holes in the data.

Delta Lake solves the issue of reprocessing by making your data lake transactional, which means that every operation performed on it is atomic: it will either succeed completely or fail completely. There is no in between, which is good because the state of your data lake can be kept clean. As a result, data scientists don’t have to spend time tediously reprocessing the data due to partially failed writes. Instead, they can devote that time to finding insights in the data and building machine learning models to drive better business outcomes.

Data validation and quality enforcement

When thinking about data applications, as opposed to software applications, data validation is vital because without it, there is no way to gauge whether something in your data is broken or inaccurate which ultimately leads to poor reliability. With traditional software applications, it’s easy to know when something is wrong — you can see the button on your website isn’t in the right place, for example. With data applications, however, data quality problems can easily go undetected. Edge cases, corrupted data, or improper data types can surface at critical times and break your data pipeline. Worse yet, data errors like these can go undetected and skew your data, causing you to make poor business decisions.

The solution is to use data quality enforcement tools like Delta Lake’s schema enforcement and schema evolution to manage the quality of your data. These tools, alongside Delta Lake’s ACID transactions, make it possible to have complete confidence in your data, even as it evolves and changes throughout its lifecycle and ensure data reliability. Learn more about Delta Lake.

Combining batch and streaming data

With the increasing amount of data that is collected in real time, data lakes need the ability to easily capture and combine streaming data with historical, batch data so that they can remain updated at all times. Traditionally, many systems architects have turned to a lambda architecture to solve this problem, but lambda architectures require two separate code bases (one for batch and one for streaming), and are difficult to build and maintain.

With Delta Lake, every table can easily integrate these types of data, serving as a batch and streaming source and sink. Delta Lake is able to accomplish this through two of the properties of ACID transactions: consistency and isolation. These properties ensure that every viewer sees a consistent view of the data, even when multiple users are modifying the table at once, and even while new data is streaming into the table all at the same time.

Bulk updates, merges and deletes

Data lakes can hold a tremendous amount of data, and companies need ways to reliably perform update, merge and delete operations on that data so that it can remain up to date at all times. With traditional data lakes, it can be incredibly difficult to perform simple operations like these, and to confirm that they occurred successfully, because there is no mechanism to ensure data consistency. Without such a mechanism, it becomes difficult for data scientists to reason about their data.

One common way that updates, merges and deletes on data lakes become a pain point for companies is in relation to data regulations like the CCPA and GDPR. Under these regulations, companies are obligated to delete all of a customer’s information upon their request. With a traditional data lake, there are two challenges with fulfilling this request. Companies need to be able to:

  1. Query all the data in the data lake using SQL
  2. Delete any data relevant to that customer on a row-by-row basis, something that traditional analytics engines are not equipped to do

Delta Lake solves this issue by enabling data analysts to easily query all the data in their data lake using SQL. Then, analysts can perform updates, merges or deletes on the data with a single command, owing to Delta Lake’s ACID transactions. Read more about how to make your data lake CCPA compliant with a unified approach to data and analytics.

Challenge #2: Query performance

Query performance is a key driver of user satisfaction for data lake analytics tools. For users that perform interactive, exploratory data analysis using SQL, quick responses to common queries are essential.

Data lakes can hold millions of files and tables, so it’s important that your data lake query engine is optimized for performance at scale. Some of the major performance bottlenecks that can occur with data lakes are discussed below.

Small files

Having a large number of small files in a data lake (rather than larger files optimized for analytics) can slow down performance considerably due to limitations with I/O throughput. Delta Lake uses small file compaction to consolidate small files into larger ones that are optimized for read access.

Unnecessary reads from disk

Repeatedly accessing data from storage can slow query performance significantly. Delta Lake uses caching to selectively hold important tables in memory, so that they can be recalled quicker. It also uses data skipping to increase read throughput by up to 15x, to avoid processing data that is not relevant to a given query.

Deleted files

On modern data lakes that use cloud storage, files that are “deleted” can actually remain in the data lake for up to 30 days, creating unnecessary overhead that slows query performance. Delta Lake offers the VACUUM command to permanently delete files that are no longer needed.

Data indexing and partitioning

For proper query performance, the data lake should be properly indexed and partitioned along the dimensions by which it is most likely to be grouped. Delta Lake can create and maintain indices and partitions that are optimized for analytics.

Metadata management

Data lakes that grow to become multiple petabytes or more can become bottlenecked not by the data itself, but by the metadata that accompanies it. Delta Lake uses Spark to offer scalable metadata management that distributes its processing just like the data itself.

Challenge #3: Governance

Data lakes traditionally have been very hard to properly secure and provide adequate support for governance requirements. Laws such as GDPR and CCPA require that companies are able to delete all data related to a customer if they request it. Deleting or updating data in a regular Parquet Data Lake is compute-intensive and sometimes near impossible. All the files that pertain to the personal data being requested must be identified, ingested, filtered, written out as new files, and the original ones deleted. This must be done in a way that does not disrupt or corrupt queries on the table. Without easy ways to delete data, organizations are highly limited (and often fined) by regulatory bodies.

Data lakes also make it challenging to keep historical versions of data at a reasonable cost, because they require manual snapshots to be put in place and all those snapshots to be stored.

Data lake best practices

As shared in an earlier section, a lakehouse is a platform architecture that uses similar data structures and data management features to those in a data warehouse but instead runs them directly on the low-cost, flexible storage used for cloud data lakes. Advanced analytics and machine learning on unstructured data is one of the most strategic priorities for enterprises today, and with the ability to ingest raw data in a variety of formats (structured, unstructured, semi-structured), a data lake is the clear choice for the foundation for this new, simplified architecture. Ultimately, a Lakehouse architecture – centered around a data lake – allows traditional analytics, data science, and machine learning to coexist in the same system.

Use the data lake as a foundation and landing zone for raw data

As you add new data into your data lake, it’s important not to perform any data transformations on your raw data (with one exception for personally identifiable information — see below). Data should be saved in its native format, so that no information is inadvertently lost by aggregating or otherwise modifying it. Even cleansing the data of null values, for example, can be detrimental to good data scientists, who can seemingly squeeze additional analytical value out of not just data, but even the lack of it.

However, data engineers do need to strip out PII (personally identifiable information) from any data sources that contain it, replacing it with a unique ID, before those sources can be saved to the data lake. This process maintains the link between a person and their data for analytics purposes, but ensures user privacy, and compliance with data regulations like the GDPR and CCPA. Since one of the major aims of the data lake is to persist raw data assets indefinitely, this step enables the retention of data that would otherwise need to be thrown out.

Secure your lakehouse with role- and view-based access controls

Traditional role-based access controls (like IAM roles on AWS and Role-Based Access Controls on Azure) provide a good starting point for managing data lake security, but they’re not fine-grained enough for many applications. In comparison, view-based access controls allow precise slicing of permission boundaries down to the individual column, row or notebook cell level, using SQL views. SQL is the easiest way to implement such a model, given its ubiquity and easy ability to filter based upon conditions and predicates.

View-based access controls are available on modern unified data platforms, and can integrate with cloud native role-based controls via credential pass-through, eliminating the need to hand over sensitive cloud-provider credentials. Once set up, administrators can begin by mapping users to role-based permissions, then layer in finely tuned view-based permissions to expand or contract the permission set based upon each user’s specific circumstances. You should review access control permissions periodically to ensure they do not become stale.

Build reliability and ACID transactions into your lakehouse by using Delta Lake

Until recently, ACID transactions have not been possible on data lakes. However, they are now available with the introduction of open source Delta Lake, bringing the reliability and consistency of data warehouses to data lakes.

ACID properties (atomicity, consistency, isolation and durability) are properties of database transactions that are typically found in traditional relational database management systems systems (RDBMSes). They’re desirable for databases, data warehouses and data lakes alike because they ensure data reliability, integrity and trustworthiness by preventing some of the aforementioned sources of data contamination.

Delta Lake builds upon the speed and reliability of open source Parquet (already a highly performant file format), adding transactional guarantees, scalable metadata handling, and batch and streaming unification to it. It’s also 100% compatible with the Apache Spark API, so it works seamlessly with the Spark unified analytics engine. Learn more about Delta Lake with Michael Armbrust’s webinar entitled Delta Lake: Open Source Reliability for Data Lakes, or take a look at a quickstart guide to Delta Lake here.

Catalog the data in your lakehouse

In order to implement a successful lakehouse strategy, it’s important for users to properly catalog new data as it enters your data lake, and continually curate it to ensure that it remains updated. The data catalog is an organized, comprehensive store of table metadata, including table and column descriptions, schema, data lineage information and more. It is the primary way that downstream consumers (for example, BI and data analysts) can discover what data is available, what it means, and how to make use of it. It should be available to users on a central platform or in a shared repository.

At the point of ingestion, data stewards should encourage (or perhaps require) users to “tag” new data sources or tables with information about them — including business unit, project, owner, data quality level and so forth — so that they can be sorted and discovered easily. In a perfect world, this ethos of annotation swells into a company-wide commitment to carefully tag new data. At the very least, data stewards can require any new commits to the data lake to be annotated and, over time, hope to cultivate a culture of collaborative curation, whereby tagging and classifying the data becomes a mutual imperative.

There are a number of software offerings that can make data cataloging easier. The major cloud providers offer their own proprietary data catalog software offerings, namely Azure Data Catalog and AWS Glue. Outside of those, Apache Atlas is available as open source software, and other options include offerings from Alation, Collibra and Informatica, to name a few.

Get started with a lakehouse

Now that you understand the value and importance of building a lakehouse, the next step is to build the foundation of your lakehouse with Delta Lake. Check our our website to learn more or try Databricks for free.

無料お試し・その他ご相談を承ります