レイクハウスとは？

公開日: 2020年1月30日

によって Ben Lorica、Michael Armbrust、Reynold Xin（レイノルド・シン）、Matei Zaharia 、 Ali Ghodsi（アリ・ゴディシ）による投稿

Over the past few years at Databricks, we've seen a new data management architecture that emerged independently across many customers and use cases: the lakehouse. In this post we describe this new architecture and its advantages over previous approaches.

Data warehouses have a long history in decision support and business intelligence applications. Since its inception in the late 1980s, data warehouse technology continued to evolve and MPP architectures led to systems that were able to handle larger data sizes. But while warehouses were great for structured data, a lot of modern enterprises have to deal with unstructured data, semi-structured data, and data with high variety, velocity, and volume. Data warehouses are not suited for many of these use cases, and they are certainly not the most cost efficient.

As companies began to collect large amounts of data from many different sources, architects began envisioning a single system to house data for many different analytic products and workloads. About a decade ago companies began building data lakes - repositories for raw data in a variety of formats. While suitable for storing data, data lakes lack some critical features: they do not support transactions, they do not enforce data quality, and their lack of consistency / isolation makes it almost impossible to mix appends and reads, and batch and streaming jobs. For these reasons, many of the promises of the data lakes have not materialized, and in many cases leading to a loss of many of the benefits of data warehouses.

The need for a flexible, high-performance system hasn't abated. Companies require systems for diverse data applications including SQL analytics, real-time monitoring, data science, and machine learning. Most of the recent advances in AI have been in better models to process unstructured data (text, images, video, audio), but these are precisely the types of data that a data warehouse is not optimized for. A common approach is to use multiple systems - a data lake, several data warehouses, and other specialized systems such as streaming, time-series, graph, and image databases. Having a multitude of systems introduces complexity and more importantly, introduces delay as data professionals invariably need to move or copy data between different systems.

What is a lakehouse?

New systems are beginning to emerge that address the limitations of data lakes. A lakehouse is a new, open architecture that combines the best elements of data lakes and data warehouses. Lakehouses are enabled by a new system design: implementing similar data structures and data management features to those in a data warehouse directly on top of low cost cloud storage in open formats. They are what you would get if you had to redesign data warehouses in the modern world, now that cheap and highly reliable storage (in the form of object stores) are available.

A lakehouse has the following key features:

Transaction support: In an enterprise lakehouse many data pipelines will often be reading and writing data concurrently. Support for ACID transactions ensures consistency as multiple parties concurrently read or write data, typically using SQL.
Schema enforcement and governance: The Lakehouse should have a way to support schema enforcement and evolution, supporting DW schema architectures such as star/snowflake-schemas. The system should be able to reason about data integrity, and it should have robust governance and auditing mechanisms.
BI support: Lakehouses enable using BI tools directly on the source data. This reduces staleness and improves recency, reduces latency, and lowers the cost of having to operationalize two copies of the data in both a data lake and a warehouse.
Storage is decoupled from compute: In practice this means storage and compute use separate clusters, thus these systems are able to scale to many more concurrent users and larger data sizes. Some modern data warehouses also have this property.
Openness: The storage formats they use are open and standardized, such as Parquet, and they provide an API so a variety of tools and engines, including machine learning and Python/R libraries, can efficiently access the data directly.
Support for diverse data types ranging from unstructured to structured data: The lakehouse can be used to store, refine, analyze, and access data types needed for many new data applications, including images, video, audio, semi-structured data, and text.
Support for diverse workloads: including data science, machine learning, and SQL and analytics. Multiple tools might be needed to support all these workloads but they all rely on the same data repository.
End-to-end streaming: Real-time reports are the norm in many enterprises. Support for streaming eliminates the need for separate systems dedicated to serving real-time data applications.

These are the key attributes of lakehouses. Enterprise grade systems require additional features. Tools for security and access control are basic requirements. Data governance capabilities including auditing, retention, and lineage have become essential particularly in light of recent privacy regulations. Tools that enable data discovery such as data catalogs and data usage metrics are also needed. With a lakehouse, such enterprise features only need to be implemented, tested, and administered for a single system.

Read the full research paper on the inner workings of the Lakehouse.

Some early examples

The Databricks Lakehouse Platform has the architectural features of a lakehouse. Microsoft's Azure Synapse Analytics service, which integrates with Azure Databricks, enables a similar lakehouse pattern. Other managed services such as BigQuery and Redshift Spectrum have some of the lakehouse features listed above, but they are examples that focus primarily on BI and other SQL applications. Companies who want to build and implement their own systems have access to open source file formats (Delta Lake, Apache Spark, Apache Hudi) that are suitable for building a lakehouse.

Merging data lakes and data warehouses into a single system means that data teams can move faster as they are able use data without needing to access multiple systems. The level of SQL support and integration with BI tools among these early lakehouses are generally sufficient for most enterprise data warehouses. Materialized views and stored procedures are available but users may need to employ other mechanisms that aren't equivalent to those found in traditional data warehouses. The latter is particularly important for "lift and shift scenarios", which require systems that achieve semantics that are almost identical to those of older, commercial data warehouses.

他の種類のデータアプリケーションのサポートはどうですか？ Lakehouseのユーザーは、データサイエンスや機械学習のようなBIワークロード以外の用途で、さまざまな標準ツール（Spark、Python、R、機械学習ライブラリ）を利用できます。データ探索と洗練は、多くの分析およびデータサイエンスアプリケーションで標準的なものです。Delta Lakeは、ユーザーがLakehouse内のデータの品質を段階的に向上させ、利用可能になるまで改善できるように設計されています。

技術的な構成要素に関する注記。分散ファイルシステムはストレージレイヤーに使用できますが、Lakehouseではオブジェクトストアがより一般的に使用されています。オブジェクトストアは、低コストで可用性の高いストレージを提供し、大規模な並列読み取りに優れています。これは、最新のデータウェアハウスに不可欠な要件です。

BIからAIへ

Lakehouseは、エンタープライズデータインフラストラクチャを根本的に簡素化し、機械学習があらゆる業界を破壊する可能性を秘めた時代におけるイノベーションを加速する、新しいデータ管理アーキテクチャです。過去には、企業の製品や意思決定に使用されるデータのほとんどは、運用システムからの構造化データでしたが、今日では、多くの製品がコンピュータビジョンや音声モデル、テキストマイニングなどの形でAIを組み込んでいます。AIのためにデータレイクではなくLakehouseを使用するのはなぜですか？ Lakehouseは、非構造化データであっても必要とされるデータバージョニング、ガバナンス、セキュリティ、およびACIDプロパティを提供します。

現在のLakehouseはコストを削減しますが、長年の投資と実際の導入実績を持つ専門システム（データウェアハウスなど）と比較して、パフォーマンスが遅くなる可能性があります。ユーザーは他のツールよりも特定のツール（BIツール、IDE、ノートブック）を好む場合があるため、LakehouseはUXと一般的なツールへのコネクタも改善して、さまざまなペルソナにアピールできるようにする必要があります。これらの問題やその他の問題は、テクノロジーが成熟し発展し続けるにつれて解決されていきます。時間の経過とともに、Lakehouseは、よりシンプルで、よりコスト効率が高く、さまざまなデータアプリケーションに対応できるというコアプロパティを維持しながら、これらのギャップを埋めていくでしょう。

詳細については、Data Lakehouseに関するFAQをお読みください。

(このブログ記事はAI翻訳ツールを使用して翻訳されています) 原文記事

What is a lakehouse?

モダンアナリティクスへのコンパクトガイド

Some early examples

BIからAIへ

レイクハウスとは？

Databricksの投稿を見逃さないようにしましょう

Sign up