Skip to main content
Platform

What is an open lakehouse? Open data standards, explained.

The open lakehouse, end to end: a definition, a reference architecture, and an open-source stack you can clone and run. Open formats, open engines, unified governance, and no vendor lock-in.

by Lisa Cao

  • An open lakehouse is a data lakehouse in which every layer (open formats, open engines, and unified governance) is built on open standards, so you can assemble, run, and own the whole stack with no vendor lock-in.
  • "Open" is only a label until you can show the commit history: Databricks was founded by the original creators of Apache Spark, Delta Lake, Unity Catalog, and MLflow, and a major contributor to Apache Iceberg.
  • We’ve developed one reference architecture that scales from two-person ETL to multi-cloud governance and production AI agents. Every layer is self-hostable and interchangeable, with no lock-in.

An open lakehouse is a data lakehouse in which every layer (storage, table format, engine, catalog, and the ML and AI tools on top) is built on open standards, so no layer is locked to a single vendor.

A data lakehouse combines the low-cost, scalable storage of a data lake with the management features and transactional guarantees of a data warehouse. An open lakehouse adds a further condition: every layer of the architecture is built on open standards. The data, the engines that process it, the catalog that governs it, and the tools that build models and applications on top of it are all open source, so none of them depends on a single vendor.

The word "open" gets thrown around a lot, and not always honestly. That is why it is worth pinning down. A format can be called open and still be practical to use through only one engine. A catalog can be called open and still travel with one platform. Storage stays portable right up until the egress charges make moving it expensive. An open lakehouse is what you get when those constraints are removed at every level. More recently, that same openness has extended to the AI and agent workloads now being built on the data.

How is an open lakehouse different from a data lake and a data warehouse?

An open lakehouse combines a data warehouse's reliability and a data lake's low-cost storage in one architecture, then adds one rule the other two lack: every layer must be open and interchangeable. A data warehouse holds clean, structured tables for reporting, with strong governance but high cost and little room for unstructured data. A data lake holds everything else, such as raw files, logs, and images, cheaply and at scale, but without transactions, schema guarantees, or much governance.

The lakehouse did not appear from nowhere. For years, teams kept the two systems side by side, copying data between them and reconciling two versions of the truth. The data lakehouse merged them: warehouse-style management and transactions applied directly to low-cost lake storage. The open lakehouse is the next turn of that idea. It keeps the combined architecture and adds the rule above, so the reliability of a warehouse and the economics of a lake arrive without binding any layer to a single vendor.

CapabilityData warehouseData lakeOpen lakehouse
Storage costHighLowLow
ACID transactionsYesNoYes
Governance and schemaStrongWeakStrong
Open formats, engine choiceNoPartialYes
BI, ML, and AI on one copyBI mainlyML mainlyBI, ML, and AI

Open vs. proprietary lakehouse: what's the difference?

The difference between an open and a proprietary lakehouse comes down to one question: who can read your data and which engines can run on it. A proprietary lakehouse stores data in formats only one vendor can read, so switching tools can require rewriting or re-exporting all of your data. An open lakehouse stores data in open formats that any compatible engine can read, so you can add, swap, or remove a query engine without rewriting your data.

FactorOpen lakehouseProprietary lakehouse
Data formatsOpen standardsVendor-specific
Engine choiceAny compatible engineVendor's engine only
Vendor lock-inLowHigh
CatalogOpen, portableProprietary
Cost controlMulti-engine flexibilityTied to one vendor

What makes a lakehouse open?

Three things separate an open lakehouse from a proprietary one: open formats, open engines, and unified governance.

Start with open formats. The data lives in open table formats, Delta Lake and Apache Iceberg®, sitting on open file formats like Apache Parquet®. The specs are public, so any engine that implements them can read and write the data, not just one vendor’s runtime.

Open engines come next: the systems doing the processing are open source too. Apache Spark® covers batch, streaming, SQL, and machine learning in one runtime, and because the tables underneath are open, engines like DuckDB, Trino, or PyIceberg can work the same data without a second copy.

Unified governance is the part teams underestimate. One catalog handles access control, lineage, and auditing for every format and every engine, so governance attaches to the data itself instead of being rebuilt inside each tool that touches it. Unity Catalog plays that role here.

Put those together and the result is open from the storage layer to the serving layer: object storage underneath, an open table format above it, an open processing engine, an open catalog for governance, and open tooling for analytics, machine learning, and AI applications on top.

Is "open" the same as open-source?

Not quite, and the difference matters. Open source describes the license on a project's code. Open, in the lakehouse sense, is broader: it covers open file and table formats, open APIs and standard ways for tools to connect, and an ecosystem where many engines work on the same data. A platform can be built from open-source projects and still trap your data, for example by storing it in a layout only its own engine reads well. The honest test of openness is simple: you can read your data with any compatible tool, and move between tools without rewriting it.

Open table format vs. open lakehouse: what's the difference?

These two terms get used interchangeably, and they should not be. An open table format is one layer: it adds database-like features (reliable updates, schema changes, and history) on top of files in object storage. An open lakehouse is everything around that layer, the storage beneath it and the compute, catalog, and governance on top. The table format is a component. The lakehouse is the whole stack.

AspectOpen table formatOpen lakehouse
ScopeOne layerFull architecture
RoleAdds tables, updates, history to filesCombines storage, formats, compute, governance
ExamplesApache Iceberg, Delta LakeA full platform built on those formats
ProvidesACID, schema evolution, time travelAnalytics, BI, ML, governance on one copy of data

Two terms from the table: ACID means data changes complete reliably without corrupting the table, and time travel means you can view or restore the data as it looked at an earlier point.

Which open standards power an open lakehouse?

This reference architecture is built from five open-source standards, each governed by a neutral foundation (the Apache Software Foundation or the Linux Foundation) and each owning a different layer of the stack. An open lakehouse is only as trustworthy as the projects it is built from, and these are not niche projects: they are standards much of the industry already runs on, and most of them were created at Databricks. That is not a humblebrag, it is checkable: as of early 2026, Apache Spark is used by around 80% of the Fortune 500 and is the most widely adopted engine for large-scale data processing. MLflow passes 30 million downloads a month. Delta Lake and Apache Iceberg together cover the vast majority of lakehouse tables in production, with Delta Lake holding the largest installed base.

Together, that is more than 90,000 GitHub stars and tens of millions of downloads a month. A snapshot of where each project stands (GitHub stars as of early 2026):

ProjectLayerAdoption
Apache Spark®Processing engine43k+ GitHub stars; used by about 80% of the Fortune 500
Delta LakeTable format8k+ GitHub stars; the largest installed base of any open table format
Apache IcebergTable format8k+ GitHub stars; REST catalog adopted across the industry
Unity CatalogGovernance3k+ GitHub stars; donated to the LF AI & Data Foundation
MLflowML and AI26k+ GitHub stars; 30M+ downloads a month

Apache Spark

Apache Spark® is the processing engine. It was created at the UC Berkeley AMPLab in 2009 by the team that went on to found Databricks, and was donated to the Apache Software Foundation, where it became one of the most widely used engines for large-scale data processing. One Spark runtime handles batch jobs, streaming, SQL, and machine learning, which is why a team can run a single engine rather than maintaining a different system for each kind of workload. In the lakehouse, Spark reads raw data, refines it in stages, and writes the results back as open tables.

Delta Lake

Delta Lake is a table format that makes object storage behave like a database instead of a pile of files. On top of ordinary Parquet files it adds ACID transactions, schema enforcement, and time travel, so concurrent jobs do not corrupt each other’s writes and a table can be queried as it looked at an earlier point in time. A companion library, Delta Kernel, packages the read and write logic into an engine-agnostic component, which makes it simpler for engines other than Spark to support the format. Delta Lake was created at Databricks, which remains its primary contributor, and is governed as a Linux Foundation project.

Apache Iceberg

Iceberg is a second table format, built for very large analytic tables and for moving cleanly between engines. It came out of Netflix rather than Databricks, and is now a broadly adopted Apache project. Databricks is a major contributor, including the Iceberg founding team who joined through the Tabular acquisition. Its table specification and REST catalog make it easy for several engines to share the same tables, which is why it turns up wherever more than one query engine is in play. Supporting both Delta Lake and Iceberg means a team does not have to commit to one format on day one and live with that choice forever.

Unity Catalog

Unity Catalog is the governance layer. It keeps access policies, credential vending, and lineage in one place, and engines reach data through it rather than around it. Because the rules live in the catalog instead of inside any one engine, access control and lineage stay consistent whether a query comes from Spark, DuckDB, or another client. Unity Catalog was created at Databricks and is now an open-source project under the LF AI & Data Foundation, with a managed version available on Databricks. The open release is newer than the managed product, and a few governance features are still maturing in it, so it is worth checking the open project against your requirements.

Unity Catalog is not the only open catalog. Apache Polaris, Project Nessie, the Hive Metastore, and AWS Glue fill the same role, and the Iceberg REST catalog is emerging as a shared interface across them. An open lakehouse can use any of them. This reference architecture uses Unity Catalog because it governs data, ML, and AI assets together under one model, but the catalog layer is genuinely swappable.

MLflow

MLflow is the layer for machine learning and AI. It handles experiment tracking, model packaging, a model registry, evaluation, and serving, and that same machinery now reaches AI agents: tracing what an agent did, scoring its output against evaluators, and placing a gateway with budget limits and guardrails in front of it. Running models and agents on the same open platform that governs the data, rather than in a separate stack off to the side, is a large part of what makes this version of the lakehouse different. MLflow was created at Databricks, which remains its primary contributor, and is a Linux Foundation project.

How do the layers of an open lakehouse work together?

The layers connect through open interfaces, so they fit together without proprietary glue and any one of them can be swapped without disturbing the others. That is what turns five projects into a single architecture.

image2.png

Start with the data. Spark writes tables to object storage as Delta Lake or Iceberg. Because those formats are open, and because Delta Kernel and the Iceberg REST catalog expose them in a neutral way, other engines read the same files directly. Nothing has to be copied into a proprietary store first, and there is no export step on the way out.

image3.png

Governance sits across all of it. Every engine reaches data through Unity Catalog, so a policy written once is enforced everywhere. Bringing a new query engine into the picture means pointing it at the catalog, not re-creating access rules and lineage for it from scratch.

image1.png

Models and agents draw on the same governed data. A model trained in MLflow reads the Gold tables Spark produced. An agent answering a question queries through Unity Catalog under the same policies a human analyst would. The lineage connecting raw input to a deployed model or an agent’s answer is recorded along the way. The AI layer is not bolted on at the end; it reads and writes through the same governed surface as everything else.

image4.png

The practical payoff is independence between layers. A team can change processing engines, add a query engine, adopt a second table format, or swap in a different model framework without re-platforming the layers above or below, because each layer depends only on its neighbor’s open interface.

What are the benefits of an open lakehouse?

The main benefit of an open lakehouse is optionality: teams keep their choices open as data and AI needs grow, because no single layer is locked to a vendor. The specific benefits:

  • No lock-in: open formats let teams change tools without migrating or rewriting data.
  • Lower cost: one copy of data in cheap storage avoids duplicating it across systems.
  • Multi-engine: many engines can run on the same data for BI, SQL, and machine learning.
  • Unified governance: a single catalog applies consistent access controls and lineage.
  • AI-ready: trusted, unified data supports model training, feature stores, and AI agents.
  • Cloud freedom: workloads can run across multiple clouds without rebuilding.

How does an open lakehouse scale as you grow?

An open lakehouse scales by adding layers, not by re-platforming: you turn on more layers as the work calls for them, and the early layers stay in place as the later ones arrive. The same shape fits a two-person team and a company spread across regions; the difference is just how many layers are turned on.

  • Basic ETL. Object storage, Delta tables, and Spark. Raw data lands in Bronze, gets cleaned into Silver, and is served from Gold. That medallion setup is a complete open stack on its own.
  • Streaming and batch together. When data has to be processed as it arrives, Spark Structured Streaming and Real-Time Mode run streaming next to batch on the same engine, with Spark Declarative Pipelines describing the transformations. A second streaming system is not required.
  • A shared catalog, and the first agents. Once analysts, applications, dashboards, and machine learning all need the same data, Unity Catalog becomes the common governance layer they read through. This is usually where the first agents show up too, monitoring data quality, drafting pipelines, and exploring lineage, all governed through the catalog like any other consumer.
  • Enterprise scale. Across regions and clouds, Unity Catalog adds credential vending, fine-grained policies, Lakehouse Federation, and Delta Sharing, and Spark runs as a managed fleet on Kubernetes. The foundation is the same; there is just more governance and more compute.

How does AI fit into an open lakehouse?

In an open lakehouse, AI applications and agents are first-class workloads governed exactly like any other data consumer, not a separate system bolted on afterward. Most lakehouse explanations stop at business intelligence and machine learning, which by now reads like a 2021 diagram with an AI box stapled on the side. Treating agents as governed consumers of the same data is what actually makes this version current.

Because MLflow runs on the platform that governs the data, an agent is held to the same rules as everyone else. It reads through Unity Catalog, so it sees only what its permissions allow. Its activity is traced and its answers are scored by evaluators in MLflow, the same way a model’s quality is tracked, and the gateway in front of it can cap spend and apply guardrails. None of that requires a separate AI stack with its own copy of the data and its own weaker governance, which is the usual alternative.

Concretely:

  • Attribution: an agent runs as its own identity, so every action is traced to a principal rather than hidden behind a shared account.
  • Scoped credentials: the catalog vends short-lived, narrow credentials instead of long-lived keys, so an agent's access can be limited and revoked like anyone else's.
  • Lineage: an agent's reads and writes are recorded, so the path from source data through retrieval to an answer can be reconstructed for an audit.
  • Spend and guardrails: budget limits and content controls live at the gateway, outside the agent's own code.
  • Failure mode: because access runs through the catalog, what happens when the catalog is unavailable is an explicit design choice, such as denying access rather than falling back to direct, ungoverned reads.

An agent's identity can be its own service principal or a delegation of the user who invoked it (an on-behalf-of model), and that choice decides how its actions show up in the audit log. And the standalone MLflow path described later gives you tracing and evaluation with no lakehouse, but the catalog-enforced access control above applies only once Unity Catalog is in place.

In practice, that is concrete. An agent requests a column it has not been granted; Unity Catalog denies the read, and the audit log records the agent's identity, the table and column it asked for, the time, and the deny decision. For the queries it is allowed, lineage links its answer back to the specific Gold tables it reads.

The point is not that agents replace the rest of the architecture. It is that they slot into it. An agent is one more consumer of governed data, built and watched with the same open tools as the pipelines and models around it.

Do you actually need an open lakehouse?

Not always: an open lakehouse is the right call when openness and scale earn their keep, and overkill when they don't. It is a strong fit when you have many data types, multiple teams or engines sharing the same data, multi-cloud requirements, or a clear goal of avoiding lock-in. It can be overkill for a small, single-tool workload with simple, structured data and no need to move between tools.

A practical approach is to start with a limited-scope pilot tied to a real requirement, for example federating one existing source or moving one pipeline, before committing to a full migration. The architecture is built to be adopted one layer at a time, so the choice is not between everything and nothing.

Can you migrate to an open lakehouse incrementally?

Yes. An open lakehouse is designed to fit next to what you already run, one layer at a time, instead of asking you to tear down your existing stack first. Few teams begin with nothing; most already run a cloud data warehouse, or EMR, or a data catalog, or a half-finished Iceberg migration.

  • On an existing cloud data warehouse: federate it through Unity Catalog so governance and lineage reach it, and leave the data where it is.
  • On EMR Spark: move only the pipeline-authoring layer to Spark Declarative Pipelines and keep the clusters you already run.
  • On a separate catalog: have Unity Catalog emit open lineage events your existing catalog ingests, so the two work side by side.
  • Partway into Iceberg: put Unity Catalog on top of the catalog you have already set up and change nothing else.

The move is the same every time: replace one layer with an open version and leave the rest alone until there is a reason to touch it.

What is genuinely hard about an open lakehouse?

An open lakehouse has real sharp edges, and an honest account should name them. Open table formats accumulate small files and need regular compaction and maintenance. Writing to the same tables from several engines at once is still less mature than reading from them, so multi-engine writes need care. The open releases of some components trail their managed versions on a few features. And self-hosting the whole stack is real operational work. Openness at the format and catalog layers also does not erase every form of lock-in. Managed runtimes, proprietary query accelerators, and compute pricing are where platforms still capture you, and they are worth evaluating separately from the open layers. None of this argues against the architecture. It is the honest cost of owning every layer.

How do you run an open lakehouse yourself?

Because every layer is open source, you can run the whole thing yourself. There is an open reference implementation that stands the layers up together:

That starts Apache Spark, Apache Kafka®, Apache Airflow®, Apache Iceberg, Delta Lake, Unity Catalog, and MLflow locally under Docker, with configurations for deploying to the cloud. If all you need at first is the AI layer, you can start with MLflow on its own: a pip install and a few lines in an existing application are enough, and you can add the rest later.

For example, you can point MLflow at an existing agent with no lakehouse underneath:

Your LangGraph or OpenAI agent runs unchanged, and its traces, prompts, and tool calls show up in MLflow. Your Postgres, vector store, and model provider stay where they are. You add the governed lakehouse underneath only when your agents need governed enterprise data.

The reference repository is Apache 2.0 licensed and maintained by the Databricks developer relations team. It is not new technology: it is the five proven projects above, wired together with Docker for when you want the whole stack in one place. Each layer also stands on its own; the bundle is a convenience, not a dependency. Running an open lakehouse yourself is realistic, but it is real work, and the repository treats it that way: it ships high-availability patterns, deployment configs, and observability so that self-hosting is a supported path and not just a claim.

What does an open lakehouse change for your role?

  • If you write SQL or dbt: Spark Declarative Pipelines gives you the same declarative, SQL-first authoring you know from dbt, and adds streaming alongside batch and full SQL and Python on one engine, all against governed open tables you can still query from your existing BI tool.
  • If you build data pipelines: you author them once, for example with Spark Declarative Pipelines, and run batch and streaming on a single engine.
  • If you build models or agents: you build, evaluate, and serve them in MLflow against the same governed data, under the same access rules as everyone else.
  • If you run the platform: you add layers as the business grows, and swap any one of them without re-platforming the rest.

Which platforms support open lakehouses?

Several platforms now implement open lakehouse principles. Databricks, Snowflake, Google Cloud, Microsoft Fabric, Cloudera, Dremio, Starburst, and Qlik all offer products in this space. These are products built on open lakehouse ideas, not the architecture itself, and they differ in how open each layer really is.

Databricks created the lakehouse category and delivers warehouse-grade performance on an open foundation, using Delta Lake and Apache Iceberg for storage and Unity Catalog for governance. The Databricks Platform offers this as a managed service for teams that prefer not to run the stack themselves.

Frequently asked questions

Is an open lakehouse the same as a data lakehouse?

A data lakehouse is the architecture: warehouse-style management on lake storage. An open lakehouse is a data lakehouse in which every layer (storage, table format, engine, catalog, and the ML and AI tools on top) is open source and interchangeable, so no layer depends on a single vendor.

Iceberg or Delta Lake: which should I use?

Both are open table formats with ACID transactions, schema evolution, and time travel, and an open lakehouse can use either or both. Delta Lake pairs closely with Spark and, through Delta Kernel, is increasingly readable by other engines; Iceberg is built for broad multi-engine access through its REST catalog. Supporting both means the choice does not have to be made up front.

Do I need all five projects to have an open lakehouse?

No. The architecture is meant to grow. A basic open lakehouse is just object storage, a table format, and Spark. Unity Catalog and MLflow are added when governance across many consumers, and machine learning or AI workloads, become part of the picture.

Can I run an open lakehouse without Databricks?

Yes. Every layer is open source and runs on your own infrastructure. The reference implementation starts the full stack locally with Docker, and the machine learning layer can be used on its own with a single pip install. Databricks offers managed versions of these open components, but self-hosting is a supported path.

Where do AI agents fit in an open lakehouse?

Agents are treated as governed consumers of data, not a separate system. They read through Unity Catalog under the same policies as people and other engines, and they are built, traced, and evaluated in MLflow alongside models, which keeps the AI layer inside the same open, governed architecture rather than bolted on.

How much does an open lakehouse cost?

The open-source components are free to use under Apache 2.0 and Linux Foundation licenses; your costs are the object storage, the compute you run, and the operational effort to maintain the stack. Running it yourself trades licensing cost for engineering time, while a managed platform such as Databricks trades engineering time for a subscription, on the same open foundation.

Is an open lakehouse production-ready?

Yes. The five core projects are mature standards already running in production at large scale, for example Apache Spark across roughly 80% of the Fortune 500 and MLflow at more than 30 million downloads a month. The main production considerations are operational: table maintenance and compaction, care with multi-engine writes, and the work of self-hosting if you do not use a managed service.

Apache, Apache Spark, Apache Iceberg, Apache Kafka, Apache Airflow, Apache Parquet, and the Apache feather logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. Delta Lake, MLflow, and Unity Catalog are trademarks of LF Projects, LLC. All other marks are the property of their respective owners.

To see the whole thing working, clone the reference architecture and run it end to end at github.com/open-lakehouse/open-lakehouse. If you are starting from the AI side, MLflow on its own is a good entry point, and the rest of the stack is there when your models and agents need governed data behind them. Prefer a fully managed path? The same open foundation powers the Databricks lakehouse platform.

Explore the reference architecture

Get the latest posts in your inbox

Subscribe to our blog and get the latest posts delivered to your inbox.