The open lakehouse, end to end: a definition, a reference architecture, and an open-source stack you can clone and run. Open formats, open engines, unified governance, and no vendor lock-in.
by Lisa Cao
An open lakehouse is a data lakehouse in which every layer (storage, table format, engine, catalog, and the ML and AI tools on top) is built on open standards, so no layer is locked to a single vendor.
A data lakehouse combines the low-cost, scalable storage of a data lake with the management features and transactional guarantees of a data warehouse. An open lakehouse adds a further condition: every layer of the architecture is built on open standards. The data, the engines that process it, the catalog that governs it, and the tools that build models and applications on top of it are all open source, so none of them depends on a single vendor.
The word "open" gets thrown around a lot, and not always honestly. That is why it is worth pinning down. A format can be called open and still be practical to use through only one engine. A catalog can be called open and still travel with one platform. Storage stays portable right up until the egress charges make moving it expensive. An open lakehouse is what you get when those constraints are removed at every level. More recently, that same openness has extended to the AI and agent workloads now being built on the data.
An open lakehouse combines a data warehouse's reliability and a data lake's low-cost storage in one architecture, then adds one rule the other two lack: every layer must be open and interchangeable. A data warehouse holds clean, structured tables for reporting, with strong governance but high cost and little room for unstructured data. A data lake holds everything else, such as raw files, logs, and images, cheaply and at scale, but without transactions, schema guarantees, or much governance.
The lakehouse did not appear from nowhere. For years, teams kept the two systems side by side, copying data between them and reconciling two versions of the truth. The data lakehouse merged them: warehouse-style management and transactions applied directly to low-cost lake storage. The open lakehouse is the next turn of that idea. It keeps the combined architecture and adds the rule above, so the reliability of a warehouse and the economics of a lake arrive without binding any layer to a single vendor.
| Capability | Data warehouse | Data lake | Open lakehouse |
|---|---|---|---|
| Storage cost | High | Low | Low |
| ACID transactions | Yes | No | Yes |
| Governance and schema | Strong | Weak | Strong |
| Open formats, engine choice | No | Partial | Yes |
| BI, ML, and AI on one copy | BI mainly | ML mainly | BI, ML, and AI |
The difference between an open and a proprietary lakehouse comes down to one question: who can read your data and which engines can run on it. A proprietary lakehouse stores data in formats only one vendor can read, so switching tools can require rewriting or re-exporting all of your data. An open lakehouse stores data in open formats that any compatible engine can read, so you can add, swap, or remove a query engine without rewriting your data.
| Factor | Open lakehouse | Proprietary lakehouse |
|---|---|---|
| Data formats | Open standards | Vendor-specific |
| Engine choice | Any compatible engine | Vendor's engine only |
| Vendor lock-in | Low | High |
| Catalog | Open, portable | Proprietary |
| Cost control | Multi-engine flexibility | Tied to one vendor |
Three things separate an open lakehouse from a proprietary one: open formats, open engines, and unified governance.
Start with open formats. The data lives in open table formats, Delta Lake and Apache Iceberg®, sitting on open file formats like Apache Parquet®. The specs are public, so any engine that implements them can read and write the data, not just one vendor’s runtime.
Open engines come next: the systems doing the processing are open source too. Apache Spark® covers batch, streaming, SQL, and machine learning in one runtime, and because the tables underneath are open, engines like DuckDB, Trino, or PyIceberg can work the same data without a second copy.
Unified governance is the part teams underestimate. One catalog handles access control, lineage, and auditing for every format and every engine, so governance attaches to the data itself instead of being rebuilt inside each tool that touches it. Unity Catalog plays that role here.
Put those together and the result is open from the storage layer to the serving layer: object storage underneath, an open table format above it, an open processing engine, an open catalog for governance, and open tooling for analytics, machine learning, and AI applications on top.
Not quite, and the difference matters. Open source describes the license on a project's code. Open, in the lakehouse sense, is broader: it covers open file and table formats, open APIs and standard ways for tools to connect, and an ecosystem where many engines work on the same data. A platform can be built from open-source projects and still trap your data, for example by storing it in a layout only its own engine reads well. The honest test of openness is simple: you can read your data with any compatible tool, and move between tools without rewriting it.
These two terms get used interchangeably, and they should not be. An open table format is one layer: it adds database-like features (reliable updates, schema changes, and history) on top of files in object storage. An open lakehouse is everything around that layer, the storage beneath it and the compute, catalog, and governance on top. The table format is a component. The lakehouse is the whole stack.
| Aspect | Open table format | Open lakehouse |
|---|---|---|
| Scope | One layer | Full architecture |
| Role | Adds tables, updates, history to files | Combines storage, formats, compute, governance |
| Examples | Apache Iceberg, Delta Lake | A full platform built on those formats |
| Provides | ACID, schema evolution, time travel | Analytics, BI, ML, governance on one copy of data |
Two terms from the table: ACID means data changes complete reliably without corrupting the table, and time travel means you can view or restore the data as it looked at an earlier point.
This reference architecture is built from five open-source standards, each governed by a neutral foundation (the Apache Software Foundation or the Linux Foundation) and each owning a different layer of the stack. An open lakehouse is only as trustworthy as the projects it is built from, and these are not niche projects: they are standards much of the industry already runs on, and most of them were created at Databricks. That is not a humblebrag, it is checkable: as of early 2026, Apache Spark is used by around 80% of the Fortune 500 and is the most widely adopted engine for large-scale data processing. MLflow passes 30 million downloads a month. Delta Lake and Apache Iceberg together cover the vast majority of lakehouse tables in production, with Delta Lake holding the largest installed base.
Together, that is more than 90,000 GitHub stars and tens of millions of downloads a month. A snapshot of where each project stands (GitHub stars as of early 2026):
| Project | Layer | Adoption |
|---|---|---|
| Apache Spark® | Processing engine | 43k+ GitHub stars; used by about 80% of the Fortune 500 |
| Delta Lake | Table format | 8k+ GitHub stars; the largest installed base of any open table format |
| Apache Iceberg | Table format | 8k+ GitHub stars; REST catalog adopted across the industry |
| Unity Catalog | Governance | 3k+ GitHub stars; donated to the LF AI & Data Foundation |
| MLflow | ML and AI | 26k+ GitHub stars; 30M+ downloads a month |
Apache Spark® is the processing engine. It was created at the UC Berkeley AMPLab in 2009 by the team that went on to found Databricks, and was donated to the Apache Software Foundation, where it became one of the most widely used engines for large-scale data processing. One Spark runtime handles batch jobs, streaming, SQL, and machine learning, which is why a team can run a single engine rather than maintaining a different system for each kind of workload. In the lakehouse, Spark reads raw data, refines it in stages, and writes the results back as open tables.
Delta Lake is a table format that makes object storage behave like a database instead of a pile of files. On top of ordinary Parquet files it adds ACID transactions, schema enforcement, and time travel, so concurrent jobs do not corrupt each other’s writes and a table can be queried as it looked at an earlier point in time. A companion library, Delta Kernel, packages the read and write logic into an engine-agnostic component, which makes it simpler for engines other than Spark to support the format. Delta Lake was created at Databricks, which remains its primary contributor, and is governed as a Linux Foundation project.
Iceberg is a second table format, built for very large analytic tables and for moving cleanly between engines. It came out of Netflix rather than Databricks, and is now a broadly adopted Apache project. Databricks is a major contributor, including the Iceberg founding team who joined through the Tabular acquisition. Its table specification and REST catalog make it easy for several engines to share the same tables, which is why it turns up wherever more than one query engine is in play. Supporting both Delta Lake and Iceberg means a team does not have to commit to one format on day one and live with that choice forever.
Unity Catalog is the governance layer. It keeps access policies, credential vending, and lineage in one place, and engines reach data through it rather than around it. Because the rules live in the catalog instead of inside any one engine, access control and lineage stay consistent whether a query comes from Spark, DuckDB, or another client. Unity Catalog was created at Databricks and is now an open-source project under the LF AI & Data Foundation, with a managed version available on Databricks. The open release is newer than the managed product, and a few governance features are still maturing in it, so it is worth checking the open project against your requirements.
Unity Catalog is not the only open catalog. Apache Polaris, Project Nessie, the Hive Metastore, and AWS Glue fill the same role, and the Iceberg REST catalog is emerging as a shared interface across them. An open lakehouse can use any of them. This reference architecture uses Unity Catalog because it governs data, ML, and AI assets together under one model, but the catalog layer is genuinely swappable.
MLflow is the layer for machine learning and AI. It handles experiment tracking, model packaging, a model registry, evaluation, and serving, and that same machinery now reaches AI agents: tracing what an agent did, scoring its output against evaluators, and placing a gateway with budget limits and guardrails in front of it. Running models and agents on the same open platform that governs the data, rather than in a separate stack off to the side, is a large part of what makes this version of the lakehouse different. MLflow was created at Databricks, which remains its primary contributor, and is a Linux Foundation project.
The layers connect through open interfaces, so they fit together without proprietary glue and any one of them can be swapped without disturbing the others. That is what turns five projects into a single architecture.

Start with the data. Spark writes tables to object storage as Delta Lake or Iceberg. Because those formats are open, and because Delta Kernel and the Iceberg REST catalog expose them in a neutral way, other engines read the same files directly. Nothing has to be copied into a proprietary store first, and there is no export step on the way out.

Governance sits across all of it. Every engine reaches data through Unity Catalog, so a policy written once is enforced everywhere. Bringing a new query engine into the picture means pointing it at the catalog, not re-creating access rules and lineage for it from scratch.

Models and agents draw on the same governed data. A model trained in MLflow reads the Gold tables Spark produced. An agent answering a question queries through Unity Catalog under the same policies a human analyst would. The lineage connecting raw input to a deployed model or an agent’s answer is recorded along the way. The AI layer is not bolted on at the end; it reads and writes through the same governed surface as everything else.

The practical payoff is independence between layers. A team can change processing engines, add a query engine, adopt a second table format, or swap in a different model framework without re-platforming the layers above or below, because each layer depends only on its neighbor’s open interface.
The main benefit of an open lakehouse is optionality: teams keep their choices open as data and AI needs grow, because no single layer is locked to a vendor. The specific benefits:
An open lakehouse scales by adding layers, not by re-platforming: you turn on more layers as the work calls for them, and the early layers stay in place as the later ones arrive. The same shape fits a two-person team and a company spread across regions; the difference is just how many layers are turned on.
In an open lakehouse, AI applications and agents are first-class workloads governed exactly like any other data consumer, not a separate system bolted on afterward. Most lakehouse explanations stop at business intelligence and machine learning, which by now reads like a 2021 diagram with an AI box stapled on the side. Treating agents as governed consumers of the same data is what actually makes this version current.
Because MLflow runs on the platform that governs the data, an agent is held to the same rules as everyone else. It reads through Unity Catalog, so it sees only what its permissions allow. Its activity is traced and its answers are scored by evaluators in MLflow, the same way a model’s quality is tracked, and the gateway in front of it can cap spend and apply guardrails. None of that requires a separate AI stack with its own copy of the data and its own weaker governance, which is the usual alternative.
Concretely:
An agent's identity can be its own service principal or a delegation of the user who invoked it (an on-behalf-of model), and that choice decides how its actions show up in the audit log. And the standalone MLflow path described later gives you tracing and evaluation with no lakehouse, but the catalog-enforced access control above applies only once Unity Catalog is in place.
In practice, that is concrete. An agent requests a column it has not been granted; Unity Catalog denies the read, and the audit log records the agent's identity, the table and column it asked for, the time, and the deny decision. For the queries it is allowed, lineage links its answer back to the specific Gold tables it reads.
The point is not that agents replace the rest of the architecture. It is that they slot into it. An agent is one more consumer of governed data, built and watched with the same open tools as the pipelines and models around it.
Not always: an open lakehouse is the right call when openness and scale earn their keep, and overkill when they don't. It is a strong fit when you have many data types, multiple teams or engines sharing the same data, multi-cloud requirements, or a clear goal of avoiding lock-in. It can be overkill for a small, single-tool workload with simple, structured data and no need to move between tools.
A practical approach is to start with a limited-scope pilot tied to a real requirement, for example federating one existing source or moving one pipeline, before committing to a full migration. The architecture is built to be adopted one layer at a time, so the choice is not between everything and nothing.
Yes. An open lakehouse is designed to fit next to what you already run, one layer at a time, instead of asking you to tear down your existing stack first. Few teams begin with nothing; most already run a cloud data warehouse, or EMR, or a data catalog, or a half-finished Iceberg migration.
The move is the same every time: replace one layer with an open version and leave the rest alone until there is a reason to touch it.
An open lakehouse has real sharp edges, and an honest account should name them. Open table formats accumulate small files and need regular compaction and maintenance. Writing to the same tables from several engines at once is still less mature than reading from them, so multi-engine writes need care. The open releases of some components trail their managed versions on a few features. And self-hosting the whole stack is real operational work. Openness at the format and catalog layers also does not erase every form of lock-in. Managed runtimes, proprietary query accelerators, and compute pricing are where platforms still capture you, and they are worth evaluating separately from the open layers. None of this argues against the architecture. It is the honest cost of owning every layer.
Because every layer is open source, you can run the whole thing yourself. There is an open reference implementation that stands the layers up together:
That starts Apache Spark, Apache Kafka®, Apache Airflow®, Apache Iceberg, Delta Lake, Unity Catalog, and MLflow locally under Docker, with configurations for deploying to the cloud. If all you need at first is the AI layer, you can start with MLflow on its own: a pip install and a few lines in an existing application are enough, and you can add the rest later.
For example, you can point MLflow at an existing agent with no lakehouse underneath:
Your LangGraph or OpenAI agent runs unchanged, and its traces, prompts, and tool calls show up in MLflow. Your Postgres, vector store, and model provider stay where they are. You add the governed lakehouse underneath only when your agents need governed enterprise data.
The reference repository is Apache 2.0 licensed and maintained by the Databricks developer relations team. It is not new technology: it is the five proven projects above, wired together with Docker for when you want the whole stack in one place. Each layer also stands on its own; the bundle is a convenience, not a dependency. Running an open lakehouse yourself is realistic, but it is real work, and the repository treats it that way: it ships high-availability patterns, deployment configs, and observability so that self-hosting is a supported path and not just a claim.
Several platforms now implement open lakehouse principles. Databricks, Snowflake, Google Cloud, Microsoft Fabric, Cloudera, Dremio, Starburst, and Qlik all offer products in this space. These are products built on open lakehouse ideas, not the architecture itself, and they differ in how open each layer really is.
Databricks created the lakehouse category and delivers warehouse-grade performance on an open foundation, using Delta Lake and Apache Iceberg for storage and Unity Catalog for governance. The Databricks Platform offers this as a managed service for teams that prefer not to run the stack themselves.
A data lakehouse is the architecture: warehouse-style management on lake storage. An open lakehouse is a data lakehouse in which every layer (storage, table format, engine, catalog, and the ML and AI tools on top) is open source and interchangeable, so no layer depends on a single vendor.
Both are open table formats with ACID transactions, schema evolution, and time travel, and an open lakehouse can use either or both. Delta Lake pairs closely with Spark and, through Delta Kernel, is increasingly readable by other engines; Iceberg is built for broad multi-engine access through its REST catalog. Supporting both means the choice does not have to be made up front.
No. The architecture is meant to grow. A basic open lakehouse is just object storage, a table format, and Spark. Unity Catalog and MLflow are added when governance across many consumers, and machine learning or AI workloads, become part of the picture.
Yes. Every layer is open source and runs on your own infrastructure. The reference implementation starts the full stack locally with Docker, and the machine learning layer can be used on its own with a single pip install. Databricks offers managed versions of these open components, but self-hosting is a supported path.
Agents are treated as governed consumers of data, not a separate system. They read through Unity Catalog under the same policies as people and other engines, and they are built, traced, and evaluated in MLflow alongside models, which keeps the AI layer inside the same open, governed architecture rather than bolted on.
The open-source components are free to use under Apache 2.0 and Linux Foundation licenses; your costs are the object storage, the compute you run, and the operational effort to maintain the stack. Running it yourself trades licensing cost for engineering time, while a managed platform such as Databricks trades engineering time for a subscription, on the same open foundation.
Yes. The five core projects are mature standards already running in production at large scale, for example Apache Spark across roughly 80% of the Fortune 500 and MLflow at more than 30 million downloads a month. The main production considerations are operational: table maintenance and compaction, care with multi-engine writes, and the work of self-hosting if you do not use a managed service.
Apache, Apache Spark, Apache Iceberg, Apache Kafka, Apache Airflow, Apache Parquet, and the Apache feather logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. Delta Lake, MLflow, and Unity Catalog are trademarks of LF Projects, LLC. All other marks are the property of their respective owners.
To see the whole thing working, clone the reference architecture and run it end to end at github.com/open-lakehouse/open-lakehouse. If you are starting from the AI side, MLflow on its own is a good entry point, and the rest of the stack is there when your models and agents need governed data behind them. Prefer a fully managed path? The same open foundation powers the Databricks lakehouse platform.
Subscribe to our blog and get the latest posts delivered to your inbox.