Skip to main content
Platform blog

Introducing Data Cleanrooms for the Lakehouse

Matei Zaharia
Itai Weiss
Steve Mahoney
Sachin Thakur
Dan Morris
Jay Bhankharia
Share this post

We are excited to announce data cleanrooms for the Lakehouse, allowing businesses to easily collaborate with their customers and partners on any cloud in a privacy-safe way. Participants in the data cleanrooms can share and join their existing data, and run complex workloads in any language - Python, R, SQL, Java, and Scala - on the data while maintaining data privacy.

With the demand for external data greater than ever, organizations are looking for ways to securely exchange their data and consume external data to foster data-driven innovations. Historically, organizations have leveraged data sharing solutions to share data with their partners and relied on mutual trust to preserve data privacy. But the organizations relinquish control over the data once it is shared and have little to no visibility into how data is consumed by their partners across various platforms. This exposes potential data misuse and data privacy breaches. With stringent data privacy regulations, it is imperative for organizations to have control and visibility into how their sensitive data is consumed. As a result, organizations need a secure, controlled, and private way to collaborate on data, and this is where data cleanrooms come into the picture.

This blog will discuss data cleanrooms, the demand for data cleanrooms, and our vision for a scalable data cleanroom on Databricks Lakehouse Platform.

What is a Data Cleanroom and why does it matter for your business?

A data cleanroom provides a secure, governed, and privacy-safe environment, in which multiple participants can join their first-party data and perform analysis on the data, without the risk of exposing their data to other participants. Participants have full control of their data and can decide which participants can perform what analysis on their data without exposing any sensitive data such as personally identifiable information (PII).

Data cleanrooms open a broad array of use cases across industries. For example, consumer packaged goods (CPG) companies can see sales uplift by joining their first-party advertisement data with the point of sale (POS) transactional data of their retail partners. In the media industry, advertisers and marketers can deliver more targeted ads, with broader reach, better segmentation, and greater ad effectiveness transparency while safeguarding data privacy. Financial services companies can collaborate across the value chain to establish proactive fraud detection or anti-money laundering strategies. In fact IDC predicts that by 2024, 65% of G2000 Enterprises will form data-sharing partnerships with external stakeholders via data cleanrooms to increase interdependence while safeguarding data privacy.

Privacy-safe data cleanroom

Privacy-safe data cleanroom

Let’s look at some of the compelling reasons driving the demand for cleanrooms:

Rapidly changing security, compliance, and privacy landscape: Stringent data privacy regulations such as GDPR and CCPA, along with sweeping changes in third-party measurement, have transformed how organizations collect, use and share data, particularly for advertising and marketing use cases. For example, Apple’s App Tracking Transparency Framework (ATT) provides users of Apple devices the freedom and flexibility to easily opt out of app tracking. Google also plans to phase out support for third-party cookies in Chrome by late 2023. As these privacy laws and practices evolve, the demand for data cleanrooms is likely to rise as the industry moves to new identifiers that are PII based, such as UID 2.0. Organizations will try to find new solutions to join data with their partners in a privacy-centric way to achieve their business objectives in the cookie-less reality.

Collaboration in a fragmented data ecosystem: Today, consumers have more options than ever before when it comes to where, when, and how they engage with content. As a result, the digital footprint of consumers is fragmented across different platforms, necessitating that companies collaborate with their partners to create a unified view of their customers’ needs and requirements. To facilitate collaboration across organizations, cleanrooms provide a secure and private way to combine their data with other data to unlock new insights or capabilities.

New ways to monetize data: Most organizations either already have or are looking to develop monetization strategies for their existing data or IP. With today’s privacy laws, companies will try to find any possible advantages to monetize their data without the risk of breaking privacy rules. This creates an opportunity for data vendors or publishers to join data for big data analytics without having direct access to the data.

Existing data cleanroom solutions come with big drawbacks

As organizations explore various cleanroom solutions, there are some glaring shortcomings in the existing solutions, which don’t realize the full potential of the “cleanrooms” and meet the business requirements of organizations.

Data movement and replication : The existing data cleanroom vendors require participants to move their data into the vendor platforms, which results in platform lock-in and added data storage cost to the participants. Additionally, it is time-consuming for participants to prepare the data in a standardized format before performing any analysis on the aggregated data. Furthermore, participants have to replicate the data across different clouds and regions to facilitate collaborations with participants on different clouds and regions, resulting in operational and cost overhead.

Restricted to SQL: Existing cleanroom solutions don’t provide much flexibility to run arbitrary workloads and analyses and are often restricted to simple SQL statements. While SQL is powerful and absolutely needed for cleanrooms, there are times when you require complex computations such as machine learning, integration with APIs, or other analysis workloads where SQL just won’t cut it.

Hard to scale: Most of the existing cleanroom solutions are tied to a single vendor and are not scalable to expand collaboration beyond two participants at a time. For example, an advertiser might want to get a detailed view of their ad performance across different platforms, which requires analysis of the aggregated data from multiple data publishers. With collaboration limited to just two participants, organizations get partial insights on one cleanroom platform and end up moving their data to another cleanroom vendor, incurring the operational overhead of manually collating partial insights.

Deploy a scalable and flexible Data cleanroom solution with the Databricks lakehouse platform

Databricks Lakehouse Platform provides a comprehensive set of tools to build, serve, and deploy a scalable and flexible data cleanroom based on your data privacy and governance requirements.

Secure data sharing with no replication: With Delta Sharing, cleanroom participants can securely share data from their data lakes with other participants without any data replication across clouds or regions. Your data stays with you and it is not locked into any platform. Additionally, cleanroom participants can centrally audit and monitor the usage of their data.

Full support to run arbitrary workloads and languages: Databricks lakehouse platform provides the cleanroom participants the flexibility to run any complex computations such as machine learning or data workloads in any language — SQL, R, Scala, Java, Python — on the data.

Easily scalable with guided on-boarding experience: Cleanrooms on the Databricks Lakehouse Platform are easily scalable to multiple participants on any cloud or region. It is easy to get started and guide participants through common use cases using predefined templates (e.g., jobs, workflows, dashboards), reducing time to insights.

Privacy-safe with fine-grained access controls: With Unity Catalog, you can enable fine-grained access controls on the data and meet your privacy requirements. Integrated governance allows participants to have full control over queries or jobs that can be executed on their data. All the queries or jobs on the data are executed on Databricks-hosted trusted compute. Participants never get access to the raw data of other participants, ensuring data privacy. Participants can also leverage open source or third-party differential privacy frameworks, making your cleanroom future-proof.

To learn more about data cleanrooms on Databricks Lakehouse, please reach out to your Databricks account representatives.

Try Databricks for free
See all Platform Blog posts