Building High-Quality and Trusted Data Products with Databricks

Published: May 6, 2024

by Amr Ali, Bernhard Walter, Fran Medina Castro, Glenn Wiebe, Karthik Subbarao, Lexy Kassan, Magnus Pierre and Pawarit Laosunthara

Introduction

Organizations aiming to become AI and data-driven often need to provide their internal teams with high-quality and trusted data products. Building such data products ensures that organizations establish standards and a trustworthy foundation of business truth for their data and AI objectives. One approach for putting quality and usability at the forefront is through the use of the data mesh paradigm to democratize the ownership and management of data assets. Our blog posts (Part 1, Part 2) offer guidance on how customers can leverage Databricks in their enterprise to address data mesh's foundational pillars, one of which is "data as a product".

Though the idea of treating data as products may have gained popularity with the emergence of data mesh, we have observed that applying product thinking resonates even with customers who haven't chosen to embrace data mesh. Regardless of organizational structure or data architecture, data-driven decision-making remains a universal guiding principle. Data quality and usability are paramount to ensure these data-driven decisions are made on valid information. This blog will outline some of our recommendations for building enterprise-ready data products, both generally and specifically with Databricks.

Data products ultimately deliver value when users and applications have the right data at the right time, with the right quality, in the right format. While this value has traditionally been realized in the form of more efficient operations through lower costs, faster processes and mitigated risks, modern data products can also pave the way for new value-adding offerings and data sharing opportunities within an organization's industry or partner ecosystem.

Data Products

While data products can be defined in various ways, they typically align with the definition found in DJ Patil's Data Jujitsu: The Art of Turning Data into Product: "To start, ..., a good definition of a data product is a product that facilitates an end goal through the use of data". As such, data products are not restricted to tabular data; they can also be ML models, dashboards, etc. To apply such product thinking to data, it is strongly recommended that each data product should have a data product owner.

Data product owners manage the development and monitor the use and performance of their data products. To do so, they must understand the underlying business and be able to translate the requirements of data consumers into a design for a high-quality, easy-to-use data product. Together with others in the organization, they bridge the gap between business and technical colleagues like data engineers. The data product owner is accountable for ensuring that the products in their portfolio align with organizational standards across characteristics of trustworthiness.

There are five key characteristics that a data product must meet:

Quality and Observability: Data quality includes accuracy, consistency, reliability, timeliness, as well as clarity of documentation. Defined quality metrics about the data product can be monitored and exposed to ensure that the expected data quality is maintained over time. The overall goal is to make the data product a trusted source for data consumers.
Semantic consistency: The goal of a lakehouse architecture is to make working with data easy. Therefore, data products that are meant to be used together should be semantically consistent. In other words, they should follow the agreed governance rules and have shared definitions of terminology in order for consumers to combine these data products in a meaningful and correct way.
Privacy: Privacy is about the confidentiality and security of information, concerning how data is collected, shared, and used. Data privacy is typically governed by regulations and laws (e.g. GDPR, CCPA). Complying with data privacy rules can include topics such as anonymization, encryption, data residency, data tagging (e.g. PII), limiting storage to specific environments, and minimizing access to a small number of employees.
Security: In addition to having an infosec-approved data platform in place, data product owners still need to define, for example, access permissions (who can access the data, which partners can the data be shared with, etc.) and acceptable use policies for their data products.
Discoverability: Data products need to be published in a way that everyone in the organization can find them. This can include places such as a central data catalog or an internal data marketplace. Data product owners should include assets with the published product that make it easy to understand the data and how to combine it with other data products (e.g. sample notebooks, dashboards, etc.).

Data Product Lifecycle

A typical data product lifecycle consists of the following phases:

Inception - This is where business value for a desired data product is defined and an owner is assigned. Performance and quality metrics should also be defined for monitoring purposes.
Design - In this phase, concrete details such as the design specification and data contracts are created, ensuring consistency with other data products.
Creation - Creating the actual data product can include schemas, tables, views, models, arbitrary files (volumes), dashboards, etc., along with the pipelines that create them. This phase also includes testing the resulting data product against the defined data contract.
Publish - The creation and publishing of a data product are often treated as the same but they are quite different. This phase includes activities such as the deployment of models, publishing a schema to a shared catalog, managing the access permissions as per the data contract, etc. Publishing should involve release management to version changes to published data products.
Operate and Govern - Operations involve persistent activities like monitoring the quality, permissions, and usage metrics. The governance part includes handling compliance-related requests and auditing data product access etc.
Consume and Value Creation - The data product is used in the business to solve a variety of problems. Consumers may provide feedback to the data product owner based on their experience of using the product and recommend enhancements that could facilitate further value creation in the future.
Retirement - There can be several reasons to retire a data product, such as a lack of usage, the data product being no longer compliant, etc. In any case, the data product should be gracefully retired. This means deprecating the product, informing the consumers, archiving assets, and cleaning up resources. Here, visibility over downstream usage will often be important and is significantly eased if lineage is automatically captured.

Figure 2: Typical lifecycle of a data product

In the figure above, the data product owner is accountable for all of the phases, beginning from the inception until the retirement of a data product. Nevertheless, the responsibility for individual tasks can be shared with other stakeholders such as data stewards, data engineers, etc.

The Databricks Data Intelligence Platform can be leveraged for several of the activities involved in the data product lifecycle:

ETL Pipelines - Delta Live Tables (DLT) can be employed to build robust and quality-controlled data pipelines. Auto Loader and streaming tables can be used to incrementally land data into the Bronze layer for DLT pipelines or Databricks SQL queries.
Governance - Databricks Unity Catalog is feature-rich and built to enable simple and unified governance across an enterprise. Catalog Explorer can be used for data discovery and access control mechanisms facilitate publishing the data products to the intended consumers. Lineage and System Tables are automatically tracked and vital to operational governance.
Monitoring - Lakehouse Monitoring provides a single and unified solution for monitoring the quality of data and AI assets. Such a proactive approach is necessary to satisfy the data contract terms.

For some of the data product lifecycle activities, such as designing the data product and data contract, Databricks does not currently have features to support it. These processes should be done outside of the Databricks Platform and the results then be documented in Unity Catalog once the data product has been published.

Data Contracts

A data contract is a formal way to align the domains and implement federated governance. The data producer should provide it; however, it should be designed with the consumer in mind. The contract should be framed in a way that is consumable by all types of users.

A typical data contract has the following attributes

Data description (name, description, source systems, attribute selection, …)
Data schema (tables, columns, anonymization and encryption info, filter, masks, …) and data formats (semi-structured and unstructured data)
Usage policies (tags, PII, guidelines, data residency, …)
Data quality (applied quality checks and constraints, quality metrics, …)
Security (who is allowed to use the data product)
Data SLAs (last update, expiration dates, retention time, …)
Responsibilities (owner, maintainer, escalation contact, change process, …)

In addition, supporting assets such as notebooks, dashboards, etc. can be provided in order to help the consumer understand and analyze the data product, thus facilitating easier adoption.

Data Governance Team

A data governance team in an enterprise usually consists of representatives from different groups such as business owners, compliance and security experts, and data professionals. This team should act as Center of Excellence (CoE) for compliance and data security topics and support the data product owner who is accountable for the data product. They play a crucial role in framing the data contract by extending the usage policies as well as influencing the decision of who is allowed to use the data product. For large organizations, such a team can help with steering and standardizing the data contract framing process in alignment with global functions such as a data management office.

Publishing and Certification

Despite established data contracts, the governance of data products remains a broad subject, encompassing aspects such as access controls, Personally Identifiable Information (PII) classification, and various usage policies, all of which can differ between organizations. However, one consistent trend we have observed concerns the publication of data products. As consumers encounter an increasing number of datasets, they often require assurance that the data is curated, standardized, and officially approved for use. For instance, a reporting or master data management use case within a large organization might necessitate a high degree of semantic consistency and interoperability between diverse data assets in the enterprise.

This is where the concept of data product 'certification' can become valuable for certain data products. In this process, data producers can first propose a data contract specification, typically subject to review by a data governance steward or team. Upon approval, Continuous Integration/Continuous Deployment (CI/CD) processes can be run to deploy production pipelines that physically write data to the customer's cloud storage accounts. This data can then be published and easily discovered through Unity Catalog tables, views, or even volumes for non-tabular data. In this context, Unity Catalog supports the use of tags as well as markdown to indicate the certification status and details of a data product.

Figure 3: Data product 'certification' process

Some customers may even choose to promote their certified data products by publishing a corresponding private listing in the Databricks Marketplace with comprehensive guides and usage examples. Furthermore, Databricks' REST APIs and integrations with enterprise catalog solutions such as Alation, Atlan, and Collibra also facilitate the easy discoverability of certified data products through multiple channels, even those outside of Databricks.

Conclusion

Formulating data products and data contracts can become intricate exercises within a large enterprise setting. Given the emergence of new technologies for interfacing with data, coupled with modern business and regulatory requirements, specifications for data products and contracts are continuously evolving. Today, Databricks Marketplace and Unity Catalog serve as core components for the data discovery and onboarding experience for data consumers. For data producers, Unity Catalog offers essential enterprise governance functionality including lineage, auditing, and access controls.

As data products extend beyond simple tables or dashboards to encompass AI models, streams, and more, customers can benefit from a unified and consistent governance experience on Databricks for all major user personas.

The key aspects of enterprise data products highlighted in this blog can serve as guiding principles as you approach the topic. To learn more about constructing high-quality data products using the Databricks Data Intelligence Platform, reach out to your Databricks representative.

What's next?

July 30, 2024/4 min read

OKR-Centric Delivery Models for Engineering-Focused Enterprises

November 12, 2024/9 min read