This blog is authored by Anup Segu, Co-Head of Data Engineering at YipitData
YipitData provides daily market insights to its clients by analyzing over 15 petabytes of alternative data stored in its Lakehouse. Its data products and services are trusted by top financial institutions and corporations, thanks to the company's multiple business units that leverage data analytics with clockwork precision and efficiency. The company's internal data platform allows its 200+ data team to independently process thousands of small and large datasets through 2000+ daily ETL pipelines without direct data engineering involvement. To scale its data operations further, YipitData is leveraging Databricks Unity Catalog as a metastore service to provide robust data governance controls while increasing the value and utilization of its 150K+ tables in its Lakehouse. Through Unity Catalog, YipitData has gained visibility into active and inactive datasets, improved performance for BI integrations, and opened new channels to deliver data to clients.
This blog post takes you on our journey from using Hive metastore to adopting Unity Catalog. We'll also discuss the numerous benefits we've experienced since implementing Unity Catalog as the centerpiece of our Lakehouse architecture.
To learn more, tune into YipitData's presentation at the Data + AI Summit 2023, where we discuss how data organizations can take full advantage of Unity Catalog through our own migration and platform architecture with concrete tactics to scale data administration, productize Lakehouse data, and optimize costs.
Using Hive Metastore impeded scalability, increased complexity in data platform administration, and limited effective data sharing with clients
In the last few years, YipitData has grown its organization 3x to cover 100+ companies and sectors through numerous data products. During this time, multiple teams and departments were formed with a spectrum of unique business requirements. Some teams need wide data access to complete its objectives while other teams need isolated environments to produce internal analytics securely. For a centralized data engineering department and data platform, supporting the plethora of business opportunities presented technical challenges.
Before Unity Catalog, YipitData's Lakehouse relied on a Hive metastore with a role-based access model through its cloud provider to manage data assets. The architecture was straightforward at its inception but as the company expanded, the number of roles required to manage the platform grew 20x. Onboarding new teams and projects became a complicated infrastructure deployment process that would stall product development. Additionally, datasets were either unused or duplicated due to permission barriers, which confused platform users and taxed the data platform. With so many tables in the metastore, many BI tools could not function due to underlying metadata operations timing out, inhibiting efforts to create rich, interactive data experiences for our clients.
In addition, many of our clients operate in different clouds or use various data stacks that don't offer workable integrations with Hive metastore. Consequently, our data teams had to duplicate data for different clients in environments outside of spark, and then carry out post-processing there. This led to increased storage costs and performance bottlenecks.
The Data Engineering department at YipitData recognized this architecture was not viable and sought a new data governance solution that would establish:
- Finegrain, object-based access controls that accurately represent business units and projects within the company
- Highly performant metastore that supports hundreds of thousands of tables with millisecond-level response times
- Search capabilities to identify and utilize existing tables and databases throughout the organization
- Robust observability and lineage information for administrators to track data flows throughout the company
- Extensible APIs to enable the development of data applications in the future
- Backward compatibility with the existing hive metastore to facilitate a smooth migration path
- Flexibility to share data with more clients in their preferred cloud or analytics tools
Migrating to Unity Catalog simplified access management, improved data observability, and enhanced data sharing
When reviewing the above architectural needs of the data platform with Databricks, Unity Catalog lined up well in its functionality as it plugs into the existing Lakehouse setup while providing a granular access control API. Starting Unity Catalog was painless for most users of the platform as they could concurrently query Hive metastore and Unity Catalog managed data in the Lakehouse. Over time, teams were fully utilizing Unity Catalog for all data analytics with the data platform gradually cutting over queries to Unity Catalog instead of Hive metastore behind the scene.
Since the migration, teams are operating in a data mesh paradigm with their databases and pipelines assigned to distinct catalogs in Unity Catalog, making ownership and expectations of data assets between business units transparent within the Databricks UI. Unity Catalog is also unblocking our data sharing efforts as we can maintain a single copy of a delta table and connect it to BI tools, DBSQL, and Delta Sharing concurrently. Now our clients can consume our analytics in simpler ways than before while we internally keep the data preparation for those experiences lean, performant, and consistent with open standards.
After completing the migration, we observed the following benefits:
- Data access is provisioned for most new teams or projects without complex cloud deployments required from data engineering.
- Data teams can discover key attributes of databases and tables quickly via the Databricks UI and REST API without launching spark compute or shoulder-tapping teammates.
- Data pipelines are more resilient by utilizing table and column-level lineage data to avoid breaking changes, trace data issues to their source, and delete unused tables.
- Datasets are easier to consume for clients through Delta Sharing and BI tools, which use Unity Catalog to fetch tables and table metadata without noticeable latency.
- Precise audit trail of queried data assets is available to security and data platform admins to verify the enforcement of compliance policies throughout the company.
Streamlined data accessibility resulted in faster insights and improved operational efficiency
Unity Catalog has improved governance, increased the utilization of datasets, and made data accessible to external systems and clients in a thoughtful way. The Unity Catalog metastore service is highly performant, with millisecond-level latency for table metadata, which has enabled the company to organize over 150,000 tables and assign permissions effectively. This has made ~70% of the custom cloud infrastructure resources (roles, buckets, bucket policies) from the previous RBAC architecture obsolete. Data engineers can now provision data access more easily via the Databricks UI and REST API, unblocking business units faster to achieve their goals. Teams now discover datasets in Unity Catalog and view auto-generated lineage graphs, leading to the consolidation of duplicate tables and deletion of unused tables/table columns. This data lake cleanup using lineage data and audit logs in Unity Catalog helped us identify inactive data and reduce our cloud storage costs by 25%.
Furthermore, we have achieved a significant 45% reduction in our SQL data warehouse expenses. Prior to implementing Unity Catalog, our organization maintained separate SQL warehouses for each team, resulting in considerable inefficiencies and unnecessary expenditures. The problem stemmed from the fact that not every team would utilize their designated clusters, leading to unused compute resources. However, Unity Catalog optimized our approach by introducing permissions that transcend cluster boundaries. This allowed us to consolidate clusters and SQL warehouses, enabling multiple teams to concurrently leverage shared resources efficiently. In addition, YipitData's data products can now be accessed through interactive analytics powered by BI tools as well as Delta Sharing, which helps clients gain more value from us without spending significant time on data access and ingestion bottlenecks.
YipitData's data engineering department is excited to continue to invest in its data platform with Unity Catalog as its metastore and access management service, positioning the company for success as it expands its analytics offerings.
Update: Now you can watch our presentation at Data and AI Summit 2023, where we share more about our journey to adopt Unity Catalog, best practices, and what is ahead for our data platform and Lakehouse architecture.