The Executive’s Guide to Data, Analytics and AI Transformation, Part 4: Democratize access to quality data with governance

Published: May 3, 2023

by Chris D’Agostino, Mimi Park and Usman Zubair

This is part four of a multi-part series to share key insights and tactics with Senior Executives leading data and AI transformation initiatives. You can read part three of the series here.

Effective data and AI solutions rely more on the amount of quality data available than on the sophistication or complexity of the report, model or algorithm. Google's paper "The Unreasonable Effectiveness of Data" demonstrates this point. The takeaway is that organizations should focus their efforts on making sure data citizens have access to the widest selection of relevant and high-quality data to perform their jobs. This will create new opportunities for revenue growth, cost reduction and risk reduction.

The 80/20 data science dilemm

Most existing data environments have their data stored primarily in different operational data stores within a given business unit (BU), and this creates several challenges:

Most BUs deploy use cases that are based only on their own data without taking advantage of cross-BU opportunities
The schemas are generally not well understood outside of BU or department — with only the database designers and power users being able to make efficient use of the data. This is referred to as the "tribal knowledge" phenomenon
The approval process and different system-level security models make it difficult and time-consuming for data scientists to gain the proper access to the data they need

In order to perform analysis, users are forced to log in to multiple systems to collect their data. This is most often done using single-node data science tools and generates unnecessary copies of data stored on local disk drives, various network shares or user-controlled cloud storage. In some cases, the data is copied to "user spaces" within production platform environments. This has the strong potential of degrading the overall performance for true production workloads.

To make matters worse, these copies of data are generally much smaller than the full-size data sets that would be needed in order to get the best model performance for your ML and AI workloads. Small data sets reduce the effectiveness of exploration, experimentation, model development and model training — resulting in inaccurate models when deployed into production and used with full-size data sets.

As a result, data science teams are spending 80% of their time wrangling data sets and only 20% of their time performing analytic work — work that may need to be redone once they have access to the full-size data sets. This is a serious problem for organizations that want to remain competitive and generate game-changing results.

Another factor contributing to reduced productivity is the way in which end users are typically granted access to data. Security policies usually require both coarse-grained and fine-grained data protections. In other words, granting access at a data set level but limiting access to specific rows and columns (fine-grained) within the data set.

Rationalize data access roles

The most common approach to providing coarse-grained and fine-grained access is to use what's known as role-based access control (RBAC). Individual users log on to system-level accounts or via a single sign-on authentication and access control solution.

Users can access data by being added to one or more Lightweight directory access protocol (LDAP) groups. There are different strategies for identifying and creating these groups, but typically, they are done on a system-by-system basis, with a 1:1 mapping for each coarse and fine-grained access control combination. This approach to data access usually produces a proliferation of user groups. It is not unusual to see several thousand discrete security groups for large organizations despite having a much smaller number of defined job functions.

This approach creates one of the biggest security challenges in large organizations. When personnel leave the company, it is fairly straightforward to remove them from the various security groups. However, when personnel move around within the organization, their old security group assignments often remain intact and new ones are assigned based on their new job function. This leads to personnel continuing to have access to data that they no longer have a "need to know."

Data classification

Having all your data sets stored in a single, well-managed data lake gives you the ability to use partition strategies to segment your data based on "need to know." Some organizations create a partition based on which business unit owns the data and which one owns the data classification. For example, in a financial services company, credit card customers' data could be stored separately from that of debit card customers, and access to GDPR/CCPA-related fields could be handled using classification labels.

The simplest approach to data classification is to use three labels:

Public data: Data that can be freely disclosed to the public. This would include your annual report, press releases, etc.
Internal data: Data that has low security requirements but should not be shared with the public or competitors. This would include strategy briefings and market or customer segmentation research.
Restricted data: Highly sensitive data regarding customers or internal business operations. Disclosure could negatively affect operations and put the organization at financial or legal risk. Restricted data requires the highest level of security protection.

Taking this into account, an organization could implement a streamlined set of roles for RBAC that uses the convention <domain><entity><data set | data asset><classification> where "domain" might be the business unit within an organization, "entity" is the noun that the role is valid for, "data set" or "data asset" is the ID, and "classification" is one of the three values (public, internal, restricted).

There is a "deny all default" policy that does not allow access to any data unless there is a corresponding role assignment. Wildcards can be used to grant access to eliminate the need to enumerate every combination.

For example, <credit-card><customers><transactions> <restricted> gives a user or a system access to all the data fields that describe a credit card transaction for a customer, including the 16-digit credit card number. Whereas <credit-card><customers><transactions><internal> would allow the user or system access only to nonsensitive data regarding the transaction.

This gives organizations the chance to rationalize their security groups by using a domain naming convention to provide coarse-grained and fine-grained access without the need for creating tons of LDAP groups. It also dramatically eases the administration of granting access to data for a given user.

Everyone working from the same view of data

The modern data stack, when combined with a simplified security group approach and a robust data governance methodology, gives organizations an opportunity to rethink how data is accessed — and greatly improves time to market for their analytic use cases. All analytic workloads can now operate from a single, shared view of your data.

Combining this with a sensitive data tokenization strategy can make it straightforward to empower data scientists to do their job and shift the 80/20 ratio in their favor. It's now easier to work with full-size data sets that both obfuscate NPI/PII information and preserve analytic value.

Now, data discovery is easier because data sets have been registered in the catalog with full descriptions and business metadata — with some organizations going as far as showing realistic sample data for a particular data set. If a user does not have access to the underlying data files, having data in one physical location eases the burden of granting access, and then it's easier to deploy access-control policies and collect/analyze audit logs to monitor data usage and to look for bad actors.

Data security, validation and curation — in one place

The modern data architecture using Databricks Lakehouse makes it easy to take a consistent approach to protecting, validating and improving your organization's data. Data governance policies can be enforced during curation using built-in features such as schema validation, data quality "expectations" and pipelines. Databricks enables moving data through well-defined states: Raw —> Refined —> Curated or, as we refer to it at Databricks, Bronze —> Silver —> Gold.

The raw data is known as "Bronze-level" data and serves as the landing zone for all your important analytic data. Bronze data functions as the starting point for a series of curation steps that filter, clean and augment the data for use by downstream systems. The first major refinement results in data being stored in "Silver-level" tables within the data lake. As these tables are recommended to use an open table format (i.e. Delta Lake) for storage, they provide additional benefits such as ACID transactions and time travel. The final step in the process is to produce business-level aggregates, or "Gold-level" tables, that combine data sets from across the organization. It's a set of data used to improve customer service across the full line of products or look for opportunities to cross-sell to increase customer retention. For the first time, organizations can truly optimize data curation and ETL — eliminating unnecessary copies of data and the duplication of effort that often happens in ETL jobs with legacy data ecosystems. This "solve once, access many times" approach speeds time to market, improves the user experience and helps retain talent.

Extend the impact of your data with secure data sharing

Data sharing is crucial to drive business value in today's digital economy. More and more organizations are now looking to securely share trusted data with their partners/suppliers, internal lines of business or customers to drive collaboration, improve internal efficiency and generate new revenue streams with data monetization. Additionally, organizations are interested in leveraging external data to drive new product innovations and services.

Business executives must establish and promote a data sharing culture in their organizations to build competitive advantage.

Conclusion

Data democratization is a key step on the data and AI transformation journey to enable data citizens across the enterprise irrespective of their technical acumen. At the same time, organizations must have a strong stance on data governance to earn and maintain customer trust, ensure sound data and privacy practices, and protect their data assets.

Databricks Lakehouse platform provides a unified governance solution for all your data and AI assets, built-in data quality to streamline data curation and a rich collaborative environment for data teams to discover new insights. To learn more, please contact us.

Want to learn more? Check out our eBook Transform and Scale Your Organization With Data and AI.