Find and Protect Your Crown Jewels in Databricks with Privacera and Apache Ranger

Optimize the performance of Databricks infrastructure by adding Privacera security and compliance workflows. Learn how Privacera’s Apache Ranger-based architecture in the cloud integrates with Databricks Delta Lake to enable a secure multi-tenant framework to efficiently find and easily manage sensitive data with centralized, fine-grained access control.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hello, everyone. We are here to talk about data governance in the hybrid Cloud and how Privacera and Databricks can help you achieve that. The title of this talk here is to find and protect your crown jewels in Databricks with Privacera and Apache Ranger. My name is Srikanth Venkat, I am VP Product Management at Privacera, I used to manage the security and governance portfolio at Cloudera including shared experience in Apache Ranger, Apache Atlas et cetera. And we are here to talk about how Privacera can help you move your workloads responsibility to the Cloud and work on top of Databricks as a solution and give you the Apache Ranger functionality for security governance and other data governance capabilities in the Cloud.

Who are we? so Privacera was founded by Balaji Ganesan and Bosco Durai in 2016. They were the original creators of Apache Ranger and Hortonworks Purchased AX secure which is the company that they founded and the code was open source to Apache Ranger and since 2014, a lot of big data deployments have Apache Ranger as their core authorization and audit engine. And the community has added a lot of functionality there in 2016 Balaji and Bosco founded Privacera to take Apache Ranger solution to the Cloud and build capabilities to address solutions such as Databricks, and other AWS, Microsoft, Google et cetera in the Cloud with the same set of capabilities. Our platform became generally available in 2017 and today we have number of Fortune 100 customers using Our solution at scale. And we have a vetted and experienced and accomplished set of innovators in the team both in the community and outside and our mission is to solve the data democratization problem that most enterprises face.

Challenge of Maintaining Data Governance

Most enterprises today as they’re moving to the Cloud, have the dual mandates of maintaining the data governance while democratizing this data responsibly. As you can see in this chart here, the data volumes are exploding but the amount of usable data that can be responsibly shared, and that meets the compliance guidelines, that coming up either as external regulations or as internal mandates is shrinking. So there’s a huge gap between the amount of data that is available and the amount of data that can be shared and used for such insights. Our goal at privacy right here is to close that gap.

As you can see here we have shrank fact. Our primary use cases are as follows. We help our enterprise customers who are either familiar or new to Ranger, and we give our customers the ability to move all of their the robust security and governance features that are available via Apache Ranger. Help them migrate safely to the Cloud in a compliant fashion, while being in compliance in terms of workflows and automation to support the emerging revelations such as CCPA and GDPR. And in doing so, being able to democratize this data so that it can be shared securely and widely across their organizations to support better insights and decision making. We cover traditional big data and relational and other platforms On Prem as well as the public Cloud on Azure, AWS and Google Cloud.

Our Primary Use Cases

And in particular, we are first class partners for Databricks. And we have a lot of deep integration on the Spark side, and we work with the community and Databricks in this regard to make sure that our solution can help our customers smoothly migrate from on Prem to the public Cloud deployments, and also support hybrid Cloud deployments while running with an Databrick solution.

What does the Privacera solution offer without the Privacera solution, if you look at all of the different Cloud services, they have their own authorization model, they have their own auditing frameworks, and they have their own data governance frameworks that might be involved. And if you look at it, this is as widespread as the Cloud and even within a particular public Cloud providers services. There is a lot of heterogeneity and a lot of duplication of the data and the policies that need to happen to make sure that you get consisting elements. So as a result enterprises are struggling with how do you impose this layer of compliance on top while being able to share the data responsibly across the enterprise and get the analytic value out of the data investment by moving into the Cloud. So as they migrate to the Cloud, they have this challenge with a lot of this heterogeneity and stove piping of all these different services, which makes it very difficult for them to manage governance and security and in particular authorization and auditing, as well as data discovery centrally. So with Privacera We aim to centralize this, we provide a single pane of glass that can be used to discover where your sensitive data is and classify it and use that and other capabilities on the platform to provide centralized access control at a very very fine grained level and also provide automated compliance workflows that will help democratizes data while keeping it in compliance with external regulations.

Little bit about our data discovery architecture, we can connect and scan a variety of Cloud databases, Cloud object stores, database and tote databases et cetera, as well as On Prem databases where warehouses file systems, object source, we very heavily leveraged Apache Spark and we leverage Databricks very heavily in our scanning technology. The scanner agents can go in in process query the data and then they use a variety of algorithms that we will show you in the demo, such as everything starting from regular expressions to dictionaries to machine learning models to go and scan this data, pattern match and use a variety of rules, and then apply different classifications and other metadata, which is stored in our meta store, which is a Cloud data meta store that is scalable and relies on the Cloud infrastructure natively. And all of that is available through our security envelope with the Apache Ranger functionality. And you can use these discovery driven classifications or labels to drive the security decisions and authorization decisions, as well as understand the audit and analytics and the usage of the data across your enterprise.

Another key component of our platform is the access control architecture that is predicated on the Apache Ranger framework in the traditional Big Data world. The Apache Ranger framework works through a plug in model where services such as Hive or HDFS On Premise there are plug-in which run within the service in the same process that receive all of these incoming requests for access, they consult a set of policies that are cached within this plug-in or agent and these get pulled down from a central portal where the users can go and alter the policies. And they’re maintained in a centralized database. And distribute to engines such as this plug-ins provided to enforce the policies or have these local enforcement points of inspect incoming requests, match it up against the policies and make a very quick policy decision, whether this user, or group ,or role under that context of data usage can actually access the data and once the decision is made, the engine, which may Hive or Spark, or other engines can natively do their enforcement and optimization so we only do the check while in process and we only demarcate a path and not the data Similarly, we have extended this plug-in model to cover Cloud and object stores in Privacera, and we have other agents that run as a lightweight reverse proxy that inspects the traffic and inspects the content coming in, and the metadata coming in, to be able to make that authorization decision. We’ve also extended for Cloud databases, another model from Apache Ranger within Privacera , which does a policy synchronization and this provides the capability to take the policies in a common form that’s expressed in the Apache Ranger format, and then translate them down to typical grants are the books that are available in the databases. So our plug-in architecture for Databricks we run within the Spark driver as in when the logical plan is created, we inspect the incoming request, make a quick policy decision concerning the policies that are pulled down from the centralized Apache Ranger servers and locally enforced that within Spark So because of this, we are very highly performant and we’re able to provide authorization at multiple levels.

Privacera Plug-in Deployment for Databricks Clusters

In a Databricks context. This plug-in is loaded via any scripts and is available for the life of the cluster. So essentially the authorization checks and audits are inescapable and we configure the system so that for the lifecycle of the cluster within Databricks, our authorization engine is available.

How do we extend the model in Databricks? so we work very closely with Databricks, and for the SparkSQL and Python notebooks. We provide an addition to the table and column level access control. We provide column masking, row filtering, classification based policies and dynamic policies as well as centralized auditing and in the case of credential pass removal for your Java and Scala notebooks, et cetera we provide both bucket object level access or file level access in ADLS, for example, and we have additional enhancements around being able to map these identities in Databricks in various formats, checking the policies for the underlying Cloud object stores, and provide a lot of flexibility in terms of authentication schemes.

Now let’s dive into a quick demo to look at Privacera in action. – [Instructor] One of the first challenges that enterprises face as they move the data more and more into the Cloud, is to first determine what data lies where and how much of that is sensitive. The first part of Privacera platform is a discovery surface that helps you precisely understand and locate where sensitive data is across your Cloud footprint and your hybrid On Prem private Cloud as well as public Cloud resources. So what you see here is a typical enterprise footprint, where we have multiple Clouds, as well as On Prem services, we have services from AWS, Google, Azure, Databricks et cetera. And the discovery Service helps you scan all of this information automatically, and classify the data that is present in there. So for example, your data could be sitting in files in the Cloud, or in columns in tables and databases sitting in the Cloud, or data warehouses, we scan that entire footprint using the Discovery Service. And we can very precisely map and understand where various types of sensitive data such as person names, social security numbers, zip codes, phone numbers, credit card numbers, et cetera, are located very precisely. And you can do this entire map and landscape across all of these heterogeneous services and get a holistic comprehensive view of the sensitive data or toxic data as the case may be across this entire footprint. Furthermore, you can divide this entire landscape into what are called Data zones. Think of the data zones as a way to segment your data and logically from multiple services, so for example, you could designate certain sets of tables and folders from, let’s say S3 and redshift, and Databricks, delta Lake files together as a particular zone. And in this case, we have defined several data zones, which represent these logical divisions of the data for which we can apply various types of access and governance policies.

So what’s the benefit of providing this data zones? you can carve out different assets and then you can look at the composition of these assets into a data zone and look at all the sensitivity attributes of those data zones. You can also have various workflows for compliance across these data zones. As an example, you can be alerted, you can have policies that will alert you whenever sensitive data is moved into a zone or across zones, or between zones. You can de identify data as it moves between the zones. And you can also alert for example, when data leaves a particular zone as ex filtration happens out of that zone. So Privacera platform gives you a variety of compliance workflows based on the sensitivity that has been detected and various operations that are happening on those data zones to alert you suitably.

So both of these capabilities will give you, as an example, all of the possible attempted data movements. And you can go drill down into each of these to see where did the data come from as a particular time you can filter this over different scenarios. To understand how the data map is constructed, and the different functionality available within data zones, and alerts and monitoring that we viewed as well as lineage. Let us see how the precise detection is accomplished through discovery model. The first process in this is to register the data services as well as data assets sitting in Cloud storage or in various databases in the Cloud or warehouses in the Cloud with the Discovery Service or the scanners which are high performing Spark jobs running in Databricks or in Apache Spark can scan the data and then produce the classification. So labels that are needed for sensitivity across this entire footprint. So the first process here is to register the study assets for the Discovery Service to be able to scan. Once you’ve registered these different data services or data databases or platforms. Then we can go set the scope of how we wanna to scan and what we wanna scan. The first part is to define the scope of the scan. So we can include various resources. As an example, if you look at the Databricks part here, so we have a two different tables from this customer database that we want to include in our scan. We can also similarly exclude various resources via tables, files or databases and define the scope of the scans. Then, out of the box we have various types of sensitive data that can be labeled or detected. And this can be extended with additional tags that can be customized according to the customer requirements needs. How do we detect and classify this data? So we use a variety of techniques in the scanning technologies. So starting with the very simple regex based patterns, this can be applied both to the data content as well as the data context such as the column names or file names, and you can apply dictionaries or inclusion exclusion lists, to match against column names or against column content. This could be useful for content such as person names where you would be able to look at a common list and look at fuzzy y matches or exact matches against those, in addition, you can have your own custom algorithms. As an example, in a credit card scenario, you may want to perform a lance check on the data once you’ve passed the content, filters as well as the column matches as well as the dictionary matches for let’s say, a column name patterns.

All of these different techniques can be composed into rules and this can get very sophisticated in terms of inclusion and exclusion, as well as content and context classifications and can be used to auto score or automatically classify the data or to impose a review score. The way the classification output is provided is as a score between zero and 100 with higher values indicating better confidence. Users can set a tunable score here, in this case, it’s set up as 70 as a threshold where any any piece of data that has a score that lands above a 70 out of 100 is automatically classified, anything between a 40 and a 70, a data steward or human can actually review and make any adjustments as necessary for false positives, for example. Once you have chosen the scope of the scan and the techniques and the threshold, then you trigger a scan. And these scans can be either triggered automatically based on what we call as file watchers or other watching technology that you’ve got that looks for changes or updates happening in each of these scopes of these scans and automatically triggers the scan job. In addition, you can also do a manual trigger or set up a scheduled scan, such as you work with a cron job. So that is a capability to set up the scans do run periodically, as well as the ability to auto scan based on changes happening in this data environments. The output of the scanner is a classification that’s attached to a particular table or traditional metadata annotation that attached to a file or folder whatever the content is that is being scanned in the service as an example, Databricks we’ve gone and scanned this table. and as you can see the output of the scanner is set of labels versus to date has been detected, we also indicate a sample of the data and what was the detection method that actually resulted in the score. And the output of the classification can be either a system classification, which is automatically approved, since the score is above the threshold for system or automatic classification, or it could also see this data that does not pass the threshold or the score of error does not pass the threshold in this case is between 40 and 70, as we saw in the threshold setup, and the data steward or another person within the enterprise who has the role to curate this data can go and accept or reject these classifications. And then once they accept or reject, they can also give a reason so we can use this information to improve the classifier over time. This part of the Privacera platform is the extension of the access management capabilities that we’re provided by Apache Ranger in the Hadoop ecosystem for a variety of services such as hive or HDFS, to the Cloud object stores, databases and warehouses and engines such as Databricks and Spark running in the Cloud. We will look at how Privacera has extended the Databricks security model to provide richer capabilities for access control, auditing, and key management services related to encryption. The unit of policies within Ranger for those of you who are familiar with the Adobe ecosystem is a policy And these policies can be role based or attribute based. And they’re expressed very simply, in the Ranger like UI that we provide within the Privacera environment. A policy simply expresses who has access to what data, under what conditions and what types of permissions do they have on those underlying objects. In the case of Databricks, we can write policies at a database table, column level, which is very very fine grained, and also like policies that pertain to a particular role, which is a composition of users and groups, as well as the groups themselves or users themselves. The users and groups can be sourced from enterprise systems is just LDAP and AD and rows can be composed based on the user and group aggregations and can be nested within each other. So for example, rows can be nested within one another and groups can be nested within one another from your AD, or Azure AD, or other directory sources. The permissions can also be both positive and negative permissions, and can be scoped down to what is the absolute minimum that’s required. The range of policy constructs of being able to allow both positive and negative permissioning are also available by extension to Databricks and to the Spark environments. Let us now look at an example of a user accessing this data through SQL to Spark SQL I have from Databricks and see the different capabilities that Ranger extension for Privacera provides for database environments. The first case here is that there is a very simple access to that sales data table in the sales database, Emily, who’s the user here who belong to that sales role is trying to access this table by shooting a select star, which is show me all the columns here. In case there is no policy that prohibits any access for her to any of the columns, she will be able to see all of the data. But in this case, what if you wanted to protect the column called sales amount which pertains to the sales for other reps, and she’s not only able to see reps from her region but also from UK and we want to be able to restrict that. So if you have a policy that probably prohibits access, you you will see that she will get denied here. And when she’s denied, we also generate an audit event which collects a lot on this event, metadata such as who tried to access this, in this case, Emily, and what policy actually allowed or denied access for her, and what exact rules, what other policies were in effect when she was denied. So in this case, you had no policy that granted her access to this data set. So she was denied access to this table, okay.

The next level of Franklin controls is around the column level. So you cannot only write policies at the whole table, but you can also write for selectively for certain columns. In this case, Emily is only permitted at these items, these columns, ID, country, region, city and sales amount, and she’s not permitted to see the name of the rep. So in this case when she runs this query, a different policy kicks in that provides this scope down axis. And you’ll see that being returned here.

Another thing we notice here is that, what if you want to restrict Emily to only see the data for her region without the names of the reps. So we can apply a row filter, which gives you row level security, and will also give Emily only restricted access to those rules. Applying these row filters is dynamic, which means, you don’t have to maintain multiple views and permissioning around those views. And it makes it very very simple to provide this type of arc control policy to restrict the data and to show the right slice of the data by dynamically filtering and inserting predicates into these policies. So you will see that in this case, row filter policy has been applied, and if you look at what they offer the policy construct here it is very straightforward. For the sales data table within the same database, this row which Emily belongs to, is only allowed to see the data belonging to the US. So we can apply any arbitrary SQL predicates that are valid to restrict the access and filter those rows that Emily gets to see.

You can also write coarser grained permissions at the file level. So we also authorize the underlying object stores are down to the object level, where there is S3 or ADLS, or GCS. Similarly read and write operations can be scooped down. In terms of data protection, we also offer dynamic and static masking as well as static in encryption. So what you see here is that a couple of policies that have been provided to mask sensitive information such as social security numbers and email. In the case of social security numbers, these have been hashed and the case of email they have been redacted and the policies that do this are within this environment under the masking area. So, masking policies are also fairly straightforward to construct, we choose the column and apply the role or the group or user and apply the type of mask that he want, out of the box we provide various types of dynamic masking options, and these can be customized based on the usage. In addition to the resource based policies, which are based on the location or the physical attributes of a particular entity, such as a table or a column, or a file or a folder or an object. We can also have tag based policies which represent the classification or the business context or other metadata properties of the object to drive the author Decision decision. So such policies are called tag based policies. As we saw in the discovery module, the output is a tag or a classification and these can be leveraged extensively to write policies across all of the services that we do authorize within the privacy extensions to Ranger and the Cloud. As an example, you can write a policy that restricts access to a particular role, or a user, or a group or some combination in a particular set of services to all of the objects that are tagged in a certain way. So in this case, anything that’s tagged with sales is only available to this user Emily, and only within Redshift and Snowflake for certain conditions. So we can through these means have very economical policies that apply across a variety of services and across a wide swath of your data landscape by simply applying the tags and mapping their resources to the tags they have either via discovery or through import from other cataloging tools. Privacera also provides encryption capabilities. So there is capability to encrypt the data using a variety of new standard techniques including format preserving encryption AES and other schemes for Databricks, we provide extended capabilities beyond what the database platform provides as a solution working together with Databricks to authorize at a very fine grained level columns, row level security tables, tables and databases as well as for other usage context via Scala or Java. So variety of languages with which you can interact in Databricks can all be provided consistent policies and consistent authorization and auditing uniformly across the entire space. – As we observed in the demo, we support the entire data access governance lifecycle through four steps.

Data Access Governance Lifecycle

The first step is the discovery, where we connect to different Cloud storage databases, scan the data and store these annotations within our metadata store. The second step, we can also find in policies and enforce them through distributed agents that are Cloud native and run and give you this functionality for role based, attribute based, tag based as well as fine grained access control.

We also saw how you can enforce policies across a variety of services and have a variety of reports and dashboards that provide visibility and give the ability for customers to be able to leverage this data from a central pane of glass.

In Summary the business value of our platform is the ability to identify sensitive data across the Cloud footprint enables secure data sharing, and do this in an automated fashion without requiring a lot of human investment, and helping customers migrate to Cloud and work seamlessly in a public Cloud more while being in compliance with regulations such as GDPR and CCPA. And providing a streamlined processes within integrated workflows to support all of this from a central pane of glass. That is pretty much a quick overview. Thank you for your attention. Please do visit our booth for a personalized demo, and connect with us on Twitter or LinkedIn and do visit our website for additional resources and collateral that will be useful to understand how we can help protect your data in the Cloud and work seamlessly with Databricks to be able to affect these controls.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Dr. Srikanth Venkat


Dr. Srikanth Venkat is VP of Product Management at Privacera. Prior to Privacera, he built out Cloudera’s Shared Data Experience (SDX) portfolio which includes Apache Knox, Apache Ranger, Apache Atlas, and Hortonworks DataPlane Service. His prior career spans a variety of roles at Telefonica,, Cisco-Webex, Proofpoint, Trilogy Software, and Hewlett-Packard. Srikanth holds a PhD in Engineering with a focus on Artificial Intelligence and an MBA. He enjoys tinkering with data science, cloud, and AI/ML technologies.