Accelerate Data Science Initiatives: Databricks & Privacera

May 26, 2021 03:50 PM (PT)

Download Slides

Accelerating Data Science Initiatives with Databricks’ Rapid SQL Analytics and Privacera’s Centralized Data Access Governance.

 

Databricks’ SQL Analytics helps data teams consolidate and simplify their data architectures. With SQL Analytics, data teams can perform BI and SQL workloads on the same multi-cloud lakehouse architecture enabling data scientists to perform advanced analytics on unstructured and large-scale data. This session will explore how Privacera’s advanced security, privacy, and governance capabilities seamlessly integrate with Databricks’ unified SQL Analytics approach to  provide single pane visibility of data analytics from a centralized location. Attendees will learn how to: 

 

  • Rapidly access data to run high-fidelity analytics
  • Implement a fully secure solution that ensures productivity, while controlling data access at fine-grained levels (row, column, and file)
  • Easily enable consistent access policies across all systems and applications
  • Support true data transparency across enterprises
  • Comply with stringent industry and privacy regulations like GDPR, LGPD, HIPAA, CCPA, PCI DSS, RTBF, and more with rich auditing and reporting
In this session watch:
Don Bosco Durai, Corporate (CIO, CTO, Chief Data Officer), Privacera

 

Transcript

Don Bosco Durai: Hello, everyone. Today, I’ll be primarily talking about Databricks SQL Analytics, how it is different from traditional Databricks workspace, its architecture and how security works in SQL Analytics. A little bit about myself. I am Don Bosco Durai. I’m the CTO and co-founder of Privacera. At Privacera, we are doing security and governance in the cloud for the open data ecosystem. I’m also a PMC member and committed in the project Apache Ranger.
Privacera is one of the preferred Databricks partners for security. I’ve been working with Databricks team for almost a few years now. We originally integrated Apache Ranger with Spark from Databricks, where we provided finding access controlled for Spark and Spark SQL workloads. When Databricks introduced Delta Lake, we extended our solution to transparently support Delta Lake also. Recently, when Databricks was working on SQL Analytics, we worked really closely with their team to design the necessary hooks to integrate Privacera seamlessly with SQL Analytics. Some of the things I’ll be talking today are the outcome of our joint efforts.
Let’s start with personas and use cases. Normally, there are two categories of users. One side is the business users who want to run queries and are looking for immediate response. And the other side, we have the data scientists and the data wranglers who want to run ML workloads or process gigabytes and terabytes of data. Previously, the same are the similar Databricks cluster were used for both the use cases. The business users could use Spark SQL, while the data scientist can use ML workflow and some Spark jobs. But underlying, it was the same multi-node cluster running Apache Spark.
Obviously, the Scala-based Spark is most suitable for fast processing. What Databricks did was they introduced SQL Analytics, which is a ground-up redesign to specifically address the SQL queries. SQL Analytics uses Delta Lake so that you don’t have to move your data around, but still you can get the very high performance for BI queries.
We’re talking about security and compliance and privacy in the next two slides. While SQL Analytics solve the problem of user experience from multiple personas, but it also introduced a new challenge. Now, you have the same dataset accessed by two different tools, by two different personas. But it should have to keep the same security and compliance policies consistent across both the tools.
Also, now we have two different admins managing the policies. You have one set of admins managing the policies for the SQL Analytics using SQL grant and revokes. Then, you have another set of admins managing the cluster policies, like the IAM rules for the cluster, the credential pass rules. So, there’s a possibility of inconsistency in the policies itself.
Now, when it comes to security and privacy, that’s another thing. Right? Generally, security policies are confused with the privacy policies. In reality, there are two different policies managed by two different groups. Security policies are primarily about preventing unauthorized access to data. It ensures that users have access to only what they are supposed to have. Second, it’s also responsible for encryption, but at a closer grip. Example, disk or volume at bucket level encryption.
However, privacy policy is all about how customer PII data is stored and used, and for what purpose. Data scientists and other users might have access to it, but they can only use them based on the consent provided by the customer. Also, there are privacy regulations that will restrict how you can use customer data. Example, let’s say you have some dataset which contained race and gender, and you want to use a creative scoring, but it is not allowed by compliance regulations.
You have to make sure that your data scientists or users are using the customer data, of PII data, particularly, the way they have been given consent. Also, compliance regulation have additional requirements, like you might have to encrypt all your PII and sensitive data, right, and who can see it. You might have to store them in an encrypted form at rest and in real time. When you’re accessing, depending on the policy, you may have to decrypt it.
All these different things can lead to governance blind spot, because the privacy team will have their own set of requirements, which is based on state and country and industry regulations. The security team was responsible for unauthorized access. They will have their own set of policies. They are trying to prevent data leakage, and data encryption policies. The owner of the data will have their own concerns. They want to make sure who’s using the data for what purpose. When you start adding all of them, you want to make sure that they’re not overriding each other’s policy and also managing them holistically across different systems and tools.
Auditing is another challenging part. With so many different systems and logging mechanisms, it’s very difficult to keep track who’s actually accessing the data and for what purpose. So, you should try as much as possible to centralize the audits in one location. This includes with SQL access, as well as accessing file and objects, and a lot of advantages to be able to centralize the audits. First of all, you will be able to see all the access logs in one place, who’s accessing what. Then, you can generate different reports based on different requirements.
You may have a compliance team who have their own set of requirements. You would have the governance team. You have the data owners. They all have different expectations and want to see different things in the report. And you may have a security team who may want to have alerting in real time. And if you’re doing classification, like you’re tagging your data, then you can actually use this audit to mix it and generate more actionable reports, like who’s accessing your BI data, from which system and for what purpose.
Now, let’s see how we can address this thing. Right? We’ll start with the database native support for security. Database has a pretty rigorous set of security features. For data, you can have user credential pass-through, you can have IAM roles, and you can also have grant/revoke for Spark SQL. We’ll talk a little bit about that. But IAM roles for the cluster, the IAM roles are associated with the EC2 instances where the Databricks driver and the [inaudible] are running. Right?
So, anyone who has access to the cluster at the same level of publishes. But user credential pass-through, you cannot apply IAM role at the user level. If you have multiple users using the same cluster, then each could have their own IAM role. That way you can control what the user can access. But there’s some little downside of the IAM role, because IAM role, you have to set up the policies within a JSON file and a JSON block which can be maximum 30 KB. So, you can limit how much they can access, but you still have some options to run.
But if you need more fine grain permissions, then you can use Spark SQL, and which provides table level access control. Additionally, for clusters, you can have cluster policies. With this, you can allow any users to create a cluster, but limit what IAM role they can associate the cluster with. Databricks also supports permissions at the cluster level. So, you can decide who can start a cluster and stop a cluster. And it also has permission at the notebook level.
I’m going to see what Privacera has done and how we extended Databricks to address it. What we had done is we took database core security, and we built on top of it to address some of the complex enterprise security and compliance use cases. At the same time, we already made it easier to manage the policies and be consistent across all use cases and tools. While in the database, you had granted the work on the SQL. Since we’re using Apache Ranger in Privacera, you have REST API, you have user interface to manage the policies.
Since they’re using Apache Ranger as a foundation, you will also get all the security features that Ranger supports. This includes a centralized policy management, also centralized collection of audits. And since Ranger has been built with big data in mind, it is very scalable. You can support hundreds and thousands of nodes in database without affecting the performance.
In traditional Databricks, products are supported fine grain access control at various levels. If you take SQL, we could support database table and column level access control. We could also support dynamic rollover filter. Example, if you had two users running the same query against the same table, they will get different results based on their permission and what purpose they’re using that data for. Similarly, we could do column masking in SQL.
Well, certain sensitive columns can be automatically masked on what the users are allowed to see and in what format. And we also support dynamic decryption. Let’s assume you have for security and compliance reason, you want to have your PII and sensitive data stored and encrypted form at disk. So, when you are trying to run the query, you want to get and interpret data. Privacera along with Ranger, you can decrypt them dynamically based on the access policies. The data is also encrypted using the keys to the Ranger KMS and get the benefit of both access control on the data side, as well as access control who has access to the key.
In Ember workflows, we can also support object and file level access control. This is form work on reading and writing. Since the policies are defined in Ranger, very similar to the SGFS, you can give at any level. You can give it at file level or you can give it a folder level. You can have tag risk policies and other ways of managing the policies also. Similarly, Ranger suppose attribute-based access control, role-based access control, tag risk policies. All these features also comes for free. With all this load of features, now without copying and moving data, you cannot very easily implement consistent policies for GDPR, CCPA and other compliance regulations across multiple tools and data sources.
Without going in too much depth, at a high level, now it says based on Ranger, so in the traditional clusters, the Ranger plugins are embedded within the Spark driver itself. So, it makes the access decision, who can access what resources, and in what format they can see it. It also collects all the audit to cost and sends it to a central location.
Now, since the plugins are running within the process, within the same process, and not as a separate process, it is super efficient, right? What happens is, when you are running the query, you’re trying to access a file. So, do a very quick check, whether the user has a permission, the table or the columns, or the objects the user is trying to access, and also where the masking has to be applied, or the rollout for that particular SEB, right? If that should be applied, rewrite the query for other file.
Once the check is done and the user has a permission, the Ranger plug-in gets out of the way, and let Spark do the rest of the thing. Spark will then send this request to multiple executors to read and write, and those executors will use the underlying IAM role if required to read and write from the S3 and ADLS. There’s absolutely no performance over.
Now let’s see how we integrate it with SQL Analytics. For SQL Analytics, you work very closely with the database team to provide the same security feature as in traditional workspaces. That is, you can do table and column access control, like the tag risk policies and activity-based policies, dynamic rollout filter, and also dynamic masking.
But we also took it a step further. We did our first class integration with Privacera Cloud, which is our SaaS offering. With this integration, there’s no need to deploy Ranger or even Ember Ranger plugins within the database. Instead, what we did was we embedded Ranger plug-in within our own policy sync process, which runs in in our cloud. So, you still define the policies in Ranger, which is hosted in our cloud, and policies, even automatically translate them into grant and revokes in SQL Analytics.
The policies are enforced in real time with Databricks SQL Analytics. Since the data is not going through any proxy and there’s no other external influence, there’s absolutely no performance overhead. Now, regardless who’s accessing the data, how are they accessing, which tool they’re accessing, the policies are consistently applied across the board. So, I’ll be showing that in my demo.
To summarize, with this integration, you get fine grain access control at row and column levels. You have a single pane visibility of data across multiple cloud services and SQL Analytics. You also get real-time monitoring of all audit trails, and you can use all of them to comply industry and policy regulation, of GDPR, CCPA, HIPAA, and GPD. Now, we’ll go through the demo.
Here, I have three different services running. I have the Privacera running out here. The Privacera portal is where you’re going to centrally manage all the policies. We support almost all the cloud services, where there’s going to be Trino, Presto, S3, or Athena, Redshift, ADLS, Snowflake, even Google Big Query. So, you’re going to manage the policies in one place, very similar to Ranger. Then I have a traditional Databricks cluster and I have a notebook. Then I got one more service. That’s this endpoint. There’s this SQL Analytics.
What I’m going to do for this demo is, going to be using a single user called Emily, and also to keep it simple, going to be only using one database and one table. But what I’m going to do is I’m going to be changing the policies for Emily, or for the role Emily belongs to. We’ll run it both on the Databricks cluster as well as on the SQL Analytics. Okay?
Just get started. Let me go to Privacera. I’m going to just show the first policy. I have a database table, sales data, and I have a star, which means all the columns in this particular table. I’m giving the role, sales role to select permission. Now, Emily belongs to the sales role. When Emily runs this query, select star from that table. She will get pretty much all the rows and columns from this table.
If I go to SQL Analytics and if I run the same query in SQL Analytics, I should get the same results. I’m getting the same result, where country, region, city, name, and sales amount, and I’m getting all the roles out here. Okay? Now, what I’m going to do first is I’m going to go and disable this policy. Okay? Now, with this policy, Emily should not have permission to run the query.
If she comes in to the Databricks cluster and she tries to run the query, she should get a authorization failure. That is, Emily does not have the permission to do select query on this particular database and table. If she tries to run the same query from SQL Analytics, she will be rejected, right? You’ll see one difference between Databricks cluster, as well as in the second analytics data messages are slightly different because in the SQL Analytics, the SQL Analytics is actually enforcing it, so that our messages are much more native to it. While on the case of Databricks cluster, it is done by the Ranger plugin from Privacera. So, it’s a slightly different message.
There are going to be some other changes also are going to be delineated as we go forward. Now, what I’m going to do is I’m going to enable one more policy, where the same table, but I’m going to give column level access control. Here, I’m going to just give to the four columns. I will not give the permission to the column name. But in this time I’m just going to give it to the user, Emily. Okay? I’m going to select the policy.
If I go to the Databricks cluster, and if I run another query where I’ve given explicitly the four columns Emily has permission to, so this query should go through and she should be able to see the four columns. If I come and run the same query in SQL Analytics, I should see the four columns. But since Emily does not have access to the column name, it’ll be redacted. But the rest of the data should be identical between SQL Analytics and the Databricks cluster. Okay?
Now, if we see out here, Emily can see the rows from all the countries and she can see the values to all of them. What I do is I will enable masking policy in Privacera and preview for the column city and see what happens. I’m going back to Privacera. We’re going to the masking policy. I’m going to enable this policy, same table, but I’m using the column city.
What I’m trying to do out here is for the role which Emily belongs to, I’m going to set the hash policy, where we can have different masking policies, including your own custom. But I’m going to just use out of the box hash for the time. So, I’m going to set this. And if I come and run the same query again for the same database and table, what you’ll see out here is, I will see all the four columns, but the city is going to be hashed out. Emily cannot see which city the user belongs to, how the sale is made.
If she comes and runs the same query out here in SQL Analytics, she’ll pretty much get the same result, where the city will be hashed out. We’ll go back to this and say, I’m going to do a rollout filter. The rollout filter, what happens is, in this previous case, for this particular table, I’m going to say that this role only has access to the records from the country, UK. I’m going to enable this policy.
Since I enabled this policy, if I come and run the same query again, what you’ll see is, Emily can now only see the data from the country, UK. The same behavior, she will also see in the SQL Analytics, if she runs the query again. You’ll see that the results are filtered out. Now, if you noticed one thing that, regardless which tools she used, the output is consistent within both of them, right?
Also. a lot of things are happening dynamically in the other scene. Even though she’s running the query against the same table, the results changes based on what permission she has. Right? So, if you are going into compliance requirement where you are having a GDPR requirement, where you can only see data for a given country, or you cannot see data for users who are not given consent, you can automatically filter out based on what user and what purpose they’re trying to run.
Similarly, if you have some data science use cases where you don’t want to show the clear data, you can actually also go and mask it out. Okay? Just to wrap this up, if I go to the audit, I will see the audits from both the Spark, which is the Sparks SQL as well as from the SQL Analytics, just because they are coming from two different sources. In one place, we showed the full email ID, while another is another thing, but this is more of a technical thing, probably, eventually will normalize the users also. Yeah. So, that is all from my side. We’ll open up for questions.
I hope the session was useful for you. Please free to contact me via LinkedIn. And also don’t hesitate to try out Privacera Cloud for free. Thank you very much.

Don Bosco Durai

Don Bosco Durai (Bosco) is a serial entrepreneur and thought leader in enterprise security. His earlier startup Bharosa built one of the first real-time user and entity behavioral-based fraud detectio...
Read more