Data teams are faced with a variety of tasks when migrating Hadoop-based platforms to Databricks. A common pitfall happens during the migration step where often overlooked access control policies can block adoption. This session will focus on the best practices to migrate and modernize Hadoop-based policies to govern data access (such as those in Apache Ranger or Apache Sentry). Data architects must consider new, fine-grained access control requirements when migrating from Hadoop architectures to Databricks in order to deliver secure access to as many data sets and data consumers as possible. This session will provide guidance across open source, AWS, Azure and partner tools, such as Immuta, on how to scale existing Hadoop-based policies to dynamically support more classes of users, implement fine-grained access control and leverage automation to protect sensitive data while maximizing utility — without manual effort
Speaker: Steve Touw
– Welcome everybody to my talk on Migrating and Modernizing Hadoop-Base Security Policies for Databricks. My name is Steve Touw, I’m the CTO of Immuta. So what are we gonna talk about today? This is a common question that we get asked, and this is a customer case where they’re either on Cloudera, Hortonworks or run on their own on-prem spark deployment or high deployment. And can I just migrate Apache Ranger, Sentry Policies directly to Databricks. And I just want to point out that while this talk is focused on Databricks, this could apply to migrating your policies from Ranger and Sentry to presto, synapse, snowflake, starburst, whatever your compute may be on the cloud. So the short answer to this is yes, you can migrate those policies right over to Databricks. But if you just do a direct migration, one for one, you are not going to modernize your policies. And so this talk is really gonna be about, how do I get a yes for both? And I’m gonna spend a lot of time talking about, why and what modernization is and why you really need to do it. So why modernized? So looking at Sentry and Ranger, they were both started development about eight years ago, Sentry in 2012 Ranger in 2013. And I don’t wanna make it sound like open-source projects don’t evolve over time and improve over time, certainly they do. But there are some foundational design decisions that were made at the get-go for these projects, that our current environment has made a challenge to continue to use these products. And I’ll talk a bit about that. So the first big thing that’s changed is obviously Hadoop is no longer the center of the universe, I’d argue maybe it never really was. There’s always other RDBMSs on-prem that people were using, but suffice it to say, it’s gotten even more complicated ’cause now you potentially have multi-cloud and then you have multi compute within that cloud. And if you are trying to manage your data policies uniquely across all of these different systems, you are going to drive yourself crazy and it’s just isn’t feasible. The second big thing, is that data protection laws of the world continue to grow. So this is a screenshot from the DLA Piper law firm website, where they keep track of all the data protection laws. And as you can see, it’s quite expansive across the globe and continually growing. And so what we’re seeing here is what we call this data fuel crisis. But back in the early nineties, if you had data you could pretty much use it, you don’t have to worry about all this stuff. And, we started introducing regulations like HIPAA, and as we’ve moved to the right here and you get to our current situation, there’s 350 plus privacy and InfoSec bills proposed. It’s getting quite hard to manage all of this. And so, the amount of compliant data you have to use is dropping as the amount of regulatory controls and privacy controls you think about increases. And I don’t wanna make it sound like it’s all about the regulatory controls. You want just privacy in general and data privacy is more obvious to your consumers, and Apple has made a business around this, what happens on your iPhone stays in your iPhone. It’s important that your data consumers understand that you’re treating their data appropriately and ethically. So you can see Ranger and Sentry started way back here, before this curve really started to hockey stick over here on the right. And so that has implications. We’ve got this tug of war going on, where you’ve got your legal and compliance team on the left that says, we need to secure our data and meet all these regulations and meet the expectations of our data consumers, our data providers, our customers basically. And on the right, you’ve got the data analysts and data scientists, whose job is to analyze the data you’re collecting and they want as much of it as they can get. And they just have that insatiable thirst for all your data, and the poor data platform team, or the data engineering team here stuck in the middle. They’ve got the compliance and legal team breathing down their neck, and they’ve got the data analyst team yelling at them. So how do they manage this? And the other issue is there just continual complexity, these regulations have really changed definitions of privacy preservation, and it’s no longer just about blocking access to direct identifiers, but you need to think about indirect identifiers. So I stole some language from CCPA that I’m not gonna read here, but we have other similar language in the GDPR and the long and short of it is, if you de-identify or anonymize your data well enough, CCPA doesn’t apply or GDPR doesn’t apply. But nothing in life is free, because PI defined personal information is not only directly relating to someone, but data could be reasonably linked. And I’ll talk a little bit about this, but this really causes a lot more data in your organization to have to be protected than historically has, and a lot more views into your data. And so that really leads to this privacy versus utility trade-off. Where again, we’ve got this idea on the left of complete privacy, you just hide all your data, which of course is unreasonable. And on the right is you just have all your data in the open, which is equally unreasonable. And there’s a lot of momentum pulling and pressure pulling this direction. And so how do you get this privacy versus trade-off? Where you can keep both these parties happy, while meeting and meeting their demands essentially. You really need to play in this gray area between closed and open. And a lot of these existing tools are very binary decisions from this perspective, you either have access to this table or you don’t, you have access to this column or you don’t. And that course granularity causes a lot of problems with this privacy versus trade-off. So combining all those things, the complex data platform ecosystem, more regulatory and privacy concerns and these new stringent definitions of what privacy preservation really means, we’ve now entered this cloud private data era. And this cloud private data era has created a role title wave. And when I say role, I literally mean the technical term role, like a role-based access control. And this isn’t limited to things like Ranger and Sentry, this is like your IAM roles in AWS. And, we found that our customers over time, their roles have been exploding. Because of these concerns you have to create all these different views into your data, and this has becomes unmanageable for a human or a team of humans to take on. And this is just a role explosion example, and this is from a real customer use case. You can see this as a screenshot of Ranger, but again this could apply to other systems. And you can see here that we’ve got a policy and it’s basically the same policy written over and over again. Organization name in, sub query select org name from external table which I redacted where role equals R zero one. The only thing that changes are these roles right here. It’s the same policy over and over again, associated to different users that map to that role. And so, if you’ve got new data you need to expose, you’re gonna have to potentially create a new role and a new policy. This becomes very, very complex. And you get to that explosion problem that I was just talking about. So roles are tied to this idea of role-based access control, which is a lot, almost all legacy RDBMSs take this approach and some of our new SAS database technologies and computing technologies do as well. And RBAC should really be named Static based access control, ’cause at the end of the day, it’s like writing code without being able to use variables. You saw, I was just writing the same thing over and over again for each different role. And so these two products their foundations are built upon, both role-based access control, as well as this world where you just didn’t have this many different views into your data. So they’re really conceived before the cloud private data era that I just talked about. So we’ve kind of talked about why you want to modernize. And our argument is that, if you don’t both migrate and modernize, you really aren’t going to be able to realize the benefits of the clouds because of those pressures I just described. And you’re gonna have even more security pressures when on the cloud. So, we’ve covered the why, now let’s talk about how to fix each of these. So, I’m gonna talk about each of these individually. I won’t run through it on this slide. So we’ll start with this one. The separation of policy from platform. So this is pretty straightforward. Just like the big data era required the separation of compute from storage, which everyone knows and loves. You’re able to ephemerally spin up Databricks on top of your data stored in S3 compute it, spin it down. Similarly you could spin up a Presto, cluster to run interactive queries against S3. There’s a lot of power in this, and a lot of flexibility and cost savings. And so for the same reasons, the private data era requires the separation of policy from platform. So this idea of having a single plane of glass extracted or abstracted from your actual compute engine, so you’re not uniquely enforcing policy, you have this one place to do things like table access controls, column level controls, role level security, cell level controls. And you can do this in a consistent matter, no matter what your compute by separating it. And it’s not just about separating the policy from the platform and the compute, it’s also about separating policy from your physical table. So if we think about all these different computes and different meta stores that you might have, this could and almost always does ends up being thousands of tables and columns. And if you have thousands of tables and columns and you build policies, at the table and column level, you’re gonna have thousands of policies. So if you combine this problem, with the RBAC problem I just spoke about, you really get this management explosion and scalability problem with your policies. So what you need to do is instead abstract your physical layout and your physical tables and columns with logical metadata. Things like PII, PHI, Address, Social Security Number, you can lay this across your physical data structure in a logical way, and just reference that logical tax. And this allows very few understandable policies to be created. Understandable is a key word here, because remember now the policies are written such as mask PII, not mask this weirdly named table in this weirdly named column, you’re leaving compliance teams can understand what’s going on. And these, this logical metadata, Immuta kind of gets discover this ourselves, and you can build policies on top of it. Or if you’ve already done this work and exist in things like Big ID or Collibra, that could be sucked in and used to drive policy. And you get a ton of scalability with this approach. So you need to tie that scalability with fixing the RBAC problem. And this is fixed with something called attribute based access control, and in fact there’s something called policy based access control, which, it also does I’ll touch on that a little bit. But, so if you remember this slide, RBAC should really be called static based access control. It’s like writing code without being able to use variables. And again, this Alliance a Ranger and Sentry, and other things like IAM roles. But if you had something like this, wouldn’t it be nice. If you could write this policy once, and the role was actually dynamically attached at runtime. So essentially the policy gets defined at query time based on who the user is and what their role is. And this is in fact really what ABAC all about. ABAC is not about where the attributes come from or how many places they could come from, it’s about how the policy gets enforced at runtime. So, this is really dynamic based access control, where RBAC is static based access control. And again, Immuta can do this ABAC methodology. So if we revisit this real customer example, they had eight rules. I’m just showing one table that had eight rules on it, which we already spoke about. But then this table had… there was 12 total tables that had the same association that needed this level policy. So you had to write 96 total rules to enforce this correctly. And this isn’t very understandable. If you have to make a change, it kinda needs to be the person that originally built this stuff. So, with using an ABAC approach, like Immuta, this can become a single policy. So, the first thing we talked about is we can scale this because we’re using the logical metadata rather than the physical table names. So we’re looking at any column tag organization name and running this where clause on it. And this will propagate this policy everywhere it finds those 12 tables essentially. And then we’re using the group attribute as a runtime variable. So we don’t have to write it eight separate times for the same table. And so that’s how we’re able to get this down to a single policy. And then also for future proofs, if you get a new group, this policy still applies, if you get a new table, it will be discovered and the policy gets attached to it. So you don’t have to remember to make these changes. And then the last one is the privacy enhancing technology. So you’ve got these more stringent definitions of privacy preservation. How do you deal with that? And this is probably everything I’ve spoken about today, maybe the less obvious one. And our customers realize this kind of, as they become more mature in building these policies. So how do we get to this sweet spot and meet the demands of legal compliance and the data analysts and scientists at the same time. And this is called the privacy utility trade-off, as I mentioned. So just to demonstrate a silly example of why it’s more than just direct identifiers, you need to worry about indirect identifiers. I’m gonna tell a quick story about Judd and Leslie. Judd Apatow, Leslie Mann, you can see they’re having a good time in their taxi here. And the New York Taxi and Limousine Commission actually released a bunch of data, on all their taxi rides. So you could see this is pretty granular data, but also is pretty darn harmless. It’s the medallion to pick up a drop-off locations, the times, the total fair amount and the tip amount. But it actually isn’t that harmless, because a tabloid magazine actually took all these photos of celebrities and cabs, and cross-reference the taxi medallion and the photo pickup time with the millions of records in this taxi data, and they were able to pinpoint the exact row for Judd and Leslie, for example. And in this case they tipped $2.10, but there are other examples where the celebrity tip zero, which I didn’t call out here, but this makes great tabloid fodder, and it was also a great example of how indirect identifiers can lead to privacy intrusion. And so there’s a lot of fancy techniques we can use to get to play in this gray area between privacy and utility that we’ve been talking about that sweet spot. And so I think a lot of the obvious ones are column restrictions, or we could reduce specificity or hash and encrypt, but there’s more advanced techniques. These are these privacy enhancing technologies like K-anonymization can actually suppress about the values that can lead to those linkage attacks we talked about, with Judd and Leslie. We could do things like local differential privacy, or differential privacy, where we add noise to data, which provides us some guarantees of privacy. We could limit records with row restrictions. We could limit the types of queries you could run, like aggregate only and add noise to those aggregations. And this allows you to play in that utility space and get the analyst what they need, but also enforced the privacy controls that are required. And Immuta gets you all of this, where these legacy systems aren’t even thinking about things like K-anonymization and differential privacy. And this is just an example of, correctly privatizing the taxi data, had New York done this, that tabloid would have never been able to pull off the stunt that they did, but it also would have provided plenty of utility from that taxi data. They could have hashed the taxi medallion, they could have generalized the pickup latitudes and longitudes and date times, so you couldn’t do that kind of attack I described. Maybe we could actually use local differential privacy to randomize the tip amount slightly, just so there’s no guarantee that it’s the exact tip amount that you’re looking at. And so taking a step back on all of this, when you think about these attacks that people can take, and by the way, attacks don’t have to be on purpose, they could be accidental breakage of regulations as well. But you think about the attack event, so the probability of attack actually occurring. And if someone actually does try to do an attack, the success of that attack and that that’s represented by these circles. So what we’ve been talking about is removing data risk or shrinking our success circle. And we can do these with those privacy enhancing technologies, and even simple policies, things like K-anonymization, local differential privacy, differential privacy, and masking. What we didn’t talk about is reducing context risk, or the likelihood of an attack even occurring. And we can reduce this through things like purpose limitations and agreements that users agree to, or sign giving you this legal audit trail. And this is a concept that exists in a tool like Immuta, that doesn’t exist in these legacy systems. So if you kind of align this to what we’ve been talking about, both Sentry and Ranger don’t even have a concept of context controls at all, and they provide a limited amount of data controls, Sentry basically can just block tables and columns. So there’s just a little bit of gain of shrinking S here. Ranger can do slightly more, it can generalize it could do real level policy so S shrink slightly more. But you’re still left with this large attack circle, where something like Immuta, can reduce your likelihood of attack with the contextual controls. And we have a wide range of privacy enhancing technologies to shrink your success rate or your data controls. And so this significantly reduces risk, but the real value in this is that not only are you reducing risk, but you’re providing more utility. ‘Cause you’re letting people get on that data, that with coarse grain controls you’re essentially just completely blocking them, and so you’re reducing their utility. So, I like to think of a tool like Immuta is not taking data away from analysts. You’re actually getting them more access to data while reducing your risk, which is very, very powerful. So this is all great. I put all this effort into Sentry and Ranger, but this seems like a big change. How am I gonna actually move to something like an Immuta? And so we made that easy for you. So we built a migration utility that you can use to migrate from Ranger or Sentry to Immuta. The Sentry one is 4GA the Ranger one we’re still working on, it’s a private preview right now. But the idea is to not only just migrate one for one, your Sentry and Ranger policies to Immuta, but also do this modernization I’ve been talking about. Get the scalability, go from whatever it was, 90 policies down to one instead of kind of keeping the antiquated approach that you have with these tools. So I’m gonna stop here and give a quick demo of this. Let’s provide a quick demonstration of what I’ve been talking about. I’m gonna show translating Sentry policies over to Immuta. So the first thing we’re gonna show is a grant statement on this Friday analyst role. And so in Sentry, we granted this role select on the database default. We’ve also granted this role the specific columns that they’re able to query on this table. So notice, we’re not saying what they’re not allowed to see, we’re seeing what they’re allowed to see. And this implicitly grants them access to the table. Note that if I granted selected this table outside of these columns, that would overwrite this, so it does get quite confusing in Sentry to manage all this. So I’m gonna show you this user Floyd that’s in this fraud analyst role. So I’m gonna run a… show database is to show that he could see, and I’m in polar right now, default a non DBFS I have not gone over to Databricks yet. You can also show you that I can only see the credit card transactions table and non DBFS, ’cause again, that was the only thing I was granted via these column level controls. And if I try to do a select star on credit card transactions, it’s not gonna let me, because, remember I’m only able to see these specific columns in here. So Floyd needs to be smart enough to know which columns he’s allowed to actually query. So I’m gonna explicitly list all those out here and now I’m actually gonna be able to get results back. So, as I mentioned, we have not transferred any of this over to Immuta or Databricks yet. So, just to prove this to you, I’m gonna try to query this non DBFS database Floyd can’t see anything in there. If I try to query the credit card transactions table, this table doesn’t even exist. And if I go to Floyd’s Immuta console or catalog here, there just aren’t any tables available to him. If I jump over to Steve’s Immuta, Steve is the data owner. So these tables actually exist as far as he’s concerned, they’d been registered with Immuta we simply haven’t transferred the policies over yet. So for example, if I look at this credit card transactions table, there are no members of it except for me and there’s no policies in here yet. And I’ve spoke about these global policies during the slides, but there are no global policies, and this is how global policies or what we term building policies at that metadata logical layer, and there’s none of these for subscription policies either. So what I’m gonna do is come back here to my shell and I’m going to run this command that will actually register the Sentry policies with Immuta. And this is done by reading the Sentry database and making API calls out to Immuta. So it’s doing all that work right now. And notice since we’re just reading from the Sentry database, we do not need Immuta to reach into the Sentry deployment over the network, you can run this locally and just ship everything out to Immuta. So this is the important part. If you remember, we talked about migrating and modernizing. So before or currently in Sentry, we have 22 different grant statements. I only showed you a few of them here. And we were able to boil this down to four subscription policies in Immuta and two data policies Immuta. So significant gains and scalability here and understandability. And just to show you this before you run back out here and look at this credit card transactions table, you can see we’ve got this policy here masking the credit card number, and the only people that are allowed to see the credit card number in the clear so it’s kind of like the inverse of Sentry, is the admin or the data owner. Similarly we have the subscription policies that got created. For example, you could see that that fraud analyst is granted access to the default database here. And this got applied globally across all the tables in Immuta. So, let’s show Floyd actually querying this now. So first of all, he’s gonna be able to see all this stuff in Immuta. So these two guys are in the default database credit card transactions this is non DBFS that’s we remember. So if I query or show the tables in non DBFS, we’re gonna be able to see credit card transactions. And if I only query credit card transactions, the interesting thing here is I can do a select star and credit card number is nulled out for me. So that was the masking type that we selected. But we can get fancy here and let’s use one of our privacy enhancing technology. So I’m gonna edit this policy and instead of making it null, I’m gonna replace it with something called format preserving masking. And this is going to actually convert the credit card number to… I’m sorry, the nulled out mask to what looks like a real credit card number, but it isn’t the real credit card number. So notice these are all nulls right now. I’m gonna rerun this query and we now see what’s real looking credit card numbers. So I can go ahead and save this right here, ’cause I’m going to show you, I’m gonna add this group as an exception, this role as an exception to this policy. So we can see what the real credit card number looks like under the covers. So let’s come back here and edit this again. I’m gonna add this exception, possesses attribute, Sentry role, fraud analysts, because remember that’s the one that Floyd’s in. So now this rule does not apply to him. It’s no longer gonna mask with format for masking. So if I rerun this query and we pay attention to this first cell again, we can see that we’ve got the real credit card number, but the last, several digits here are real and these ones up here were false. So it lets you see what looks like credit card numbers, but really aren’t. So that’s… obviously there’s a lot of other features here that I’m glossing over, but that was a quick demonstration of translating century policies, getting that modernization and then being able to manually update these policies to give you more granularity and open more tables and columns to users because you can play this gray area between hiding the column first actually getting some utility out of it, but also adding a layer of privacy. So thanks again for your time. And I encourage you to please leave feedback on this session. Thanks.
Steve Touw is the co-founder and CTO of Immuta. He has a long history of designing large-scale geo-temporal analytics across the U.S. intelligence community - including some of the very first Hadoop analytics and frameworks to manage complex multi-tenant data policy controls. Previously, Steve was the CTO of 42Six Solutions (acquired by Computer Sciences Corporation), where he led a large Big Data services engineering team. Steve holds a Bachelor of Science in Geography from the University of Maryland.