Case Study and Automation Strategies to Protect Sensitive Data

For data teams, migrating new workloads into Databricks – whether from Hadoop platforms, cloud computer layers, or on-premises databases – is a significant undertaking. A critical step in migrating workloads, especially sensitive data, is to provision access controls that enable compliance with internal rules or privacy regulations such as GDPR, CCPA, or HIPAA. This session will explore various Databricks access control scenarios — such as credential passthrough, table ACLs, and partner solutions — to automate security and privacy controls on sensitive data. For each scenario, automation strategies will cover managing user access, enforcing data policies, implementing privacy-enhancing technologies, and data movement. A case study will be presented to put these concepts in practice, including lessons learned at a Fortune 500 company undergoing a complex, on-premises migration to Databricks Delta Lake to unify their data analytics while complying with internal rules and privacy laws such as CCPA.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hi, welcome everybody. Thanks for joining us to discuss automation strategies to protect sensitive data. We’ll also be going through a case study, about these topics that we’re gonna talk at the beginning of the deck.

I’m Steve Touw, I’ll talk a little bit about myself in the next slide. Joining me is Greg Galloway from Artis Consulting. I’ll let him introduce himself when he starts talking, but he’s kind enough to talk about a case study with a real live customer, where he went through these challenges that we’re gonna talk about today and how they solve them. (keyboard clicking) So I’m the CTO at Immuta.

CTO, Immuta ON

Our goal is to enable the legal and ethical use of data. We were just recently nominated as one of Fast Company’s Most Innovative Companies of the 2020 year, which is pretty exciting. As you can see here, we’re also a very strong partner with Databricks and AWS.

The reason I bring this slide up is that, I’ve had conversations with hundreds of customers and what we’re trying to do in this deck today is boil down the common themes that we see the customers struggle with, and then bring that to reality when Greg talks about his case study. (keyboard clicks) And here’s a quick look at the agenda. So I’m gonna cover all these topics starting with why you should care.

Databricks’ Unified Data Analytics Approach

So if you think about Databricks, it is a unified data analytics approach. You could really think of it as these three pillars of data, AI and people. And so you of course need the data and then you of course need the tools to do the machine learning and analytics against that data and it allows you to have one place where all of that can occur, have your data brought together, your people on that data, building the algorithms they need to achieve whatever they want to achieve with their data.

Does Protecting Sensitive Data Slow you Down?

However, if you lay governance on top of this, a lot of people will kinda throw up their arms and then roll their eyes and say, Oh great, this is really gonna slow down everything I’m trying to do. With Databricks, for example, you really need consistent controls, no matter what language the user wants to use, like Python, SQL, R, Scala, you really wanna have a hundred percent of your data live there and a lot of people hold back bringing their data into Databricks, because they’re concerned about their privacy and sensitivity of that data, which is a challenge. And then of course you need to involve legal and compliance as part of the people that are involved with this entire framework, so they can help make decisions and understand what’s happening in terms of policy. So I’m actually here to argue that this shouldn’t slow you down and in fact, it can make you more efficient and get you more access to more data across all of your analytics and meet your goals. So how is that possible? (keyboard clicks) So the first thing we’re gonna cover from that angle is some different architectures that exist in Databricks to manage this sensitive data.

Classid Access Controls in a Simplified Snapshot

So before I dive into those, this is a quick overview of a simplified snapshot, if you will, of how you do access control. So there’s of course table level controls. So this would be, think of these as entitlements, who has access to the entire table and then within that, you get more granular, such as column controls, who can see the columns and I’m gonna talk a little bit about masking of columns, and what that means, then you can do things like make rows disappear, or that’s termed row level security, where some people can see some rows, but other people can’t. And then you could even go deeper than that and do cell controls, which is similar to column controls, except not all the cells in the column get masked, only certain cells based on potentially other values in that same row. (keyboard clicks) (clears throat) So here’s some different options for you in a Databricks world and in fact, this spans beyond Databricks. And I’ll talk about each of these, in the subsequent slides, where you’ve got control at compute, control at storage, and then probably something that people aren’t familiar with, which is, the policy is actually separated from the platform altogether. (mouse clicking) So the first one here is controlling at the compute layer and I think everyone’s very familiar with this.

Control at the Compute Layer

This is pretty much how every database works, where you build the rules in the platform, on who can see what. And Databricks has something called table ACLs to manage this for you, where you can define, what roles or users have access to what tables. This is all defined within the Databricks construct. And table ACLs has only gonna get you the table level access controls. This is not enable you to do column, row or cell level controls. So an example supported policy here at the bottom of the slide is only allow members of data engineering group access to the customer’s table. So moving on to the next one, (keyboard clicks) we’ve got control of the storage layer. So this is fairly unique to the cloud, or I should say very unique to the cloud in the sense that instead of doing their controls natively in Databricks, Databricks can actually do something called credential pass through and pass that down to the cloud provider at the storage layer and let that make decisions on who should see what packets or folders or files. So you can keep using your regular old controls that you have in ADLs or S3, which define what roles have access to what files in storage, and that basically propagates through to the tables that you’re using in Databricks. So again, this is really just table level controls ’cause at the end of the day it’s files that back the tables that you’re restricting access to, and you can’t do things like column, row, and cell. So this example policy is very similar to the prior one I just described except instead of allowing the engineering group access to the customer table, we’re saying the customer S3 directory.

But Use Copies/Views To Do More Than

So what folks end up doing? So in those first two scenarios, as I mentioned, they don’t do column, row or cell, kind of a naive approach to solving this problem is let’s just create a bunch of views or copies of the data to manage these different access levels or the granularity that I need. So I’m calling them lenses in this slide, so different lenses into your data. And so we’ve got that first lens. So you had the original table, which everyone could see everything in that original table and then you could potentially build a copy that removes that one column in purple in that lens number one, you could have another lens which removes that same column, but also removes some rows and you could have a third copy or a third view that removes a different column, but adds different rows. And then you have to manage access to all these copies and views separately. This quickly, as you can see, can explode out of control. And so, what we actually call this is an anti-pattern.

Controls using Copies/Views is an Anti-Pattern

So an anti-pattern meeting, it sounds good on the surface, but it actually makes more headaches for you down the road. So one of the obvious issues here is that in a lot of systems, the user querying the view needs access to the backing table too, so that really defeats the purpose, so you really have to end up creating copies, which is the case with Databricks. And so if you’re creating all these copies, you obviously create this proliferation of copies all over the place. And in order to maintain all those copies, you need a series of ELT jobs to create all those transformations and create all those data copies and views. I call this ELT spaghetti. But even adding more complexity to that is if you have updates happening, historical updates, you know, like three months ago, someone now updates a value in the table, you need to account for all those historical updates and when you created your copy so that everything stays up to date. This could get really, really complex. And then of course, as I mentioned, the prior slide is you’re creating all these new copies or views of the data, that means that you have to manage all the roles that have access to those copies and views. So this really leads to roll explosion. So all of this is gonna cost you a lot of time and money, which is why it’s an anti-pattern. A very simplistic example down here at the bottom is you have a hundred tables and you need three different lenses into each of those tables.

It’s not realistic that it would always be three, but just bear with me for this example, that would mean that you would have to have 300 ELT jobs, which create 300 different copies and views, not counting your original table, and then you’re gonna need somewhere between three and 300 roles, depending on how homogeneous that those three different lenses are across all those tables. This can quickly get crazy, but very few lenses on, I would argue a hundred tables is very few tables. (mouse clicking) So what do you do?

Separate Policy from Platform

So this is where really the separate policy from platform comes into play. So just like the big data era required you to separate compute from storage to scale, the personal data era, my argument is, requires the separation of policy from platform. And so you can define your policy external of Databricks and storage in a consistent way to manage these controls and in fact, if you do this, you get beyond just table level controls and it can be enforced dynamically, so you’re not having to create these copies or views. One user querying the same table as me may see a completely different view or lens into that table, than I would based on the policy and this could happen down to the column row and cell level. So an example kind of crazier policy that you could support is by default everyone can see rows greater than 90 days old, but people that are insiders can see data newer than 90 days, but they can only see it for their region. And we’ll come back to this example in a little bit. (mouse clicking) Okay.

Why? The Same Reasons You Separate Identity Management

So I’ll elaborate a little bit, why would you wanna separate your policy from your platform? So it’s the same reason that I guess that most of the people watching this separate your identity management from your tools. ‘Cause you want consistent users across all those apps, you don’t want Steve to be called Steve in one app and Steven in another, you want consistent authentication across those apps and the way people authenticate to be consistent, you want complete visibility into who your users are and what their groups and attributes are. You want consistent audit on when people are logging into where. And I think probably the most important thing is, the reason why a lot of us use Okta is, it’s Okta’s day job to build identity management. They’re gonna give you all the necessary bells and whistles that you need to do what you need to do. Similarly, you would want to separate your policy management for very similar reasons. You want consistency in the policy enforcement across your database’s compute and storage. You want complete visibility into the policies that are being enforced. You need consistent audit across all your data interactions or your queries, but also the management of your policies, who’s changing what in your policy rules. And similarly, it’s their day job. So us at Immuta, we are a policy management platform that exists separate from the platform and this is why we’re able to do a lot of the very advanced anonymization techniques I’m about to talk about, but also the dynamic enforcement down to the cell, row and column level.

Protecting Sensitive Data With the Right Architecture

So if we lay this back on our slide that we started this with, when I said, Hey, don’t let governance slow you down, well, if you’re using the right tool for the right problem, it doesn’t have to slow you down. So you can have consistent controls across Python, SQL, R and Scala, which Immuta supports. And in fact, you can see I have a bunch of other icons in this triangle now, it doesn’t have to just be Databricks. You can have consistency no matter where your data lives. And then down here on the left, you were afraid to move your data into the platform ’cause you didn’t have the controls needed, well, now you can move it all there and enable your analysts to have more access through anonymization techniques, which I’ll talk about where you can actually gain utility from data that typically would have just been completely hidden, but now we can kind of fuzz it or half hide it so that you still gain some utility, but also are maintaining some level of privacy. And then also of course now your legal and compliance teams can see everything that’s going on, understand what policies are being enforced. It’s not a black box per compute that you’re using. So we actually are accelerating your data initiatives through governance. Okay, so now that we understand the architectures and some concepts around how to do the fine grain controls, I’m gonna talk a little bit more about anonymization.

Column Controls-It’s a Lot More Complex…

Okay, so when we talk about column controls, it’s really shouldn’t just be about, should you see this column or not? This can’t just be a binary decision, and I’m gonna make an argument why here? So a slight tangent.

So I know stuff about Judd and Leslie Apatow. So they’re having a great time in their taxi cab, in New York city here and the New York Taxi and Limousine Commission, this happened a while ago, I think this date has been around for like eight years now. They released all their information about the taxi pickup and drop off times and locations, the amount of the ride, the tip amount, all this data seems fairly harmless, right?

Well, Judd and Leslie may not think it’s it’s harmless. So if you look closely at this data, you’ll see there on the bottom left that we’ve got the taxi medallion and pickup time. And then over on the right, we also have the tip amount here. And if you look up at that image, you can see that we also have the taxi medallion. So what tabloids did, think it was Gawker, where they were able to map the taxi picture based on the medallion and the timestamp for the picture, with the taxi data and amongst millions and millions of records, they were able to find Judd and Leslie’s ride. And because of that, they know how much Jen and Leslie tipped the taxi driver. There are a bunch of other celebrities they did this to, who tipped the taxi drivers $0, which I’m not going to disparage here, but at least it gives you a sense of what happened with this kind of attack.

So this is an example of what’s called a linkage attack. So we took some information we know from the outside world, in this case, the medallion and photo time, and we were able to link that to the medallion and pickup time in the New York taxi data, to break privacy. So New York actually kinda thought ahead a little bit and said, Hey, let’s, let’s mask the taxi medallion. This might’ve happened after this attack, I can’t remember. But it wouldn’t have mattered either way, and I’m gonna tell you why. Even if you mask that direct identifier of the taxi medallion, you could still do this attack. (mouse clicks) So if you look at, Judd and Leslie here, so remember we have the medallion and pick pickup time attack. So if we go ahead and hide the medallion, now that attack doesn’t work anymore. But we can simply use the pickup time and location, if we knew that to break privacy, ’cause then that’s unique enough to uniquely identify them amongst those millions of rows. So let’s go ahead and hide the pickup time. Well, what if I knew the pickup location and drop off location? Again, well, let’s hide the pickup location, so you can’t do that. All right, well, what if I know the drop off location and drop off time? Oh, crap. Okay, let’s hide the drop off location. And so I think you get the idea here is that as you keep hiding what are called quasi identifiers or direct identifiers, you just make your data useless, you’re gonna end up hiding everything. So how do we solve this problem?

And this problem exists well beyond just my silly taxi example. So I’m gonna play a lawyer for a moment, and this is language from the CCPA, and there’s also a similar language in the GDPR.

Language from CCPA (and other similar language in GDPR)

So this first orange blob is basically discussing, if you anonymize or de-identify your data well enough CCPA doesn’t apply. Cool, this is the get out of jail free card. I can analyze my data, all I want ’cause I’ve anonymized it well enough. But as we all know, nothing in life is free and so PII is defined as information that identifies, and this is in the CCPA that identifies, relates to, describes, or it’s capable of being associated with or could reasonably be linked. So they are talking about a linkage attack. Like we just talked about in the taxi use case. So this, if you want to anonymize your data, you need to worry about quasi-identifiers and indirect identifiers.

So these little bubbles kind of walk through the kinds of identifiers you would wanna be concerned with and align them to the taxi data and talk about some techniques you could use to mitigate them. So we’ve got the direct identifiers which show the taxi medallion, we’ve got the indirect identifiers, which is the pickup drop-off times and locations, and that was what was used for the linkage attack. Remember we have sensitive data like the tip amount, which is what was released about the celebrities that was embarrassing, right?

And so let me go to the next slide which actually discusses some of these privacy enhancing technologies.

Privacy Enhancing Technology (PETs) The answer instead of binary ves/no

So there’s a bunch of different ways you could do this through obfuscation, generalization and randomization. And these aren’t techniques that Immuta made up, these are called Privacy Enhancing Technologies that have been around for awhile, but actually implementing them and enforcing them in something like a Databricks and Spark dynamically is highly complex and it’s not something you can just kind of on a whim ask your data engineering team to do. So these are things like differential privacy, randomized response, which can replace the tip amount with random tip amounts, K-anonymization, which can remove, highly unique values rounding which aligns with generalization of making the pickup and drop off locations and times less specific. So lots of tricks you can do to balance utility and privacy. So to show you what I mean by that balancing, let’s apply some of these rules to the taxi data. So of course, we’re gonna mask the direct identifier, which is the taxi medallion, but then we’ve got all those indirect identifiers, the dropoffs and pickup locations and times. So rather than just completely blocking them, we can generalize them to make them less specific. So we remove, the minutes and seconds from the times, we remove some precision from the coordinates and the drop offs and pick ups. To an analyst, this shouldn’t really matter, right? They could still do all their aggregate traffic analysis that you need to do from this taxi data, but now they cannot do that linkage attack, that we showed earlier. And similarly, we could randomly replace the tip amounts, if it made sense, where sometimes we can replace the tip amount with a legitimate but fake tip amount, so the attacker doesn’t know for sure what the tip amount is in the data. But again, so rather basically, if we just naively blocked all these columns, all we’ll be left with is the total trip length, or the trip costs, but with these anonymization techniques, you get full utility out of this data while also protecting privacy. (keyboard clicks) Okay, the last thing I’m gonna talk about before I turn it over to Greg is scalability. So, when you start thinking about all these lenses into your data and all these anonymization techniques you need to consider, this completely explodes the universe of, if we go back to our example of using copies you would have to create and manage, and also the roles that you would have to manage. So you could see a world where there’s five different geographies where all your data lives with different controls and different business units, with different rules and different regulatory controls you need to think about, this is beyond human comprehension and beyond a human be able to do this manually. So how do you manage it? So the first thing you can do is this technique called ABAC, which is how Immuta happens to work, where you can define your users and in ABAC world which is Attribute Based Access Control, you define your users like Steve is six foot two, brown hair, works for Immuta, however you wanna define your users, and then you build rules separately from that definition, they make decisions on the fly.

ABAC: No Role “Explosion”

Oh, Steve is querying and he’s part of Immuta, he’s not supposed to see this table, I’m gonna block him. ABAC on the other hand, conflates who the user is with a role and what that role should have access to. So now basically you have to create a role for every different access decision you need to make, which quickly explodes out of control.

So if you go back to this example where, we want everyone to see data older than 90 days, but only insiders can see data younger than 90 days, but they can only see that younger data for their region.

So I went ahead and built a policy and a tool called Apache Ranger, which is our RBAC based policy manager and in this case, we have to build a rule for every possible combination of region in insiders. So we could have any an insider in the East region, so we need a where clause for that. We have a insider in the West region, we need a where clause for that. We might have some insiders, and I didn’t build this in the rule that are both in East and North. And in that case, we’d have to build a separate rule for that. ‘Cause remember we’re conflating who the user is with the policy that they should get. So if we were to count for every single combination, this would be 19 different roles we would need, and this is only for four regions. So the problem is you have to predetermine all these roles up front and build the policy against this and access is implied. So I don’t know what giving someone insider ease really means like, because it’s all conflated with the policy. So it’s very hard to comprehend on like where you should add your users role wise and what that actually means ’cause it’s all implied. Whereas with ABAC, you simply define your user. So Steve is an insider in the East and this is actually a visualization of building the rule in Immuta, this is the actual tool, but we assign the attributes to the user and then when I query the data from region East, it’s gonna dynamically see that I’m coming from the East region and apply that policy on the fly. So I just have to build a rule once and it will also be future-proof.


So if I add a new region Central

and I query the data, it’s actually gonna see Central and apply the rule appropriately. Whereas with ABAC, I would have to remember to create a new rule for that Central role. So there’s a lot of power in ABAC.

Physical vs Logical

Okay, last thing is physical vs logical. So just like ABAC gave you scalability, this gives you scalability as well. So if you are building a rule against a physical table, like I could build the rule for credit card transactions, I would amass the customer last name and credit card number columns. So I’m calling out the actual physical table.

This could be really painful if you have thousands of tables and all of those columns are named differently, right?

But What if You Have Thousands Of Tables?!

This would be a nightmare job had to build these policies one at a time.

Physical vs Logical

So if you build them at the logical layer, instead, you can actually tag your data with person names and credit card numbers. So you’re abstracting the actual physical table and database. This is some of these tables could be the Databricks. Some of these tables could be an Oracle, wherever your data lives, we don’t care what the column names are, what the table names are, that logical layer abstracts all that and you could build policy against that logical layer. And then you could actually go out and discover all the different places where the sensitive data lives through our machine learning driven cast classification algorithms to find where our person names, where our credit card numbers and auto tag that for you, so you can build your policy this way in a scalable manner. (mouse clicks) Okay, so that’s it for me. I’m gonna pass it over to Greg to actually talk about how he ran into these problems in a real world and apply these techniques to solve it.

– Thanks Steve. Well, I hope this case study will help bring the message home and make it a little more concrete. I’m Greg Galloway with Artis Consulting and I’m Azure Architects And Principal. We were able to do a case study of proof-of-concept with a particularly large fortune 500 company that I think applies here.

Case Study Background

So this company made a really large bet on Azure and on Delta Lake and particularly on Azure Databricks as really the only way to get data out of the data lake, then make sense of it. And so they came to us with some security requirements. They were both internal policies in addition to industry regulations, such as the CCPA and others. And so Artis Consulting was brought in to perform a proof-of-concept and test out how to accomplish these requirements, both in Databricks and in Immuta.

High-Level Architecture

So turning in to the level architecture for this proof-of-concept, they had an Azure Data Lake Store that was central, and on the left, there was a data engineering focused workspace that had read-write access to the data lake. In the middle of the screen is a workspace for Databricks that was really focused more on end users, both BI and data science end users and that’s workspace and the clusters within it was integrated with Immuta. And so there were a couple of different ways that we tested allowing users to get data out of Databricks in a secure manner, and Immuta was able to seamlessly secure any of those methods. The one on the bottom is through Databricks notebooks. A user can access tables in the Immuta database within Databricks. They could also use Spark drivers in tools like power BI or alternately, they could go through the Immuta Postgres interface. Immuta pretends to be a Postgres database to make some of the data available, if that’s a preferred approach. We recommended generally users using notebooks or using Spark drivers to connect.

Proof of Concept Results

So if we summarize the results of the proof-of-concept, there were a number of requirements or use cases, and those are listed on the rows, seven key ones. And we did tests to show essentially what could be accomplished with plain vanilla Databricks without Immuta, and then we repeated those particular tests with Immuta integrated in order to determine what’s the best security pattern? How do we meet these requirements? How do we make this successful in their organization? So you can see at a quick glance that we were able to do five or so, to some degree with plain vanilla Databricks, but with the Immuta involved, we were able to really well accomplish all of the requirements. So I’ll go through each of those. So first of all, the first one was redaction of older data. And this was easily accomplished with Databricks. Essentially, we were able to create a new copy of the table so that we’re wiping out the Delta table history and, the time travel and we were able to do that with Databricks by creating a second copy of the table and doing some renames.

The next requirement was around masking and this was of course a key requirement. For example, Social Security numbers and credit card numbers and such should not be visible except under very specific situations. Maybe the HR department needs to do a study for a couple of days, so they need time limited access to data unmasked, but everyone else should only be able to see data masked. In Databricks, you were able to do some approximation of this, maybe with the MD5 hash function and views, but it wasn’t a great solution. Immuta was by far the best solution here. It offered a number of different ways to format the mask and it also allowed you to do very, convenient things with global policies where you didn’t end up with a bunch of spaghetti views and spaghetti code like Steve had talked about.

The next requirement was around deleting the detailed data, but keeping aggregate data, and that was easily accomplished with Databricks. The next requirement was around role level security, and this was a key requirement for them. They were very insistent that there was a large complex set of requirements. Each user should only see their region for example, but there were a number of other requirements. And like Steve had talked about, there is a naive approach, an anti-pattern like Steve talked about, where you end up with a ton of views in Databricks and that really wasn’t gonna meet the needs of having this dynamic based upon user attributes and user roles. And so, with the Immuta the solution was very easy. Global policies made it easy to apply a solution to all tables that had columns tagged in a certain way or had data of a certain type and the policies could be based upon user attributes and so that made it very straightforward to set up with Immuta. The next requirement was around auditability, being able to see who has access to what and also see who accessed what. And so with Databricks, there’s some integration with Azure diagnostic logs and with some metadata queries to see who has access to particular tables, but I would say definitely that Immuta is built to do this. It had a very user friendly UI that would give you deep insight into exactly what people are doing and exactly what people have access to. So that was a superior solution. The next requirement was around manageability. So, easily Immuta is built for this. It’s a single pane of glass that lets you manage policies instead of thousands of views that you have to maintain and test and hope you don’t make a mistake. And then the last performance or the last key requirement was around performance. So, particularly related to role level security, we did some pretty intensive testing to make sure that applying policies wouldn’t degrade security. And the main recommendation we would have is to make sure that you use policies that are based upon user attributes and particularly for their organization, they’re using a feature called the external UserInfo endpoint that would let them populate those policies and that seemed to be the best performing way of the ones that we tried in Immuta and overall they were pleased with the performance. I think one of the key reasons is that Immuta integrates so deeply with Spark, that it’s able to push its policies down into Spark and have it be calculated in a scalable compute. So overall the customer was happy, it was a very successful POC. Just on a personal note, working with the Immuta was one of the best product support experiences I’ve ever had. They were willing to jump in and help and offer suggestions and I don’t think that the proof-of-concept would have been successful without Immuta, as it brings a lot to the table.

– Thanks, Greg, that was a great overview and I think you really brought home my points in a concrete way. So let’s summarize everything that we talked about today, so go on to the next one.

Quick Summary

So again, just going through the points we touched on is, choosing the right policy architecture. So in the example, Greg walked through, of course they needed a lot more than table level controls so hence they decided on this architecture where we separate the policy from the platform, and in his case using Immuta, and successfully avoided that anti-pattern of creating, hundreds or thousands of views and trying to manage that. And then of course the second big point is column level controls needs to be a lot more than a binary you have access or not. If you want to leverage your data effectively, especially in this world of regulatory controls around privacy, and it’s not gonna stop, you’re gonna really need to consider how you manage these privacy enhancing technologies on your indirect quasi-identifiers and it’s not just about your direct identifiers anymore. And then of course, the scalability points, which I think Greg really brought home, which are, the use of ABAC to build policies in a scalable way. And then of course using our logical layer, what he mentioned is this global policies, which are able to reference that logical layer so you can build a rule once and I have it propagate across all your tables, and this is made possible by that sensitive data discovery feature, I mentioned earlier, which is Immuta can go in there and through our classifiers, discover that sensitive data you could of course tag it yourself as well, or use other external business glossaries and pull that information and, which is also a capability that we offer. (mouse clicks) And then just a few closing thoughts here if you wanna learn more, you can go visit the to learn more about the integration.

Next steps

What I wanna really encourage people to do is we have a free trial of a Immuta, so if you go to this website,, you can actually enter your information, we will spin up an immediate instance for you. You can goof around with it for 14 days for free. We have instructions on how to configure it to your Databricks cluster. You’ll be off and running, doing everything that we spoke about today. (mouse clicks) And then please, don’t forget to leave feedback on this session, we really appreciate it.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Steve Touw


Steve Touw is the co-founder and CTO of Immuta. He has a long history of designing large-scale geo-temporal analytics across the U.S. intelligence community - including some of the very first Hadoop analytics and frameworks to manage complex multi-tenant data policy controls. Previously, Steve was the CTO of 42Six Solutions (acquired by Computer Sciences Corporation), where he led a large Big Data services engineering team. Steve holds a Bachelor of Science in Geography from the University of Maryland.

About Greg Galloway

Artis Consulting

Greg Galloway is an Azure data analytics architect and principal with Artis Consulting in Dallas, Texas. He also has been a Microsoft Data Platform MVP for 12 years. Greg has spoken at national events such as PASS Summit and local Dallas events such as MSBIC and SQL Saturdays. Greg blogs at