Identify Sensitive Data and Mitigate Risk in Apache Spark and Databricks

May 28, 2021 10:30 AM (PT)

Data analysts and data scientists use Apache Spark and Databricks for a unified analytics platform, but need to understand what data is available to use and what data has sensitive information or is restricted by policies and regulations. Leverage BigID’s Data Discovery-in-Depth to uncover sensitive data elements before data scientists and researchers build algorithms on top of data. Scale discovery and labeling data for context to know all of the data in your Delta Lake and keep up with the speed of data growth. Knowing what exists in your data will help keep necessary guardrails around your data. 


Apply BigID’s discovery platform to:

  1. Know all of the data inside of Spark to select the best data for analysis
  2. Identify sensitive data with relevant policies for compliance and risk mitigation
  3. Add context to data to understand what data scientists are doing.
In this session watch:
Dan Rice, Senior Solution Engineer, BigID



Speaker 1: Hello, and welcome to our session. Today, we’re going to speak about how BigID can incorporate and work on top of a Databricks environment in order to help identify sensitive data elements that might exist on top of the data that sits on Databricks. Right? Some of the key points or reasons why you might want to do this is [inaudible] just to make sure that there are no sensitive data elements that your data scientists might not have access to or shouldn’t have access to. And you just want to know, right? You just want to be able to scan your data, have an understanding of what exists inside of that specific data repositories that are connected to your Databricks environment. And you can do that with BigID. So BigID, [inaudible] is a user interface in a technology stack that allows you to have a good understanding of what exists contextually inside of your data environments.
And in Databricks is a data repository that BigID can connect to, and certainly understand. BigID’s core capabilities to start out in data discovery. So just being able to learn and understand what types of data elements exist across your data landscape, and that’s the discovery in depth data insights that BigID provides, and that’s one of the foresee foundations that BigID has within it’s framework. The discovery foundations that we have are the classification, which allows us to classify when we find sensitive data elements using machine learning, as well as regular expressions and looking at metadata as well as data itself. We have another technology called Correlation, which helps to learn important identifiers across your data landscape to teach itself via using the machine learning techniques, to understand where the IDs exists across your data landscape, as well as other learnings on top of what we see in relation close proximity to that specific identifies that you teach us about.
We can also look at [inaudible] your data and leveraging a regular expression and pattern matching to detect where things are. And we can also cluster using machine learning to teach you where like elements exists across your data landscape. On top of this core stack, that’s the framework that we have within BigID, the four CS on top of that we can then action on top of the data itself, we have a framework that allows for apps to be built for the mindset of privacy protection and perspective. And these apps are built by our own developers internally by some of the field resources, by some of our partners, and the customers themselves. We have SDKs and APIs that allows for people to go ahead and build their own business requirements or business software on top of our foresee foundation, that BigID has an inside of our data discovery in depth.
And the last thing is just the ability to unleash your data. I mean, based off of the core findings, based off of your ability to action on top of your data, we are providing this global data access and virtualization to be able to provide you with a complete understanding from a privacy standpoint and protection standpoint of your data holistically. So, you can really start to take action on top of everything that is found inside of the BigID discovery platform. In terms of the data discovery intelligence foundation, I already kind of got into it my last slide, but knowing your data, knowing exactly what exists on top of your environment and understanding of your crown jewels. [inaudible] So on the left hand side, that shows a representation of all the different data sources that BigID can connect to. And if we’re connecting to something like an RDBMS, typically is the JDBC connection and for connection to applications like an Informix or a Salesforce, a lot of times BigID just writes to the native APIs.
And if we’re connecting the things like SMB, Samba, or any kind of NFS type of shares, we’re also connecting using the native protocols that are available with those different end data repositories. [inaudible] On top of that, once BigID has the connection information to these different data repositories. And today there’s over like 130 data sources that we can connect to. As you see here, Databricks is one of them. We can then classify your data. We represent the classification of all your data on top of the catalog. We do the cluster analysis and the correlation, as I mentioned before, right? So using a several techniques to help understand what you have on top of those data repositories so that you can start taking action on it. You can adhere to different policies. You can start leveraging the applications that we’ve already pre-built or building your own.
And then when you look at the foundation, this is the layer that we provide our data discovery in depth. And just important to note it can be deployed on the cloud, on prem, a combination of hybrid. It can also work with a data in motion like your Kafka’s, or your Kinesis’s. We have the applications that sit on top of it, and we have applications already pre-built that you can certainly use today, such as your privacy apps. Maybe you want to just look for PI inventory, or sensitive data element inventory, and you can look for DSR fulfillment, so you can execute DSR requests on top of our foundation. You can look for different and access intelligence. So you can see having an understanding of [inaudible] what data elements exists across your landscape. We have a remediation that’s in their protection as well. We have ability to label data and we can incorporate with other third-party vendors for that in addition, as well as our own labeling.
And we have data perspective apps like your retention management and that co-links with your remediation. So if you have retention policies where you have to maintain data for seven years and you have a remediation app that needs to purge data, those two go hand in glove, so know, hey, I’m trying to remediate something that is also part of a retention policy, right? And we have also a data lineage type of applications as well, that help with the data landscape, where things came from, where it’s going and additional things in terms of that. [inaudible] And then from the joint value between the two solutions with BigID database is just the fact that customers can now have a complete understanding of everything that is in top of their data landscape from Databricks, anything that lives in maybe in Spark on top of Hadoop, any technology within RDD messes, file shares, and just be able to say, “Hey, there are sensitive data elements inside of this repository.”
Now, who should have access to those and be able to start even showing insights to what a group of users have access to it and just knowing your data, right? And once you know your data, you can start putting governance around that to make sure that you’re doing the right security ramifications on top of that sensitive data for your data scientists to start to really use appropriately. Right? And that’s really the glue between what BigID provides and well as, on top of the Databricks stack itself. [inaudible] So, let’s jump right into a demonstration here. Just want to kind of give you a lay of the land. I have a Databricks environment here [inaudible] Inside of my Databricks environment, I have a cluster, this test cluster. Inside of my test cluster I’ve got some data here. Inside this default database you can see a list of all these different tables that exist currently. Some of these tables have sensitive data. Some of them do not. It’s not very large data, but it’s enough to kind of show exactly the power of what BigID can provide.
So this is a Databricks environment. And now let’s go ahead and look at the BigID environment. When I first logged into BigID, I’m prompted with a username and password, this username and password, can it be easily incorporated with LDAP for it to AD, for your user credentials to be stored and managed by a central management system that you folks have. It can be incorporated with SANO and even SSO. [inaudible] Right now, I only have a single data source here. I can certainly configure just a bunch of them and I’ll show you how to do that. And this provides me a map of all the things that we are showing inside this environment, again, just a single data source, and it shows you your attributes found. So these attributes would be classifiers or classifications that we enabled within BigID. So when we do start to expose, or find context inside the data sources that we’re scanning, there’ll be flagged here, right?
And if I configure any policies like GDPR or CCPA, they’d show here. And if I configure correlation sets, now, correlation sets is like a mechanism for us to… You can teach BigID about identifiers that are important for my business and important for BigID to know, think of like an identifier like a customer ID, account ID, some kind of a GUID, or something that helps represent something like a product skew. And once you teach us about this, we help show you where those identifiers exist across your landscape and any relation that that identifier has with other things that we see contextually around that identifier, you can see the object of the findings. So I’m going to just go ahead and jump straight into how to configure a data repository, focusing completely on Databricks, I already have one here, but I want to go ahead and create another one. So here are all the different data sources that we can connect to. There’s a lot of them. And you can notice Spark down here, but we also have one specifically for Databricks, and now I can select this.
We’ll go ahead and give it a data source name and I’ll say “Yes, enabled.” And I’m going to give it the location for the Databricks environment. Let’s go ahead and give it information pertaining to the connection details and the username and password. Okay. So, from here, I’ll go ahead and move on to the next tab. And if I want, I can associate a specific business owner, an IT owner. This would help me from an understanding of like who manages the specific data repository, and from a business user, as well as from an IT standpoint. I can provide allocation information if I want to, and additional details. And then I can say if I want to do sample scans, if I want to do full scans, if I want to enable classifiers, enable enrichment, so classifiers will say, “Yes, I want to go ahead and classify data as I scan it.” And I’ll go ahead and save this at this point. And let’s go ahead and now test that connection and see if we can establish a connection to this Databricks environment.
[inaudible] Great. And once it tests connection successfully, you can see a list of some of the tables that show up here, right? ID list, Parquet, different types, customer two, or CSV. So if I look back at my Databricks environment and my clusters, yep, here’s the Oracle CSV. So we see that we are connected to the right environment and the test connection worked great. We can go ahead and move on to the next step. So, within BigID here are the four C’s that we were talking about before the catalog, classification, and cluster analysis, and correlation. When we start scanning your environments, we are providing all the findings inside of the catalog so you can easily hone in on specifically what we find, right? So you can do quick filters for contains PI, which would be sensitive data, has duplicates, so when we look at unstructured repositories, as well as structured repositories, we can start showing you that they are duplicate sources across the data landscape.
Maybe you have duplicate tables that were part of an ETL job that came from Oracle, and that got copied up into Databricks or something to that extent, but the has duplicates will help you identify that. I can hone in specifically and filter out based on the data sources. I can also filter based off of attributes, right? And then these attributes are basically the classifiers. You can see classifier.publicIP before. So, it looks like if I click on this, it’ll filter out a specific data table inside of this Databricks environment has the classifier public IPV4 identified in it. [inaudible] So, from a management of the classifications, we have this gear here that can help us to toggle on different classifications. Now, as I was mentioning before, there’s a few different flavors for us within BigID to scan your data environments.
We can leverage techniques such as machine learning. We can leverage regular expression and these top 15, or so before this first line break are all are supervised machine learning to understand the different documents. We’ve already pre-taught our machine learning algorithms what context makes up that of a boarding pass? What makes up that have a criminal record check, CV check, and the machine starts to learn based on us teaching it. And we can then apply that learning across the data landscape for you. What’s great is you can also train BigID on the document types, and document elements that are important to your business, right? So just by having BigID, look at a sample of like 25 to 50 documents of X type, that’s important for you folks for BigID to learn, we can teach BigID this, and then we can start applying that learning across your data landscape so we can classify that document of being this type, right?
And these are the ones that are pre taught for you out of the box. We also have this name that’s the recognition to help identify country, city, full name and phone. This uses an unsupervised machine learning approach to detecting phones and names and countries right? Which is super difficult to do. If you’re trying to do like a regular expression is just better to use a machine for the identification of that. We also have now the last line right here, all the different regular expression pattern matching that you can have. Inside of our regular expression pattern matching it’s not just a pattern match, but we can also do things like checks on validation. We could also have a support term, regular expression. So if we match this pattern of a credit card and it’s in this min length max length, and it does a checks and validation with this Luhn checks on validation, we can also have a support term right where it’s precedent says, “All right, I found this specific pattern. It looks like it matches out of a credit card, it passed the Luhn check.
We can also have it look for another supporting [inaudible] expression. So say, for example, for credit card number, I might have a CVV, or it might have an expiration date. So I can look for a support term with proximity how many characters before and after this match would I expect to see the expiration date as an example, right? So it helps with the false positives with just a traditional regular expression. So these are all the classifiers. Then you can see the ones that I have enabled in this environment. So, now, from here, I can actually go ahead and execute a scan. So inside of the scan profile, I can say, “All right, I have my data source. I created, I made the connection to Databricks. I can now create [inaudible] a Databricks environment profile.
And from here, I can specify the scan type and there’s a different categories of scanning. I can do just the linear scan, a data source scan, and a metadata scan, which just looks at your higher-level table names, your file names. I can do a hyper scan, which does a quick scanning using a combination of a percentage of data plus metadata to help you identify things. So if an unstructured repository, so say if you have petabyte scale data, it can take a while to scan. So hyper skin reduces that down drastically to make it more palatable so that it gives you a good understanding of what exists inside of it super quick. And then from there, you can make a determination how much deeper of a scan you want to do. You have a schedule so we can execute the scan automatically for you on a regular basis.
And then I can specify correlation sets. We haven’t showed you anything and talked about correlation stuff yet, but a correlation set like I was mentioning earlier during the PowerPoint slide, you can specify within BigID, the identifiers that are important to your business. And then we learn about those IDs and we understand where they exists across your data landscape. And then from here, I can just say, “Let’s go ahead.” And I’m not going to select any correlation sets here because we don’t have any to find. I’m going to select this a data source one that we created, which was a Databricks data source. I’m going to go ahead and save this, and now I can go ahead and run. All right, and so, once I run, this is going to start queuing up. Then it’s going to start running and you’ll see it can proceed. So in the interest of time, this should only take a few minutes, but I’d rather just jump straight to the catalog. And now we can show you [inaudible] what we’ve uncovered in the data that sits inside of this Databricks environment.
So, we have this Databricks, we have here, all the different objects that we scanned within this environment. I can hone in specifically on what contains PI. Now, I can actually click on this specific table, this events table that you saw that was right here, I can say, “Okay, it looks like this does have sensitive data elements as I filtered things out.” As I start clicking you notice I started building this query string on the right hand side, everything that I show you within the UI can certainly be surfaced via rest APIs, et cetera. I could add tags. I can actually pull tags in if this had tags created via like another solution like MIP or another metadata labeling solution that’s available. It gives you information about details when the last one was scanned, created. I could look at the attributes here the classifiers that have been identified in this specific repository.
I can now click on this. I can get attribute value, which will give me the sampling, assuming that I have the proper role based access control within BigID to show me, oh, this is why it was catalog, or classified as a us social security number, because it did find this specific element inside this table here in this specific column. I can actually look at the column names, click on the column names and get profiling information for this, which is always good for your data scientists to understand well, what is the makeup of this specific table?
What kind of profiling is inside this table itself, right? All right, and you can see the classifiers that have been identified for this specific column, right? So [inaudible] again, just a fantastic way for BigID to be used, to help discover what is inside this data before you start having your data scientists really start building machine learning models on top of it, right? Knowing what’s there first, having profiling information, having quality information, having maybe users, some data scientists should not have access to different elements that make up this specific data repository. Maybe the data element should be masked, and it’s not, it’s in clear tax. So all of this information is pertinent to know, before your data scientists start to actually use it. And it can also be used by your data scientists to understand your profiling information of what makes up the actual data itself.
From here, we have other things as well, like cluster analysis and correlation. I’m actually going to go to another environment to show you folks more insights into what BigID can also do. In addition, [inaudible] that has just more information inside of our demo environment. So, if I go over to our cluster analysis, you can see inside of here what we basically do is when we scan your unstructured data repositories, we can start to build cluster of like information like contextual information, right? So, each one of these circles represents a cluster and it’s all based off a distance, like the further away these clusters are, the more unrelated they are, the closer they are, the more related they are. And then if I click inside of one of these clusters, it tells me the top keywords that make up the cluster. Think like a K means nearest neighbor machine learning model.
I can get all the lists of attributes that have been identified in this specific cluster. I can see the objects per data source, the number of duplicate files inside of this cluster as well. I can even then click on the object, I can see that there’s 30 objects that are represented inside of this cluster. Some of them have open access, meaning everyone full control, and you can even define what open access really means in your world. It doesn’t always have to be everyone full control. It could just be these group members have access to the specific data, right? And you can see where they exist from Office 365, One Drive to NFS shares to SharePoint Online. I can take this export them, but very similar to what we saw before with your catalog, you can get information for it this way. I can look at the attributes. I can click on this classifier country. I can get attribute value. It’s going to fetch the values that made this attribute or this classifier get triggered. And you can see that it looks like it found Ireland here, right? I can preview the data as well.
So, this actually takes a connection back to the source. So, BigID doesn’t store the data, BigID establishes or reestablishes a connection to the source where this file exists and it pulls it into cache, a caching layer, and then it represents what it found. And that’s why it’ll highlight it. And I can hover over it. And it’ll tell me I matched the country, this matched the street number, this matched an email address. So, we have related information this way. And now I can click on duplicates and show you all the duplicate files that have been found that are closely related to this file if they’re not exactly the same. So, that’s what cluster analysis does. And then on top of this, we also have correlation. [inaudible] And correlation’s able to go even a step further on the structure side of the house if I select an attribute like an email. You can see on the right hand side, it looks like email was found 17 times across, or I’m sorry, across 17 different tables.
So, if I expand out these tables, I can select them, and it starts to build you a map of how things are interrelated to one another. So based off this email attribute that we selected, there’s a 91% probability or confidence level that this email attribute matches this email underscore address in this transactions table and 97% probability that matches this email in the security table. And it doesn’t always have to be, like say, see this one right here. This is a notes table. And it is finding email inside the comments field of a notes table. [inaudible] The, also, additionally, interesting thing to note here is you notice that it also pulled in this transaction table from this linkage database based off this transaction number. So what it’s basically saying is that every single time an email address is identified, that there’s a close proximity that’s a transaction number that’s also identified and it’s related to this transactions table. So we figured we just enrich some of the findings with something else that we found that you might not have known about before, by showing you this graphing of how things are interrelated.
So, it’s a fantastic way to understand where like email addresses, like customer IDs, like product skews, where it exists across your data landscape, where you might not have known it existed before, as well as building relationships that you did not know about by having that linkage to other elements such as that transaction number. So, those are basically the four C’s, the foundation in depth. And then on top of that, we have applications, right? So your applications cut across your data privacy, your data protection, and you’re data perspective, right? So, like I was mentioning before, you got the data rights fulfillment, that’s their DSR requests. You have things such as access intelligence, you have remediation and you have data attention. So, BigID, to wrap up, in summation, is a fantastic mechanism to discovering all your data elements across your data landscapes, whether it’s on top of Databricks, NFS, SMB, whether it’s your applications like Salesforce, SAP, BigID helps with the understanding of everything that is across your data landscape. It then helps to showcase our findings or BigID’s findings with the catalog.
And then you can start building your applications, or using applications that we’ve pre-built for your business needs, whether it’s DSR fulfillment, whether it’s access intelligence, whether it’s remediation or retention, et cetera. [inaudible] So I hope this session was informative to you. Databricks has been a great sponsor, and Databricks has just been a great technology to work with from our standpoint, especially on the machine learning front. And we hope to meet with you in the future to do a deeper dive on any one of these areas that are of interest to you. Thank you very much, and take care.

Dan Rice

Senior Solution Engineer at BigID with expertise in Relational Database Systems, DevOps, Engineering, Spark, and Big Data - Hadoop. Dan has a passion for solutions surrounding data governance, data se...
Read more