In this talk, we’ll compare different data privacy techniques & protection of personally identifiable information and their effects on statistical usefulness, re-identification risks, data schema, format preservation, read & write performance.
We’ll cover different offense and defense techniques. You’ll learn what k-anonymity and quasi-identifier are. Think of discovering the world of suppression, perturbation, obfuscation, encryption, tokenization, watermarking with elementary code examples, in case no third-party products cannot be used. We’ll see what approaches might be adopted to minimize the risks of data exfiltration.
Some of the abovementioned techniques are barely an inconvenience to implement, but difficult to support in the long run. We’ll show in which occasions Databricks Delta can help to make your datasets privacy-ready.
Speaker: Serge Smertin
Hello everyone. Right now, I’m going to be giving a talk on data privacy with Apache Spark on defensive and offensive approaches.
I am Serge Smertin, Resident Solutions Architect at Databricks.In the past 13 years working on all stages of the data life cycle. I’ve been working with Apache Spark for around six years now.
I was building data science platforms from scratch. I tracked cybercriminals through massively scale Data Forensics, and I also created anti-PII analysis measures for the payments industry.
My full-time job right now is to help and bring that work customers to the next level in their AI Journey.
What I expect from viewers of this talk:
This presentation is about tools for data privacy when it comes to Artificial Intelligence Journey like; there are many components. In this talk, we’re going to be focusing on Data Privacy.
We will try to do some hints for using open-source intelligence, which means looking at all available information from all available sources, including those that open on the internet. Please look at this fancy link that you see on the slide to know what kind of information you can find – just lying around online.
Like one of the typical offensive techniques, which is also known as day-to-day Data Science, is performing linkage attacks in reality. So what you see here on this slide is a botnet that operates by what it seems by two different criminal groups, but in fact, it’s the same organization in the center, offering all command-and-control centers. Threat Intelligence analysts are looking at graphs like these every day.
Later, during the tokenization section, we’ll show an example of Sequence attacks, and in data rollups, we’ll touch on homogeneity attacks.
What is personally identifiable information? Emails, Names, Addresses, ZIP codes, IP’s, account numbers, driver licenses, you name it! It depends on the nature of your business. Some of them have stricter regulations, some of them not. Your mileage may vary. Talk to your GDPR or CCPA trusted advisor on the matter and make a plan of action.
But anyway, you need to be aware of all of the anonymization techniques, all of which you may be able to build yourself.
So we have plenty of techniques, but they all could be divided into categories of pseudonymization and Anonymization.
One is on record level and is mostly used for machine learning. According to GDPR, a pseudonym is still considered to be personal data. Given the right access, process, and tools, you can re-identify a pseudonym to the original value. This is quite complicated to make in practice, by the way – you should buy third-party vendor products if you want accessible user interfaces for re-identification.
So on the screen’s left side, you see typical examples of synonyms, just gibberish letters.
On the other hand, we have Anonymization, where the scope includes entire tables, databases, or even enterprise data catalogs. This is usually for less restricted business intelligence aggregate tables. Usually, there are more people in BI than data scientists. And they don’t need record level access, so they work with anonymized data. Data Scientists cannot work with fully anonymized data because it loses its value. Though in some companies, they have to. It depends. Frequently a combination of more than one pseudonymization and anonymization techniques are used.
Let’s dive deeper.
The first one is encryption – when someone thinks that Amazon S3 or Azure ADLS encryption at rest is not enough.
Here’s the example with AWS Key Management Service encryption.
Here, you see a simple terraform configuration that initializes the instance profile, registers it with Databricks, creates the AWS key, and grants the GenerateDataKey and Decrypt permissions role that instance profile is tied. It creates a cluster where it is specified that this cluster will be using that specific EC2 instance profile that has access to the key that encrypts or decrypts the data, which you pass as spark conf. It also installs the PyPI library for AWS encryption SDK. You can use the previous function easily. It’s effortless, and in practice, if or decrypt the data, you call encrypt and decrypt of the last slide.
As you can see, the original data point is small. Encrypted data is 50-60 times larger than the fake name of the non-real person here. Consider if you need it if you have data at scale. Data Architects that will probably be watching this to try to make an informed decision 🙂
The other technique is hashing. Apply SHA512 to everything. Yeah. Just hash all sensitive data that way.
To make decryption a bit harder, what you generally do is create a salted hash. By prepending some random data to the original value, usually in the beginning or in the end. Or both. You are then applying the hashing algorithm on top of it. That’s all.
My advice could also be to encode the hash with base64 instead of hex encoding – you’ll save approximately 33% of the space, making it way easier to store the data and do a GROUP BY. It’s also easier to pronounce this BASE64 hash than full-length SHA256. You’ll see it copy-pasted a lot in practice.
Databricks has lovely functionality that replaces each “secret,” like hash salt, with REDACTED—essentially anonymizing it. It’s only meant to be used against incidental printing. Just in case 🙂
But anyway, don’t try to re-invent the new salting technique. It’s probably already cracked by HashCat. The background of the slide shows the algorithms that hashcat can figure out.
Common is freeform data, which is a text that might have an email or personal name.
On the other hand, we can have an offensive use for names-dataset. And it’s called combinator attack. Essentially you take all of the first names and last names, join them together, and in the end, you get 16 billion, almost a terabyte, and then you can build the reverse lookup table: same data, different use. You are applying some background knowledge.
The type of data that can be considered very extremely sensitive is credit card numbers. So here we have four rows.
Tokenization has very high accuracy.
Usually, with tokenization architecture, you have the highest access control level for the anonymization layer, which has the Token Vault database. So that token vault hosts types of data and their numeric representation. All of the sensitive data should live there. It’s just more accessible and compliant. You’ll see why.
How to implement: let’s look at the step-by-step how to do tokenization work.
Well, look at the order.
Making the randomly sorted window and getting our row number within that window.
It would help if you stored a token vault somewhere, which brings Databricks Delta into play – it is one of the options for how to do that.
It’s easier to do GDPR or CCPA with vault-based systems. You have a single place to delete sensitive data. Without a vault, you have to perform data erasure operation on probably thousands of tables.
One of the typical usages I’m seeing is creating a randomly generated dataset for external data scientists. The hardest part is to make data looking real and respect real-world skews and distributions.
Another thing is the perturbation. Or noise.
The easiest thing to do with the Anonymization is to suppress (not display) by making a data rollup.
The other thing is a generalization, where let’s say there are different genres of music, for example.
Take approximate percentiles of data. Let’s say tree and split them into the new, seasoned, senior, and early employee.
IP addresses that need special truncation rules.
You can apply rounding to data
Only at Databricks, you can protect data on column, row and table level without requiring any external product. Though you’d need to use High Concurrency clusters for that.
Watermarking is a fun case
The other things would be related to external controls where you need data scientists to track different data scientists.
Auditing requires significant infrastructure investment.
Another option is the Remote Desktop
If you want to prevent people from taking screenshots, you should require a physical desktop with a remote windows desktop on it.
Thank you all for a very productive session. This was Serge Smertin of Databricks.
Serge Smertin is a Resident Solutions Architect at Databricks. In his over 14 years of career, he’s been dealing with data solutions, cybersecurity, and heterogeneous system integration. His track record got novel ideas from whiteboard to operating them in production for years, like large-scale malware forensic analysis for the cyber-threat intelligence, or real-time data science platform as the basis for anomaly detection and decision support systems for an industry-leading payments service provider. At Databricks, Serge’s full-time job is to bring its strategic customers to the next level in their Data and AI journey. On rare occasions, when spare time is left, to accelerate Databricks adoption across more customers, he leads Databricks integration with Hashicorp Terraform, de-facto standard for multi-cloud Infrastructure-as-a-Code. To share knowledge, from time to time, Serge writes blogs and speaks at conferences internationally.