Data Privacy with Apache Spark: Defensive and Offensive Approaches

Download Slides

In this talk, we’ll compare different data privacy techniques & protection of personally identifiable information and their effects on statistical usefulness, re-identification risks, data schema, format preservation, read & write performance.

We’ll cover different offense and defense techniques. You’ll learn what k-anonymity and quasi-identifier are. Think of discovering the world of suppression, perturbation, obfuscation, encryption, tokenization, watermarking with elementary code examples, in case no third-party products cannot be used. We’ll see what approaches might be adopted to minimize the risks of data exfiltration.

Some of the abovementioned techniques are barely an inconvenience to implement, but difficult to support in the long run. We’ll show in which occasions Databricks Delta can help to make your datasets privacy-ready.

Speaker: Serge Smertin


Hello everyone. Right now, I’m going to be giving a talk on data privacy with Apache Spark on defensive and offensive approaches.


I am Serge Smertin, Resident Solutions Architect at Databricks.In the past 13 years working on all stages of the data life cycle. I’ve been working with Apache Spark for around six years now.

I was building data science platforms from scratch. I tracked cybercriminals through massively scale Data Forensics, and I also created anti-PII analysis measures for the payments industry.

My full-time job right now is to help and bring that work customers to the next level in their AI Journey.


What I expect from viewers of this talk:

  • Most likely, all of you should have a hands-on background in data engineering and Apache Spark, information security, and a bit of cloud infrastructure.
  • Virtually a bit of everything.
  • You want to (or you’re asked already by your bosses) to limit the data to maintain privacy and comply with regulations like GDPR and CCPA. If you don’t know about them, I’ll briefly touch on those
  • Genuine curiosity about how to do that.


This presentation is about tools for data privacy when it comes to Artificial Intelligence Journey like; there are many components. In this talk, we’re going to be focusing on Data Privacy.


We will try to do some hints for using open-source intelligence, which means looking at all available information from all available sources, including those that open on the internet. Please look at this fancy link that you see on the slide to know what kind of information you can find – just lying around online.


Like one of the typical offensive techniques, which is also known as day-to-day Data Science, is performing linkage attacks in reality. So what you see here on this slide is a botnet that operates by what it seems by two different criminal groups, but in fact, it’s the same organization in the center, offering all command-and-control centers. Threat Intelligence analysts are looking at graphs like these every day.


Later, during the tokenization section, we’ll show an example of Sequence attacks, and in data rollups, we’ll touch on homogeneity attacks.


What is personally identifiable information? Emails, Names, Addresses, ZIP codes, IP’s, account numbers, driver licenses, you name it! It depends on the nature of your business. Some of them have stricter regulations, some of them not. Your mileage may vary. Talk to your GDPR or CCPA trusted advisor on the matter and make a plan of action.


But anyway, you need to be aware of all of the anonymization techniques, all of which you may be able to build yourself.


So we have plenty of techniques, but they all could be divided into categories of pseudonymization and Anonymization.


One is on record level and is mostly used for machine learning. According to GDPR, a pseudonym is still considered to be personal data. Given the right access, process, and tools, you can re-identify a pseudonym to the original value. This is quite complicated to make in practice, by the way – you should buy third-party vendor products if you want accessible user interfaces for re-identification.


So on the screen’s left side, you see typical examples of synonyms, just gibberish letters.


On the other hand, we have Anonymization, where the scope includes entire tables, databases, or even enterprise data catalogs. This is usually for less restricted business intelligence aggregate tables. Usually, there are more people in BI than data scientists. And they don’t need record level access, so they work with anonymized data. Data Scientists cannot work with fully anonymized data because it loses its value. Though in some companies, they have to. It depends. Frequently a combination of more than one pseudonymization and anonymization techniques are used.


Let’s dive deeper.


The first one is encryption – when someone thinks that Amazon S3 or Azure ADLS encryption at rest is not enough.

  • The accuracy of data is high.
  • It’s a medium difficulty to implement. Schema is mostly the same.
  • Performance is going down when you write the data when you read the information because it essentially has more data.
  • It is slower to read.
  • It adds up to Cloud Cost.


Here’s the example with AWS Key Management Service encryption.

  • AWS KMS DATA KEYS decorator. It reads the key from the config supplied by a terraform example on the next slide.
  • It creates a Pandas UDF.
  • Wrapper function initializes Amazon encryption SDK client.
  • And it makes a data encryption key from the Customer Master Key.
  • Pandas UDF is executed in Python subprocess with efficient Arrow serialization.
  • By default python, the subprocess is reused, but you can disable that by spark.python.worker.reuse: false. At forking performance cost, of course. This way, you can guarantee that data key decryption is happening in a dedicated process for a given spark task. This is to make security officers happier.
  • Later, we call the .apply() method on Pandas Series to change each row’s values for that specific encrypted column.


Here, you see a simple terraform configuration that initializes the instance profile, registers it with Databricks, creates the AWS key, and grants the GenerateDataKey and Decrypt permissions role that instance profile is tied. It creates a cluster where it is specified that this cluster will be using that specific EC2 instance profile that has access to the key that encrypts or decrypts the data, which you pass as spark conf. It also installs the PyPI library for AWS encryption SDK. You can use the previous function easily. It’s effortless, and in practice, if or decrypt the data, you call encrypt and decrypt of the last slide.


As you can see, the original data point is small. Encrypted data is 50-60 times larger than the fake name of the non-real person here. Consider if you need it if you have data at scale. Data Architects that will probably be watching this to try to make an informed decision 🙂


The other technique is hashing. Apply SHA512 to everything. Yeah. Just hash all sensitive data that way.

  • Accuracy is very high.
  • it’s elementary to implement
  • the schema stays the same
  • the format is different but bigger
  • It hurts GROUP BYs.
  • For example, if you hash the email and want to know the number of unique people in our imaginary e-commerce shop geography.
  • The more bytes email address will have in memory, the slower the GROUP BY would be.
  • It has risks for malicious re-identification – there are many attacks like a dictionary or combinators. Or there’s a brilliant tool called HashCat. Very advanced because it uses GPUs and has a broad community of users.


To make decryption a bit harder, what you generally do is create a salted hash. By prepending some random data to the original value, usually in the beginning or in the end. Or both. You are then applying the hashing algorithm on top of it. That’s all.


My advice could also be to encode the hash with base64 instead of hex encoding – you’ll save approximately 33% of the space, making it way easier to store the data and do a GROUP BY. It’s also easier to pronounce this BASE64 hash than full-length SHA256. You’ll see it copy-pasted a lot in practice.


Databricks has lovely functionality that replaces each “secret,” like hash salt, with REDACTED—essentially anonymizing it. It’s only meant to be used against incidental printing. Just in case 🙂


But anyway, don’t try to re-invent the new salting technique. It’s probably already cracked by HashCat. The background of the slide shows the algorithms that hashcat can figure out.


Common is freeform data, which is a text that might have an email or personal name.

  • It’s present in every company.
  •  It’s tough to clean-up because it entirely depends on your specific dataset.
  • A generic solution is hardly possible.
  • Recommended Is The Ensemble of different techniques that remove the sensitive pieces out of freeform context.
  • One of the techniques could be like having the REGEX rules to filter out emails, IPS, etc..
  • Another thing is named entity recognition from ML, which doesn’t work well on short strings of two or three words.
  • What can also work is the PIP install names-dataset, which can get you 160k different names that you can use as a filter, and here’s an example of how to do that in the Pandas UDF. Enhance these names definitions with more region and business-specific data, and you will have a lovely way to strip the private information from the freeform data.
  • The general rule if you cannot efficiently replace the freeform data from your BigData – it just doesn’t put it there.


On the other hand, we can have an offensive use for names-dataset. And it’s called combinator attack. Essentially you take all of the first names and last names, join them together, and in the end, you get 16 billion, almost a terabyte, and then you can build the reverse lookup table: same data, different use. You are applying some background knowledge.

  • The key-space is small – we have approximately six billion different people in the world. Well, of course, we cannot guess the names of everyone.
  • We can get the trigrams of the most famous names like trigram for SERGE is SER, ERG, RGE – then randomly combine them in the proper sequence.
  • Probably put in some Markov chains on top of that so you can get the more fancy result and a smaller number of records.
  • This is a common technique to fight against typos because basically, if you combine different trigrams.
  • Still, this is research; if you want to invest in it, do that.


The type of data that can be considered very extremely sensitive is credit card numbers. So here we have four rows.

  • The first row is PAN – Personal Account Number.
  • The one you see is a fake credit card that can be used to test VISA payments.
  • This is what PCI DSS scope protects.
  • The first six digits are called Bank Identification Number (BIN), and there are approximately 6k different ranges of those available in open sources. Sometimes it’s the first six digits, sometimes more.
  • BIN is not sensitive by itself.
  • The last digit is used for number integrity validation.
  • The last four digits are what you commonly see in all user interfaces about your cards, and that’s the only thing your bank might ask you on a phone call to name the card. Otherwise, it’s a scam 🙂
  • It’s not recommended and sometimes not allowed to store both the first six digits and last found digits of a card on the same row. Guess why.
  • Rules are very convoluted and sometimes contradictory.
  • Don’t use hashing. Use tokenization instead.


Tokenization has very high accuracy.

  • It isn’t easy to implement, and the schema is almost the same; the format is different.
  • Whatever the value is, it’s stored in the usually long number.
  • it’s slow to write, but it’s way faster to read because you have fewer bytes to shuffle


Usually, with tokenization architecture, you have the highest access control level for the anonymization layer, which has the Token Vault database. So that token vault hosts types of data and their numeric representation. All of the sensitive data should live there. It’s just more accessible and compliant. You’ll see why.


How to implement: let’s look at the step-by-step how to do tokenization work.

  • We get the columns and turn them into an array of structs.
  • Once you have the array of struct, what you have to do is explore them and do star. struct push-up
  • once you have that, you select the key and value and then assign a token
  • in this case – a row number, ordered by ID
  • so Reynolds will have a token one
  • Reynolds email is going to be token 2


Well, look at the order.

  • The email and name of the same person. It’s sequential, so you can even guess the name when having the email. Or vice versa.


Making the randomly sorted window and getting our row number within that window.

  • They are available on the same record.
  • These records are not vulnerable to sequence attacks, and the previous slide was vulnerable to that.
  • That’s just the example.


It would help if you stored a token vault somewhere, which brings Databricks Delta into play – it is one of the options for how to do that.

  • You have to deal with Delta’s eventual consistency model.
  • On the code example on the right, they have the same functions from the previous slides, but you have the Retry Loop in which you are trying to save to the table again if you get a concurrent modification exception.
  • Monotonically increasing ID won’t work as you expect
  • UUID type 4 (which spark has built internally) are hard to merge into delta vault.
  • Use DELTA.


It’s easier to do GDPR or CCPA with vault-based systems. You have a single place to delete sensitive data. Without a vault, you have to perform data erasure operation on probably thousands of tables.


Anonymization techniques


One of the typical usages I’m seeing is creating a randomly generated dataset for external data scientists. The hardest part is to make data looking real and respect real-world skews and distributions.

Another thing is the perturbation. Or noise.

  • It’s when you inject the noise to your sample data and make it a bit more Anonymous.
  • Usually, that technique is not used on its own.


The easiest thing to do with the Anonymization is to suppress (not display) by making a data rollup.

  • Whenever we see that’s a certain threshold with a minimal amount of records.
  • You know the country’s biggest cities, and you know the smallest cities in the country – it’s a piece of background knowledge.
  • Imagine you’re working for an imaginative retail company, and you’re analyzing some key performance indicators per region.
  • If you see the smallest city popping up with just a few records, you can find out information about people living in it.
  • Just look up on Facebook, Linkedin, or other places – people usually tell where they work publicly.
  • And with that, you can more or less reliably guess how much money they’re making. And then the list goes on.


The other thing is a generalization, where let’s say there are different genres of music, for example.

  • there is technical death metal, and there’s mathematical grindcore
  • which really can define a person because a Persona probably is very explicit about their music likes
  • you want to analyze it further, so you generalize it one level up and call that this person likes extreme metal
  • It’s also not too private enough.
  • So you’ll tell. The person likes rock music.
  • No, this is still identifiable, so we can tell we can suppress a music column or music taste column.
  • So that’s the typical example of generalization.


Take approximate percentiles of data. Let’s say tree and split them into the new, seasoned, senior, and early employee.

  • You did very simply all rapid clustering with spark, and you don’t need the classifier algorithm for that.
  • You can do it with pure SQL.


IP addresses that need special truncation rules.

  • take the last byte from it, replace it with zero, and you’ve got /24 CIDR
  • essentially you’re generalizing the data to City level or so
  • Generally, the safest side is to truncate the IP address to the slice 24 CIDR block. Or /16, but that’s when it won’t be useful anymore.


You can apply rounding to data

  • there are some rules for rounding that your organization can adopt
  • rounding numbers to nearest multiple 15
  • any numbers of half of that are rounded or suppressed
  • halves always rounded upwards.
  • Let’s say two and a half is rounded to five.


Only at Databricks, you can protect data on column, row and table level without requiring any external product. Though you’d need to use High Concurrency clusters for that.


Watermarking is a fun case

  • Accuracy is high on a snapshot.
  • It works only for a snapshot and makes no sense globally.
  • It needs you to write a catalyst rule for that
  • the schema is almost the same.
  • On the snapshot level, where one is made for every query, tokens are saved on the storage.
  • And it gives you an exciting opportunity to track if your data has leaked outside or not.
  • Because you can tie a token to a person that received a snapshot.
  • You have a mapping for that.
  • as you see, this isn’t easy to operationalize, and it should be applied only for the most sensitive data
  • and when you don’t trust your data scientists


The other things would be related to external controls where you need data scientists to track different data scientists.


Auditing requires significant infrastructure investment.

  • Things to focus on are usage patterns and ordinary filters.
  • Alert on outliers.
  • You can use the databricks audit logging and Amazon CloudTrail delivered to your S3 buckets.
  • or Azure Audit Logging service
  • and then correlate the data yourself
  • Good luck with that. It isn’t easy.
  • And if you need it, you’ll probably do it.


Another option is the Remote Desktop

  • to prevent any copy-paste of data.
  • And make the life of your data scientists more difficult.
  • you can give any access to the access is through Remote Windows desktop


If you want to prevent people from taking screenshots, you should require a physical desktop with a remote windows desktop on it.

  • With a motion sensor camera.
  • That detects attempts to take photos.
  • one of the banks is doing that
  • If you don’t trust your data scientists at all, you can, too.


Thank you all for a very productive session. This was Serge Smertin of Databricks.

Watch more Data + AI sessions here
Try Databricks for free
« back
About Serge Smertin


Serge Smertin is a Resident Solutions Architect at Databricks. In his over 14 years of career, he’s been dealing with data solutions, cybersecurity, and heterogeneous system integration. His track record got novel ideas from whiteboard to operating them in production for years, like large-scale malware forensic analysis for the cyber-threat intelligence, or real-time data science platform as the basis for anomaly detection and decision support systems for an industry-leading payments service provider. At Databricks, Serge’s full-time job is to bring its strategic customers to the next level in their Data and AI journey. On rare occasions, when spare time is left, to accelerate Databricks adoption across more customers, he leads Databricks integration with Hashicorp Terraform, de-facto standard for multi-cloud Infrastructure-as-a-Code. To share knowledge, from time to time, Serge writes blogs and speaks at conferences internationally.