Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large Datasets with Apache Spark

Download Slides

In today’s data-driven economy, companies increasingly collect more user data as their valuable assets. By contrast, users have rightfully raised the concern of how to protect their data privacy. In response, there are data privacy laws to protect user’s privacy, among which, General Data Protection Regulation (GDPR) by European Union (EU) and California Consumer Privacy Act (CCPA) are two representative laws regulating business conduct in corresponding regions. Common requirements are to access or delete all records in all ever-collected data given a specific user’s search key(s) in a timely manner. The size of collected data and the volume of requests for search make enforcing GDPR and CCPA highly inefficient if not resourcefully infeasible.

In this talk, we demonstrate our work for enforcing GDPR and CCPA in Adobe’s Experience Platform (AEP) by efficiently solve the search problem above. Specifically, we build Bloom Filters while saving data in Data Lake with minimal resource and maintenance overhead, which reduces nearly 10X searching time for a single search request. Furthermore, we build orchestrated microservices for splitting and scheduling extra-large search jobs into multiple smaller jobs with a balance between resources consumption and processing time. Finally, we walk through a few lessons learnt from our work of handling datasets with a larger number of files and partitions with Spark.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hello, everyone, I’m Miao Wang, engineer manager at Adobe.

Today with my colleague, Jean Yves we will present our work on enforcing GDPR and CCPA in very large data size with Apache Spark. Here’s our top agenda. I will cover the background, platform data management architecture and the challenges we are facing on handling GDPR and CCPA in our Data Lake. Then Jean will present our solution versus Bloom Filter including design concerns, performance evaluation, trade off and future work.

The background, recently years, data privacy becomes more and more important for both end users and data service providers.


Therefore, there are few regulations had been enforced in the field, for example, general data protection regulation. It is a regulation in European Union, the European Economic areas which is effective from May 2018. Similarly, California enforces California Consumer Privacy Act (CCPA) effective from January 1st 2020. All the regulations guaranteeing the user can access and erase their data stored by data service providers. Violation of search regulations, not only first the reputation of the business, but implies bigger penalty to the business. From the back end engineering perspective, the core component of handling such regulations is the same, that is locating users records given users identity information.

Adobe Experience Platform – Data Flow

So before introducing the tools of how we handled GDPR and CCPA in our data platform, I would like to walk through the data flow of Adobe Experience Platform. As a data platform, we provide a different of ways of bringing customer data into Data Lake like REST API, connectors or streaming services, while data are being ingested into Data Lake, we long to computer jobs to do data validation, data conversion and read clean data into Data Lake. We’re also providing data ingress service, including REST API SDK to consumers, like data engineers, data scientist, customer and third parties. Within Data Lake, we run a data management service which handles GDPR and CCPA requests. Now let’s take a closer look at architecture of data management service.

Our Data Management Architecture

So in general, there are two types of requests of GDPR and CCPA. The first type is called access, to me as data user wants to access, its data stored up in our Data Lake. The second type is called the Delete, it means the user wants to delete his or her data stored by our Data Lake, user submit a request to Adobe privacy service, which does authorization on the field, and to filter in the request and submit the request to data management service. Data management Service will attract the user identity information and to submit the jobs to schedule service, which it lunches at the monitoring computer jobs. The computer job does the heavy lifting work of which are computer jobs of scanning data with identity columns at case. If it is delete a request, we will delete to match the records in Data Lake. If it is Access Request, we will write to the match the records to our egress storage, for example, other storage and then notify user to download the record and throw our customer notification pipeline.

There are three challenges of handling CCPA and GDPR in our data platform, that is data size, cost and scalability. We are seeing tens of thousands of requests per customer per hour in our production. The next two slides, I will present the data size and the cost on handling these requests.

Challenge – Data Size

This table shows how we classify our Data Lake accounts according the account, data size and the number of files in that account. For example, a small storage account has less than one terabyte data, there’s fewer than a million files and folders. Well, extra large account has 50 to 400 terabytes data with 50 to 100 million files and folders. A customer may have multiple Data Lake accounts for different business purposes.

Challenge – Compute Cost

This figure shows the computer cost of one scan of different data size with different data size based on our measurement, so getting a one terabyte data cost a $37, scanning 10 terabytes data cost $278. Based on macaques modern projection, vast scope of one purpose data will cost more than $15,000. They have more details document that we have published the blog below, you can refer to for details.

Problem – Finding a Needle in a Haystack

So let’s see what’s a problem, so the problem of locating a customer red card based on user identity, it’s like finding a needle in a haystack, there are too many all natural gas taking audience manager dataset as an example, this dataset contain 700 of terabytes data, which includes 75 billion users, so on average, the user record sizes turn to kilobytes. So therefore, finding a user with ID information is equivalent to finding 20 kilobytes data into 700 of terabytes data.

Data Skipping

So why intuitions to apply some data skipping technology to avoid such unnecessary gaps? On this figure, we show how we store data in our Data Lake, but you should have decided, it is storing our Data Lake based on some partition strategy. In this example, it’s a storage space some year, month, day and the batch ID. Batch ID is the internal transaction ID. Since the user information, you’re spending multiple columns of data, therefore, it makes the marketing and the stats min-max all the work while both is skipping for each column. It has a very large number of distinct values which means that the column has a high cardinality which makes some encoding scheme like this dictionary encoding not working well while still skipping.

So our solution is using Bloom Filter, which is a probability structure to test whether element is a member of a site.

Solution – Bloom Filter

Bloom Filter has a nice property that has zero false negative. Therefore, if a Bloom Filter says that file ID is not contained in your file, it is safe, we can skip the whole file. There are two key parameters to determine the Bloom Filter size and curacy, so the first one is the number of distinct values or NDV in short, the second one is the false positive probability or FPP in short, so the algorithm itself is not super complex to implement and there are multiple implementations in the open source work. However, to integrate the Bloom Filter in production and how to apply Bloom Filter properly is to change the task.

Key Design Concerns

It is because we have to solve three categories of problems. First, learn how we build a Bloom Filter, second, how can we maintain Bloom filter while updating the Data Lake? The third, how to apply a Bloom Filter with Apache Spark? I will handle the rest of presentation to Jean, I’ll let her present other solution using Bloom Filter.

Design Concerns

– Thanks Miao, now, I will go over the design concerns that we had one by one. The first question is that when we should build the Bloom Filters, the decision was to build the Bloom filter as the ingestion time, we get the data from the producer. Once the data is land into the Data Lake, there will be a structure kicked off to load the data, do some validation and conversion and read the result back to analysis directory in Data Lake, and I’ll read that the output of the Spark job contains two parts. One is the data files and two is the Bloom Filters. Originally, we thought about building Bloom Filters after ingestion, but we decided to go with a current approach because of several reasons. Firstly, it will add less overhead overall. With this approach, we can only scan the data once and generate the Bloom Filters and data files at the same time. There is no need for an extra time of scan. Secondly, the overall operational cost is low. We have an existing Spark job which can plug in the Bloom Filter related to logic, but if we want to build Bloom Filter after ingestion time, we have to have a step respect jobs specifically for the Bloom Filter as well as the logic to decide when to schedule that, that’s wrapped up. The another advantage is that this approach will not introduce any delay. As soon as the data file is available, we can have, we know for sure the Bloom Filters will also be available. We don’t have to worry about that case where some of the files may have Bloom Filter and some others don’t. There’s one disadvantage of this approach, it will add more overheads and failure points to the ingestion proof process. We also measured the overheads through the test, which will be shared a bit later.

The second question is that, at which level we should build the Bloom Filter. As Miao already mentioned, our data is stored as parquet in the Data Lake and they’re partitioned by some partitioning strategy. The Bloom Filter is build on the file level, because we all know parquet is not, is immutable. As long as the file gets generated, nobody will be able to update the parquet file. So if we build the Bloom Filter by, on the file level, then we don’t have to worry about maintaining the Bloom Filters there, we don’t need to make any, delete any elements once the Bloom Filter gets deleted, sorry, yeah, once the (murmurs) get deleted.

There’s just one, the data file will be guest, it don’t need to be deleted at some point, we have to make sure the Bloom Filter files can be deleted at the same time.

The Bloom Filter metadata is stored as a separate mathematical folder, and they’re partitioned in the same way as the data files. We’re building one Bloom Filter per data file, per column. So the total number of the Bloom Filter files will be equal to the total number of data files multiplied by the number of columns with Bloom Filter enabled. One problem that might be introduced versus his approach is that logically, the Bloom Filter size will be much smaller than the original data file. So, if we are building multiple Bloom Filters for the same data file, then we are creating a lot of small files in the system. To mitigate this problem, we are considering to combine the different Bloom Filters for the same file into one, so that we can reduce the number of the small files.

One major use case we have for GDPR is to search identity on map type column. This is a kind of complicated structure with a map on the top level column, and the key, the value of the map is an array. Within the array, each element is extract and the identity is stored as a string is extract.

Hence, we want Bloom Filter to have with all the GDPR use cases and how to improve performance. We have to make sure Bloom Filter can be built on this complicated column. The way we solve this problem is to consider the key and the value as two separate columns so that we can plot all the values in the, other IDs in the value array and insert arrays, all of them to the big Bloom Filter and generate one Bloom Filter for the entire map irrespective of which key the ID belongs to.

Another important part is to decide what is the size to allocate to the Bloom Filter, the size of Bloom Filter is determined by two factors, the FPP, false positive probability and NDV, the number of distinct values. Once those two variables are decided, you can use the formula to calculate what is the optimal size of Bloom Filter.

In order to decide the number of distinct values, we have to look at the data pattern in our production, so we’re trying to make sure all the files generated in production is around one gigabyte. So once the file size is decided, we can use the number of rows in each file to estimate the number of distinct values. We’re using 2.1 million as a default value. Now FPP is the only variable left to this to determine what is the size too for Bloom Filter, and here’s a table to show the different sized, Bloom Filter size according to the different value of FPP we select. From the test we’re running, for estimation and performance evaluation, we’re using 0.1 as the default value.

Now let’s look at how the Bloom Filter is integrated with the structure. The Bloom Filter logic is implemented in Apache Iceberg. It’s an open source light-weight table format to manage the table metadata. It’s also integrated with Spark. In the write path, user has to pre-define what are the, ID columns that they want to enable Bloom Filter, and the Bloom Filter configuration is stored in a table schema, there is API provided to configure the Bloom Filters set up for each column. The user should provide the column name, FPP and NDV in order to set up the column (murmurs). Once the table schema is decided, people can use the data frame API to write the data into the table and specify the table format as iceberg so that the Bloom Filters and data files can be generated at the same time.

On the read path, again, we need to make use of the data frame API and to load the data table as I spoke the format. In order to tell iceberg to make use of Bloom Filter do files skipping, we have to pass the Bloom Filter query to the iceberg reader using the Spark option. In iceberg option, we should provide what is the column we want to search on, and what is the value we are interested in so that iceberg in the background, it can automatically skip most of the files that don’t contain this value.

We evaluated performance on a 1.5 terabytes dataset. This is one month’s customer invented, it has that and we have both created the entire dataset to make sure it’s anonymous. In total, it has more than 700 files, and on average, each file is about 833 megabytes, uploading Bloom Filter on one ID column.

Performance – ngestion Overhead

The overheads to build Bloom Filter at ingestion time is about 1.1% compared with the ingestion cost before, without Bloom Filter. Similarly, the storage overhead also increased by 1%. In order to store the Bloom Filter related metadata, note that here we are estimating based on one column enabled, Bloom Filter enabled. If we’re, for the datasets, there are multiple columns we want to build Bloom Filter, then the overheads will increase accordingly.

Id Distribution

Before we look at the performance in the read side, we want to spend some time to understand the ID distribution. It is quite important because we want to make sure the ID is partially distributed in the dataset so that there is enough space for Bloom Filter to optimize, imagine if there are a lot of IDs, they appear in almost every single file, then Bloom Filter cannot do anything in this situation, because there won’t be any file that can be skipped. So it turns out that the data that we have in production that do have a very path distributed ID. For the one we’re using for the path, 99% of the IDs appear in less than 31 files compared with the total number of the file, or which is more than 1700. This is less than 2%, so we do have enough space for Bloom Filter to do file skipping. We also confirm this pattern with a much larger dataset. This one has been six months customer events dataset, and the ID distribution is similar. 99% of the IDs appear in less than nine files for the large data set.

So we picked a couple of IDs that appear in different number of files and run the GDPR query on top of them, we also run the Spark job without Bloom Filter to get a baseline of cost.

Performance – Scan Dataset

It turns out Bloom Filter is very helpful in terms of cost saving, we, for example, for the ID that appears in 31 files, we can achieve, we can save more than 90% of the cost, and this case can already cover 99% of the IDs.

Value Bloom Filter Brings to GDPR & CCPA Compliance

So to conclude, Bloom Filter is very helpful in to handle the GDPR and CCPA use case, it can allow us to process the datasets much faster since a lot of files can be skipped. It will enable us to support a much larger dataset than before, also the cost is greatly reduced.

Ongoing Work

Now we’re working on productize, the Bloom Filter work, and there are a couple of things still going on. The first one is that we’re trying to combine the Bloom Filters for the same data file into one and maybe get the small file problem. Also, currently, the process to load Bloom Filters is in the driver, we want to move it to the executors so that the process can be parallelized. Another thing is that we’re trying to extend the Bloom Filter use case from GDPR to general query, because any column that is sparse and is frequently used for in the query can help Bloom Filter enabled and fix speed up the queries.

Here are the (murmurs) for Miao and myself, feel free to reach out to us if you have any questions or thoughts.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Miao Wang

Adobe, Inc.

Miao is an Engineering Manager at Adobe, where he works with a great team on platform engineering with Spark and other open source technologies. He used to be an active Spark contributor before changing to his manager role. His interests span on high speed networks, data center infrastructure, data processing and machine learning. Prior to joining Adobe, he worked at A10 Networks, IBM and Alibaba with various engineering roles. Miao holds a Ph.D in Computer Science from University of Nebraska. Lincoln with a focus on Peer-to-Peer (P2P) Streaming.

About Jun Ma

Adobe, Inc.

Jun Ma is an engineer in Data Ingestion team of Adobe Experience Platform. She has been working on building a distributed system that allows customers ingest data in large volumn as well as facilitate data management. Her focus is to ensure efficiency, scalibility and resiliency of the system. Prior to join Adobe, she obtained a master degree in Information Technology from Carnegie Mellon University.