Using Apache Spark and Differential Privacy for Protecting the Privacy of the 2020 Census Respondents

Download Slides

The goal of the 2020 Census is to count every person in the US, once, and in the correct place. The data created by the census will be used to apportion the US House of Representatives, to draw legislative districts, and distribute more than $675 billion in federal funds. One of the data challenges of the 2020 Census is to making high-quality data available for these purposes while protecting respondent confidentiality. We are doing this with differential privacy, a mathematical approach that allows us to balance the requirements for data accuracy and privacy protection. We use a custom-written application that uses Spark to perform roughly 2 million optimizations involving mixed integer linear programs, running on a cluster that typically has 4800 CPU cores and 74TB of RAM. In this talk, we will present the design of our Spark-based differential privacy application, and discuss the application monitoring systems that we built in Amazon’s GovCloud to monitor multiple clusters and thousands of application runs that were used to develop the Disclosure Avoidance System for the 2020 Census.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hi, I’m Simpson Garfinkel. I’m the senior computer scientist for confidentiality and data access at the U.S. Census Bureau. I’m going to talk today about our work using spark in differential privacy to protect the privacy of the 2020 census respondents.

Using Spark and Differential Privacy to Protect the Privacy of the 2020

This is a pre official presentation, but the views that are in this presentation are mine, those of the author and those not have the U.S. Census Bureau. And this abstract this here so that Google will be able to find and index this talk.

This is the work of a large team at the U.S. Census Bureau. The team is headed by John Abowd the chief scientist, Dan Kifer, the scientific lead came up with a top down algorithm. And we have a team with many people on it.

So the outline is I’m gonna first talk about the motivation behind our use of Spark and differential privacy, and I’m going to talk about how we’re using differential privacy in the 2020 census. I’m gonna talk about the top down algorithm. And then I’m gonna talk about the work that we’ve done monitoring Spark. And this is largely a homegrown system. And I’ll say that if anybody in the audience can think of a better way of doing this, please contact me because we are actively looking for additional ideas.

So, the census is called for in the constitution. Article one, section two calls for Congress to make an actual enumeration of the people of the United States every 10 years, and that information is used to a portion that House of Representatives. It’s also used to distribute $675 billion in 2015, and presumably more today in federal funds.

Uses of the Decennial Census Data

And it’s also used by the Department of Justice for enforcing the Voting Rights Act of 1965.

And so it’s very, very important. Also that the Decennial census is used to calibrate the American Community Survey and many other projects that we have at the Census Bureau.

Privacy and the Decennial Census

It is the reference right. So for the Decennial census, and for all data products that the U.S. Census Bureau puts out, we also have to follow strict confidentiality requirements as specified in law in title 13, section nine of the U.S. code, which is title 13, is the census act. In section nine says that we cannot make any publication that reveals the data provided by a person or an establishment. So one way of thinking about this is that we cant make any publication so that you can draw a line from the person or the establishment that provided the data to to the date of publication, and we are Census Bureau employees are sworn for life to protect that respondent data.

And the data can’t be used for non statistical purposes, which means it cannot be, we cannot do special reveal, confidential data for law enforcement or immigration or anything like that.

Data that we collected in 2010:

Now, for the 2010 census, we collected data on 308,745,538 people in the United States. And for each one we collected where they lived. The household they were in. Their sex, their age, their race, their ethnicity, and their relationship to that householder. So you could be the householder. You could be the householder’s spouse, you could be their child, their parent, you could be an unrelated person living with the householder. It comes to about 44 bits of data per person. And so the entire 2010 consensus, the raw confidential data is 1.7 gigabytes, which doesn’t really seem like that much data.

2010 Census: Summary of Publications

In fact, if you think about it in terms of integers, we basically collected six integers per person. But we publish that data in many different ways. We collected it, we summarize it, we tabulate it. We put out publications of say the number of people living on a block. The number of people of each race living on a block, we publish age pyramids that the census tract level. In total, we published around 2.7 billion integers counts in the redistricting file, which we call the PL94-171 file. After Public Law 94-171 that specifies how redistricting data is provided to the States. The balance of that we put in what’s called Summary File One. So in total, we published around 5.5 billion integers. Now from high school math, you may remember that if you have more equations than you do unknowns, you can actually solve for the unknowns. And you can literally build a system with 5.5 billion equations and solve for the roughly 1.8 billion unknowns. And we’ve done that that’s called database reconstruction.

We performed a database reconstruct and re-identification attack for all

So we’ve performed that database reconstruction attack on our published data, the data that we published in 2010. Using only public micro data, we were able to reconstruct all 308,745,538 micro data records.

We did that. we performed a database reconstruction and re-identification attack on all of the published data from the 2010 census, and we re-constructed 308,745,538 micro data records. And then those records have age, address, sex and ethnicity, but they don’t have name. So we then bought a commercial database in 2010 that had names, addresses, ages and sex. And if you link the two databases together, you now have linkage between name and ethnicity and race. And that’s considered by most Americans to be very confidential data. We also have linkage to ages of children, which also most Americans considered to be quite confidential. It’s actually hard to buy information on children in the United States. Our linkage rate was about 45%. And we know that confidential data so we actually then peaked and we saw how good we got it, and the re-identification rate was around 38% of the linked data, which means that for 17% of the US population, we could reconstruct the confidential data. So you shouldn’t worry that your confidential data may be revealed by the Census Bureau. We didn’t publish names in 2010. We’re never going to publish those names. And we have prepared this public service announcement, which you can view later explaining how we’re going to keep your data safe. But another thing we’ve done for 2020 is that we are modifying the privacy protection mechanism that we’re using.

Statistical agencies aggregate data from many households together into a single publication.

Now to explain… To understand how we’re doing it, you need to drill down on how we actually publish data. What statistical agencies do is we take data from people like here we have seven people on a block and we tabulate that data different ways, we might publish the total number of people on the block, their median age, their mean age, maybe the number of females on the block, the number of males, number of people in each racial category. You can take those counts, and those means, and you can create a set of equations.

We now know how to take many aggregate publications and “solve’ for the original microdata.

And in this case, it comes to about 164 separate equations. So I did that on my MacBook Pro, my MacBook Pro can solve those equations in about 0.2 seconds, and come up with the confidential data based solely on this published statistics.

The basic idea of differential privacy:

So the basic idea of differential privacy is to take the data that we would publish and add noise to it. So in this case, the counter that block might be nine people, but we might publish eight people and the median of age on that block might be 35. And we might publish 45. So, I think noise protects the confidentiality of the respondents.

The Census Bureau is using differential privacy for the 2020 Census.

Now what differential privacy does, is it allows us to control the amount of noise that we add, so that the statistics that we care about are modified the least. And the statistics that we don’t care about are modified the most. And there’s a minute physics video that we work with the Creative Minute Physics to do. And it’s it’s more than a minute, but it’s totally worth watching. We’re not gonna watch it now. Differential privacy allows us to control that privacy loss versus accuracy trade off, It allows us to put the accuracy where we need it. And we say that it’s future proof because no matter what breakthroughs there are in the future, the respondent data protected with differential privacy remains protected.

How much noise do we add?

Now the question is how much noise should we add? So that’s really a policy decision. And you can see on this graph, that if we add a little bit of noise, we get highly accurate data, that’s off on the right, we get highly accurate data, but there’s a high privacy loss. And if we don’t put a lot of noise in over on the left, that’s less privacy loss, but the data is less accurate. So that’s also more noise. So the way the top down algorithm works is that we basically compute the statistics at each layer of the census. We compute statistics at the national layer, at the state layer, at the county for every track and for every block. We then add noise to each of those sets of statistics. And we end up with a set of noisy statistics for each different level. And then we use a commercial optimizer to make sure that all of those noisy statistics are consistent with each other.

There was no off-the-shelf system for applying differential privacy to a national census

Now there’s no off the shelf system for applying differential privacy to a national census. So we had to create one, we had to create a system that would produce high quality statistics that densely populated geographies, like in cities and for the nation, and produce consistent tables. So we created this new algorithm that as I said, it processes statistics top down. It starts at the national level. It then computes the statistics at the state level, at the county level, at tract, block group and finally at the block level. And this system fits into the Decennial production system. So the way that disclosure avoidance system works is that it takes what we call the census edited file, it then performs the differential privacy mechanism. We call that disclosure avoidance because we want to avoid an improper disclosure, and then it produces the micro data detail file from which the tabulations are made.

The Disclosure Avoidance System allows the Census Burequ to enforce

Here’s a more enlarged diagram of what we’re doing, we actually take what’s called the Decennial response file that people census returns. It’s also reports from enumerators who visit house to house. That goes into something called the census unedited file. It is then edited for correctness. For example, you might say, if you’re filling out the census form for your household, you might say that your mother is 30 years old, and your daughter is 95 years old. And so maybe they’re both living with you and you swap them, so we might reverse those two ages. That’s an example. We then take that census edited file, and we apply the global confidentiality protection process, we do the differential privacy, we make the micro data detailed file, and that micro data detail file is then what’s used for the official tabulations.

Our DP mechanism protects histograms

So we actually don’t work with micro data in the census system, we actually work with histograms of people who live on that block. And so to understand what that means is this block that our sample block which had seven people on it, you can think of it as a histogram, which is largely sparse. It’s a histogram with five dimensions. One dimension is age, one dimension the sex, one dimension is race. One dimension is ethnicity. One dimension is relationship to the householder. So it has about 500,000 cells, and most of those cells are zero, but seven of those cells have ones in them.

Our first system applied DP to every block. This was the “block-by-block” system

We take that histogram, and we apply differential privacy to that histogram. And then we have a noisy histogram and from that we reconstruct the micro data that is the most consistent with that noisy histogram. And so we end up with different kinds of people. But there’s a relationship and if epsilon is high, we add a very small amount of noise. And if epsilon is low, we add a lot of noise. And if we add a very small amount of noise, then the histogram looks a lot like the input histogram.

There are roughly 8 million

So our original mechanism, which I call the block by block mechanism, started with 8 million habitable blocks, it added noise to each histogram, and it created 8 million protected blocks.

Each block histogram has:

Now, apparently, this version of the histogram has about 217,000 cells. But if you look at that, we start off with 5.3 petabytes of zeros, 1.7 gigabytes of data, but it’s farce. We run that through the disclosure avoidance system and then we we coalesced that into a new set of micro data. So we go from 1.7 gigabytes to 5.3 petabytes back to 1.7 gigabytes. That’s beginning to sound like something you would want to do with Spark, right?

Running the block-by-block

So here’s actually a map of how we do that with Spark. We have 16 gigabytes, comma separated values file, we create an RDD, with eight million scikit, sparse histograms, we use a map operation to apply the noise. We then use the Gurobi linear program… Mixed integer linear program to solve all those histograms to be consistent. We then map it to create micro data and we save it as a text file. So we very clearly distinguish the hardware the noise gets, the adding of noise, the micro data, the protection process of differential privacy from the post processing, which is done with Gurobi and most of the work is gonna be in the post processing.

In 2018 we invented the

So the problem with that mechanism is that every block ends up with a lot of noise. And when you add those blocks up to get counties and states, the noise gets added up to. So in 2018, we invented the top down algorithm, which actually first works on a national histogram. And then it works on the 51 state histograms. Because we treat the District of Columbia as a state for this process. We then work on the county histograms, on the census tract histograms, and eventually on the block level histograms.

TDA part 1:

So each set of records is turned into a histogram, we add the noise, we get the noisy histograms.

We then take those noisy histograms at both the national level and the state level. We make them consistent with Gurobi. We then take the state level histograms and we make them consistent with the county level histograms again using Gurobi. And you can see that the histogram storage gets bigger and bigger and bigger as we go down. The first histogram only takes 869 kilobytes, the states take 44 megabytes.

IDA part 2: post-processing

And that when we’re at the block level,

TDA part 2:

it’s five terabytes. And we actually need to start two histograms.

In 2018 we invented the

So it’s a lot of data, right.

And in terms of counts, we still, when we’re done with this, we only wanna have, you know, each of those histograms has 217 integers in it. So it’s a total of five terabytes. So those slides, I think are out of order. So all this is going on, and you might get the idea that our RDDs are not very well balanced at all, because we have one RDD for the national level, and it’s like a megabyte, and we have one RDD at the block level, and it’s like, you know, five terabytes. We have another design, but we haven’t moved to it yet because this design works and you don’t have a lot of time left.

Key challenges in monitoring spark

So the top down algorithm is not balanced. Most of our computation is done in Gurobi. So we call Gurobi as a shared library in that mapper, so that’s even invisible to Spark. It’s, like this big Python process going on, we typically have to perform around 800,000 separate optimizations for a typical run. We’re looking at other algorithms where we’ll actually need to do around 3 million optimizations per run. Gurobi is a multi threaded shared library. So we can tell Gurobi, you know, Python thinks it only has one thread and Spark is giving four threads per each executor. But you can go use 90 threads ’cause we’re running on a machine with 96 cores, and it’s very unlikely that all those cores are gonna use at any given time. We’re running in the AWS GovCloud. And it turned out that the AWS monitoring tools like Cloud Watch, were just too coarse. They weren’t recording data fast enough. They were throwing the data away after… When we wanted to look at it, and the data that they were collecting wasn’t really synced up with, with our runs. So we ended up developing our own monitoring system specifically for this application. Now, we’re only gonna run the top down algorithm once in 2021. Because we’re still collecting the data now in 2020. But to develop and tune this algorithm, we have to run thousands of trials, thousands of tests. So we also need to have a system for keeping track of all those runs. And so we could see how our algorithm got better. So to do that, we built a separate monitoring framework.

We created our own monitoring

We have an agent that runs on every node on the master notes and on the worker nodes. They send data separately to a REST server. That REST server stores that’s an Apache server using the REST protocol that stores data in a MySQL database for using the Amazon RDS service. And we store the results. And we have a web based display.

Cluster List

So here’s an example of our cluster list. We typically have anywhere between three and 10 separate EMR clusters running at a time. The clusters are we bring them up and we bring them down all the time. So the Amazon cluster list identify… The cluster identifiers difficult to work with. So the clusters when they start up, they they pick randomly chosen names for themselves. So this one, the first cluster picked the name at ABDAT, ABDAT and the second one picked the name BRIOS.

Each one has potentially a different version of EMR running and then we have our own release system. You can see that the ABDAT has eight workers right now with a total of six gig of RAM. Excuse me. Six, yes, six gigabytes of RAM. And the second one has six workers, but it only has 4.6 gigabytes of RAM. The first one has 768 virtual CPUs, excuse me, not gigabytes, terabytes, that’s running on machines that have 768 gigabytes of RAM. And then, so sorry about that. Terabytes of RAM, not gigabytes of RAM. Anyway, we we track the CPU load on each machine with these little graphs. And if you click the show button, it would actually show you each of the workers so you can see how balanced it is.

Each cluster can be expanded

And in here, I do that with the BRIOS cluster. So you can actually see graphically the CPU load on each machine and the memory load. We could create and kill the clusters for each separate run. But for various reasons, we decided not to do that. The clusters just take too long to start up. And, you know, typically we’ll have like 100 of these machines running at a time and there’s some capacity issues that we’ve run across in the past.

Each DAS run is a “mission”

So each run of the DAS we call a mission and the mission names are randomly chosen with an adjective and a noun. So here we have litigious welcome as the first mission. And we have a weightless unique is the second mission. Each of these missions for like a test run, we try run to small runs, we actually have we have small data sets, or we might be trying to just work on one aspect of the algorithm. You can see there the fourth mission intravenous spread, it crashed, so it has an execute of one and it only ran for 115 seconds. But the first one, litigious welcome. It ran for 72,000 seconds. And we track the Gurobi version, we track other information, we actually track about 100 variables for each of these missions.

Log display shows activity in all

For just the mission itself, the display itself, each mission.

Mission Report

There it can go do logging, I’ll go back to that previous slide.

Log display shows activity in all

There’s a log statement. And so the missions can put logs locally, but they can also send log messages to the centralized console. So you can go to this website and see interleave the log files of all the missions, or you can see just the log for the particular mission.

Mission Report

And you can see that while they’re executing, and you can see that when they’re finished.

You can also click and see the entire YARN log from that mission, or you can see the YARN log for a particular node.

System Load

This graph here shows for a different mission, the execution of the top down algorithm. So all the way off on the left, that’s when the noisy measurements are being computed. And then the first blue green bump that’s when the nation to nation state, this nation to nation solve is being done that can only be done on one machine, but it’s using 96 cores, and so the load goes up to 100. Then we do the nation to state solve, that’s the second green bump there. And that bump is again, it’s using all 96 cores, but it’s not using any of the other nodes. Now, you might say, “Well, why don’t you just shut down the other nodes?” Because looks like you spent four hours? Well, we could, but those other nodes are holding the noisy measurements right now. And so if we shut them down, we’d have to put them off in HDFS or something, then we’d have to load it back.

And it’s petabytes of data. So we keep it in RAM.

Anyway, my guess here it’s like high terabytes of data, but we keep it in RAM. Then we do each different solve. And then you can see that big noisy part in the middle. That’s when the I think that’s the tract block group solve happening. And then the next big bump that’s the block group to block solve happening And then there’s a whole bunch of statistics being computed at the end. And it’s not using a lot of stuff on the notes, but the the red, the red, plus the red line, that’s the master, you can see there’s a lot of stuff that the master is orchestrating. And it keeps hitting all the other clusters, all the other workers, and then finally, everything gets saved out.

Free Memory

So, you know, this is a run that is thousands of dollars in computer time, it’s running on machines that are $7 an hour each. And I think we have, you could count some large number of nodes. We track the free memory now, we used to run out of memory and our machines kept crashing, but now we know why. And we stopped that from happening. But we still wanna be able to track the free memory because if it gets down to zero, that would be really bad. And you can see the master node only has about 200 gig of free memory and the worker nodes they’re okay.

In Summary

We also track the HDFS usage, we track input and output, we track actually all the parameters, I’m just not gonna show you all the scraps. So in summary, in 2020. So let me just say the reason we track those is that we’re trying to make the algorithm more efficient. At the same time, we’re trying to make the algorithm better. So the word efficiency is double loaded. The statisticians, the computer scientists, they like to think of the efficiency of the statistical efficiency, how little noise gets into the statistics that we care about, and keeping that noise in the statistics that we don’t care about. And the engineering team wants the machines to be used as efficiently as possible. We would like to see the load on every machine at a constant 100. They have 96 cores, it’s okay, if they go up to 600 loads. That just means it’s gonna take 10 times longer, but at least all of them machines are running with a load of 600, we could increase the number of clusters and run that faster, but it would… The number of workers but it would then take longer. Sorry, it would then cost more. So in summary, the 2020 census will use differential privacy as its privacy protection mechanism. Differential privacy allows us to add carefully controlled noise to the statistical queries. Most of the work is in the post processing to make the micro data for legacy users you might remember from that graph, the noise was added all at the beginning and then there was all the other parts of the top down algorithm. And we use spark to distribute the load to a commercial optimizer, we use Gurobi. And we solved in that one that I showed you around 800,000 mixed integer linear programs, the new version of the top 10 algorithm will probably solve around 2.4 million mixed integer linear programs. Maybe more.

So with that, I want to thank you all.

For more information…

If you’re interested in this more, you can get the ACM communications article that we published on understanding database reconstruction. And that has that example was seven people worked out, there was this lovely article in the “Wall Street Journal.” And there is an article in “Science.”

The Most Technical 2020

And there’s been a lot of other coverage about our move to differential privacy. If you want the actual details, you can actually get the source code, which we’ve put on GitHub. This is the first time ever that the Census Bureau has published source code for its disclosure mechanism. And we’ve also published our detailed design specification for the system. And there’s a scientific paper describing the mechanism as well. If you want more background, it’s here. And with that, I guess I’m free to take questions.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Simson Garfinkel

U.S. Census Bureau

Simson Garfinkel is the Senior Computer Scientist for Confidentiality and Data Access at the US Census Bureau. He holds seven US patents and has published more than 50 research articles in computer security and digital forensics. He is a fellow of the Association for Computing Machinery (ACM) and the Institute of Electrical and Electronics Engineers (IEEE), and a member of the National Association of Science Writers. As a journalist, he has written about science, technology, and technology policy in the popular press since 1983, and has won several national journalism awards.