Miscellany Central: They deliberately put errors in the Census

They deliberately put errors in the Census

We need a healthier way to think about privacy

By Matthew Yglesias

August 15, 2021

Every 10 years the Census assembles tons of information about each American household and then for privacy reasons keeps it secret for 72 years. So I’ve looked up the block-level census pages for all four of my grandparents in the 1940 Census, but I don’t know exactly where any of them lived in 1950 because that’s still under lock and key for a couple more years.

But the point of the Census is not just to provide historical information for future genealogy researchers.

We use the Census in something close to real-time to draw our legislative district maps (not just for Congress but also lower-profile stuff like city councils). And academic researchers use Census data to study all kinds of modern problems. The way they do this is not by using the individual Census files but using aggregate Census data that describe the population characteristics of a given geographic span. So while the underlying Census says the name, age, race, etc. of each person living in each house, the aggregate data says how many people live in the tract and what the age distribution, racial distribution, etc. of those people is.

Except now if you look at the new Census’ finest-grained areas, it’s going to give you information that’s clearly wrong. As Kriston Capps explains in a great piece on this, the Census now says that Liberty Island (where the Statue of Liberty is) is home to 48 people even though everyone knows that zero people live there. That’s because the Census Bureau is now deliberately adding a large amount of deliberate inaccuracy to fine-grained Census information. The idea is that as you zoom out to larger geographical units the inaccuracies will balance out and you can get the correct aggregates.

I think the sober-minded journalistic thing to say about this is it’s an effort to strike a balance between accuracy and privacy. And then you’re supposed to say that striking balances of this sort is controversial, and different stakeholders have different views on whether the Census got it right. But I actually think it should be an occasion for a larger rethink of what the value of “privacy” really is in these contexts. It seems to me that as Americans, we’ve landed in a situation where for better or worse we have virtually no privacy thanks to commercial data collection. But then we force the government to act as if government-constraining privacy measures will protect our privacy when really all they do is impede efficient state functioning and deny researchers access to the kinds of useful information that is abundantly available to advertisers.

You have very little privacy from commercial actors

I was just checking up on my Google ad preferences page, and they know I’m a man between 35 and 44 who uses an iPhone and likes to watch action movies. It doesn’t say explicitly there that they know where I live, but Google obviously does know that because I’ve marked it as home on my Google Maps.

In an earlier, cruder era of marketing, I used to get a lot of spam phone calls in Spanish because I guess telemarketers would just identify people based on their surname. It’s been forever since I can recall being mistakenly marketed to in Spanish since it’s obvious if you track my internet usage that I use English.

And of course, commercial vendors are able to obtain much more detailed information than any Census questionnaire. Credit card companies track what their users buy, then package that information and sell it off. The transaction data is anonymized, of course, but as the privacy people keep pointing out, it’s not especially hard to recreate detailed individual data from what’s publicly available:

Tokenization “effectively created a loophole,” says Yves-Alexandre de Montjoye, who heads the computational privacy group at Imperial College London, and who has advised the European Commission on privacy issues. By removing names and other details, companies can argue “that it’s not personal data; it’s ‘anonymized,’ ” he says.

But it isn’t so anonymous. In 2015, de Montjoye and colleagues at MIT took a data set containing three months’ worth of credit card transactions by 1.1 million unnamed people, and found that, 90% of the time, they could identify an individual if they knew the rough details (the day and the shop) of four of that person’s purchases. In other words, a combination of a few receipts, tweets, and Instagram photos of you dining out is enough to reveal your other purchases.

Now I think a separate question is “are third-party actors actually interested in doing this,” and the answer seems to be mostly no. But of course if you shop on Amazon, which I certainly do, then Amazon knows all about what you’re buying. And Amazon has to know where you live to deliver stuff to you. Your cell phone can be used to track your location at basically all times. The websites you browse include hooks for various companies to monitor you across the web. A few years ago the European Union imposed some new privacy rules that in some sense were meant to discourage this, but in practice just means you need to click to “accept cookies” before you can see websites.

The people who write about these issues tend to be people who care about them, which generates unrepresentativeness in much the way that only cranks actually show up to neighborhood meetings to complain. So you end up with a lot of rhetorical constructions about life in a “digital panopticon” or “surveillance capitalism,” but normal people give almost no sign of caring.

People don’t seem to value privacy highly

There was a boomlet a few years ago of people saying that “data is the new oil” which in turn produced a boomlet of refutations. One of the best was by Antonio García Martínez in Wired, who — among other things — pointed out that your data just isn’t that valuable.

Facebook is a super-lucrative business, for example, mostly because it has a ton of users. The actual revenue per user is a relatively paltry $32 per year. Why does Facebook rely on a surveillance-and-advertising business model rather than charging $5 a month and having rock-solid privacy? Well, it’s two things.

On the one hand, Facebook users don’t value Facebook that much and wouldn’t want to pay for it.

On the other hand, internet users don’t value privacy that much and don’t see being stalked around the web as a serious cost.

And you just see this over and over and over again. I didn’t have to label my home address in Google. For that matter, I don’t have to use Gmail — there are perfectly good paid email services out there that are more privacy-protective, and I even use one for my SlowBoring.com address. But to actually scrap my Gmail address seems like more trouble than it’s worth. When I go to Safeway, Giant, Walgreens, and CVS I use my loyalty swipe card to pick up small discounts.

Now of course the privacy people will chime in here to note that one reason I value my own privacy so little is that I have so little privacy. Given the all-seeing all-knowing eye of surveillance capitalism, what would the marginal value of not using the Walgreens loyalty card possibly be? It’s true that it would be cheap to give that specific thing up. But to try to live a genuinely private life would mean eschewing not one store loyalty card but a half dozen. Then I could get a bunch of extra paid subscriptions (VPN, paid email, etc.), but the costs really would add up. I’d have to give up seeing photos of my friends’ kids on Instagram!

Perhaps if someone wrote a really good law that did a really good job of comprehensively protecting privacy, I would find it worth bearing some monetary loss. The appearance of everyone putting a near-zero value on privacy is that marginal privacy is in fact basically worthless.

I think that’s an interesting debate and I eagerly await someone’s proposal for that hypothetical really good law. But my problem is that privacy advocates don’t seem to actually apply that insight about the low marginal value of privacy to these issues around the public sector.

What is the Census protecting us against?

Turning back to Capps’ article on the Census, the concern here is that traditional public-use microdata could be subjected to processor-intensive database analysis to figure out that I am a white man who is 40 and lives at my address.

In November 2016, the Census Bureau mounted a database reconstruction attack against itself. John Abowd, the bureau’s chief scientist, assembled a crack team to use summary tables to reconstruct the 2010 census records for every American: sex, age, race, ethnicity and block-level location. Two years later, the team had finished the project, assembling a virtually complete and highly accurate match for the nearly 8 billion figures in the 2010 census tables. Using the same approach and off-the-shelf software, the New York Times was able to replicate the process for Manhattan. Abowd described the prospect of database reconstruction as “the death knell for public-use detailed tabulations and microdata sets as they have been traditionally prepared.”

What Abowd doesn’t really explain is who would do this or why or what significance it would have for my life. It’s not that it’s inconceivable to me that someone would do this database deconstruction. I’m sure they would, and then toss it up into the general maw of data brokerage that’s happening on the internet, allowing various algorithmic targeters to be a little more precise in their estimates of various people’s ethnicities.

But just as in the existing commercial use cases, it neither seems like you could make a ton of money doing this nor that people would particularly mind. Is Walgreens using its security camera footage paired with loyalty card swipes to put your face through image recognition software and determine your ethnicity? This thought has honestly never troubled me as I enter a store. I figure the answer is no, because it doesn’t seem like there’d be any value to doing this. But in terms of things that you “could do” if you were really desperate to learn about people’s ethnicity, it’s clearly doable. After all, your skin color is not a secret — it’s right there on your face.

Does Walgreens know I’m a man? Well, they know I buy men’s razors, which is probably more important for their purposes than the finer points of gender identity.

Now if you could take modern database reconstruction technology back in time to 1961 and use it on the 1960 Census, that would be valuable. You’d be obtaining information that was not broadly feasible to collect at that time. On the other hand, in 1961 every town in America had a White Pages book doxxing all of its residents and they would drop that doxxing book to every house in town. It was wild. You could just sit in my living room and look up the home address and phone number of everyone in Manhattan! Bored kids would do crank calls to strangers.

But today we are living in a world where the marginal cost of one more actor being able to make strong inferences about the age, sex, and ethnicity of the person living at your address is extremely low. By contrast, the costs of Census inaccuracy are very real.

Bad Census data creates real problems

In some kind of high-level statistical sense, saying there are 48 people on an empty island known for its statue is just a funny quirk of an outlier Census tract.

But a team of statisticians and political scientists from Harvard notes that there are some flaws with the idea that the errors go away when you move up to higher-level geographies. One is that aggregation and error-reduction are only designed to work for what are called “on-spine” geographies. So Census Blocks aggregate up into Block Groups aggregate up into Tracts, and the Tracts aggregate into counties. If it works properly, you have less inaccuracy in the Groups, even less in the Tracts, and then you zoom out to the counties and it’s very accurate. But there are “off-spine” geographic categories like Census Designated Places (a small town, basically) and crucially voting precincts where the noise cancellation doesn’t work.

Even worse!

Our analysis finds that not only is there more noise for VTDs, but also that there remains a particular form of previously undiscussed bias — perhaps an unintentional side-effect of the DAS post-processing procedure needing to satisfy accuracy constraints in on-spine geographies. We find that the DAS data systematically undercounts racially and politically diverse VTDs in comparison to more homogeneous VTDs. How these discrepancies add up into legislative districts clearly depends on the spatial adjacency of diverse and homogenous VTDs. But in some cases the bias does not cancel out. In Pennsylvania, the average Congressional district changes by only 400 or so people. But the majority-Black 3rd Congressional District gains around 2,000 people under the DAS-protected data, while the more diverse 2nd Congressional District, a Black-Hispanic coalition district, loses around 2,000 people.

Capps’ story discusses two specific instances of this: small towns losing their population (and potentially therefore people) and Native American tribal areas losing their residents (and potentially representation in the state legislature).

Using demonstration data, for example, the Utah state legislature reported a loss of nearly 15,000 residents, according to the brief. Two small towns lost half their populations. And Census Bureau emails revealed an internal rift over the security measures.

And:

The National Congress of American Indians (NCAI), which represents American Indian and Alaska Native (AI/AN) tribal nations, has outlined its concerns about negative impacts on Native populations over the last two years. Most recently, the group took issue with internal agency communications revealed by the Alabama suit. Emails discussed a proposal to make counts in tribal areas “essentially invariant” by ensuring them a higher privacy-loss budget — meaning less noise for small blocks in tribal areas. One bureau official objected to this suggestion in an October 2020 email: “We cannot promise to do something that blatantly gives one racial group an advantage at the expense of all others.”

Small towns, by definition, do not have a lot of residents. The American Indian population is very small. Fudging the numbers around to overcount the size of some small towns and undercount the size of other small towns or to erroneously understate the number of Native Americans living on tribal lands while misattributing them to Liberty Island or whatever could lead to meaningful misallocation of resources.

Now one hopes that in practice, the United States Postal Service will not shut down a post office based on a Census differential privacy algorithm that pretends a town has lost half its population. But that’s because one can sanity check this kind of thing against other data (has the volume of mail collapsed?), which is another way of underscoring the point that the actual privacy gain here is small. But there are situations in which the government has to use the official Census report, so we might end up disenfranchising Native voters, which is a very real cost.

Dumb, ineffective government is bad

My big concern about this is that because privacy advocates are a loud noisy minority but the general public places little value on privacy, we end up with a serious mismatch:

The private sector has very little respect for privacy, users hand over data quite willingly, and any serious legislative effort to curtail private data collection would be easily swapped down.

The public sector does not lobby on its own behalf, and consumers don’t pay explicit prices, so advocates are able to get the government to act with a great deal of regard for data privacy with regard to its own conduct.

This is fine if you’re an ideological libertarian who cares mostly about making the state ineffective. It’s also okay if you’re just an all-around business type who enjoys rhetorically japing about how slow and clumsy the state is compared to big business. But it doesn’t actually generate much privacy. And it does generate a lot of state inefficacy in situations where a more knowing state would be useful.

Right now, for example, a lot of police manpower is used on dealing with people driving unsafely. Some jurisdictions have moved toward automated enforcement of speeding rules via cameras which is good, both because it catches more violators and also because the camera can’t discriminate based on race. One could imagine a world in which we put the tech in cars to stop drunk people from operating them. It’s extremely dangerous for the police to shoot at moving cars, but it’s something they sometimes feel they have to do. In principle, we should be able to equip cops with the ability to brick cars remotely in a crisis rather than shooting at them — this would be much safer.

This anecdote about how the CDC wouldn’t allow researchers to perform Covid tests early in the pandemic on already-collected flu samples plays as a kind of “haha the CDC is dumb” story, but in this case, at least the CDC wasn’t being dumb, they were just following real privacy rules. It’s just that the rules are dumb.

Back in the spring of 2020 when the country adopted a widespread pause on non-essential activity to try to get a grip on Covid, there was a thought that the way we would reopen would be with intensive contact-tracing methods. This worked pretty well in South Korea, where they took advantage of the fact that everybody carries a digital location tracking device with them everywhere they go all the time. But in America, we were stuck with half-assed opt-in tracing apps. In the red states, this ultimately resolved with the state governments basically doing nothing to control Covid. But in blue states, they spent months using fairly heavy-handed anti-Covid policies, and yet somehow would not use the location data for contact tracing.

Yet despite this privacy regime, the other day I got a push notification from Domino’s Pizza about local specials as I drove past the Ingram, Texas outlet.

Goals-based privacy

I would say that the right way to think about this is by first asking what the objective is here. The information in question would allow people to identify the age, sex, and ethnicity of the people who reside in any particular place. Then we should ask what kind of laws would it take in general to prevent someone from doing that. Then we could try to decide whether or not it makes sense to try to pass laws and regulations that prevent that in general. And if we do that, then of course the Census should not create a loophole that undermines the otherwise tight privacy rules.

In this case, it seems pretty clear that nobody is actually interested in fighting for the goal. The issue instead is something more like “it would be embarrassing for the Census for someone to do a full database reconstruction because that would undermine the 72-year rule.”

I can sympathize with that, but it seems like you should try to address it in a specific way — it’s basically a public relations problem that ought to be handled on that level, complete with explanations of the (non-trivial) error rate inherent to these database reconstruction methods and the value to society of accurate Census data. Kneecapping the Census itself seems bad. And kneecapping the government in general so that everyone but the state can exploit modern technology, mostly to micro-target us for shoe ads rather than to save lives, seems really bad.

Miscellany Central

Monday, August 16, 2021

They deliberately put errors in the Census

No comments:

Post a Comment