Privacy challenges

Analysis: It’s surprisingly easy to identify individuals from credit-card metadata.

Larry Hardesty | MIT News Office

January 29, 2015

Press Inquiries

Press Contact:

Abby Abazorius

Email: abbya@mit.edu

Phone: 617-253-2709

MIT News Office

Media Download

↓ Download Image

Caption Yves-Alexandre de Montjoye

Credits Photo: Bryce Vickmark

crowd of people crossing a street at a crosswalk

↓ Download Image

Credits Image: Yves-Alexandre de Montjoye

*Terms of Use:

Images for download on the MIT News office website are made available to non-commercial entities, press and the general public under a Creative Commons Attribution Non-Commercial No Derivatives license. You may not alter the images provided, other than to crop them to size. A credit line must be used when reproducing images; if one is not provided below, credit the images to "MIT."

Yves-Alexandre de Montjoye

Photo: Bryce Vickmark

Image: Yves-Alexandre de Montjoye

In this week’s issue of the journal Science, MIT researchers report that just four fairly vague pieces of information — the dates and locations of four purchases — are enough to identify 90 percent of the people in a data set recording three months of credit-card transactions by 1.1 million users.

When the researchers also considered coarse-grained information about the prices of purchases, just three data points were enough to identify an even larger percentage of people in the data set. That means that someone with copies of just three of your recent receipts — or one receipt, one Instagram photo of you having coffee with friends, and one tweet about the phone you just bought — would have a 94 percent chance of extracting your credit card records from those of a million other people. This is true, the researchers say, even in cases where no one in the data set is identified by name, address, credit card number, or anything else that we typically think of as personal information.

The paper comes roughly two years after an earlier analysis of mobile-phone records that yielded very similar results.

“If we show it with a couple of data sets, then it’s more likely to be true in general,” says Yves-Alexandre de Montjoye, an MIT graduate student in media arts and sciences who is first author on both papers. “Honestly, I could imagine reasons why credit-card metadata would differ or would be equivalent to mobility data.”

De Montjoye is joined on the new paper by his advisor, Alex “Sandy” Pentland, the Toshiba Professor of Media Arts and Science; Vivek Singh, a former postdoc in Pentland’s group who is now an assistant professor at Rutgers University; and Laura Radaelli, a postdoc at Tel Aviv University.

The data set the researchers analyzed included the names and locations of the shops at which purchases took place, the days on which they took place, and the purchase amounts. Purchases made with the same credit card were all tagged with the same random identification number.

For each identification number — each customer in the data set — the researchers selected purchases at random, then determined how many other customers’ purchase histories contained the same data points. In separate analyses, the researchers varied the number of data points per customer from two to five. Without price information, two data points were still sufficient to identify more than 40 percent of the people in the data set. At the other extreme, five points with price information was enough to identify almost everyone.

The researchers characterized price very coarsely, treating all prices that fell within a few fixed ranges as functionally equivalent. So, for instance, a purchase of $20 at some store on some day in one person’s history would count as a match with a purchase of $40 by someone else at the same store on the same day, since both purchases fell within the range $16 to $49. This was an attempt to represent the uncertainty of someone estimating purchase amounts from secondary information, such as an Instagram photo of the food on someone’s plate. The limits of each range were based on a fixed percentage of its median value: The range $16 to $49, for instance, is the median value of purchases ($32.50) plus or minus 50 percent, rounded to the nearest dollar.

Preserving anonymity in large data sets is a pressing concern because public and private entities alike see aggregated digital data as a source of novel insights. Retailers studying anonymized credit-card histories could certainly learn something about the tastes of their customers, but economists might also learn something about the relationship of, say, inflation or consumer spending to other economic factors.

So the MIT researchers also examined the effects of coarsening the data — intentionally making it less precise, in the hope of preserving privacy while still enabling useful analysis. That makes identifying individuals more difficult, but not at a very encouraging rate. Even if the data set characterized each purchase as having taken place sometime in the span of a week at one of 150 stores in the same general areas, four purchases (with 50 percent uncertainty about price) would still be enough to identify more than 70 percent of users.

Nonetheless, de Montjoye and Pentland remain adamant that socially beneficial uses of big data should be pursued. “Sandy and I do really believe that this data has great potential and should be used,” de Montjoye says. “We, however, need to be aware and account for the risks of re-identification.”

In separate work, de Montjoye, Pentland, and other members of Pentland’s group have begun developing a system that would enable people to store the data generated by their mobile devices on secure servers of their own choosing. Researchers looking for useful patterns in aggregate data would send queries through the system, which would return only the pertinent data — such as, for instance, the average amount spent on gasoline during different time periods.

Press Mentions

New Scientist

A new study by MIT scientists has found that metadata provides enough information to identify consumers in anonymous data sets. Aviva Rutkin writes for New Scientist that “for 90 per cent of people, just four pieces of information about where they had gone on what day was enough to pick out which card record was theirs.”

Full story via New Scientist →

In this video, Robert Lee Hotz of The Wall Street Journal discusses how MIT researchers have found that individuals in an anonymous data set can be identified using just a few pieces of information about their shopping habits. “We're really being shadowed by our credit cards,” Lee Hotz explains.

Full story via →

PBS NewsHour

Rebecca Jacobson writes for the PBS NewsHour about how MIT researchers have found that individuals in anonymous data sets can be identified using just a few pieces of outside information. The researchers found that there is a “94 percent chance of tracking all of your purchases with three pieces of extra information.”

Full story via PBS NewsHour →

The Wall Street Journal

A new MIT study examining anonymous credit card data shows that individuals can be identified using just a few pieces of information, writes Wall Street Journal reporter Robert Lee Hotz. “This touches on the fundamental limit of anonymizing data,” explains Yves-Alexandre de Montjoye.

Full story via The Wall Street Journal →

New York Times

MIT researchers have found that anonymous individuals in a data set can be identified using a few pieces of information, reports Natasha Singer for The New York Times. “We ought to rethink and reformulate the way we think about data protection,” explains Yves-Alexandre de Montjoye.

Full story via New York Times →

Nature

MIT researchers were able to accurately identify individuals in an anonymous data set by looking at the date and location of four credit card transactions, reports Boer Deng for Nature. “Even when researchers only had estimates of time and location of a purchase to within a few days or neighbo[u]rhood blocks, they could still identify cardholders,” explains Deng.

Full story via Nature →

Associated Press

Seth Borenstein and Jack Gillum write for the Associated Press about how MIT researchers have found individuals can be identified by examining a few purchases from anonymous credit card data. "We are showing that the privacy we are told that we have isn't real," explains Pentland.

Full story via Associated Press →

Scientific American

In a piece for Scientific American, Larry Greenemeier writes about new MIT research showing how easy it is to identify individuals in anonymous data sets. “We have to think harder and reform how we approach data protection and go beyond anonymity, which is very difficult to achieve given the trail of information we all leave digitally,” says Yves-Alexandre de Montjoye.

Full story via Scientific American →

MIT News | Massachusetts Institute of Technology

Browse By

Topics

Departments

Centers, Labs, & Programs

Schools

Privacy challenges

Press Contact:

Media Download

*Terms of Use:

Press Mentions

New Scientist

PBS NewsHour

The Wall Street Journal

New York Times

Nature

Associated Press

Scientific American

Related Topics

Related Articles

More MIT News

Three from MIT awarded 2024 Guggenheim Fellowships

A musical life: Carlos Prieto ’59 in conversation and concert

Two from MIT awarded 2024 Paul and Daisy Soros Fellowships for New Americans

MIT Emerging Talent opens pathways for underserved global learners

The MIT Edgerton Center’s third annual showcase dazzles onlookers

3 Questions: A shared vocabulary for how infectious diseases spread

Browse By

Topics

Departments

Centers, Labs, & Programs

Schools

Breadcrumb

Privacy challenges

Press Contact:

Media Download

*Terms of Use:

Share this news article on:

New Scientist

PBS NewsHour

The Wall Street Journal

New York Times

Nature

Associated Press

Scientific American

Related Links

Related Topics

Related Articles

More MIT News