Social Security Number Vulnerability Findings Relied on Supercomputing
Information available on the Internet can in certain cases be used to predict individual social-security numbers, posing a risk of identity theft that policy-makers and individuals should address. This finding, an unexpected consequence of public information in modern economies, published in the Proceedings of the National Academy of Sciences (PNAS) and highlighted in the national media, relied on computational resources of the TeraGrid, a National Science Foundation cyberinfrastructure program. It would have been difficult, if not impossible, to obtain these findings without these publicly-funded, high-performance computing (HPC) resources, says one of the lead researchers, Alessandro Acquisti, a professor at Carnegie Mellon University.
About a year ago, at an important phase in the project, Acquisti and his colleague, Ralph Gross, a post-doctoral researcher, and several graduate students who worked with them, began using a large-scale parallel computing system at the Pittsburgh Supercomputing Center (PSC). “At that stage,” said Acquisti, “we had a rough idea of the results, but to go forward we had to try many different variations of the algorithms. It would have been incredibly difficult to do this, or taken much, much longer without access to this system.”
After first working with desktop computers, the researchers turned last year to a PSC system called Pople (named for Nobel laureate chemist John Pople of Carnegie Mellon). A Silicon Graphics Altix 4700, installed in March 2008, Pople has 768 cores (processors) and 1.5 terabytes of shared memory (all of memory accessible from each core). The SSN runs used up to 400 of Pople’s cores and 800 gigabytes of memory, a large memory requirement that made Pople’s shared memory very helpful to the project.
TeraGrid staff at PSC installed Octave — an open-source version of the programming language MATLAB — and wrote a script to submit a large number of parallel Octave jobs simultaneously on Pople. This facilitated the Acquisti team’s interactive process, which involved doing many runs representing different states and computational strategies, checking and analyzing results and re-thinking before running more variations. PSC’s consulting, said Acquisti, was “extremely helpful.”
One fairly unassuming graphical figure in the PNAS paper, notes Acquisti, represents results of “more than 700,000 regressions over very large sets of data,” which to computational scientists gives a sense of the immense computational scope of the problem.
“This project,” said Sergiu Sanielevici, PSC director of scientific applications and user support, who also leads user support and services for the TeraGrid, “exemplifies how powerful systems like Pople can open doors to data-mining and data-centric research in fields not traditionally associated with HPC, such as the social sciences, and make it possible to get answers that would otherwise be impractical or impossible.” PSC supported this project through the NSF TeraGrid program, which allocates large-scale computing resources free to researchers at U.S. universities on a peer-review proposal basis.
Carnegie Mellon graduate students Jimin Lee, Ihn Aee Choi, Dhruv Deepan Mohindra, and Ioanis Alexander Biternas Wischnienski collaborated in this research with Acquisti and Gross and did much of the hands-on computational work.
Acquisti and Gross tested their prediction method using records from the Death Master File of people who died between 1973 and 2003. They could identify in a single attempt the first five digits for 44 percent of deceased individuals who were born after 1988 and for 7 percent of those born between 1973 and 1988. They were able to identify all nine digits for 8.5 percent of those individuals born after 1988 in fewer than 1,000 attempts. Their accuracy was considerably higher for smaller states and recent years of birth: for instance, they needed 10 or fewer attempts to predict all nine digits for one out of 20 SSNs issued in Delaware in 1996. Sensitive details of the prediction strategy were omitted from the article.
“If you can successfully identify all nine digits of an SSN in fewer than 10, 100 or even 1,000 attempts, that Social Security number is no more secure than a three-digit PIN,” the authors noted.
Further information about the research: http://www.ssnstudy.org