Study Shows DNA Database Not so Anonymous on Internet
On the Internet, nobody knows you're a dog -- but it is getting increasingly easy for someone to figure it out.
As more and more of our personal data -- and those of the people we know and are related to -- gets posted online, the anonymity promised by the remove of a computer screen gets more and more elusive, according to a new study out Thursday in the U.S. journal "Science."
That's what a team of scientists uncovered when they started playing Sherlock with a batch of genetic data posted online for researchers to use.
The data was anonymous: the participants' names were not published.
But using the information that was provided, including age and where they live, along with freely available Internet resources, the researchers were able to identify nearly 50 of the individuals in the genomic database.
"This is an important result that points out the potential for breaches of privacy in genomics studies," said Whitehead Fellow Yaniv Erlich, who led the research team.
Erlich's team started by analyzing certain genetic markers, called "short tandem repeats," on Y chromosomes (Y-STRs) that tend to be passed down from father to son.
Because in U.S. culture, surnames are also passed from father to son, there is a strong link between these repeats and family names.
That information is used by genealogy web sites to help people find common ancestors and other family information. Men upload information about the Y-STRs they have to find others with similar ones -- leaving a publicly searchable database of Y-chromosome data linked to family trees.
By comparing the Y-STRs of the study participants, the researchers found the last names of a number of the participants. They estimate it would be possible to identify last names for about 12 percent of Caucasian males this way.
Cross-referencing these with Internet record search engines, obituaries, genealogical websites, and public demographic data, the team was able to fully identify nearly 50 participants of the genomic study, including some women relatives.
"We show that if, for example, your Uncle Dave submitted his DNA to a genetic genealogy database, you could be identified," said Melissa Gymrek, a member of the Erlich lab and first author of the Science paper.
"In fact, even your fourth cousin Patrick, whom you've never met, could identify you if his DNA is in the database, as long as he is paternally related to you."
When Erlich's team informed the NIH of what they had been able to do, the agency removed the participants' ages from the database, to make it more difficult to identify them.
Erlich said that he and his research team had nothing nefarious in mind when they did the research -- but that doesn't mean someone else might not have more sinister motivations.
"Our aim is to better illuminate the current status of identifiability of genetic data," he said.
"More knowledge empowers participants to weigh the risks and benefits and make more informed decisions when considering whether to share their own data."
But he emphasized that genomic databases provide crucial information for researchers and scientific progress.
He said he hoped research like his would prompt better security measures to protect research subjects in future.
Privacy of genetic information -- which can reveal predispositions to certain illnesses -- is a major concern among the scientific community and the larger public in the U.S. who fear such information could be misused by insurance companies or employers.