Downloadable, fully searchable you

Genome sequencing has come a long way since the Human Genome Project, but has privacy kept pace?

The human genome was first sequenced in 2003 for $2.7bn. Since then, our understanding of the genome has moved forward in leaps and bounds. The 1000 Genomes Project, which sequenced and made publicly available the genomes of over a thousand ethnically diverse individuals, has allowed us to map variation across human populations and explore the genomic determinants of health and disease. “We are entering a new age of discovery that will transform human health," boasts the National Human Genome Research Institute.

As the cost of sequencing the genome (or the exome, the coding parts of a genome) drops dramatically, the procedure has become more popular for both medical researchers and direct-to-consumer companies. But this isn’t without some issues; the interpretation, privacy, and security of a genome are all important matters for discussion, both by scientists and the public using these technologies.

Your genome is the underlying script that determines many of your traits: your hair, your eyes, whether you can roll your tongue or wiggle your ears, and your likelihood to develop a range of diseases. While geneticists are quick to remind people that their genome is not a set of immutable laws, the way most people think about their genome is at odds with that. It’s compelling to view your genome as the base code to who you are, but in reality it's more like sheet music for an accomplished improvisational jazz pianist — sure, there are some things that show up in the piece, but there’s also a whole lot that doesn’t.

The cost of sequencing a human genome has plummeted in the last 15 years. National Human Genome Research Institute (public domain) — *The cost of sequencing a human genome has plummeted in the last 15 years.* National Human Genome Research Institute (public domain)

Many “genomics” services don’t actually investigate your whole genome — they might sequence the whole genome, but then only look at genes that are thought to be relevant, or they look for known variations that might be in your DNA. Melbourne Genomics only considers clinically relevant genes for their patients, and 23andme, the direct-to-consumer genetic testing service, investigates a particular set of variations rather than sequencing the entire genome.

Consent in genome sequencing is also a big issue. We know so little about genomes, and whether someone can provide informed consent to possible future discoveries is a tricky question. Informed consent can range from closely-written documents reminiscent of the iTunes Terms of Service to the clear and simple webpage from DNA.LAND, to something roughly in between. But regardless of how comprehensive the terms of consent for genome sequencing are, their importance to the genome donor’s privacy and protection is imperative.

For genomic databases to best advance science and medicine, they need to be accessible to a range of researchers. This can be in specific databases like the Cancer Genome Atlas, which aim to help develop genomic markers for all cancer types, or in more general databases like the Personal Genomes Project. Open access databases, from which you can just download a genome, help this research by making it easy to access and analyse large numbers of genetic data without needing to collect them yourself.

While your genome might seem like meaningless code to the untrained eye, the risk of open access databases is that this code can be solved. In 2008, David Craig showed that he could find out whether a person had contributed to a genome-wide association study from their DNA sample, even if only summary statistics were reported. The National Institute of Health responded quickly, removing these data from public databases and requiring researchers to apply for access.

A printed version of the human genome, housed in the Wellcome Collection in London, spans over 100 volumes, each 1000 pages long. Richard Spalding/Flickr (CC BY-NC-ND 2.0) — *A printed version of the human genome, housed in the Wellcome Collection in London, spans over 100 volumes, each 1000 pages long.* Richard Spalding/Flickr (CC BY-NC-ND 2.0)

In 2013, a team led by Assistant Professor Yaniv Erlich, a computational geneticist at the New York Genome Center, went one step further. They showed that you can take a publicly accessible genome and get pretty close to who it came from using information freely available on the internet. This has been done for both public genomes, such as those in the 1000 Genomes Project, and “anonymous” sperm donors. For genetic data to be most useful to researchers, information on family history, age and medical background needs to be included, necessarily making the genome easier to identify. Donors to the 1000 Genomes Project had been told that it would be difficult for anyone to determine which information came from a particular person when looking at the database, making Erlich’s finding quite controversial.

The problems with this are wide-ranging: a health or life insurance company could raise premiums based on your genome, an employer might choose not to keep you on if you have a high risk of disease, or your DNA could be synthesised and placed at the scene of a crime. While the United States has the Genetic Information Non-Discrimination Act (GINA), most other countries don’t have similar legislation, and GINA only protects against health insurance and employment discrimination. In Australia, genetic discrimination is prevented by the Disabilities Discrimination Act.

While it could be argued that nobody really wants to work for an employer who would discriminate on the basis of genetics, the larger issue is one of privacy. Using this technology, a stranger can potentially discover who you are, what you look like, or your risk of developing Alzheimer’s from genomic data that you either donated to science or left in a public place. A technique to help police sketches using genomics has already been patented. Suddenly, only your thoughts can be definitively private — and, even then, genomics is being used to determine your likely political leanings, among other things.

While various forms of computational encryption have been suggested for genomic data, the main problem is the sheer size of the genome – any method would require a lot of power to correctly encrypt and decrypt 6 billion bases. The favoured method for Erlich – who has a background in bank cybersecurity – is homomorphic encryption. This allows researchers to analyse genetic data without decrypting it, using tools that are developed to work with the encryption. A proof of concept study for this form of encryption was published in 2013, but the technique is still experimental and will need additional work before it can be used commercially.

Most studies get around the problems of de-anonymisation with restricted access. To access genomic data you must be a researcher and agree to protect the confidentiality of data. Usually, a new access request for each database must be submitted; the European Bioinformatics Institute has data-access committees that review all applications to use their genomic databases. But this researcher-only approach isn’t favoured by everyone.

“Who decides who is a researcher?” said George Church, a professor of genetics at Harvard Medical School and founder of the Personal Genomes Project. “PhD students? College students? High school interns in NIH funded labs?

"If you promise to the people donating genomes that you will police this definition strictly and then it leaks into one of the unapproved categories, or perhaps even more broadly, like Wikileaks, then who is responsible?”

Church founded the Personal Genomes Project in 2005 on the idea that research is most effective with total open access. The project recognises that data escape and de-anonymisation of genomes are likely, and tries to communicate this fully to participants. By recognising and accepting re-identification, the project can include physical traits and healthcare data about subjects that other databases are forced to avoid.

George Church, pictured here in 2009, founded the Personal Genomes Project to give researchers ready access to genome data. PopTech/Flickr (CC BY-SA 2.0) — *George Church, pictured here in 2009, founded the Personal Genomes Project to give researchers ready access to genome data.* PopTech/Flickr (CC BY-SA 2.0)

Another solution to the issue of donor anonymity is to only publish aggregated data, rather than individual genomes. This is the default approach used by the genome database DNA.LAND, although users can then decide what further details they wish to share.

“I think that we need to give options to participants and let them choose the best method that works for them,” said Erlich, DNA.LAND's founder.

Erlich himself doesn’t worry about being identified through his genome: "I put my whole genome with my name online,” he said. But control of data is paramount, as it will increase the trust between the public and researchers. Indeed, with this approach DNA.LAND has already been successful at obtaining raw data: when the project launched in October last year, 1,250 genomes were added on its first day, and the database recently passed the 10,000 genomes mark.

The terms of consent on DNA.Land are slightly more than a page in plain language. Clear communication to the participants is incredibly important to Erlich and his team.

“We do tell them that we cannot completely de-identify genomes,” he said. “But we take some safeguards about sharing genetic data, such as sharing only summary statistics by default or letting users to decide if they want to share individual level results.”

The focus for both the Personal Genomes Project and DNA.LAND is to give data back to participants while benefiting the research community. These projects provide participants with real-time access to ongoing research about their own health and genomics.

Genomics databases put billions of genetic data from thousands of people at our fingertips. Jesper Dyhre Nielsen/Flickr (CC BY-NC-SA 2.0) — *Genomics databases put billions of genetic data from thousands of people at our fingertips.* Jesper Dyhre Nielsen/Flickr (CC BY-NC-SA 2.0)

One major shortcoming of these projects is that genetic counselling isn’t provided, so individuals curious or worried about a particular result would have to seek further help by themselves. The web-based tool Promethease will interpret your genomic data for a cheap $5 and split the results into “bad news” and “good news”, but even that would benefit from the context and knowledge a physician or genetic counsellor can offer.

Genomic databases have much to offer us. By taking an unbiased snapshot of the genome, they enable us to pick up genes associated with disease that might not have been predicted. A lot of data are needed to conduct statistically powerful investigations into the causes of diseases that have some environmental and some genetic input, such as heart disease, diabetes, and cancers. By locking this information up or removing it entirely, we can delay or even prevent the discovery of causes that have a genetic basis.

Even so, while anyone who has come up against a paywall will see the value of open access, the inherently personal nature of a genome suggests that genomes are not the best place to start. We can gain some incredibly useful knowledge from studying genome sequences, but the risk and repercussions of revealing someone’s genetic code may not necessarily be worth that benefit. To prevent this happening, respect, trust and engagement of volunteers should all be closely interwoven in genomic studies.

Different institutes, companies and individuals fight on different sides for open access. As time goes on and we become more comfortable with our genetics, a firmer set of policies and laws will likely be developed. Whether this results in an open-access world or strict laws against genetic discrimination remains to be seen. Hopefully, it will be a combination of the two that keeps the interests of volunteers and scientific endeavour at its core.

Edited by Andrew Katsis and Ellie Michaelides