I gave you my sample, now where are my data?

The more data the better, right? As medical researchers and practitioners seize on the benefits of precision medicine, we risk drowning in our own data.

All those scientific data need to go somewhere.   German Climate Computing Center/Flickr  (CC BY-NC-ND 2.0)

All those scientific data need to go somewhere. German Climate Computing Center/Flickr (CC BY-NC-ND 2.0)


You and I just took a codeine tablet. You felt nauseous, but I didn’t. Why? Our biochemical pathways that metabolise codeine may differ and produce different metabolites, some of which cause side effects. These pathways are determined by our genes.

Cue precision medicine. Instead of one-size-fits-all, precision medicine identifies what drug is best for you – in fact, it’s all about you. It takes into account your genetics, diet, lifestyle, illnesses and environment. You’ll receive an accurate diagnosis, experience fewer side effects and enjoy a more rapid recovery. As for medical researchers, it offers access to vast pools of health data, virtually on tap.

In simple terms, precision medicine looks for molecular markers (such as metabolites) in saliva, blood, plasma or urine using mass spectrometers, nuclear magnetic resonance and high performance chromatography. Accurate results are delivered in real-time.

Here’s my sample — how much data will you get?

Implementing precision medicine on a large, collaborative scale presents some major challenges, but some companies are already in the market. In the US, ACTx offers genomic profiling services ‒ just send them your saliva sample in the mail. The Institute for Precision Medicine in New York gathers genomic data from cancer patients to improve diagnosis and treatment.

In Australia, Perth’s Phenome Centre will focus specifically on infants and children when it opens later this year. Using metabolomics, a decision to prescribe antibiotics for a suspected infection can be made in a matter of hours.

A simple saliva test can produce millions of genetic data about you.     Pelle Sten/Flickr   (CC BY 2.0)

A simple saliva test can produce millions of genetic data about you. Pelle Sten/Flickr (CC BY 2.0)

Robert Trengove, director of the Separation Sciences and Metabolomics Laboratory at Murdoch University, is working to establish the Phenome Centre. He estimates the first year of operation will generate 50-100 terabytes (TB) of patient data, escalating to two petabytes (PB) per annum within 4-5 years. “After five years I expect we will build towards 3PB, but will also begin to include all metadata,” he said. The group will also add ‘omics’ information generated by related specialisations in genetics, genetic transcription, and protein, lipid and metabolite production.

This year the biomarker market is expected to approach US$47 billion. The volume of generated data in the USA alone could soon reach yottabyte territory. Storage demand at the European Genome-phenome Archive, used by many cancer research organisations to deposit and access data, has increased from 50 to 1700TB since 2011. 

With precision medicine’s foot firmly in the camp of big data, let’s look at some figures and challenges. Typically, a human genome has 3 billion base pairs, which need 700MB storage space. Only 0.1% varies between individuals, so storing 3 million base pairs is a more efficient option at 125MB. However, there is a risk in discarding raw data; they may be needed for context or future reference. There are also quality control and re-test requirements to consider.

Keeping data up to standard

Biomarker detection and data management together form precision medicine. If either one is weak, precision medicine is a powerless tool. Data management in precision medicine must adapt to growing volume, different methodologies and data formats, and incessant technological evolution. Management systems have not been standardised across countries, or even institutions.

The High Performance Storage System at the US National Energy Research Scientific Computing Center can manage up to 37PB of scientific data.     Roy Kaltschmidt/Flickr   (CC BY-NC-ND 2.0)

The High Performance Storage System at the US National Energy Research Scientific Computing Center can manage up to 37PB of scientific data. Roy Kaltschmidt/Flickr (CC BY-NC-ND 2.0)

Data originate from diverse precision medicine sources: researchcommercial diagnostic laboratoriesuniversity-led research facilities and electronic health records from medical practices and hospitals. Can disparate sources co-develop, implement and comply with common data standards? Who will manage and monitor them?

Standardisation must start in the clinic or research laboratory. If different tests are used for the same outcome then different margins of error may exist when all the data are combined, rendering their value questionable.

Doctors and researchers must respect conditions of patient consent when accessing and using information from pooled data. Are conditions in patient consent forms uniform? At the moment, they are not.

Andrew Currie, a senior lecturer in immunology at Murdoch University, is also involved in establishing Perth’s Phenome Centre. “The initial studies don’t have specific consent on permission to access the individual data per se because the research is population based and not personalised,” he said.

“However, as we move to more personalised use of the data (e.g. longitudinal measurement) with a better understanding of what specific patterns mean, I think we will have to consider such access”. 

Dr Currie stated that if multiple sites were used they would standardise consent forms as much as possible while still complying with ethical obligations.

Big data and big decisions

Precision medicine data present thousands of possible variations and outcomes, so decision-making becomes difficult and subjective without computers. Big data means more than “lots of data” – it means chewing through masses of complex information quickly and making real-time decisions.

Assoc Prof Trengove recognises that the Phenome Centre “will definitely require a supercomputer to do this. In addition, visualisation will become even more complex as we try not only to correlate metabolites and metabolite change with disease, but also all the other ‘omics’”.

Epidemiologists rely on supercomputers and data analysts to process millions of raw data and correlate results with individual status and disease incidences. Using multi-variant statistical analyses, such as advanced pattern recognition techniques and principle component analysis, they identify very low statistically significant findings, like rare conditions linked to minor specific genes or ethnic groups. For example, a side effect occurring in 1 in 10,000 patients is unlikely to be detected in a trial with 2000 participants, but can be pinpointed through big data.

Decision support tools like IBM Watson can analyse and visually display connections in large numbers of data.     Jon Simon, Feature Photo Service for IBM/Flickr   (CC BY-NC-ND 2.0)

Decision support tools like IBM Watson can analyse and visually display connections in large numbers of data. Jon Simon, Feature Photo Service for IBM/Flickr (CC BY-NC-ND 2.0)

Decision support tools, such as artificial intelligence (AI), can then recommend a course of action to clinicians and researchers, like those at the Memorial Sloan Kettering Center in the US who use the IBM Watson system for interpreting oncology data.

Dr Currie foresees “a major analysis bottleneck” as the size and complexity of studies increase and more ‘omics’ datasets are added to clinical information. Pattern identification and testable hypotheses generated by AI will be reviewed by human analysts to provide context and emphasis. “Without this eventual capacity, we run the risk of drowning in data and information, at the expense of knowledge,” he said.

Although AI seems inevitable, nagging ethical and philosophical issues remain. How much autonomy and trust should we give AI? These and other concerns are expressed in an open letter by the Future of Life Institute. The letter is signed by experts from universities, industry (including Google, Facebook, Skype, Apple, Deep Mind and IBM Watson), and individuals like Stephen Hawking and Elon Musk.

Are my data under lock and key?

Like financial and retail data, health data appeal to hackers. Vulnerability exists because lots of data are stored and accessed through multiple sites, using cloud and mobile technology. If hackers fail at one site, they will move onto another. Health data are more valuable than a stolen credit card on the black market. Being harder to detect, health data can be on-sold several times before being reported.

Security breaches are on the rise and tend to be harder to detect. Health professionals, researchers, hospitals and other institutions are slow to respond because their focus is elsewhere. With lack of training and the vast number of data involved, it’s actually hard to see what has been accessed.

The Pawsey Centre houses multiple powerful supercomputers that will work on precision medicine problems.      iVEC/Wikimedia Commons   (CC BY-SA 1.0)

The Pawsey Centre houses multiple powerful supercomputers that will work on precision medicine problems. iVEC/Wikimedia Commons (CC BY-SA 1.0)

Precision medicine data security is probably overlooked by institutions because money is more often spent on instruments instead. IT systems are infrequently updated and virus scanners are used instead of data activity trackers. One solution is to hire and encourage IT security professionals to collaborate and share intelligence about breaches, technology and threats with their industry peers.

Assoc Prof Trengove said that Perth’s Phenome Centre “will have a small team who will work with the PAWSEY Supercomputer facility in Perth”. This is a similar approach taken by other research institutions around the world.

Another option is using genomic data collecting companies like DNAnexus that offer laboratories complete off-site data analysis, interpretation and storage services for genetic studies. They comply with security standards such as HIPAA and ISO27001CLIA and GLP, which are not usually the focus of research or medical centres.

* * *

Precision medicine is probably the future of diagnostic medicine ‒ actually, it’s already here.  However, for precision medicine to work on a large scale and to fully realise its potential, we must address data management. Specialisation may be the best approach; let data systems be managed by in-house IT security professionals and third party storage sites, while researchers and clinicians do their science. Everybody must also collaborate; there is no room for exclusivity. The goal of precision medicine is to benefit patients and scientific research.

How do we standardise, share and access precision medicine data? Who owns the data? How should data be used to develop new treatments, and by whom? Some of these questions were raised when genomic sequencing started in the 1990s, but even today, when analytical techniques have become slicker and cheaper, these key questions remain largely unanswered.

Let’s not let this exciting revolution in diagnostic methods eclipse the challenge of managing the not-so-sexy, but equally important, precision medicine data.

Edited by Andrew Katsis and Ellie Michaelides