By David Patterson
Professor of Computer Science
University of California, Berkeley
This ancient assassin, first identified by a pharaoh's physician, has been killing people for more than 4,600 years. As scientists found therapies for other lethal diseases--such as measles, influenza, and heart disease--cancer moved up this deadly list and will soon be #1; 40% of Americans will face cancer during their lifetimes, with half dying from it. Most of us ignore cancer until someone close is diagnosed, but instead society could zero in on this killer by recording massive data to discover better treatments before a loved one is in its crosshairs.
Cancer is unlimited cell growth caused by problems in DNA. Some people are born with precarious DNA, and others acquire it later. When a cell divides, sometimes it miscopies a small amount of its DNA, and these errors can overwhelm a cell's defenses to cause cancer. Thus, you can get it without exposure to carcinogens. Cigarettes, radiation, asbestos, and so on simply increase the copy error rate. Speaking figuratively, every time a cell reproduces, we roll the dice on cancer, with such mutagens loading the dice to raise cancer's chances.
Most cancer studies today use partial genomic information and have fewer than 1,000 patients. One wonders whether their conclusions would still hold if they used complete genomes and increased the number of patients by factors of 10-100.
Given cancer's gravity and nature, shouldn't scientists be able to decode full genomes inexpensively to fight this dreaded disease in a better informed way? Now they can! The plot below shows the dropping cost of sequencing a genome since 2001.
Moore's Law, which drives the information technology revolution, improved 100-fold in 15 years, yet the wet lab costs to identify a genome have dropped 100,000-fold to $1,000 per genome, which is considered the tipping point of affordability for many.
This graph should be a call to arms for computer scientists, as the war on cancer could require Big Data. If the 1.7 million Americans who will get cancer in 2016 were to have their healthy and tumor cells sequenced, it would yield one exabyte (1018) of raw data. The UC Berkeley AMPLab --collaborating with Microsoft Research and UC Santa Cruz--joined the battle in 2011, which we launched with a New York Times essay. We have been championing cloud computing and open-source software development ever since, which is natural inside computer science but counterintuitive elsewhere.
The good news is that our collaboration developed software that has already helped save a life. A teenager went to medical specialists repeatedly and was eventually hospitalized for five weeks without a successful diagnosis. He was placed in a medically induced coma after developing brain seizures. In desperation, the doctors sent a spinal fluid sample to University of California, San Francisco for genetic sequencing and analysis. Our program first filtered out the human portion of the DNA data, which was 99.98% of the original 3 million pieces of data, and then sequenced the remaining pathogen. In just two days total, UCSF identified a rare infectious bacterium. After treating the boy with antibiotics, he awoke and was discharged. Although our software is only part of this process, previously doctors had to guess the causative agent before testing for a contagious disease. Other hospitals and the Centers for Disease Control now use this procedure.
The bad news is that genetic repositories are still a factor of 10-100 short of having enough cancer patients to draw statistically significant results. The reason to include so many patients is that cancer tumors are notoriously varied; most are unique, so it takes numerous samples to make real progress. Here are obstacles to collecting that valuable exabyte, despite the storage itself being affordable:
- Who would pay? Like the chicken-versus-the-egg debate, we don't yet have conclusive data that show how genetic information leads to effective therapies for most cancers. Thus, despite lower costs, insurance companies won't pay for sequencing. Although many believe it would yield bountiful insights, we can't prove it.
- If funding was found, would the hospitals share data? Researchers write grants to pay for the sequencing and consider the data private at least until they publish; one editorial even labels outsiders who wish to study such data "research parasites." Hospitals may also consider genetic data a proprietary advantage, as it might attract patients and researchers.
- If hospitals were willing, would they be allowed to share data? While a cancer repository will likely lead to breakthroughs, medical ethicists worry more about patient privacy. Consequently, cancer studies regularly restrict data access to the official investigators of the research grant.
As Francis Collins, Director of the National Institute of Health, said at the Davos meeting about accelerating progress on cancer: "We need that big data to be accessible. It's not enough to say that we are in a big data era for cancer. We also need to be in a big data access era."
To overcome such issues, the Global Alliance for Genomics and Health was founded in 2013 "to enable the responsible, voluntary, and secure sharing of genomic and clinical data." While 375 organizations have joined, and its working groups are active, progress has been slow. Perhaps the main impact thus far is that the community now largely believes that such data will eventually be shared.
To prepare for that revolutionary leap, we need to draft software experts immediately who can leverage advances in cloud computing and machine learning while protecting patient privacy to start building open-source tools that will enable scientists to make major inroads on cancer.
Recruitment should be easy, as there's no more inspiring endeavor than helping save the lives of your friends, your family, or yourself.