Last week, the large genome sciences consortium ENCODE (ENCyclopedia of DNA Elements) made a big splash by presenting its long-awaited results in a publishing extravaganza. This was a fantastic opportunity for scientists and science journalists to explain to the public some of the exciting and important research findings in genome biology that are changing how we think about health, disease, and our evolutionary past. But we blew it, in a big way.
If you read anything that emerged from the ENCODE media blitz, you were probably told some version of the "junk DNA is debunked" story. It goes like this: When scientists realized that classical, protein-encoding genes make up less than 2% of the human genome, they simply assumed, in a fit of hubris, that the rest of our DNA was useless junk. (You might have also heard this from your high school or college teacher. Your teacher was wrong.) Along came the ENCODE consortium, which found that, far from being useless, junk DNA is packed with functionality. And so everything scientists thought they knew about the genome was wrong, wrong wrong.
The Washington Post headline read, "'Junk DNA' concept debunked by new analysis of human genome." The New York Times wrote that "The human genome is packed with at least four million gene switches that reside in bits of DNA that once were dismissed as 'junk' but that turn out to play critical roles in controlling how cells, organs and other tissues behave." Influenced by misleading press releases and statements by scientists, story after story suggested that debunking junk DNA was the main result of the ENCODE studies. These stories failed us all in three major ways: they distorted the science done before ENCODE, they obscured the real significance of the ENCODE project, and most crucially, they mislead the public on how science really works.
What you should really know about the concept of junk DNA is that, first, it was not based on what scientists didn't know, but rather on what they did know about the genome; and second, that concept has held up quite well, even in light of the ENCODE results. Among the reasons that scientists in the 1970s and '80s began to believe that much of the genome is non-functional was the observation that very similar species could have very different genome sizes. There is no reason to believe that similar species require dramatically different amounts of functional DNA, and thus something other than functional requirements must explain differences in genome size. Scientists also discovered that our genomes contain parasitic, virus-like elements called "transposons" that have the ability copy themselves within our cells. This DNA ecosystem makes our genomes more like a jungle than a precision machine. At the latest count, transposon-derived DNA makes up at least half of our genome. The transposon-derived sequences in our genomes do not have to be explained by invoking some useful function for it. There is no mystery here: this DNA is there because it can replicate.
The primary scientific task of the ENCODE group was to scope out the biochemical landscape of the genome, and put the resulting data out as a resource. The Human Genome Project gave us the text of our genome, but this text is essentially impossible to read without an interpretive guide of key biochemical landmarks. ENCODE, in what was a genuine, technological tour-de-force, measured dozens of different kinds of biochemical landmarks, which can be suggestive of important functions, but do not by themselves demonstrate that a region of the genome is doing something useful for us. This distinction was obscured by the press releases put out by ENCODE, and largely lost on most of the reporters who covered the story. Missing from press releases and news reports was a description of what non-functional DNA looks like: it carries many of the same biochemical landmarks as functional DNA. The widely reported claim of debunked junk DNA is simply wrong.
The media reports on ENCODE used the word 'breakthrough,' but it is too early to fully measure the success of ENCODE, despite the high quality of the data. Ten years out, the reference human genome sequence is a must-have tool for nearly all biomedical researchers. Will the ENCODE results become equally indispensable to our efforts to understand the connection between our genomes and our health? ENCODE's results are very big, but not comprehensive: they don't include every type of cell or class or regulatory protein that we're interested in. As our genome technology improves (which it is doing at a rate that might put the iPhone to shame), we may decide that we need to re-do much of the work done by ENCODE. And many are worried that our funding agencies have become addicted to Big Science, prioritizing massive data generation efforts over the more idea-driven work of smaller, individual labs.
The most damaging aspect of our massive failure to get the ENCODE story right was that readers were served up a terrible distortion of the scientific process. A rule of thumb you should apply whenever reading about supposed breakthroughs is this: past scientists weren't as dumb or credulous as they're made out to be. Scientists tend to be a cautious and skeptical lot, not given to cooking up new theories based on a blithe and arrogant dismissal of what they don't understand. They work hard to base their ideas on the best data available at the time, and then they work hard to come up with even better data.