The first publications from the ENCODE project (Encyclopedia of DNA Elements) made a big splash at Evolution News in 2013, and around the world, because it undermined the “junk DNA” myth and simultaneously fulfilled an ID prediction: that non-coding parts of the genome would prove functional. Junk-DNA proponents like Dan Graur were upset at the time, admitting as Jonathan Wells reported, “If ENCODE is right, evolution is wrong.”
Well, ModENCODE (ENCODE for model organisms) found “unprecedented complexity” in the fruit fly genome in 2014, then “ENCODE 2” followed up with more discoveries of function. Now, ENCODE 3 has just finished submitting its reports, with record numbers of DNA annotations listed, and ENCODE 4 is gearing up. Nothing like a little overkill to drive the point home: “… then evolution is wrong.” Look at how much constructive science is being done with the assumption that DNA elements are there for a purpose.
History and Purposes
Before introducing the latest results, Nature provides an overview, “Perspectives on ENCODE,” that recounts the history and purposes of the project:
The ENCODE Project was launched in 2003, as the first nearly complete human genome sequence was reported. At that time, our understanding of the human genome was limited. For example, although 5% of the genome was known to be under purifying selection in placental mammals, our knowledge of specific elements, particularly with regards to non-protein coding genes and regulatory regions, was restricted to a few well-studied loci.
ENCODE commenced as an ambitious effort to comprehensively annotate the elements in the human genome, such as genes, control elements, and transcript isoforms, and was later expanded to annotate the genomes of several model organisms. Mapping assays identified biochemical activities and thus candidate regulatory elements. [Emphasis added.]
Annotations are like labels or comments on things. For instance, if you have a stereo system with a lot of cables, you might affix tags on them to indicate where the TV plugs in, or where each speaker wire goes. In computer programming, wise programmers add comments in English to explain what a section of code does. Comments do not affect the function of the code, but help the next programmer follow the logic.
DNA is like a program, only it did not come with English comments! That is why ENCODE is important; the project is building a searchable database for researchers to find out what a string like ACCCTGTAAAGTG is doing. Is it a gene? Is it a control region? Medical researchers will want to see if a SNP (single nucleotide polymorphism) in that string correlates with a disease.
A Treasure Hunt
The project is sending scientists all over the world on a treasure hunt for functions in the “junk” pile assumed to exist from evolutionary history. The first results in 2013 indicated that over 80 percent of the genome was being transcribed. That was a strong clue that most non-coding regions were functional, even if the functions were unknown, because a cell would be unlikely to spend the energy transcribing nonsense. Indeed, many of those transcriptions turned out to be important regulatory regions.
The project began slowly but is accelerating.
Phase I (2003–2007) interrogated a specified 1% of the human genome in order to evaluate emerging technologies….
Phase II (2007–2012) introduced sequencing-based technologies (for example, chromatin immunoprecipitation with sequencing (ChIP–seq) and RNA sequencing (RNA-seq)) that interrogated the whole human genome and transcriptome. General assays such as transcript, open-chromatin and histone modification mapping were used on a wide variety of cell lines, while more specific assays, such as mapping transcription factor binding regions, were performed extensively on a smaller number of cell lines to provide detailed annotations on, and to investigate the relationships of, many regulatory proteins across the genome.
ENCODE Phase III
The findings of ENCODE are accelerating along with sequencing technologies. The latest Phase III reports were published July 29 in Nature. Some of the labs involved are telling what they found.
Cold Spring Harbor Laboratory includes a must-watch video:
It begins with a striking moment where Ewan Birney, senior ENCODE researcher at CSHL, opens a huge tome containing a complete printout of the human genome. He calls it “a big achievement in 2000, but it’s just a set of boring letters” that ENCODE is bringing to life. “These letters actually do something,” he says, “they mean something.” The goal is to find out what they mean — to learn their functions. Magdalena Skipper, the next speaker, says that the ENCODE Consortium “considers functional elements in very broad terms” — beyond genes to the regulatory elements, switches and even to parts where “we have no idea what they are doing,” Birney adds. This fits ID proponent Paul Nelson’s motto, “If something works, it’s not happening by accident.”
Not an Accident
So far ENCODE has produced a “staggering” hundreds of terabytes of raw data in detailed form:
In Phase 3, researchers took advantage of the latest genetic technologies to glean data from biological specimens and deeply investigate the regulatory regions outside of genes, where most of the genome’s person-to-person variation lies. Their data identifies some 900,000 candidate regulatory elements from the human genome and more than 300,000 from the mouse, which can be explored through ENCODE’s new online browser.
Within the hundreds of cell types studied, ENCODE is helping scientists understand “why your liver cell is different from your kidney cell,” Birney says; the secrets will be found in the switches that turn genes on and off. “It’s really a first view of that complexity that generates a human being.”
Skipper says it was “striking” to find that they were able to assign a “biochemical function” to 80 percent of the genome: striking, because “not such a long time ago, we still considered that a vast proportion of the human genome was simply junk.” Birney comments, “It’s very hard to get over the density of information” in the genome. They found places that are “much more complex” than expected, and loci thought to be completely silent are actually “teeming with life, teeming with things going on; we still really don’t understand that.” Another surprise is that portions corresponding with disease are being found in non-coding parts of the genome.
“This encyclopedia is a living resource. It has a beginning but really no end. It will continue to be improved, and grown, as time goes on.”
MIT’s news release is titled, “Bringing RNA into genomics.” It describes the new technologies MIT used to identify candidate RNA transcripts and then determine their functions. And “function” is the key word in their work:
These RNA sequences do not get translated into proteins, but act in a variety of ways to control how much protein is made from protein-coding genes. The research team, which includes scientists from MIT and several other institutions, made use of RNA-binding proteins to help them locate and assign possible functions to tens of thousands of sequences of the genome.
The National Institutes of Health (NIH) describes the search for genetic “switches” that turn genes on and off in different cell types and in various stages of development. This is being determined for the mouse genome as well as the human genome.
“A key challenge in ENCODE is that different genes and functional regions are active in different cell types,”said Elise Feingold, Ph.D., scientific advisor for strategic implementation in the Division of Genome Sciences at NHGRI and a lead on ENCODE for the institute. “This means that we need to test a large and diverse number of biological samples to work towards a catalog of candidate functional elements in the genome.”
Significant progress has been made in characterizing protein-coding genes, which comprise less than 2% of the human genome. Researchers know much less about the remaining 98% of the genome, including how much and which parts of it perform other functions. ENCODE is helping to fill in this significant knowledge gap.
The human body is composed of trillions of cells, with thousands of types of cells. While all these cells share a common set of DNA instructions, the diverse cell types (e.g., heart, lung and brain) carry out distinct functions by using the information encoded in DNA differently. The DNA regions that act as switches to turn genes on or off, or tune the exact levels of gene activity, help drive the formation of distinct cell types in the body and govern their functioning in health and disease.
Nature’s own “News and Views” story, “Expanded ENCODE delivers invaluable genomic encyclopedia,” boasts that Phase III has generated “the most comprehensive catalogue yet” of the functional elements that regulate our genes.
In the current third phase of the project, the consortium moved from cell lines to cells taken directly from human and mouse tissues, providing a more biologically relevant encyclopedia. They also introduced assays to investigate the broader aspects of functional elements — for example, to characterize the elements embedded in RNAs or to analyse chromatin looping, which brings separate CREs into close proximity to enable gene regulation.
There are eight technical papers in the special issue of Nature, all open access. Start with “Expanded encyclopaedias of DNA elements in the human and mouse genomes” for the details, scan through the other papers, and marvel at the exciting discoveries from Phase III. Or you can read “ENCODE explained” for an overview with illustrations. It’s all part of the ENCODE Collection going back to 2012.
ENCODE Phase IV
What is next for ENCODE? The “Perspectives on ENCODE” article cited earlier explains why much more research is still needed to understand the vast volume of functional information in our DNA:
It is now apparent that elements that govern transcription, chromatin organization, splicing, and other key aspects of genome control and function are densely encoded in the human genome; however, despite the discovery of many new elements, the annotation of elements that are highly selective for particular cell types or states is lagging behind….
Thus, as part of ENCODE 4, considerable effort is being devoted to expanding the cell types and tissuesanalysed … as well as mapping the binding regions for many more transcription factors and RNA-binding proteins.
Further research may even find new functions for repetitive sequences, or for “silent” sequences that only get switched on under unique circumstances, or in particular cell types, or during certain stages of development. In short: the search is on for function in the junk!