Evolution Icon Evolution
Intelligent Design Icon Intelligent Design

Test of Functional Sequence Space Shows Less Tolerance for Mutations than Expected


When you change a tire, you expect the holes on the wheel to line up with the lug bolts on the axle. But manufacturing defects can occur. The bolts might be slightly offset, or slightly longer or shorter than spec. A bolt or two might be missing. How many defects, and how large, will still permit you to continue down the road without being stranded?

Protein interactions are a more sophisticated instance of much the same problem. Proteins must interact with other proteins, often fitting like a lock and key. As with the tire example, there’s a bit of wiggle room that proteins can tolerate and still function. A new study shows that this wiggle room, in practice, is less than theoretically conceivable.

Proteins are made up of chains of amino acids, sequenced in such a way that the chain folds into a three-dimensional shape with "functional information" (the ability to do work or provide structure for the cell). One or more parts of that shape have to "fit" with other proteins. There’s a certain amount of error that can allow substitutions of amino acids (mutations) to be tolerated; that’s called "neutral evolution" or "neutral variation." It might be compared to typos in a paragraph that don’t prevent the reader from understanding the message. The reader might gloss over the typo "tfe" in a sentence like "Functional space is a subset of tfe sequence space," and still understand what was intended. Other substitutions, though, may render the meaning unintelligible, like gubset for subset, or spade for space.

Like our simple sentence says, the functional space of a chain of amino acids is a small subset of possible sequences — a very small subset, in fact. Take a chain of N amino acid residues. Since there are 20 amino acids in proteins, there are 20N possible combinations. For a relatively small protein of 100 amino acids, that’s 20100 combinations, or 20 thousand trillion trillion trillion trillion trillion trillion trillion trillion possibilities. Of those, only a tiny fraction of those are functional — if they fold at all.

Homologous proteins from different organisms often show substantial variation yet are still functional. In proteins, contact sites that interact with partners are typically less forgiving than other sites. The protein might still interact if a hydrophobic amino acid residue at the active site is substituted for another hydrophobic residue; the function can break, however, if a very different amino acid were substituted.

"Degeneracy" is a measure of the tolerance for substitution in a sequence. Consider Morse code as an example of a non-degenerate code: each letter has one "and only one" sequence of dots and dashes for each letter of the alphabet. The genetic code, however, is degenerate: some amino acids can be coded by multiple triplet codons in the DNA. For example, the codons GGA, GGC, GGG, and GGU all code for glycine. There are functional reasons for this apparent laxness in the code, but that’s another story.

Now to the crux of the issue: How much tolerance for error is there in proteins? How degenerate is the protein code? As seen from our 20100 example, it would be practically impossible to search sequence space for functional chains of amino acids. The only way to broach the question is by breaking it down into a tractable question testable by experiment. Take one specific protein interaction: How much substitution can it tolerate and remain functional?

Two researchers at MIT, Podgornaia and Laub, decided to tackle that question. Their work is published in Science. A summary by Guy Riddihough, "Exploring the limits of protein sequence space," in the same issue of Science, gives the main lesson learned:

Exploring the variability of individual functional proteins is complicated by the vast number of combinations of possible amino acid sequences. Podgornaia and Laub take on this challenge by analyzing four amino acids critical for the interaction between two signaling proteins in Escherichia coli. They build all the possible 160,000 variants of one of the two proteins and find that over 1650 are functional. Even though there can be very high variability in the composition of the interface between the two proteins, there are nonetheless strong context-dependent constraints for some amino acids, which suggests why many functional variants are not seen in nature. (Emphasis added.)

As stated, the researchers performed a very scaled-down experiment on two interacting proteins named PhoP and PhoQ. We don’t need to concern ourselves with what the two proteins do. We just need to ask how much substitution they can tolerate and still interact. Now that we know what "degeneracy" means, we need to add another term to our working vocabulary: epistasis. This refers to something like "unintended consequences" of making changes. Sure, you might substitute another hydrophobic residue from the wild type, but will that change the spacing of the other nearby amino acids? Will it affect the folding of the protein? Will it break the function? A permissible substitution may not be practical in context.

The researchers basically found that these two properties, degeneracy and epistasis, work against each other. Their paper’s title is, "Pervasive degeneracy and epistasis in a protein-protein interface." Degeneracy allows for a fairly large amount of variation (if you can call 1% "large"). Sure enough, when they considered all the possible substitutions for "four key residues" in PhoQ (160,000 of them), they found 1659 that still worked. However, epistasis cut that number way down.

Our results reveal extensive degeneracy in the PhoQ-PhoP interface and epistasis, with the effect of individual substitutions often highly dependent on context. Together, epistasis and the genetic code create a pattern of connectivity of functional variants in sequence space that likely constrains PhoQ evolution. Consequently, the diversity of PhoQ orthologs is substantially lower than that of functional PhoQ variants.

In theory, therefore, you can find a lot of varieties (orthologs) that are functional. In practice, epistasis constrains that number substantially. That’s why only a small fraction of their 1659 functional varieties were actually detected in living E. coli.

One reason for the constraint is the mutational path to some of the functional orthologs. We need to recall that Darwinian evolution is blind to future goals; every substitution has to produce a functional advantage right now. The researchers found that getting to some of their theoretical functional orthologs by random mutation would require traversing through two or more non-functional intermediates. Those non-functional intermediates would likely be eliminated by selection before the next step or steps could be reached. The only way Darwinian evolution could "select" those variants would be for the two mutations to appear simultaneously. This approaches "the edge of evolution" that Michael Behe discusses at length in his book of that name: the probability for simultaneous random mutations rapidly drops off the edge of a cliff.

How many functional variants remain out of 1659 when epistasis is considered? To answer that question, we need to add another term to our working vocabulary: "Hamming distance." In information theory, the Hamming distance is the number of substitutions required to transform one string into another of equal length. The Hamming distance to transform 1011 to 1001 is 1. The Hamming distance to transform "cartop" into "carpet" is 3. You might be able to transform one string into another via a series of meaningful words, like gate into name via the series, gate, game, name. Darwinian evolution, however, cannot tolerate non-functional intermediates. The "path length" of functional intermediates cannot exceed the Hamming distance.

Now that we have reviewed these concepts, what do the MIT researchers tell us about the actual subset of functional orthologs of PhoQ out of the theoretical possibilities, after considering epistasis, Hamming distance and path length?

Shortest path lengths now exceeded Hamming distances for >97% of all connected variant pairs (fig. S8B). Together, the genetic code and epistasis severely constrain mutational paths in sequence space for PhoQ….

In general, the natural diversity in PhoQ orthologs (Fig. 4H and fig. S9C), even those with divergent PhoP partners, is much more limited than the diversity in our selected, functional variants [i.e., the 1659 orthologs]….

Collectively, our results suggest greater functional degeneracy for PhoQ than would be expected by site-saturation mutagenesis. However, the interconnectivity of functional variants, which results from epistasis and the structure of the genetic code, has likely limited nature’s exploration of sequence space, as reflected in the limited diversity of PhoQ orthologs (Fig. 4H).

Theoretically, then, one could imagine a fairly large number of variations in the four key residues of PhoQ that do not break functionality. Practically speaking, however, you can’t get to most of them naturally. For one thing, nature doesn’t even explore large areas of theoretical sequence space, because functional variants are interconnected. For another, epistasis, with its unintended consequences, interferes. Finally, natural selection can’t even get to most of the possible functional forms, because "shortest path lengths now exceeded Hamming distances for > 97% of all connected variant pairs." But then, out of the 57 or so (3%) they considered practical, they only found 13 could actually compete with the wild type.

Knowing the results, let’s look again at what they set out to test:

Protein-protein interactions drive the operation and function of cells. These interactions involve a molecular interface formed by a subset of amino acids from each protein. Interfacial residues often vary between orthologs, indicating some mutational tolerance or degeneracy, but such natural variability may not capture the full plasticity of interfaces. Thus, it remains unclear how many combinations of interface residues will support a given interaction and how these combinations are distributed and connected in sequence space (3) (fig. S1A). It is also unknown whether all functional variants can be reached through a series of mutations that retain function, or whether evolution is fundamentally constrained, limiting the natural diversity in orthologous proteins.

The test results are clear. We’ve dropped from 20N variants in sequence space to 160,000 possible sequences for just four key residues, to 1659 functional variants of those, to 57 that are accessible to unguided natural processes, to 13 that could actually compete in the wild. In summary, functional space is a tiny, tiny fraction of sequence space. And "practical" functional space is another tiny fraction of theoretical functional space. We might put it this way: nature abhors the edge of evolution.

Seeing that the "wiggle room" for our spare tire on the axle is so small, we might return to the old, tried-and-true explanation that the car was designed after all.

Image by Booyabazooka (Based on Image:Rubiks cube.jpg) [GFDL or CC-BY-SA-3.0], via Wikimedia Commons.