Course notes for the December 6th 2009 lecture of the Genome Evolution Course

 

Itai Yanai

Department of Biology

Technion – Israel Institute of Technology

yanai@technion.ac.il

 

Evolution by Gene Duplication

 

How does a new function emerge in the genome? Two decades before the molecular structure of the gene was determined, the great biologist Haldane first proposed the mechanism of new gene function occurring after gene duplication: “A redundant duplicate of a gene may acquire divergent mutations and eventually emerge as a new gene.” Two decades after the discovery of the structure of DNA, the principle of duplication was listed by Kimura and Ohta as one of the five laws of molecular evolution. These five are:

1.      For each protein, the rate of evolution in terms of amino acid substitutions is approximately constant per year per site for various lines, as long as the function and tertiary structure of the molecule remain essentially unaltered.

2.      Functionally less important molecules or parts of molecules evolve (in terms of mutant substitutions) faster than more important ones.

3.      Those mutant substitutions that are less disruptive to the existing structure and function of the molecule (conservative substitutions) occur more frequently in evolution than more disruptive ones.

4.      Gene duplication must always precede the emergence of a gene having a new function.

5.      Selective elimination of definitely deleterious mutant and random fixation of selectively neutral or very slightly deleterious mutants occur far more frequently in evolution than positive Darwinian selection of definitely advantageous mutants.

The first is the molecular clock. The second and thirst are the functional aspects of the neutral theory. The fifth is the main aspect of the neutral mutation theory stating that most mutations are not adaptive. Kimura and Ohta originated all of the laws with the exception of the fourth, which is the product of Susumu Ohno. As we described in the previous lecture, Ohno proposed a major role for gene duplication in the evolution of new functions. In this lecture we will study this evolution at a level that was hidden from Ohno but which is now coming to light.

 

Preponderance of paralogs (gene duplicates) in genomes

 

We will first ask if observing gene duplicates is a common event. For this we introduce the concept of a homologous gene family. Using sequence analysis methods such as BLAST we identify the copies of this gene across and within genomes. Representing the copies as a profile we can observe in which organisms the genes are found and in how many copies. Repeating this process for the many available microbial genomes we find that paralogs are extremely frequent. In fact, out of 3307 homologous gene familes (examined in an earlier version of the COG database), we find that for 2,152 there are duplicates in at least one of the genomes. Since paralogs are ubiquitous we next ask by what principles these are organized.

 

For a given organism, say Bacillus subtilis, we ask how many genes appear in just a single copy, two copies, three etc. In this organism 47% of the genes are member of gene families of size two or greater. There many families of each size up to around 20 genes, above which there are a few very large families. When we plot the number of genes of each family size we find a linear relationship in log-log scales. Further, this relationship holds all examined microbial genomes and was once referred to as the first law of genomics: that the family size distribution is of a regular shape.

 

A further observation concerns larger genomes. With an increase in gene number we observe a larger fraction of gene duplicates. Further, in larger genomes the average size of gene families also increases. Together, these observations suggest that one way in which genomes grow is by gene duplication – increasing the gene family sizes of the families. However which genes families a genome duplicates is not always the same. Comparing the sizes of specific gene families across genomes we find lots of variations. For examples there are 81 families of size three in B. subtilis which are found in only one copy in E. coli. Meanwhile only 66 families are found in three copies in both organisms.

 

While increase in size can include gene duplication events, genome reductions involve a decrease in gene family sizes. M. leprae is closely related to M. tuberculosis. However, most gene families have been simplified in the short time period in which the leprosy bacillus has adopted its mainly intracellular lifestyle.

 

 

The fate of duplicate genes

 

What do we want to know about gene duplicates? Andreas Wagner nicely summarized our questions:

         At what rate to gene duplications occur?

         Once a gene is duplicated, what are the chances that the duplication becomes fixed in a population?

         How long does it take until such fixation?

         Do many duplicates evolve new functions?

         How long does it take until one of the duplicates suffers degenerative mutations and becomes silenced?

         Do the vast majority of gene duplicates become silenced?

 

One productive approach involves collecting sets of duplicates and computing their rates of synonymous and non-synonymous changes. This is a useful metric because if duplicates have similar rates of synonymous and non-synonymous mutations these may be said to be experiencing neutral evolution. If there are more synonymous changes than non-synonymous changes this is a sign of purifying selection (since selecting is purifying out those changes that alter the protein sequence.). More rarely is positive selection, where there are more non-synonymous changes than synonymous changes suggesting that the changes provide some contribution to the organism’s fitness.

 

Examining the rates of synonymous and non-synonymous changes for all duplicates together provides a global view of the selective constraints on gene duplicates [The following discussion is based upon Lynch and Conery’s landmark 2000 Science paper]. On a plane of synonymous (or silent, S) substitutions and non-synonymous (or replacement, R) substitutions, each point represents a single pair of gene duplicates. The open points denote genes for which the ratio of R/S is not significantly different from 1, and appear to be evolving neutrally. The scatter of points shows that many gene duplicates experience a phase of relaxed selection or even accelerated evolution at replacement sites.

 

Examining the R/S plot we find that many gene duplicates experience a phase of relaxed selection or even accelerated evolution at replacement sites. This is evidenced by the scatter around the neutral expectation for the younger duplicates (S < 0.05). This trend is observed for 6 diverse genomes. Modeling the qualitative behavior of the R/S ratio, suggests the progressive decline of R/S reflects a gradual increase in the magnitude of selective constraint. Early in their evolutionary history, duplicate genes tend to be under moderate selective constraints with the rate of amino acid substitution averaging about 43% of the neutral expectation. The efficiency of purifying selection subsequently increases 10-fold, to the point at which only about 4% of amino acid–changing mutations are able to rise to fixation.

 

How often do gene duplicates survive? Synonymous substitution rate should proceed approximately linearly with time (according to the molecular clock). Thus for each duplicate we can estimate how long ago it occurred. Thus, the relative age-distribution of gene duplicates within a genome can be inferred. If duplications occur at an approximately constant rate and if duplicates survive indefinitely then each bin (representing a unit of time) will have the same number of duplicates. However, the youngest duplicates are of the highest density (most poular) among all duplicates across six genomes. This suggests that most duplicates do not survive with time.

 

For a better estimate, we zoom in on this distribution which we will call ‘survivorship curves”. We find an exponential decrease in the number of duplicates with time for those duplicates where at most a quarter of the synonymous loci have changed (S <0.25). The rate of loss of gene duplicates can be estimated by using the survivorship function: N_s = N_0e^-dS. d is a parameter that was estimated by fitting to the data. For example, for C. elegans d was estimated as 7. Using this information we can compute S for C. elegans as 0.099 which translates to 3.2 million years using a mutation rate of 15.6 per silent site per billion years. Further we can calculate the after 13.7 million years 95% of the new duplications will be lost.

 

To calculate the rate of gene duplication, we return to the survivor plot and note that for Drosophila 10 pairs of duplicates have S<0.01. How much time is S=0.01? Using again our mutation rate of 15.6 per silent site per billion year we find that this corresponds to a third of a million years. This there will be about 31 duplications fixed in a million years of Drosophila evolution. Upon normalizing this to the total number of genes in the genome (13,601) we conclude that there will be 0.0023 duplications per gene per million years in Drosophila

For C. elegans there will be 0.0208 duplications per gene per million years. Thus overall, 50% of all genes are expected to duplicate and increase to high frequency at least once within a window of 35 to 350 million years, corresponding to a major role for gene duplication in the evolution of the genome.

 

An example of adaptive evolution

 

We now examine one specific gene duplication event and the functional diversification that followed as published by J. Zhang et al., Nat. Genet. (2002) 30:411–415. The Colobine douc langur is a leaf-eating primate in which leaves are fermented by symbiotic bacteria in the foregut. Colobines recover nutrients by breaking and digesting the bacteria with various enzymes, including pancreatic ribonuclease, RNASE1. RNASE1 has a close paralog in douc langur named RNASE1B – interestingly other primates only have one copy of this gene.

 

Examining the protein sequences for the primate pancreatic ribonucleases we find that RNASE1B has large number of changes relative to the other sequences. In other words, the molecular clock does not hold for nonsynonymous substitutions in RNASE1B. The substitutions are mostly in the mature peptide. Seven of the nine amino-acid substitutions in the mature peptide of RNASE1B  involve charge changes. All of these charge changes increase the negative charge of the protein from 8.8 to 0.8

 

Presumably due to foregut fermentation and related changes in digestive physiology, the pH in the small intestine of colobine monkeys shifts to a range of 6–7. Notably, the optimal pH for douc langur RNASE1B was found to be 6.3 while for RNASE1 it is ~7.5. Further, RNASE1B does not exhibit the double stranded RNA degradation capacity found in the RNASE1 of douc langur and other species. Thus it appears that adaptive evolution occurred in the RNASE1B gene in douc langur making important changes to the coding sequence allowing the protein to adapt to a new function.

 

Preservation of duplicates by subfunctionalization

 

In Ohno’s model, after a gene duplication is fixed in the population, one gene copy is redundant and free to accumulate substitutions at random. By chance, some of these substitutions may suit the protein encoded by such a non-functional gene to a new function, which it can subsequently assume. One main premise of Ohno’s model is that a gene can have only a single function, and thus a gene duplication can free genes to evolve new functions. However, the one-gene one-function model is shattered to pieces and genetic pleiotropy is the rule. For example, Eric Davidson and colleagues have beautifully dissected the cis regulatory domains of genes and revealed the myriad functions encoded in promoters specifying expression in different times, places, and conditions. Thus it appears that genes may have many functions. How does this affect our model of gene function evolution post-duplication?

 

Restating the problem in terms of multiple independent regulatory sites, the options of nonfunctionalization and neofunctionalization play out as follows. After gene duplication, mutations settle on the regulatory sites of one of the duplicates, these are tolerated since the duplicate also has these regulatory sites and can back up the functions. However a mutation to the coding region will result in nonfunctionalization. Alternatively, neofunctionalization may occur when the coding sequence is altered as in RNASE1B discussed above, or a new regulatory region emerges.

 

However there is a third possibility: the duplicates will experience losses of their different subfunctions by degenerative (null) mutations. Combined action of both gene copies is necessary to fulfill requirements of the ancestral gene. Thus the gene duplication led to a partitioning of the ancestral functions, a process called subfunctionalization. Notably this process does not require the agency of selection but rather occurs only by neutral processes.

 

What is the probability that subfunctionalization occurs? Let there be z independently mutable subfunctions (all essential). The subfunctions all mutate at identical rates, ur , to alleles lacking the relevant subfunction. Let also uc be the rate at which null mutations arise in the coding region. The null mutation rate for the locus is then uc + zur per gene copy. Using this formalism we can calculate the probabilities of each of the possible scenarios leading to subfunctionalization: 1. The first two null regulatory mutations occur on different gene copies, 2. The first mutations two occur in the same gene but the third is on the second gene, and 3. The first there may initially occur on the same copy followed by the fourth mutation on the second copy. Each scenario leads to subfunctionalization and it is evident that the greater the number of regulatory regions, the greater the probability of subfunctionalization and consequently the preservation of the gene duplicate.

                                                        

For example, if there are 5 subfunctions, and the mutation rate per subfunction is 10% of the coding region null rate; uc =  10ur , then the probability of subfunctionalization is 0.1. If the mutation rate per subfunction is 30% that of the null rate, the probability of subfunctionalization is 30%. Force et al proposed this model of subfunctionalization and suggested as one example the tissue-specific pattern of expression of the engrailed genes in zebrafish. Of course the model is overly simplified as regulatory regions may include overlapping and embedded regulatory elements.

 

The consequences of gene duplications

 

The consequences of gene duplication may also include speciation. Upon duplication, either the ancestral or the descendant copy can be silenced. In geographically isolated populations different copies might be silenced thus passively giving rise to a small-scale chromosomal rearrangement. These rearrangements will lead to nonfunctional hybrids and thus to reproductively isolated species.