Course notes for the December 6th 2009
lecture of the Genome Evolution Course
Itai Yanai
Department of Biology
Technion – Israel Institute of Technology
yanai@technion.ac.il
Evolution by Gene
Duplication
How does a new function emerge in the genome? Two decades before
the molecular structure of the gene was determined, the great biologist Haldane
first proposed the mechanism of new gene function occurring after gene
duplication: “A redundant duplicate of a gene may acquire divergent mutations
and eventually emerge as a new gene.” Two decades after the discovery of the
structure of DNA, the principle of duplication was listed by Kimura and Ohta as one of the five laws of molecular evolution. These
five are:
1.
For each
protein, the rate of evolution in terms of amino acid substitutions is
approximately constant per year per site for various lines, as long as the
function and tertiary structure of the molecule remain essentially unaltered.
2.
Functionally
less important molecules or parts of molecules evolve (in terms of mutant
substitutions) faster than more important ones.
3.
Those mutant
substitutions that are less disruptive to the existing structure and function
of the molecule (conservative substitutions) occur more frequently in evolution
than more disruptive ones.
4.
Gene
duplication must always precede the emergence of a gene having a new function.
5.
Selective
elimination of definitely deleterious mutant and random fixation of selectively
neutral or very slightly deleterious mutants occur far more frequently in
evolution than positive Darwinian selection of definitely advantageous mutants.
The first is the molecular clock. The second and thirst are the
functional aspects of the neutral theory. The fifth is the main aspect of the
neutral mutation theory stating that most mutations are not adaptive. Kimura
and Ohta originated all of the laws with the
exception of the fourth, which is the product of Susumu Ohno.
As we described in the previous lecture, Ohno
proposed a major role for gene duplication in the evolution of new functions.
In this lecture we will study this evolution at a level that was hidden from Ohno but which is now coming to light.
Preponderance of paralogs (gene
duplicates) in genomes
We will first ask if observing gene duplicates is a common event.
For this we introduce the concept of a homologous gene family. Using sequence
analysis methods such as BLAST we identify the copies of this gene across and
within genomes. Representing the copies as a profile we can observe in which
organisms the genes are found and in how many copies. Repeating this process
for the many available microbial genomes we find that paralogs
are extremely frequent. In fact, out of 3307 homologous gene familes (examined in an earlier version of the COG
database), we find that for 2,152 there are duplicates in at least one of the
genomes. Since paralogs are ubiquitous we next ask by
what principles these are organized.
For a given organism, say Bacillus subtilis,
we ask how many genes appear in just a single copy, two copies, three etc. In
this organism 47% of the genes are member of gene families of size two or
greater. There many families of each size up to around 20 genes, above which
there are a few very large families. When we plot the number of genes of each
family size we find a linear relationship in log-log scales. Further, this
relationship holds all examined microbial genomes and was once referred to as
the first law of genomics: that the family size distribution is of a regular
shape.
A further observation concerns larger genomes. With an increase in
gene number we observe a larger fraction of gene duplicates. Further, in larger
genomes the average size of gene families also increases. Together, these
observations suggest that one way in which genomes grow is by gene duplication
– increasing the gene family sizes of the families. However which genes
families a genome duplicates is not always the same. Comparing the sizes of
specific gene families across genomes we find lots of variations. For examples
there are 81 families of size three in B. subtilis
which are found in only one copy in E. coli. Meanwhile only 66 families
are found in three copies in both organisms.
While increase in size can include gene duplication events, genome
reductions involve a decrease in gene family sizes. M. leprae
is closely related to M. tuberculosis. However, most gene families
have been simplified in the short time period in which the leprosy bacillus has
adopted its mainly intracellular lifestyle.
The fate of duplicate genes
What do we want to know about gene duplicates? Andreas Wagner
nicely summarized our questions:
•
At what rate to
gene duplications occur?
•
Once a gene is
duplicated, what are the chances that the duplication becomes fixed in a
population?
•
How long does
it take until such fixation?
•
Do many
duplicates evolve new functions?
•
How long does
it take until one of the duplicates suffers degenerative mutations and becomes
silenced?
•
Do the vast
majority of gene duplicates become silenced?
One productive approach involves collecting sets of duplicates and
computing their rates of synonymous and non-synonymous changes. This is a
useful metric because if duplicates have similar rates of synonymous and
non-synonymous mutations these may be said to be experiencing neutral
evolution. If there are more synonymous changes than non-synonymous changes
this is a sign of purifying selection (since selecting is purifying out those
changes that alter the protein sequence.). More rarely is positive selection,
where there are more non-synonymous changes than synonymous changes suggesting
that the changes provide some contribution to the organism’s fitness.
Examining the rates of synonymous and non-synonymous changes for
all duplicates together provides a global view of the selective constraints on
gene duplicates [The following discussion is based upon Lynch and Conery’s landmark 2000 Science paper]. On a plane of
synonymous (or silent, S) substitutions and non-synonymous (or replacement, R)
substitutions, each point represents a single pair of gene duplicates. The open
points denote genes for which the ratio of R/S is not significantly different
from 1, and appear to be evolving neutrally. The scatter of points shows that
many gene duplicates experience a phase of relaxed selection or even
accelerated evolution at replacement sites.
Examining the R/S plot we find that many gene duplicates experience
a phase of relaxed selection or even accelerated evolution at replacement
sites. This is evidenced by the scatter around the neutral expectation for the
younger duplicates (S < 0.05). This trend is observed for 6 diverse genomes.
Modeling the qualitative behavior of the R/S ratio, suggests the progressive
decline of R/S reflects a gradual increase in the magnitude of selective
constraint. Early in their evolutionary history, duplicate genes tend to be
under moderate selective constraints with the rate of amino acid substitution
averaging about 43% of the neutral expectation. The efficiency of purifying
selection subsequently increases 10-fold, to the point at which only about 4%
of amino acid–changing mutations are able to rise to fixation.
How often do gene duplicates survive? Synonymous substitution rate
should proceed approximately linearly with time (according to the molecular
clock). Thus for each duplicate we can estimate how long ago it occurred. Thus,
the relative age-distribution of gene duplicates within a genome can be
inferred. If duplications occur at an approximately constant rate and if
duplicates survive indefinitely then each bin (representing a unit of time)
will have the same number of duplicates. However, the youngest duplicates are
of the highest density (most poular) among all
duplicates across six genomes. This suggests that most duplicates do not
survive with time.
For a better estimate, we zoom in on this distribution which we
will call ‘survivorship curves”. We find an exponential decrease in the number
of duplicates with time for those duplicates where at most a quarter of the
synonymous loci have changed (S <0.25). The rate of loss of gene duplicates
can be estimated by using the survivorship function: N_s = N_0e^-dS. d is a parameter that was
estimated by fitting to the data. For example, for C. elegans d was estimated as 7. Using this information we can compute S for C.
elegans as 0.099 which translates to 3.2 million years using a mutation
rate of 15.6 per silent site per billion years. Further we can calculate the
after 13.7 million years 95% of the new duplications will be lost.
To calculate the rate of gene duplication, we return to the
survivor plot and note that for Drosophila 10 pairs of duplicates have
S<0.01. How much time is S=0.01? Using again our mutation rate of 15.6 per
silent site per billion year we find that this
corresponds to a third of a million years. This there will be about 31
duplications fixed in a million years of Drosophila evolution. Upon normalizing
this to the total number of genes in the genome (13,601) we conclude that there
will be 0.0023 duplications per gene per million years in Drosophila
For C. elegans there will be 0.0208 duplications per gene per
million years. Thus overall, 50% of all genes are expected to duplicate and
increase to high frequency at least once within a window of 35 to 350 million
years, corresponding to a major role for gene duplication in the evolution of
the genome.
An example of adaptive evolution
We now examine one specific gene duplication event and the
functional diversification that followed as published by J. Zhang et al.,
Nat. Genet. (2002) 30:411–415. The Colobine douc langur
is a leaf-eating primate in which leaves are fermented by symbiotic bacteria in
the foregut. Colobines recover nutrients by breaking
and digesting the bacteria with various enzymes, including pancreatic ribonuclease, RNASE1. RNASE1 has a close paralog in douc langur named RNASE1B – interestingly other primates
only have one copy of this gene.
Examining the protein sequences for the primate pancreatic ribonucleases we find that RNASE1B has large number
of changes relative to the other sequences. In other words, the molecular clock
does not hold for nonsynonymous substitutions in
RNASE1B. The substitutions are mostly in the mature peptide. Seven of the nine
amino-acid substitutions in the mature peptide of RNASE1B involve charge changes. All of these charge
changes increase the negative charge of the protein from 8.8 to 0.8
Presumably due to foregut fermentation and related changes in
digestive physiology, the pH in the small intestine of colobine
monkeys shifts to a range of 6–7. Notably, the optimal pH for douc langur RNASE1B was found to
be 6.3 while for RNASE1 it is ~7.5. Further, RNASE1B does not exhibit the double
stranded RNA degradation capacity found in the RNASE1 of douc
langur and other species. Thus it appears that adaptive
evolution occurred in the RNASE1B gene in douc langur making important changes to the coding sequence
allowing the protein to adapt to a new function.
Preservation of duplicates by subfunctionalization
In Ohno’s model, after a
gene duplication is fixed in the population, one gene copy is redundant
and free to accumulate substitutions at random. By chance, some of these
substitutions may suit the protein encoded by such a non-functional gene to a
new function, which it can subsequently assume. One main premise of Ohno’s model is that a gene can have only a single
function, and thus a gene duplication can free genes
to evolve new functions. However, the one-gene one-function model is shattered
to pieces and genetic pleiotropy is the rule. For
example, Eric Davidson and colleagues have beautifully dissected the cis regulatory domains of genes and revealed the
myriad functions encoded in promoters specifying expression in different times,
places, and conditions. Thus it appears that genes may have many functions. How
does this affect our model of gene function evolution post-duplication?
Restating the problem in terms of multiple independent regulatory
sites, the options of nonfunctionalization and neofunctionalization play out as follows. After gene
duplication, mutations settle on the regulatory sites of one of the duplicates,
these are tolerated since the duplicate also has these regulatory sites and can
back up the functions. However a mutation to the coding region will result in nonfunctionalization. Alternatively, neofunctionalization
may occur when the coding sequence is altered as in RNASE1B discussed above, or
a new regulatory region emerges.
However there is a third possibility: the duplicates will
experience losses of their different subfunctions by degenerative
(null) mutations. Combined action of both gene copies is necessary to fulfill
requirements of the ancestral gene. Thus the gene duplication led to a partitioning
of the ancestral functions, a process called subfunctionalization. Notably this
process does not require the agency of selection but rather occurs only by
neutral processes.
What is the probability that subfunctionalization occurs? Let there
be z independently mutable subfunctions (all
essential). The subfunctions all mutate at identical
rates, ur , to alleles lacking the relevant subfunction. Let also uc
be the rate at which null mutations arise in the coding region. The null
mutation rate for the locus is then uc
+ zur per gene copy. Using this
formalism we can calculate the probabilities of each of the possible scenarios
leading to subfunctionalization: 1. The first two null regulatory mutations
occur on different gene copies, 2. The first mutations two occur in the same
gene but the third is on the second gene, and 3. The first there may initially
occur on the same copy followed by the fourth mutation on the second copy. Each
scenario leads to subfunctionalization and it is evident that the greater the
number of regulatory regions, the greater the probability of
subfunctionalization and consequently the preservation of the gene duplicate.
For example, if
there are 5 subfunctions, and the mutation rate per subfunction is 10% of the coding region null rate; uc = 10ur , then the
probability of subfunctionalization is 0.1. If the mutation
rate per subfunction is 30% that of the null rate,
the probability of subfunctionalization is 30%. Force et al proposed this
model of subfunctionalization and suggested as one example the tissue-specific
pattern of expression of the engrailed genes in zebrafish.
Of course the model is overly simplified as regulatory regions may include
overlapping and embedded regulatory elements.
The
consequences of gene duplications
The consequences of gene duplication may also include speciation.
Upon duplication, either the ancestral or the descendant copy can be silenced. In
geographically isolated populations different copies might be silenced thus
passively giving rise to a small-scale chromosomal rearrangement. These
rearrangements will lead to nonfunctional hybrids and thus to reproductively
isolated species.