Course notes for the November
15th 2009 lecture of the Genome Evolution Course
Itai Yanai
Department of Biology
Technion – Israel
Institute of Technology
yanai@technion.ac.il
The Neutral Mutation Theory of
Molecular Evolution
In the previous two sections, we have learned fundamental concepts in population genetics and molecular evolution. In this chapter, we will discover a crucial link between the two fields. To borrow Dobzhansky’s dictum, we will find that nothing in evolutionary genomics makes sense except in the light of a certain theory, the neutral mutation theory.
We begin with the mouse genome and its similarity to our own. The mouse genome has 19 autosomal chromosomes as well as the X and Y sex chromosomes. The genome was completely sequenced by 2002 revealing a 2.5 GB genome, smaller with respect to ours by 14%. Over 90% of the genome was said to be in syntenic regions (a contiguous stretch of DNA that is in tact across organisms). The genome could be said to be a shuffled human genome with 342 long (>300kb) segments. The only mouse chromosome composed of DNA homologous to human DNA of only one chromosome is the X chromosome. This is not so surprising since it is known to undergo very little recombination. Human chromosome 20 is entirely represented as a portion of mouse chromosome 2. Most of the other human chromosomes show are scattered in segments. Human chromosome 4, for example, is split into seven large chunks spread over mouse chromosomes 3, 5, 6, and 8. In many instances it is possible to infer the order of the chromosomal rearrangements that have taken place based upon a most parsimonious scenario (minimal number of steps).
The most recent common ancestor of human and mouse lived about 75 million years ago, although estimates vary quite a bit for this divergence. It is of particular interest that there are extremely few genes (<1%) specific to either human or mouse. Basically, the same set of genes can build both organisms. The sequences of these genes themselves however show considerable divergence. Overall the human and mouse genomes differ at nearly one out of every two nucleotides.
This level of divergence was touted as a particularly important aspect of comparing the two sequences: “The divergence is high enough so that one can recognize many functionally important elements by their greater degree of conservation.” (Mouse Genome Sequencing Consortium, Nature (2002)). Such a statement reflects the degree to which one very basic notion about molecular evolution has penetrated the way we look at sequences. When aligning homologous sequences, it is a common assumption that conserved positions reflect functionally important sites while variable regions are less important.
As intuitive as this notion appears to us today, its formal introduction to evolution set off one of the biggest controversies in the history of the field. By the 1950’s the synthetic theory of evolution – holding that evolution is the change in gene frequencies due to natural selection – had fully matured into the dominant mode of thought in evolution. The fate of neutral mutations had been discussed theoretically but widespread belief rejected the possibility of their fixation in populations. As G.G. Simpson put it in 1950:
The consensus is that completely neutral genes or alleles must be very rare if they exist at all. To an evolutionary biologist, it therefore seems highly improbable that proteins, supposedly fully determined by genes, should have nonfunctional parts, that dormant genes should exist over periods of generations, or that molecules should change in a regular but nonadaptive way . . . [natural selection] is that composer of the genetic message, and DNA, RNA, enzymes, and other molecules in the system are successively its messengers.
Therefore, differences between homologous sequences correspond to changes brought about by natural selection and thus these are the ones to be examined carefully. According to the synthetic theory there are two types of mutations: deleterious and advantageous. Most mutations are of the former type. Deleterious mutations reduce the fitness of the organism and are consequently removed from the population by natural selection. Such selection is called purifying selection. Advantageous mutations increase the fitness of the organism. These rise in frequency thanks to Natural Selection until they are fixed in the population. This type of selection can be called adaptive, positive, or Darwinian selection. It is important to remember here that the fitness upon which selection acts is judged according to survival rates and fecundity only.
The number of substitution observed in genetic sequences
A fitting beginning to the story may be the notion of the “cost of natural selection”. As brutal as it sounds, in order for natural selection to do its job a fraction of the population needs to die each generation before breeding. Consider a haploid population with two genes, A and A’. (The following is derived as in Ridley’s Evolution (1996)) Consider that the frequency of A and A’ are p and q(=1-p) and their selective weights are 1 and 1-s. Each generation, of the fraction q with allele A’, s will die and (1-s) will survive. This means that all together, sq individuals in the population will die each generation.
In 1957, the great population geneticist J.B.S. Haldane, sought to calculate how many individuals will die altogether in the number of generations it takes a new mutation to rise to fixation within the population. Interestingly, he found that the number of individuals is largely independent of the selective weight. The relationship between the selective weight and the number of generations involved in fixation and thus the two end up canceling each other out. Haldane determined that that the cumulative fraction of the population that must die is related only to the initial frequency of the mutation: D = -2ln(p). A mutation in a typical population might have a frequency of 10-6, yielding a cost of 27.6: meaning that 2,760% of the population will die over the length of time it takes to rise to fixation. Haldane took D = 30 as a rough estimate. Since populations cannot support extremely strong selection because of the chance of extinction, Haldane reasoned that a population can sustain about a 10% of the population each generation. Consequently, a new allele may be substituted in a population roughly every 300 generations.
Haldane’s estimate of the speed of evolution gave an
indication of the number of substitutions expected among homologous sequences. In
the 1960’s it finally became possible to test Haldane’s prediction on real genetic
sequences. In 1968, the Japanese geneticist Motoo Kimura published his analysis
of changes in hemoglobin, cytochrome c, and triosephosphate dehydrogenase in
light of Haldane’s hypothesis. Appyling the molecular clock, Kimura calculated, that on average, one change occurs every 28x106
years in a protein of length 100 amino acids.
While this estimate seems extremely slow, Kimura showed that when extrapolating it to the entire genome, it translates into a tremendous amount of change. Based upon a crude assessment of the number of genes in the genome (13 million!), Kimura arrived at the rate of one substitution every 1.8 year – in sharp contradiction to the 6,000 years (300 generations * 20 years per generation) estimated by Haldane. Using modern figures, Kimura’s argument is actually stronger (despite the ridiculous gene number guess). The human and mouse genomes diverged 75 million years ago thus evolving separately for 150 million years. During this time the sequence managed to diverge at roughly one out of every two bases, p=0.33. Accounting for multiple substitutions across the genome of 3.2 billion bases, we estimate 1.39 billion substitutions. Thus each substitution proceeded at about 0.11 year per substitution – a super speed for substitutions.
Kimura argued that the large number of substitutions suggested that something other than selection was responsible. No population would be able to sustain the cost of selection to bring about the substitutions. He proposed that most of the substitutions are selectively neutral – i.e. the fitness of the organism is unaffecting by them. A neutral mutation can rise to fixation simply by random drift without inflicting any cost. This proposition was extremely controversial in 1968 and set of a controversy between the so-called selectionists and neutralists. In the three following sections we review the body of evidence supporting this theory.
The steady rate of evolution
The rate of evolution is the number of substitutions that occur in each generation. Another way of conceiving this rate is by estimating the fraction of new mutations that occur each generation that will be lucky enough to fix in the population. In other words, the rate of evolution is the product of the number of mutations appearing each generation and the probability that each one has of achieving fixation. As we have already seen in section two, the number of mutations appearing each generation is simply the product of the number of genomes in the population (2N) and the mutation rate, 2Nm. Thus the rate of evolution is equal to 2NmP, where P is the probability of fixation.
A remarkable result is discovered when one assumes that all
the mutations are selectively neutral. As we have already seen P= 1/(2N) for neutral mutations in diploid organisms. Substituting
this P, we find that the rate of divergence (evolution),
=
. That is, the rate of divergence is simply equal to the
mutation rate. This simple and beautiful equation stands as one of the most
famous results of molecular evolution. Note in particular that the rate of
evolution appears independent of the population size, N.
It is clear, however, that not all mutations are neutral.
The neutral mutation theory predicts that in the spectrum of mutations,
fraction f are neutral while 1-f are deleterious.
Advantageous mutations which undoubtedly occur are of such low frequency that
they are effectively ignored. Thus, the actual rate of divergence is,
, the product of the
fraction of the mutations that are selectively neutral and the mutation
rate.
The predicted rate of divergence predicted by the neutral mutation theory has a particularly salient implication for the evolution’s molecular clock. Because the rate of divergence depends only on two parameters which are considered conservative, the rate is expected to be regular across lineages.
What happens to the rate of divergence when the mutations are not neutral. The probability of fixation of an advantageous mutation with selective weight, s, is estimated as 2s. Thus the rate of divergence for advantageous mutations is 4Nsm. Here, the rate of divergence depends upon the population size, the selective advantage of the weight, and the mutation rate. It is thus unlikely that such a product will be held constant and the molecular clock will thus not be accounted for. In short, we see that a situation where there are many more neutral mutations than advantageous mutations can explain the molecular clock while a model with only advantageous mutations reaching fixation cannot.
Neutral mutations and functional constraint
Perhaps the most convincing evidence for the neutral mutation theory is the inverse relationship between the importance of a protein or a site within a protein and its rate of evolution. Considering a protein, not all of the amino acid sites are equally important to the function. Likewise, from the point of view of a protein chemist many mutations such as glycine to alanine are minor while other can be more drastic, for example valine to arginine. If most changes were due to selective forces one would expect more drastic differences than trivial ones. However the opposite is observed: minor changes are frequent while drastic changes are not. Furthermore, it is observed that pseudogenes evolve at a faster rate than functional genes.
An inverse correlation between divergence rate and functional importance is also strikingly observed across a gene’s landscape. An analysis of 3,165 orthologous human-mouse pairs by the Mouse Genome Sequencing Consortium showed that exons are on average ~85% identical between the two genomes. The conservation drops drastically to ~68% for introns. Splice sites and the start of transcription are genetic loci that are particularly well conserved, ~90%. These statistics are extremely compelling in favor of the neutral mutation theory.
A common method for comparing the relative neutral forces with that of purifying selection takes advantage of the genetic code. The genetic code is said to be degenerate because it uses 61 codons to code for 20 amino acids; all but two amino acids (tryptophan and methionine) have multiple codons which are able to code for them. Considering that a codon is composed of three bases, one mutation can change each codon to nine different codons. The genetic code is such that of the 549 possible changes, 134 (~25%) are synonymous, i.e. encoding the same amino acid.
If, as the pan-selectionist programme suggest, all substitutions are adaptive one would assume that most changes would occur in the first and second codon positions. If DNA divergence includes neutral mutations, then the third position should change more rapidly because synonymous mutations are more likely to be neutral. (Synonymous mutations are often not completely neutral because of non-equal tRNA and tRNA synthetases abundances, but that is a story for another day). When comparing the DNA coding sequences of different organisms, for example, human and mouse we find an overwhelming preponderance of synonymous changes.
We will now introduce a method for computing the number of synonymous substitutions per synonymous site and the number of non-synonymous (altering) substitutions per non-synonymous site, commonly called ks and ka, respectively. The method amounts to first annotating the degeneracy of the site and then sorting the divergences as either synonymous or non-synonymous. This method is known in the literature as the Pamilo-Bianchi-Li method. We classify each site into one of three degenerate types, 0-fold, 2-fold, and 4-fold. A 4-fold generate site is a base position where all possible changes still relate to the same amino acid, for example the third position in the valine codons is 4-fold degenerate. 4-fold degenerate sites are found in 32 of the 3rd position of 61 codon sites. A 2-fold degenerate site is one where of the three possible DNA changes, one is synonymous and two are altering, for example the first position of arginine is two-fold degenerate. 2-fold degenerate sites are found in 25 of the 3rd positions and 8 of the 1st positions. 0-fold degenerate sites are those where any change is non-synonymous, for example the 2nd positions of all codons are non-synonymous. 0-fold degenerate sites are also found of 53 of the 1st position sites. The only exception to this classification is the third position of the three isoleucine codons. On the basis of other genetic codes where isoleucine only occupies two codons, we say that this site is two-fold degenerate to fit in with the scheme.
Given a DNA coding sequence we can easily annotate the 0-, 2-, and 4- fold degeneracy site. We now wish to compare with a second sequence, which is itself annotated. Due to codon changes, one sequence may claim 2-fold degeneracy while the second sees it as a 4-fold degeneracy. Thus to estimate the number of each kind of N-fold degeneracy site we take the average of the two counts.
If it were not for a key simplification our job would now be to check each 2-fold diverging site and determine if it is synonymous or not. However, we find that the genetic code is composed in such a way that transitions in 2-fold degeneracy sites are synonymous while transverstions are non-synonymous. There are two exception to this rule: 1) the first codon-position of arginine is two-fold degenerate but a transversion related the synonymous codons, and 2) the last position of isoleucine (“3-fold” degenerate site). Thus, we find that ks can be estimated as the sum of transversions in 4-fold sites and transitions in 2-fold and 4-fold sites. Similarly, ka can be estimated as the sum of transversions in 0-fold sites and transitions in 0-fold and 2-fold sites.
The Kimura distance studied last time is used to correct the number of transitions and transversions separately:
A = (1/2) ln (1/(1- 2P – Q)) –
(1/4) ln (1/(1- 2Q))
B = (1/2) ln (1- 2Q),
where P and Q are the fraction of transitions and transversions, respectively. For each of the three type of sites, 0-, 2-, and 4- sites, A and B are calculated separately.
Now we are ready to calculate KS and KA. The total number of synonymous sites is estimated as the sum of 4-fold sites and one third of 2-fold sites (since of the possible three mutations that can occur, one is synonymous and two are not). Of these sites the number of synonymous changes that occur is the sum of transitions in 2- and 4- fold changes and the transversions in the 4-fold positions:
![]()
Similarly, KA is defined as:
![]()
However, because transitional substitutions tend to occur more often than transversional substitutions and because most transitional changes at two fold are synonymous changes, KS is overestimated and KA is underestimated.
To correct for this it has been proposed to take the weighted average of the transitional changes:
![]()
and likewise:
![]()
These are the standard formulations.
What happens when we compare KS and KA? In overwhelming support for the neutral theory given a pair of functional orthologous genes KS tends to be unequivocally larger than KA. While the selectionists are pressed for an explanation, the neutralist model is simple: the nonsynonymous mutation rate is slower because purifying selection maintains those sites more conservatively.
The ratio between KS and KA is a good gauge for the level of selection upon a gene in evolution. If the KS and KA are about the same it means the gene is accumulating synonymous mutations and nonsynonymous mutations at about the same rate. This strongly suggests that the nonsynonymous mutations are as neutral as the synonymous. If KS < KA , positive selection is in effect : bringing about more altering changes than synonymous changes. However, as is typical the case for functional genes, KS >> KA demonstrating that purifying selection is removing many of the amino-acid changing mutations.
It is important to stress that neutral mutations say nothing about the functional significance. That is, the mere existence of different functional forms is not evidence for the operation of natural selection. Selection can only be assessed through investigations of survival rates and fecundity. Thus, a mutation which makes you have a blue middle finger on your right hand can be perfectly neutral despite having a clear phenotypic difference. If the survival rate and fecundity of individuals with the blue finger is the same as for the others, the mutation is said to be neutral.
Polymorphisms and the neutral theory
In the 1960’s geneticists got their first chance at systematically estimating the genetic variance in populations. By gel electrophoresis, samples from different individuals could be typified according to their molecular size. Richard Lewontin – the first researcher to use this technique in this field – discovered that amount of variation in the population was astoundingly high. From our modern SNP analysis discussed in the second section we rediscovered the same result. Further research has shown that the fraction of polymorphic loci (P) is typically very high across species. One way to estimate the degree of polymorphicity is by calculating its heterozygosity: the probability of choosing an individual that is heterozygous at a given location. The early studies found that in humans, heterozyogisity is ~7% - that is, choosing an individual at random and examining his two copies at a certain locus, you have a 7% chance of finding him/her to be heterozygous at this site.
Polymorphisms were thought to be due to balancing selection. It was believed that a heterozygotic advantage as was shown for sickle-cell anemia in populations with a high occurrence of malaria.
In start contrast, the neutral mutation theory proposes that polymorphism is a phase of molecular evolution. As Kimura and Ohta put it in 1971:
In our view, protein polymorphism and molecular evolution are not two separate phenomena, but merely two aspects of a single phenomenon caused by random frequency drift of neutral mutants in finite populations.
Thus, given the large number of substitutions that occur, much traffic occurs through the generations of mutations rising in frequency. Effectively, the neutral mutation theory unifies population genetics and molecular evolution.
We will now derive the amount of heterozygosity expected in a population of size N with a neutral mutation rate m. We first define homozygosity as simply 1 – the heterozygosity (H). Homozygosity thus is the probability of drawing a random homozygous individual.
Given that the homozygosity is a given population is G, what will be the G’ – the homozygosity in the next generation? Although, according to the Hardy-Weinberg you would expect G’ = G, consider that the population is of finite size (Hardy-Weinberg assumes an infinite population) and thus random events can have an effect. Imagine that you are drawing with replacement the two alleles that will form the individual from the gene pool with 2N alleles. One way to get a homozygous is to simply choose the same exact allele twice – this has a probability of 1/(2N). The other way is to choose different alleles (with probability 1- 1/(2N) ) whose probability to be of the same kind is G. Thus,
G’ = 1/(2N) + (1 – 1/(2N))G
Solving for H now (H = 1-G) we have,
H’ = (1 – 1/2N)H
And the change each generation is:
DH = H’ – H = –(1/2N)H
Thus, each generation the level of heterozygosity decreases by a factor of 1/(2N). This is the effect of random drift which acts to remove variation (heterozygosity) from the population.
The heterozygosity is saved from extinction by the engine of evolution: mutations. An allele has a probability m of mutating each generation, and 1 – m of not mutating. Thus in our calculation of the homozygosity in the next generation G’ we need to also stipulate that neither copies mutated:
G’ = (1-m)2 [1/2N + (1 – 1/2N)G]
Since u is small, (1- m)2 can be approximated by (1-2m)
G’ » (1-2 m)[1/2N + (1 – 1/2N)G]
Ignoring terms with u/N:
G’ » 1/2N + (1 – 1/2N)G – 2 m G
Solving again for H’ we have:
![]()
And the change of H each generation can be shown to be:
![]()
![]()
The first term we recognize as the effect random drift. Thus, the amount of heterozygosity introduced into the population each generation is 2m(1-H). Solving for H when change in H is zero (assuming a steady state) we find:
![]()
Thus the amount of heterozygosity is dependent only the size of the population and the mutation rate. This remarkably simple relationship holds for a large number of instances when the mutations are indeed neutral.