Course notes for the November
1st 2009 lecture of the Genome Evolution Course
Itai Yanai
Department of Biology
Technion – Israel
Institute of Technology
yanai@technion.ac.il
Clockwork Evolution at
the Molecular Level
Phenotypic and molecular
rates of evolution
Phenotypic evolution seems to be fundamentally
different from molecular evolution. Darwin addressed phenotypic evolution in Chapter
10 of his “Origin of Species”, noting that different organisms evolve at
different rates:
Species
of different genera and classes have not changed at the same rate, or in the
same degree…. The Silurian Lingula differs but little from the living species
of this genus; whereas most of the other Silurian Molluscs and all the
Crustaceans have changed greatly. The productions of the land seem to change at
a quicker rate than those of the sea, of which a striking instance has lately
been observed in Switzerland. There is some reason to believe that organisms,
considered high in the scale of nature, change more quickly than those that are
low: though there are exceptions to this rule.
As another
example, consider trilobites – hard-shelled sea-dwelling creatures that existed
300 million years ago – and the evolution of the number of ribs they contain. The
fossil record is particularly rich in trilobites and thus allows us to trace
the number of ribs for different trilobite glineages. Different patterns are
apparent. In the Nobiliasaphus lineage we detect a clear linear trend of an
increase in rib number with time. For Cnemidopyge the number of ribs increases
fast to eight ribs on average and is then steadily maintained. In other lineage
we find a zigzag and circuitous patterns.
In this
section, we investigate the pattern of evolutionary change as recorded in the genome.
We will make the remarkable discovery that the rate of molecular change is of a
completely different nature than that of the phenotypic.
Evolution
is the change in allele frequencies. This was the definition we found valuable
at the level of populations, examined in the previous section. Indeed changes
in allele frequencies are the atoms of evolutionary change. In this section, we
turn our attention to their effects at a one-step higher level of abstraction –
the evolving genomic sequence. This involves an encapsulation of populations
onto one representative sequence.
The
representative sequence captures evolutionary changes in terms of substitutions.
Each nucleotide in the sequence corresponds to the wild-type version. When the
population genetic processes described earlier bring about a fixation of a
mutant allele (a base in the sequence) the representative sequence is
substituted. Effectively, the long process of an allele’s rise in fixation is
captured as one step by the representative sequence.
Family
relationships among genes
In studying
genome evolution, we must now acquaint ourselves with one of the major
concepts: homology.
Sir Richard
Owen became was the first Superintendent of the British Museum and an expert in
comparative anatomy. Based upon his examination of the skeletons of the various
classes of vertebrates – fish, reptiles, birds, and mammals – he came to the
conclusion that vertebrates have a common structural plan which he called an
“archetype”. Owen introduced the term homology to mean “the same organ
in different animals under every variety of form and function” (1843). Owen
meant, for example, that human fingers are homologous to bird fingers. Owen had
his own particular explanation for the existence of common archetypes which he
saw as evidence that the creator had a common design for all vertebrates.
Following
Darwin’s theory of evolution, homology took on a new significance. Often times
it is observed that a sequence from one organism shows a high fraction of
identity with a sequence from another organism. If the similarity is strong
enough to rule out the possibility that the similarity can be reasonably
explained by chance, common ancestry may be inferred. The two sequences will be
said to be homologous. Thus, in the light of evolution, homology is
reinterpreted as similarity due to common ancestry.
As in
family relations, it is particularly useful to further specify the types of
homologies that can take place. In the example, the two sequences can trace
their ancestry from one sequence based upon a speciation event that led to the two
contemporary species. Homologous sequences due to speciation events are
referred to as orthologs. Orthologs can be seen as “evolutionary
counterparts” (as Eugene Koonin refers to them). They are “the same” gene in
different organisms.
The second
major type of homology is a result of gene duplication events, frequent in
genome evolution. Two homologous sequences can descend from a single sequence that
underwent a duplication event. Such homologous sequences are referred to as paralogs.
They can be seen as continuing their existence in parallel to one another
within the same genome. Thus duplication events yield paralogous relationships.
Even with the
concepts of paralogy and orthology, however, describing the relationships among
genes can easily become complicated. Consider the history of eights genes from
yeast (one gene), human (5 genes) and worm (3 genes). Note that in the last
common ancestor of these three organisms, these 8 were represented as only one
gene – the yeast genome still contains only one as well. Along the animal
lineage – before the speciation of the worm and human ancestors – the
primordial gene underwent gene duplication forming two paralogs A and B. Some
time later the speciation occurs and each lineage received its own copies of A
and B – two sets of orthologs. The two paralogs, A and B, had different fates.
B was not duplicated since the speciation event and thus both worm and human
have a copy of B. Following speciation, A however one duplication event in worm
(2 A genes) and two duplication events in human (3 A genes). What are the
relations among these genes?
The yeast
gene is considered orthologous to all worm and human genes, which are in turn
all co-orthologous to the yeast gene. So, who is the orthologs of the yeast in
humans? Due to the duplications, there is not one gene - HA1, HA2, HA3, and HB
are all co-orthologous to the yeast gene. And, who is the ortholog of the HB
gene in worm? This is simple, since there were no duplications of the B gene in
worm since speciation with human so the orthologs is simply WB. Similarly, the
co-orthologs of WA1 and WA2 are HA1, HA2, and HA3.
A recent
useful addition to the terminology of evolutionary relationships is the
distinction between in-paralogs and out-paralogs. Genes HA1, HA2, and HA3 are
considered in-paralogs with respect to the worm genome, since the duplications
the brought them upon occurred after the speciation event. By contrast, the
family of HA* and HB are outparalogs with respect to the worm genome since they
stem from a duplication event that occurred before – or outside – the
speciation event. The same set of genes is considered inparalogs though with
respect to the yeast genome, whose divergence predates the duplication of A and
B. Thus the specification on in- and out- paralogy is only relevant when in the
context of another – more distant – genome.
Measuring
Divergences between species
Hemoglobin
proteins play an important physiological function in transferring oxygen from
the lungs throughout the body. Humans have multiple genes encoding different
hemoglobin genes - a, b, and g hemoglobins
among them. The major adult hemoglobin is composed of 2 a
proteins and 2 b proteins. The major fetal
hemoglobin is composed of 2 a proteins and 2 g
proteins.
When
Zuckerkandl and Pauling compared the protein sequences of a, b, and
g hemoglobins
in humans with those of cow and horse they found an interesting pattern. While
the a’s of various organisms showed different levels of
similarity, all of the a’s where of roughly the same level
similarity with all of the non-a’s regardless of the organisms
examined. This suggested that the precursor to a and b duplicated
before the divergence of the three organisms. The fact that since their divergence
all of the organisms accumulated a similar amount of differences suggested that
“there may thus exist a molecular clock”. In other words, that the number of
substitutions accumulated by genes may be proportional to the amount of time
passed.
Each
protein family keeps time at a different rate. Histone genes for example are
very conserved throughout eukaryotes with very few changes. Cytochrome C and
hemoglobins have a fair amount of differences while fibrinopeptides exhibit a fast
rate of change. Different locations in the gene also keep different time rates.
For example, it is observed that coding regions corresponding to the outside of
the protein have a higher rate of change than regions in the inside of the
protein.
One way to
test for a molecular clock is to limit the number of organisms – thus examining
for a local clock. Among the organisms mouse, rat, hamster and human – mouse
and rat are thought to be sister taxa diverging from the hamster which in turn
diverged from the ancestor of humans. Based upon such a set of relationships,
it expected that a molecular clock – if present – would show that the distance
from mouse to hamster is the same as that of rat to hamster. Similarly,
distances from mouse, rat and hamster to human, respectively, should all be
equal. Indeed, a study found that on average 30.3 out of 100 synonymous sites
(mutations that do not change the coding amino acid – more on this in the next
section) are different between mouse and hamster and 31.3 between rat and
hamster. Also the mouse-, rat-, and hamster- differences with respect to humans
are 53.4, 51.6, and 52.3; all very similar. Thus we can say that a molecular
clock ticks regularly in muroid rodents (mouse and rat) and hamsters.
There are
two important reasons to be interested in molecular clocks. First and foremost
we ask: What does the molecular clock signify? In other words: Why do changes to
the genome proceed in a clocklike manner? Remember, this is radically different
from the fits and starts of evolution at the phenotypic level, as in the ribs
of the trilobites. There is a well-known model known as the neutral mutation
theory that can explain this phenomena which we discuss in great detail in the
next section so for the time being we postpone this issue. The second reason
for interest is its usefulness in establishing a time scale for evolution. The
molecular clock essentially maps out the evolutionary past.
Each
molecular clock has its own rate – substitutions per unit of time. Thus upon choosing
a clock, for example the a hemoglobins across species, it is
calibrated by determining this rate. By comparing a pair sequences, the number
of substitutions – call it K – can be estimated (below we learn to correct for
multiple substitutions to the same site). The K substitutions occurred since
the time of speciation – T years ago – between the two lineages. Since both
lineages continue to evolve after speciation we say that the sequences have
evolved separately for 2T time. The rate of substitution – r – is then K/2T. K
is not difficult to derive but how can T be established? Molecular evolution is
at a loss here and calls upon the fossil record. Time T is thus an assumption
that must be made in order to use the clock.
With the
evolutionary rate of the clock in hand, we can proceed to date other
divergences. Given two sequences of the same gene family, their K divided by
twice the rate is the time of speciation – the number of years since their
divergence. It is important to consider that the clock is not a metronome but
stochastic. Deviations are expected within statistical bounds.
The
molecular clock is so commonplace that it is often assumed as a null
hypothesis. When a speed-up of slow-down occurs in the clock it is evident
towards a mechanism impinging upon the clock. For example, a comparison among a
combined sequence of a and b, hemoglobins,
cytochrome c, and fibrinopeptide A among mammalian groups, revealed strong
congruence between molecular clock estimates and dates from the fossil record.
Interestingly, the molecular clock predicts a much more recent ape-human
divergence than the fossil estimates, suggesting (if the fossil record is to be
trusted) that a clock slowdown has occurred along the lineages. The opposite
was observed for the horse-donkey lineages, suggesting a molecular speeding up
of the clock. Or perhaps this points to interesting discovery about human
evolution? (Continue reading..)
A molecular
clock may thus hold for a subset of species and not for others. For example,
Insulin A and B has only moderate differences among human, horse, rabbit,
whale, bovine and rabbit. However, the sequence in guinea pig contains more
than five times the number of differences between all of the mentioned
sequences. It is clear that the insulin clock does not hold along the lineage
leading to the guinea pig.
Tajima’s
relative rate test can be used to statistically determine if a molecular clock
can be ruled between two lineages – if there is too much variation in the
number of substitutions that each has accumulated. The method works by requiring
a third (outgroup) sequence in addition to the two whose molecular clock we
wish to test. Based upon the third sequence – which is assumed to have diverged
earlier in time from the two in question – we assign each difference in sequence
to one of the two inferring the ancestral nucleotide from the third. The
problem can be stated as: are numbers of substitutions that each has
accumulated different enough to rule out a molecular clock? The answer is
determined using a c2 test of the form: c2 = (m1-m2)2/(m1+m2),
where m1 and m2 are the number of substations (K) that
each has accumulated. If is the P-value of the c2
with one degree
of freedom is below 0.05 we can say that we can rule out the molecular clock.
Deviations
from the molecular are possibly due to multiple processes. Implicit in our
conception of time is that the generation times are equivalent among the
compared organisms. However, a population with a short generation time is
expected to evolve faster since there are more time steps in place. Indeed the
rate of substitution is higher in monkeys than in humans and is even higher in
rodents. These observations are consistent with the generation time effect
hypothesis however much controversy surrounds this issue and it does not appear
to have a wide consensus in the literature.
The
mutation rate itself is not the same across organisms and this seems to be
responsible for some of the deviations of the molecular clock. If a lineage has
a higher mutation rate it is likely to accumulate more substitutions and evolve
faster than a sister lineage. More regarding what may account for molecular
clock deviations will be given in the next section.
Calculating
the number of substitutions
We turn our
attention to the dynamics of evolutionary change of the representative sequence
(henceforth sequence). Imagine the following simple model for sequence
evolution: 1) beginning with a sequence of 10,000 bp, 2) choose one basepair at
random, 3) substitute the basepair, and 4) repeat for 5,000 substitutions.
Besides simple substitutions of one of the original bases with a new base, two
notable differences arise. Multiple hits to the same base will cause to
underestimate the number substitutions occurring, K. Multiple hits may amount
to back substitutions – reversion to the original base – which will also
underestimate K. Thus, due to mutations to the same sites, the sequence does
not change linearly with the number of accumulated substitutions. For example,
after 5,000 substitutions to one sequence, the sequence is still 65% identical
to the original sequence.
How can the
number of occurring substitutions between two sequences be inferred from their
sequence identity? Correcting the fraction of observed differences, D, as
, produces a linear relationship between D and K. For a fun
derivation of the Jukes-Cantor correction click here.
Do all
substitutions occur without bias as to which base to which base? Our model has
assumed so but do the data show this? We find that transition events
(a<->g, t<->c) occur disproportionately more frequently than
transversions (all remaining changes). To reflect this a two-parameter model
has been proposed: one parameter a for
transitions and one parameter b for transversions. Now given two
sequences with D differences we ask what fraction are transitions (P) and what
are transversions (Q) and use these as inputs to a correction that takes the
different rates in consideration.
![]()
In general,
one can envisage a total of 12 parameters corresponding to each of the
mutations from one base to another. We find however, that for closely related sequences
all models give roughly the same K. And that in most cases, the one parameter
Jukes-Cantor model is often satisfactory.
How many
millions of years separate us from the apes?
Based upon
heroic searches of fossils with human characteristics, paleontologists in the
1960s held that early humans first appeared 30 million years ago. However,
molecular techniques emerging at that time, made a radical shift this scenario.
Sarich and Wilson published a paper in 1967 in which they used an immunological
approach to compare proteins across species. Their analysis yielded an estimate
of 5 million years since the divergence with chimpanzee based upon a
calibration of 30 million years for the split of apes from the old world
monkeys. The leading paleontologist of the time, the great Louis Leakey, was
adamantly opposed to such a recent human origin. However, overwhelming evidence
now supports this estimation and has demonstrated the power of molecular
clocks.
One of the
supporting evidence can be seen as the sequencing of the chimpanzee genome in
2005 by a large consortium. Overall, the human and chimpanzee genome differ by
1.23% at the DNA level. Most of these changes however do not lead to coding
differences. In fact, for 29% of the orthologs the human and chimpanzee
proteins are completely identical, and the typical difference is just one
amino-acid difference per lineage.
Application
of the molecular clock established a timescale for vertebrate evolution. The
calibration was performed based upon sound fossil evidence that mammals and
birds diverged 310 million years ago. For each divergence they used not one but
tens of gene families to establish a distribution of molecular clock estimates.
The true estimate was taken as the peak of the observed normal distribution.
The result is an evolutionary tree where we find that the chimp has diverged
from human 5.5 million years ago – our closest living ancestor along with the
bonobo. The chimp is followed by the gorilla, orangutan, and gibbon.