Course notes for the November 1st 2009 lecture of the Genome Evolution Course

 

Itai Yanai

Department of Biology

Technion – Israel Institute of Technology

yanai@technion.ac.il

 

Clockwork Evolution at the Molecular Level

 

Phenotypic and molecular rates of evolution

 

Phenotypic evolution seems to be fundamentally different from molecular evolution. Darwin addressed phenotypic evolution in Chapter 10 of his “Origin of Species”, noting that different organisms evolve at different rates:

 

Species of different genera and classes have not changed at the same rate, or in the same degree…. The Silurian Lingula differs but little from the living species of this genus; whereas most of the other Silurian Molluscs and all the Crustaceans have changed greatly. The productions of the land seem to change at a quicker rate than those of the sea, of which a striking instance has lately been observed in Switzerland. There is some reason to believe that organisms, considered high in the scale of nature, change more quickly than those that are low: though there are exceptions to this rule.

 

As another example, consider trilobites – hard-shelled sea-dwelling creatures that existed 300 million years ago – and the evolution of the number of ribs they contain. The fossil record is particularly rich in trilobites and thus allows us to trace the number of ribs for different trilobite glineages. Different patterns are apparent. In the Nobiliasaphus lineage we detect a clear linear trend of an increase in rib number with time. For Cnemidopyge the number of ribs increases fast to eight ribs on average and is then steadily maintained. In other lineage we find a zigzag and circuitous patterns.

 

In this section, we investigate the pattern of evolutionary change as recorded in the genome. We will make the remarkable discovery that the rate of molecular change is of a completely different nature than that of the phenotypic.

 

Evolution is the change in allele frequencies. This was the definition we found valuable at the level of populations, examined in the previous section. Indeed changes in allele frequencies are the atoms of evolutionary change. In this section, we turn our attention to their effects at a one-step higher level of abstraction – the evolving genomic sequence. This involves an encapsulation of populations onto one representative sequence.

 

The representative sequence captures evolutionary changes in terms of substitutions. Each nucleotide in the sequence corresponds to the wild-type version. When the population genetic processes described earlier bring about a fixation of a mutant allele (a base in the sequence) the representative sequence is substituted. Effectively, the long process of an allele’s rise in fixation is captured as one step by the representative sequence.

 

Family relationships among genes

 

In studying genome evolution, we must now acquaint ourselves with one of the major concepts: homology.

 

Sir Richard Owen became was the first Superintendent of the British Museum and an expert in comparative anatomy. Based upon his examination of the skeletons of the various classes of vertebrates – fish, reptiles, birds, and mammals – he came to the conclusion that vertebrates have a common structural plan which he called an “archetype”. Owen introduced the term homology to mean “the same organ in different animals under every variety of form and function” (1843). Owen meant, for example, that human fingers are homologous to bird fingers. Owen had his own particular explanation for the existence of common archetypes which he saw as evidence that the creator had a common design for all vertebrates.

 

Following Darwin’s theory of evolution, homology took on a new significance. Often times it is observed that a sequence from one organism shows a high fraction of identity with a sequence from another organism. If the similarity is strong enough to rule out the possibility that the similarity can be reasonably explained by chance, common ancestry may be inferred. The two sequences will be said to be homologous. Thus, in the light of evolution, homology is reinterpreted as similarity due to common ancestry.

 

As in family relations, it is particularly useful to further specify the types of homologies that can take place. In the example, the two sequences can trace their ancestry from one sequence based upon a speciation event that led to the two contemporary species. Homologous sequences due to speciation events are referred to as orthologs. Orthologs can be seen as “evolutionary counterparts” (as Eugene Koonin refers to them). They are “the same” gene in different organisms.

 

The second major type of homology is a result of gene duplication events, frequent in genome evolution. Two homologous sequences can descend from a single sequence that underwent a duplication event. Such homologous sequences are referred to as paralogs. They can be seen as continuing their existence in parallel to one another within the same genome. Thus duplication events yield paralogous relationships.

 

Even with the concepts of paralogy and orthology, however, describing the relationships among genes can easily become complicated. Consider the history of eights genes from yeast (one gene), human (5 genes) and worm (3 genes). Note that in the last common ancestor of these three organisms, these 8 were represented as only one gene – the yeast genome still contains only one as well. Along the animal lineage – before the speciation of the worm and human ancestors – the primordial gene underwent gene duplication forming two paralogs A and B. Some time later the speciation occurs and each lineage received its own copies of A and B – two sets of orthologs. The two paralogs, A and B, had different fates. B was not duplicated since the speciation event and thus both worm and human have a copy of B. Following speciation, A however one duplication event in worm (2 A genes) and two duplication events in human (3 A genes). What are the relations among these genes?

 

The yeast gene is considered orthologous to all worm and human genes, which are in turn all co-orthologous to the yeast gene. So, who is the orthologs of the yeast in humans? Due to the duplications, there is not one gene - HA1, HA2, HA3, and HB are all co-orthologous to the yeast gene. And, who is the ortholog of the HB gene in worm? This is simple, since there were no duplications of the B gene in worm since speciation with human so the orthologs is simply WB. Similarly, the co-orthologs of WA1 and WA2 are HA1, HA2, and HA3.

 

A recent useful addition to the terminology of evolutionary relationships is the distinction between in-paralogs and out-paralogs. Genes HA1, HA2, and HA3 are considered in-paralogs with respect to the worm genome, since the duplications the brought them upon occurred after the speciation event. By contrast, the family of HA* and HB are outparalogs with respect to the worm genome since they stem from a duplication event that occurred before – or outside – the speciation event. The same set of genes is considered inparalogs though with respect to the yeast genome, whose divergence predates the duplication of A and B. Thus the specification on in- and out- paralogy is only relevant when in the context of another – more distant – genome.

 

Measuring Divergences between species

 

Hemoglobin proteins play an important physiological function in transferring oxygen from the lungs throughout the body. Humans have multiple genes encoding different hemoglobin genes - a, b, and g hemoglobins among them. The major adult hemoglobin is composed of 2 a proteins and 2 b proteins. The major fetal hemoglobin is composed of 2 a proteins and 2 g proteins.

 

When Zuckerkandl and Pauling compared the protein sequences of a, b, and g hemoglobins in humans with those of cow and horse they found an interesting pattern. While the a’s of various organisms showed different levels of similarity, all of the a’s where of roughly the same level similarity with all of the non-a’s regardless of the organisms examined. This suggested that the precursor to a and b duplicated before the divergence of the three organisms. The fact that since their divergence all of the organisms accumulated a similar amount of differences suggested that “there may thus exist a molecular clock”. In other words, that the number of substitutions accumulated by genes may be proportional to the amount of time passed.

 

Each protein family keeps time at a different rate. Histone genes for example are very conserved throughout eukaryotes with very few changes. Cytochrome C and hemoglobins have a fair amount of differences while fibrinopeptides exhibit a fast rate of change. Different locations in the gene also keep different time rates. For example, it is observed that coding regions corresponding to the outside of the protein have a higher rate of change than regions in the inside of the protein.

 

One way to test for a molecular clock is to limit the number of organisms – thus examining for a local clock. Among the organisms mouse, rat, hamster and human – mouse and rat are thought to be sister taxa diverging from the hamster which in turn diverged from the ancestor of humans. Based upon such a set of relationships, it expected that a molecular clock – if present – would show that the distance from mouse to hamster is the same as that of rat to hamster. Similarly, distances from mouse, rat and hamster to human, respectively, should all be equal. Indeed, a study found that on average 30.3 out of 100 synonymous sites (mutations that do not change the coding amino acid – more on this in the next section) are different between mouse and hamster and 31.3 between rat and hamster. Also the mouse-, rat-, and hamster- differences with respect to humans are 53.4, 51.6, and 52.3; all very similar. Thus we can say that a molecular clock ticks regularly in muroid rodents (mouse and rat) and hamsters.

 

There are two important reasons to be interested in molecular clocks. First and foremost we ask: What does the molecular clock signify? In other words: Why do changes to the genome proceed in a clocklike manner? Remember, this is radically different from the fits and starts of evolution at the phenotypic level, as in the ribs of the trilobites. There is a well-known model known as the neutral mutation theory that can explain this phenomena which we discuss in great detail in the next section so for the time being we postpone this issue. The second reason for interest is its usefulness in establishing a time scale for evolution. The molecular clock essentially maps out the evolutionary past.

 

Each molecular clock has its own rate – substitutions per unit of time. Thus upon choosing a clock, for example the a hemoglobins across species, it is calibrated by determining this rate. By comparing a pair sequences, the number of substitutions – call it K – can be estimated (below we learn to correct for multiple substitutions to the same site). The K substitutions occurred since the time of speciation – T years ago – between the two lineages. Since both lineages continue to evolve after speciation we say that the sequences have evolved separately for 2T time. The rate of substitution – r – is then K/2T. K is not difficult to derive but how can T be established? Molecular evolution is at a loss here and calls upon the fossil record. Time T is thus an assumption that must be made in order to use the clock.

 

With the evolutionary rate of the clock in hand, we can proceed to date other divergences. Given two sequences of the same gene family, their K divided by twice the rate is the time of speciation – the number of years since their divergence. It is important to consider that the clock is not a metronome but stochastic. Deviations are expected within statistical bounds.

 

The molecular clock is so commonplace that it is often assumed as a null hypothesis. When a speed-up of slow-down occurs in the clock it is evident towards a mechanism impinging upon the clock. For example, a comparison among a combined sequence of a and b, hemoglobins, cytochrome c, and fibrinopeptide A among mammalian groups, revealed strong congruence between molecular clock estimates and dates from the fossil record. Interestingly, the molecular clock predicts a much more recent ape-human divergence than the fossil estimates, suggesting (if the fossil record is to be trusted) that a clock slowdown has occurred along the lineages. The opposite was observed for the horse-donkey lineages, suggesting a molecular speeding up of the clock. Or perhaps this points to interesting discovery about human evolution? (Continue reading..)

 

A molecular clock may thus hold for a subset of species and not for others. For example, Insulin A and B has only moderate differences among human, horse, rabbit, whale, bovine and rabbit. However, the sequence in guinea pig contains more than five times the number of differences between all of the mentioned sequences. It is clear that the insulin clock does not hold along the lineage leading to the guinea pig.

 

Tajima’s relative rate test can be used to statistically determine if a molecular clock can be ruled between two lineages – if there is too much variation in the number of substitutions that each has accumulated. The method works by requiring a third (outgroup) sequence in addition to the two whose molecular clock we wish to test. Based upon the third sequence – which is assumed to have diverged earlier in time from the two in question – we assign each difference in sequence to one of the two inferring the ancestral nucleotide from the third. The problem can be stated as: are numbers of substitutions that each has accumulated different enough to rule out a molecular clock? The answer is determined using a c2 test of the form: c2 = (m1-m2)2/(m1+m2), where m1 and m2 are the number of substations (K) that each has accumulated. If is the P-value of the c2 with one degree of freedom is below 0.05 we can say that we can rule out the molecular clock.

 

Deviations from the molecular are possibly due to multiple processes. Implicit in our conception of time is that the generation times are equivalent among the compared organisms. However, a population with a short generation time is expected to evolve faster since there are more time steps in place. Indeed the rate of substitution is higher in monkeys than in humans and is even higher in rodents. These observations are consistent with the generation time effect hypothesis however much controversy surrounds this issue and it does not appear to have a wide consensus in the literature.

 

The mutation rate itself is not the same across organisms and this seems to be responsible for some of the deviations of the molecular clock. If a lineage has a higher mutation rate it is likely to accumulate more substitutions and evolve faster than a sister lineage. More regarding what may account for molecular clock deviations will be given in the next section.

 

Calculating the number of substitutions

 

We turn our attention to the dynamics of evolutionary change of the representative sequence (henceforth sequence). Imagine the following simple model for sequence evolution: 1) beginning with a sequence of 10,000 bp, 2) choose one basepair at random, 3) substitute the basepair, and 4) repeat for 5,000 substitutions. Besides simple substitutions of one of the original bases with a new base, two notable differences arise. Multiple hits to the same base will cause to underestimate the number substitutions occurring, K. Multiple hits may amount to back substitutions – reversion to the original base – which will also underestimate K. Thus, due to mutations to the same sites, the sequence does not change linearly with the number of accumulated substitutions. For example, after 5,000 substitutions to one sequence, the sequence is still 65% identical to the original sequence.

 

How can the number of occurring substitutions between two sequences be inferred from their sequence identity? Correcting the fraction of observed differences, D, as , produces a linear relationship between D and K. For a fun derivation of the Jukes-Cantor correction click here.

 

Do all substitutions occur without bias as to which base to which base? Our model has assumed so but do the data show this? We find that transition events (a<->g, t<->c) occur disproportionately more frequently than transversions (all remaining changes). To reflect this a two-parameter model has been proposed: one parameter a for transitions and one parameter b for transversions. Now given two sequences with D differences we ask what fraction are transitions (P) and what are transversions (Q) and use these as inputs to a correction that takes the different rates in consideration.

 

In general, one can envisage a total of 12 parameters corresponding to each of the mutations from one base to another. We find however, that for closely related sequences all models give roughly the same K. And that in most cases, the one parameter Jukes-Cantor model is often satisfactory.

 

How many millions of years separate us from the apes?

 

Based upon heroic searches of fossils with human characteristics, paleontologists in the 1960s held that early humans first appeared 30 million years ago. However, molecular techniques emerging at that time, made a radical shift this scenario. Sarich and Wilson published a paper in 1967 in which they used an immunological approach to compare proteins across species. Their analysis yielded an estimate of 5 million years since the divergence with chimpanzee based upon a calibration of 30 million years for the split of apes from the old world monkeys. The leading paleontologist of the time, the great Louis Leakey, was adamantly opposed to such a recent human origin. However, overwhelming evidence now supports this estimation and has demonstrated the power of molecular clocks.

 

One of the supporting evidence can be seen as the sequencing of the chimpanzee genome in 2005 by a large consortium. Overall, the human and chimpanzee genome differ by 1.23% at the DNA level. Most of these changes however do not lead to coding differences. In fact, for 29% of the orthologs the human and chimpanzee proteins are completely identical, and the typical difference is just one amino-acid difference per lineage.

 

Application of the molecular clock established a timescale for vertebrate evolution. The calibration was performed based upon sound fossil evidence that mammals and birds diverged 310 million years ago. For each divergence they used not one but tens of gene families to establish a distribution of molecular clock estimates. The true estimate was taken as the peak of the observed normal distribution. The result is an evolutionary tree where we find that the chimp has diverged from human 5.5 million years ago – our closest living ancestor along with the bonobo. The chimp is followed by the gorilla, orangutan, and gibbon.