Course notes for the November 29th 2009 lecture of the Genome Evolution Course

 

Itai Yanai

Department of Biology

Technion – Israel Institute of Technology

yanai@technion.ac.il

 

Evolution by Genome Duplication

 

Looking at the living world we are struck by its diversity. What is the source of biological novelty? How does anything new come about in the genome? What is the birth of creativity, novelty, and diversity? In the model of genome evolution we have described in previous lectures, this novelty may arise in one of two steps: 1. Variation in the form of mutations arise, and 2. These are then sorted by selection and drift, (and finally 3. We loop back Step 1.). A useful metaphor is that mutations are the engine of evolution while selection and drift act as the steering wheel.

 

One may see a problem with how selection can lead to radical biological novelty. If selection maintains a particular gene’s function how can that gene then evolve a new function? In 1970, Susumu Ohno elegantly addressed this problem with his landmark book entitled “Evolution by Gene Duplication” in which he claimed that “natural selection merely modified, while redundancy created”. Susumu Ohno’s logic is that one way to escape the conservative nature of natural selection is to duplicate a gene. Upon duplication, selection may force one copy to maintain the original function while the other gene is free to adopt a new function.

 

Duplications need not occur just at the gene level. The length of duplicated DNA may vary considerably from stretches of a few base-pairs amounting to several codons or regulatory elements in the promoter. They may be at the domain level (lecture 8), at the gene level (lecture 7), at the several gene level, at a large fraction of a chromosome, an entire chromosome, or even the entire genome. These changes refine our view of the standard model in noting that the genome has lots of changes that it can generate. Thus, genomic variations come in many flavors and in this lecture we will explore how they might influence the evolution of diversity.

 

The Drosophila melanogaster genome was published in 2000. It contains 120 megabases of euchromatin sequence. Euchromatic literally means “true chromatin” and contains the non-repetitive sequences. Despite being a more complex organism than the nematode C. elegans, Drosophila contains only ~14,000 genes relative to the ~20,000 of C. elegans. Thus, organismal complexity is not simply reflected by a rise in gene number.

 

One fascinating example of gene duplication is the HOX gene family. Mutations in the HOX genes lead to dramatic changes such as an extra set of wings in the HOX Ultrabithorax gene (Ubx) or a transformation of antennae into a pair of legs in the HOX Antennapedia (Antp) gene. These genes are called HOX genes because they each have a 60-amino-acid homeobox domain which binds DNA and thus allows the genes to act as transcriptional regulators. There are 3 additional striking features of the HOX genes. First they cluster together in the genome and their order along the chromosome reflect their spatial expression pattern. Second, all animals (metazoans) have HOX genes suggesting that even very different animals such as a worm and a tiger have a set of important genes in common. And finally many vertebrates have 4 copies of the HOX cluster of genes (some fish have even more) prompting the question of how duplication has led to the diversification in the vertebrate lineage.

 

What is an animal? An additional interesting story regarding the HOX genes can actually help us answer this question. The textbook definition of an animal is “an organism that feeds, moves, and responds to stimuli” however that does not seem to capture the real essence. Haeckel first observed that at the ‘tail-bud’ stage all chordates look very similar. He was proposed to call this the phylotypic stage because it is universal to the chordate phylum and seems to be a constraint through which all phylum members must pass. In 1993 Jonathan Slack and colleagues took this one important step forward. They proposed that an animal is an organism that displays a particular spatial pattern of gene expression, called the zootype. The zootype defines the spatial order of genes comprising the HOX and several other genes. Thus they proposed that an animal can be accurately defined as any organism that expresses a certain set of genes (HOX) in a particular way during its embryonic development. Interestingly, while non-animals have homeodomained genes, only animals have clusters of HOX genes.


In 1970 Susumu Ohno published what came to be known and the 2R hypothesis: “It is our contention that the ancestors of reptiles, birds, and mammals have experienced at least one tetraploid evolution either at the stage of fish or at the stage of amphibians”. One main piece of evidence is the bi-modal distribution of chromosome numbers in fish; many fish have either ~50 or ~100 chromosomes. Ohno concluded from this evidence in 1973 that “A mammalian ancestor might have gone through at least one round of tetraploid evolution at the stage of fish”. Had Ohno had access to the genomic sequences of fish he would have much additional evidence. For example, there are at least 7 Hox clusters in the zebrafish (Amores et al 1998 Science). How could a genome duplication be tolerated? The dosage relationships among functionally related genes would be preserved. Further, each structural gene is accompanied by duplication of its own regulator.

 

In this lecture, we describe many instances of genome duplications including plants, fungi, ciliates and vertebrates. The most famous genome duplications are the two postulated to have occurred along our own lineage. If two genome duplications occurred along the evolution of vertebrates leading to humans, we expect four paralogs for each gene in non-vertebrates such as Drosophila. This “one-to-four” rule has many examples such as the HOX clusters and EGF receptors, however has many exceptions. How can these exceptions be explained? As we shall see, genome duplications tend to be followed by extensive gene deletions.

 

In general the two methods for detecting genome duplication events are spatial and temporal. In the spatial method, a map-based approach uses knowledge of the order of the genes along the complete genomes. In the temporal method, a so-called tree-based approach uses the molecular clock to estimate the age of the duplication. Ken Wolfe from Ireland pioneered the map-based approach to provide molecular evidence for the first detected ancient genome duplication event.

 

Ken Wolfe examined the yeast S. cerevisiae genome searching for duplicated regions. For each gene he identified the genes similar to that gene. He then identified clusters of such duplications. For example on chromosome X he found three blocks that also occur on chromosome XI. Within each duplicated block many genes are present in either copy with no duplicate in the other. This is to be expected since these duplications are old and many rearrangements and in particular gene loss occurred since. 55 duplicate regions were identified using this method covering 376 pairs of homologuos genes which cover over 50% of the genome.

 

The main question regarding this duplicates occurred by successive duplications interspersed in time or occurring by a simultaneous single duplication of the entire genome. Two lines of evidence suggest a genome duplication. First, the 55 duplicated regions form pairs not triplicates. If the duplications occurred successively over time we would expect duplications of duplications leading to triplicates. Wolfe estimated that a model of successive duplicates would lead to 7 of the 55 duplicated regions to be in triplicates. Second it was observed that 50 of the 55 duplicated regions conserve the orientation with respect to the centromere. This suggests that most of the changes to the genome occurred by reciprocal translocation by which nonhomologous chromosomes exchange segments, thus conserving the orientation with the chromosome. In other words, the observation that orientation with respect to the chromosome is conserved further suggests that the entire genome was duplicated. If smaller duplications successively occurred we would expect no particular conservation of the centromere orientation.

 

Based upon this evidence, Wolfe concluded that the genome duplication is a result of a fusion of two yeast cells each containing ~5000 genes. Initially the fused cell was tetraploid, but later, through a decay in sequence identity the new species became a diploid. Most of the duplicates (85%) were deleted leading to the current species with 5,800 genes, many of which are ancestral duplicates.

 

The case for a genome duplication was made air-tight with the sequencing of another fungi genome. K. waltii and S. cerevisiae last shared an ancestor 150 million years ago, before the duplicated that occurred along the S. cerevisiae lineage. Strikingly, for the pairs of regions identified by Wolfe in S. cerevisiae, only one is present in K. waltii as union of asymmetric gene losses occurring in each segment. Such duplicate blocks tile 85% of each K. waltii chromosome in the pattern expected for a gnome duplication event. These blocks contain 75% of K. waltii genes and 81% of S. cerevisiae genes. The rest of the divergence is due to the inevitable changes to genome content occurring to the two lineages throughout the 300 million years of independent evolution.

 

The same approach also suggests a genome duplication in the Arabidopsis genome. Again much of the genome is in paired-segments and not triplicates. Overall, there is evidence that most flowering plants have a polyploidy ancestry.

 

The second general method for identifying genome duplications operates with the notion that genes duplicated simulataneously should show the same history. For example, the 2R hypothesis in vertebrate evolution postulates two genome duplications which should lead to a specific relationship among the 4 modern-day copies of each ancestral gene. In the cases where all four copies survived, one would expect a tree topology comprising two pairs reflecting the two layers of the genome duplications. The fact that the human genome does not have four times the number of genes of early diverging metazoans (25,000 instead of 80,000) suggests that the ‘one-to-four’ rule does not generally hold up. Further there are several reports that phylogenetic trees for four-membered human gene families do not show the excess of ((AB),(CD)) topologies expected under a 2R model.

 

A pair of publications will demonstrate the tree-based approach in detail. McLysaght et al defined paralogons as paralog pairs separated by at most 30 intervening genes. For example, a paralogons on human chromosome 17 contains nine genes across a region of several megabases which have paralogs clustered together on a region on chromosome 3. Overall, the authors identified the paralogons for different thresholds the required number of paralogs. For a minimal paralogon of size 2, 1,642 paralogons were identified covering 91% of the genome.  Using a shuffling simulation, the authors argue that large-scale duplications must be invoked to explain paralogons. Any paralogon with sm ³ 6 was very likely to have been formed by a single duplication. Most interestingly, the duplications are then dated. As a reference, the fly genome is used and distances between human duplicates are quantified in terms fraction distance to the time of divergence with the fly genome using ultrametric trees. An excess of duplications in the 0.4-0.7 D date range were discovered. In other words most of the duplications forming duplicated regions (paralogons) fall in the age class 0.4–0.7 D. Another study by Gu et al. examined 749 gene families and came to a similar conclusion using the tree-based approach detecting a large number of duplications 550 million years ago.  Thus a burst of gene duplication occurred during early chordate evolution.

 

In the genome of a unicellular eukaryote, Paramecium tetraurelia, a ciliate 3 successive genome duplications have been detected (Aury et al Nature 2006 444 171-178). This was detected by first detecting pairs of synthenic regions accounting for most of the genome. The next genome duplication was found by collapsing the pairs into one inferred ancestral region and finding pairs among these. The third was found by repeating this last event. Since the number of genes prior to the three genome duplications is estimated as 20,000 and the present number is not 80,000, many of the genes were lost following duplication. The authors suggest that genes with certain dosage requirements relative to one another would be maintained in the genome. In other words, duplicates of interacting genes that need to have the correct stochiometry may be maintained because deleting them would lead to a decrease in fitness. Overall though, 95% of the duplicated genes are removed. The next cycle again duplicated the genome and leads to further gene loss.

 

Van de Peer have recently suggested (Van de Peer, Nature Reviews Genetics, 2009) that although the descendants of a genome duplication are destined towards extinction, when it does lead to a successful lineage it is one that is marked by a higher complexity. This is based upon evidence that polyploidy is extremely common, corresponding to 2-3% of speciation events. On the other hand detecting genome duplication events, despite the many discoveries described here, are relatively rare events.