Course notes for the
November 22nd 2009 lecture of the Genome Evolution Course
Itai Yanai
Department of Biology
Technion – Israel
Institute of Technology
yanai@technion.ac.il
Phylogenetics and the Tree of Life
Evolution
Trees are the icons of evolutionary biology, and not
haphazardly. While evolution is the organizing principle of life, trees are its
embodiment. Trees follow directly from
Ernst Haeckel became an evolutionist at the age of 25 upon
reading
Today
A Tree is a Mathematical Model of Evolutionary History
What exactly is a tree? Since the true evolution of life is not directly perceived, any inference of it is a model. A tree is a mathematical structure representing such a model. In other words, a tree is an evolutionary hypothesis regarding the history of a set of sequences or organisms.
A tree is a system of nodes and edges. The nodes correspond to organisms and the edges show their relationships. The terminal nodes – those with only one linking edge – typically correspond to the organisms for which we have data, commonly referred to as operational taxonomical units (OTU’s). Internal nodes – those with three linking edges – correspond to the hypothetical ancestors of our set of organisms. And one unique internal node with only two edges represents the root of the tree, the one ancestral organism of all of the organisms.
Trees may be formulated as rooted or unrooted. An unrooted is identical to a rooted tree with the exception that no node exists with only two connecting edges. Thus, while a rooted tree with M terminal nodes has M – 1 internal nodes, an unrooted has M – 2 internal nodes. Similarly, an unrooted tree has one less internal edge relative to a rooted tree with the same number of M terminal nodes, M-3 and M-2, respectively.
For a given number of OTU’s the number of possible trees increases very sharply. For example, while only three possible unrooted trees (evolutionary scenarios) can represent the relationships among four organisms, there are 945 possible trees for 7 organisms. A given unrooted tree with M OTU’s has 2M-3 internal branches, and thus this many possible ways to root the tree. So there are 10395 (945*11) rooted trees for a tree with 7 OTU’s.
The kind of trees presented here are called bifurcating trees because at every divergence exactly two lineages separate. Sometimes, it is convenient to draw three or more lineages which stem from the same ancestor called a polytomy, forming a multifurcating tree. Polytomy can represent one of two things. 1) a simple lack of information about the order of divergence of the multifurcating organisms, or 2) the plausible hypothesis that the divergence of the multiple species occurred simultaneously.
A tree as we have defined is distinct from a network. A tree is a network in which there are no closed loops, i.e. there is only one path between any two nodes. A network on the other hand emerges when one OTU as multiple ancestral organisms.
A convenient notation for writing tree is the Newick format. By this notation a tree is written as a phrase where the topology of the tree is given by the unions of nodes. Although the tree is written as a “one-dimensional” phrase it is actually a bit of a fractal structure. An example of a tree in Newick format is: ((1,2),((3,4),5),6).
There are three basic kinds of trees that can be used to depict different aspects of evolutionary history. A cladogram simply shows the relative of common order of common ancestry. In other words the evolutionary hypothesis is limited to the topology of the tree. In most cases a cladogram is already very valuable. An additive tree is one where along with the topology of the tree edge lengths are also specified corresponding to distances among the organisms. An ultrametric tree is a special kind of a rooted additive tree where the terminal nodes are all equally distant from the root.
Haeckel began building trees examining the similarities and differences of organisms. But is this an objective method? The trouble with using morphological features such as tail length and skin color for building morphologies is that the relative importance of the two features needs to be decided in order to resolve the phylogeny.
With the availability of genetic sequences it became possible to ask if molecular information can be used to construct evolutionary trees. In 1965, Zuckerkandl and Pauling proposed that DNA and protein molecules can actually be used documents of evolutionary history to reconstruct phylogeny. Based upon the notion of molecular clocks the relationships among different organisms as recorded in their genetic material can be solved.
Constructing Trees with the Molecular Clock
The problem of constructing evolutionary trees using genetic sequences can be stated as follows: given a multiple alignment of a set of sequences (representing perhaps a group of organisms) how can their relationships be represented in the form of a tree? All proposed tree building methods fall into two general classes: algorithmic and optimality criteria. With an algorithmic approach one unique tree is built by a series of steps while using the optimality criteria all possible trees are examined and the tree best meeting a certain criteria is chosen. In the former the algorithm plays a fundamental role but in the optimal criteria approach the algorithm is simply a tool to evaluate the criteria among the possibilities.
It is worth pointing out the circularity inherent in building a tree according to some evolutionary model: the model itself required knowing something of the phylogeny of a set of sequences. We would like to think of this tight relationship as “reciprocal illumintation” (Hennig, 1966). With a model you can build a tree that may help you devise a better model and so forth.
Genetic sequences have the nice property of being discrete: only four possible states. Although they are given to us as linear sequences we make the additional simplifying assumptions that the order does not matter and that there is no relationship among the positions. Both are of course are dead wrong in general but we will see that they are useful in building good trees.
We will first discuss two algorithmic approaches to tree building. Both are known as distance methods because they compute a tree via a distance matrix composed of the distances between each pair of sequences. Given the distance matrix, a distance based algorithm is basically a recipe for building a tree based upon the matrix. If the outputted tree is to be additive it must obey four basic rules. 1) All distances are positive. 2) A distance between two points can be zero only if the points are actually the same. 3) Distances are symmetrical. 4) There are no shortcuts in the tree, i.e. the distance a-c cannot be longer than the sum of distances a-b and b-c. An ultrametric tree follows these and one additional rule known as the tree point condition: distance a-b cannot be larger than their distance to a third point ( the maximum of distances a-c and b-c ).
UPGMA (Unweighted Pair Group Method with Arithmatic Mean) is an ultrametric tree building algorithm. UPGMA proceeds by inferring one ancestral sequence per step. In the first round UPGMA selects the least distant pair of sequences (or one of them), summarizes their distance as the first branches of a new tree, and recalculates the entire matrix with the pair as one entity (taking the mean of distances). After N – 1 steps (where N is the number of sequences) the matrix is reduced to just one element. The last inferred ancestor is taken as the root of the tree.
As an example of using UPGMA, we compare the mitochondrial
sequences (16.5 kb) of 86 humans from diverse backgrounds (Gyllensten,
2000). In the UPGMA tree we find that the deepest branch leads exclusively to
sub-Saharan mtDNAs, with the second branch containing
both Africans and non-Africans. This is evidence for a human mitochondrial
origin in
UPGMA implicitly assumes a molecular clock. This can be easily appreciated by considering that its output is an ultrametric tree. In accordance with the prediction of a molecular clock the distances of terminal nodes from the root are identical. If the tree point condition does not hold, an ultrametric tree is inappropriate. In other words, when evolutionary rates are unequal (no molecular clock) UPGMA is likely to fail.
Constructing additive trees by neighbor joining
We define two nodes as neighbors if they are connected though a single internal node. If the tree is additive (following the four rules we stipulated above) the following four-point condition will hold. Given a quartet of points that are two sets of neighbors, the sum of distances between neighbors is smaller than the sum of distances between non-neighbors. In plain words, the four-point condition states that neighbors are closer in the tree than non-neighbors.
The neighbor joining algorithm makes use of this condition by clustering terminal nodes not only because they are similar but also because they are distant from other terminal nodes. The algorithm evolves the tree from the most ambiguous tree possible, a star. Knowing the pairwise distances it is possible to calculate the sum of such a tree:
![]()
Based upon this star, we can detect the closest pair of
neighbors by examining each possible star tree with one additional internal
node clustering the neighbors together. For a tree with
![]()
Together with the distances L1X + L2X and the DiY’s we can calculate the length of the tree. After calculating this for all pairs of neighbors the shortest tree is selected and is the base for the subsequent step. Thus, as in UPGMA an internal branch is added at each step until, after N-2 steps the entire tree is resolved.
Note that the root is not specified by a NJ tree. A common method for identifying the root is to add a sequence to the set that is known to be more different from all the others, an outgroup. The position of the outgroup on the tree indicates the root. For example, in the case of the human mitochondria, a chimpanzee outgroup was used to identify the root.
The tree of life
What are the general features of the tree of life? The most fundamental aspects of our view of the tree of life have undergone many revisions in the past centuries. Until recently, two basic divisions of life depended upon the presence of a nucleus: the eukaryotes have a true (‘eu’) nucleus (‘karyote’) while the prokaryotes do not (‘pro’ – before). However, the ability to construct trees based upon genetic sequences forever altered this dichotomy.
The 16S ribosomal RNA molecule is extremely well conserved across organisms. Because distant organisms such as an elephant and a bacterium have very different rates of evolution, no molecular clock can be used to calculate their time of divergence. However, the topology of the tree constructed from the 16S molecules are expected to be conserved. In the late 1970’s Carl Woese and George Fox determined the 16S sequences from a wide range of organisms including prokaryotes and eukaryotes are constructed for them a similarity matrix. The results were shocking. Not two, but three clear clusters were present in the matrix. While it had been assumed that all prokaryotes were equally distant from the eukaryotes, Woese and Fox discovered a new kingdom, which they named the archaea. Consequently, the traditionally named bacteria are referred to now as “eubacteria” – the true bacteria.
It is of particular interest for us to identify the root of the tree of life. However, because all organisms are present in the tree of life we cannot choose any other organism as an outgroup by which to identify the root. A clever trick however can be used. The tree Woese and Fox constructed was based upon a single molecule, the 16S ribosomal RNA. But a revealing tree can be constructed from a family of sequences. In particular, consider a gene family that underwent its duplication very long ago: before the divergence of the three kingdoms, the gene duplicates, such that each kingdom inherits both copies of the gene. The six sequences can then be combined to form one tree whereby the subtree of each set of orthologs is the root of the other subset. EF-TU and EF-G are ancient paralogs whose function is involved in translation. When examining their combined tree of representatives from all kingdoms, we come to the surprising result that the root is between the eubacteria and the two other kingdoms. In other words, archaea and eukaryotes are sister taxa, and not eubacteria and eukaryotes as the old division among prokaryotes and eukaryotes might lead you to believe.
Gene trees and the history of the
mitochondria.
It is important to remember that when we construct a tree for an orthologous set of sequences we are first and foremost inferring the history of the gene family which may or may not correlate with the history of the corresponding organisms. For example, the tree of the Mn-dependent transcriptional regulator gene family even contradicts the holiest of all principles: that the archaea and eubacteria should cluster together. A phenomena known has horizontal gene transfer is invoked to explain such a situation. This phenomenon is so important that we devote an entire lecture to it soon. In the meantime though we examine perhaps the most famous instance of horizontal transfer.
On the face of it the mitochondria is a strange organelle. Most strangely it has its own genome, which is quite different from the genome of its host. The mitochondria’s DNA is not even organized by histone proteins basic to all eukaryotes. In 1970, Lynn Margulis, then of Boston University, put forth a bold theory for the origin of the mitochondria. She postulated that the mitochondria was once a free-living bacteria. At one point a eukaryote swallowed it and the two live in symbiosis ever since. This theory is known as the endosymbiotic theory of eukaryote evolution. Margulis’ theory was not fully accepted nor appreciated at first but is now universally embraced.
The most convincing piece of evidence in the puzzle had to await the age of genomics. In 1998 the complete genomic sequence of the a proteobacteria, Rickettsia prowazekii was determined. It was gloriously observed that the genes in R. prowazekii - such as ribosomal proteins and ATP synthesis (the crucial genes of the mitochondria) – are much closer to the mitochondria than to other bacteria. That is, in the tree of life, the mitochondria are positioned not with the genome of their host but with the a proteobacteria.
Margulis’ endosymbiotic theory also applies to the chloroplasts which are now known to stem from cyanobacteria. Thus when comparing orthologous genes from different organisms it is important to understand that the history of the gene does not always run parallel to that of the organisms.
Maximum parsimony as a character-state phylogeny method
A drawback of the distance based methods such as UPGMA and NJ is that the sequences are summarized by a distance matrix, never to be examined in detail. Another school of phylogenetic thought – the character state methods – examines each site separately. Such an approach considers each site as a potential clue towards choosing the best tree out of the complete space of possible trees among the number of OTU’s.
The method we will discuss, maximum parsimony, bases its tree decision for each site upon Occam’s razor: stating that simpler hypotheses are preferable to more complicated ones. Thus in our context, Occam’s razor translates to finding a tree that requires the smallest number of evolutionary changes among the sequences.
Maximum parsimony proceeds by identifying those sequences that are informative (offer clues). Clearly sites that non-variant across the sequences are not helpful. However, even the variant ones may not be informative. We decide if the site is informative by reviewing each possible tree and mapping the locations of the fewest number of mutations required to explain the observed pattern of sequences. Comparing the number of inferred mutations for each possible tree, we define an informative site as one that favors one of the possible trees in terms of fewer mutations. Thus, each informative site “votes” for one of the trees. The most parsimonious tree is said to be the one receiving the most votes.
Ascribing confidence values to the tree by bootstrapping
Are all the branches of a tree created equal? Less cryptically, how confident are we in the tree? While the tree is a collective whole it can also be seen as a collection of hypotheses. Essentially, each internal branch “splits” the tree in two. For example, in a tree of microbial organisms, an internal branch might cluster E. coli and H. influenza and separate them from the main tree. We might then ask how confident we are in this particular split as well as each of the other internal branches?
Bootstrapping is a statistical method that can be used to place confidence intervals on phylogenies. It is particularly convenient because it does not require anything more than the actual data used to build the tree. It call also be used in conjunction with any tree building method. Bootstrapping proceeds by resampling from the data. Starting with the multiple alignment it builds a random multiple alignment by sampling with replacement the different sites of the original multiple alignment. For a large number, say 10,000, of such pseudo multiple alignments a tree is built by the same method used to build the original tree.
Based upon the 10,000 trees, bootstrapping can be used to
annotate the internal branches of the original tree. For each internal branch,
we count how many times the same particular “split” is found among the 10,000
trees. The higher this number the more we are confident in the robustness of
the split.