Course notes for the October 25th 2009 lecture of the Genome Evolution Course

 

Itai Yanai

Department of Biology

Technion – Israel Institute of Technology

yanai@technion.ac.il

 

Human Genetic Variations

 

That humans have differences between one another is rather self-evident. Each one of us is unique in terms of our memories, behaviors, and personality. It has also been clear to humanity that some of our distinguishing characteristics such as height, skin color and facial looks are inherited from our parents. In other words, the variations that make each one of us unique have a genetic component. Recall the amazing similarities of looks that exist between monozygotic twins, to quickly appreciation of the degree of that genetic inheritance.

 

Why do people have different skin colors? Jablonski and colleagues have identified that skin color has evolved according to two selective pressures: 1. Photoprotection from the sun’s UV and 2. UV-induced biosynthesis of vitamin D3 synthesis, which is one of the rare benefits of UV radiation. Close to the equator there is a need for photoprotection from the sun’s strong UV. Dark skin (deep melanin) protects from this UV. Secondly, at higher latitudes (>30°N) there is a greater need for vitamin D3 biosynthisis as the UV radiation it requires is lessened as a consequence of the angle by which radiation enters Earth. Jablonski and Chaplin thus devised a model predicting skin color based upon UV radiation and skin reflectance. Remarkable this model agrees well with observed skin colors around the world. This shows that human skin color is a highly adaptive phenotype that has evolved to accommodate the physiological environments of our ancestors.

 

The differences among individuals however run much deeper than superficial looks. Of particular importance to the field of medicine, and pharmacogenomics in particular, are the ways in which such variations can affect our health. The existence of inter-individual differences undermines the notion of a “one-size-fits-all” approach to medicine. A relevant example is the story of the introduction of succinylcholine in operating rooms. In the 1950’s this molecule was found to be a muscle relaxer and thus became a useful medication for anesthetists. Succinylcholine worked fine for the overwhelming majority but proved almost life-threatening to a small portion of the population. The phenomena turned out to be purely genetic. Succinylcholine is broken down by the protein cholinesterase and some individuals do not have a working version for the gene encoding this enzyme.

 

Most importantly for medicine, our genetic makeup may dispose us towards diseases. For example, a simple mutated hemoglobin gene causes the sickle-cell anemia disease in the unlucky individuals that carry two versions of it. This is just one of over 11,000 genes known to be involved in diseases when mutated. These are stored in the Online Mendelian Inheritance of Man (OMIM) database.

 

As a cautionary note, it behooves us to keep in mind that the genotype as well as the environment under which it unfolds defines the phenotype of an organism. For example, consider the growth of 7 plant variations of Achillea at the low elevation of 30 meters above sea level. The fact that these variations differ in terms of the height to which they rise might lead us to denote one variation as tall and another as relatively shorter. Without adding the clause ‘under this environment’ however, we would be misleading ourselves that these differences do not themselves vary. Interestingly, in this case the height of the Achillea clones in low elevation serves no predictive power in medium elevations, 1,400 meters in the foothills of the Sierra Nevada. There the plant that was tallest at low elevations actually grows the shortest.

 

The nature of human genetic differences

 

The genetic differences among individuals are encoded in our genome. What does it mean exactly then that the Human Genome Project has determined the sequence of the human genome? Whose genome, exactly, since each individuals’ genome is different. The answer is two-fold since there were really two genome projects. As for the public effort, the genome sequence is actually an amalgam of the genomes of 10-20 anonymous individuals of different backgrounds. Celera’s private effort was also supposed to involve the DNA a range of people but as has been recently by Celera’s former head Craig Venter, the sequence of their assembled sequence corresponded overwhelmingly to his own genome. Even though, however, the complete genome assembly by itself gives only a consensus of the human genome, much information is known about the variations among the genomes of individuals.

 

A fundamental question then is how genetic variation is encoded in the genome. In other words, given an alignment of the genomic sequences of two individuals, what shape do the differences assume? Currently, a database containing the human variations detected by various methods has been accumulating and stands now at over 6 million. We find that a large portion of human genetic variation now recorded turns out to be remarkably simple. Approximately 93% corresponds to simple point mutations, that is single nucleotide polymorphism (SNP, pronounced “SNiP”). For example, the 1 millionth nucleotide in your chromosome one may be an ‘A’ while for me it is a ‘T’. A distant second type of variation in terms of frequency is called an ‘indel’, for insertion/deletion. Amounting to 7% of the known variation, indels are extra bases inserted into the text. For example, the sequence “CCAT” may be found after my millionth chromosome one nucleotide before my sequence continues to what is immediately the next base in your sequence. Because it is not obvious whether the extra sequence was inserted in mine or removed from yours, both terms are used in the name: “indel”.

 

A much rarer form of variation involves simple repeats. As was discussed in the previous section, 3% of the human genome involves simple repeats, also called microsattelites, such “TTAGGG”. There are currently ~2,500 known variations involving the length of such microsattelites. For example, at one position you might have the triplet “CAC” repeat itself 10 times while in my genome only are 8 are present. Finally, genomes also occasionally differ in terms longer stretches of insertions or deletions. The most representative example of such an occasion is the ALU sequence, which we will discuss in greater depth in the “Evolution of Selfishness” section.

 

So how different are any two individuals? One way to answer this is to determine the sequence of a genomic region in many individuals and look for differences. In the following we will modify our question to how different on average are any two individuals in terms of SNP’s. The sequences from the multiple individuals will be aligned and the differences between all pairs of sequences will be tabulated. By dividing by the number of pairs and then by the length of the genomic region we examined, we will obtain the fraction of SNPs which differ on average among individuals. This statistic is also the nucleotide diversity measure, denoted p. Such analyses have been performed for a number of genomic regions and the numbers tend to center around 0.1%. In other words, one out of every thousand base pairs differ between humans.

 

While one in a thousand sounds like a rather mild number, because the genome is 3.2 billion basepairs, a 0.1% difference actually amounts to 3.2 million differences between a pair of individuals. Now that is a much more respectable representation of the differences! How could there be this many differences? And, if these are only the differences between two individuals, the total number of variations must be astounding. In fact they are – as the following quick calculation shows.

 

The origin of a difference is a base mutation typically occurring during DNA replication. Based upon previous estimates, it is known that each base has a two in a 100 million chance of being mutated in the next generation. Since there are about 14 billion genomes on earth today (twice the number of people), each basepair in the genome is mutated in about 280 individuals with respect to the previous generation. Thus, an extraordinary number of mutations are present. In fact, nearly all possible SNPs across the 3.2 billion basepairs are probably present in the global human population (nearly because some mutations are deadly).

 

The important thing to realize about these variations is that most mutations have a negligible frequency in the population. To stress this point, only if a mutation (or SNP) is present in at least 1% of the population is it promoted to the standing of a ‘polymorphism’. The number of polymorphisms present is then of course dependent upon this cutoff. Classical population genetics calculations have estimated that 11 million SNPs are to be expected in our population at the 1% cutoff, while only 7.1 million are expected the frequency of at least 5%.

 

What this trend suggests is that there are more rare mutations than popular polymorphisms. A study involving the SNPs of chromosome 21 in 20 individuals has verified this trend. It was found that are nearly three times less SNPs with a frequency of 0.41-0.5 then SNPs with a frequency of 0.05-0.1. The shape of this distribution, with its decreasing number of popular polymorphism will be important in the discussion of the neutral mutation theory of molecular evolution.

 

Although we have calculated that only 11 million SNPs are frequent enough to be considered polymorphisms, this is still a huge number. Consider for a moment that sickle cell anemia is caused by just one mutation. Could each of the 11 million SNPs cause such differences? It appears that in actuality most of the 11 million SNPs do not make much of a difference because they are either located in introns, other untranslated regions, or positions in the coding region that can be altered with any (significant) phenotypic effect. Of the SNPs that may make a difference, there are ~30 thousand that alter a coding a nucleotide such that it codes for a different amino acid. Another ~500 SNPs alter a nucleotide that is positioned at a splice junction. Both types of mutations can easily be imagined to cause phenotypic differences.

 

The number of known SNP’s is rising extremely fast. This year for example 17.3 million new SNPs will be deposited (~20% of the known SNPs, not all are new) thanks to the 1000 genomes projects. Last year, million of SNPs were deposited following the sequencing of Craig Venter’s, James Watson’s, an individual Chinese (anonymous), and an individual Korean (also anonymous). All together today there about 18 million known SNPs.

 

However, while SNP’s were thought to be the dominant form of differences between individuals it is now clear that this is not the full picture. Recent technology has enabled a whole-genomic assay of duplications of entire segments (>1kb) of the genome. This was first achieved using tiling arrays but is now mostly done using next-generation sequencing. A study examining 270 individuals identified 1,447 of these so-called copy number variations (or CNVs) which together encompass 360 MB or 12% of the genome. Between individuals there are an estimated 600-900 of these CNV which, when accounting for the base pairs they cover, accounts for about 4 times more variation than SNPs. For example, in Craig Venter’s genome, when examining the two haploid genomes there are 2,894,929 SNPs relative to the reference genome, 939,799 multiple nucleotide polymorphisms, and 10MB of large insertions and deletions. Together these amount to a 0.5% difference with the reference genome. Thus, in this case newer technology has alerted us to a new and significant form of genetic variation between people.

 

Introduction to Beanbag genetics

 

A reasonable amount of insight into the fate of mutations can be gathered from simple so-called “beanbag genetics”. The etymology of this term will be clear after a few definitions. In this section we focus on one region, or locus, on the chromosome. Let us suppose that in this particular locus there is no general consensus on the exact DNA sequence. Instead each individual in the population has one of two variants, or alleles, of possible sequences at the locus. Let’s name the two alleles ‘A’ and ‘a’. Since every individual has two genomes, as far alleles A and a are concerned, there are three types of individuals, AA Aa and aa. To calculate the number of copies of the A allele we would simply multiply the number of AA individuals by two and add the number Aa individuals. The frequency is computed by normalizing this number by the total number of copies of alleles present, which is twice the total number of individuals. Another way to look at it is that the allele frequency of A is the fraction of AA individuals in addition to the half the fraction of Aa individuals.

 

But this situation involving the A and a alleles is a static picture since we examined just one moment in time. What concerns us mainly are the changes in allele frequencies. In fact, we can go as far as saying that the change in allele frequencies is – at one level –evolution itself. In order to postulate about changes across generations we need to make an important assumption regarding exactly how the next generation is conceived. We can imagine that breeding among individuals occur by random mating. While this is strikingly different from our reality, where partners are chosen in an anything but random manner, we will make this assumption to start with a simple scenario.

 

Assuming random mating, the logical step to imagine a gene pool. The pool is composed of the sperm and eggs that the males and females respectively produce. Each sex cell contains one genome and thus either a copy of A or a copy of a. Visualize each sex cell as either orange if it is A, or blue if it is of the a allele. The next generation is produced by paring up alleles. For example, if we happen to pick from the gene pool an orange and a blue ball then the individual would be of the Aa genotype. In such a fashion, we begin with say a billion individuals, with a 1 to 1 ratio of blue to orange balls, and choose from their gene pools a new set of billion individuals which we term the second generation.

 

When selecting from the gene pool to produce the next generation, a trend develops. If the blue balls (A allele) have a 50% frequency we can predict that 25% will be of the next generation (0.5*0.5) will be of genotype AA. The same logic leads to 25% frequency of the aa genotype. Since there are two ways to choose a blue and an orange balls (Aa or aA) the frequency will be twice the frequency of both (2*0.5*0.5). It essentially does not matter what the starting frequencies of AA, Aa, and aa. As long as there is random mating in the gene pool, the frequencies of A and a alone can predict the frequencies of the genotypes. Starting with a billion individuals that are all of the Aa genotype, the next generation is destined to have ~250 million AA, 500 million Aa, and 250 million aa individuals.

 

An interesting thing occurs when we continue breeding to the next generation – nothing changes! The genotypes will not change in frequencies: the system has equilibrated. Given only the frequency of A, henceforth p, we are able to give the distribution of genotypes in the next and future generations. This principle is known as the Hardy-Weinberg equilibrium. It’s logic, to recapitulate, is that under random mating in a large population operating under the scenario we described, the fraction of the population corresponding to AA, Aa, and aa genotypes correspond to p2, 2pq, q2, respectively, where p and q are the frequencies of the A and a alleles. An important corollary to the Hardy-Weinberg equilibrium, is that p does not change with generations. To see this directly, we calculate p’ – the frequency of p in the next generation. We know that all of the AA individuals have the A allele and that these have a frequency. The Aa individuals with frequency 2pq also have the A allele but at only half the frequency. Summing these terms and normalizing by the total number genotype frequencies (and remembering that p+q = 1) we have:

Thus, the frequency of the a allele in the next generation is the same as the frequency of the present generation.

 

Consequently, the situation we have described is described as one where there is no evolution. With the assumptions of the game as we have thus played it - large number of individuals (a billion), random mating (gene pool), constant population size (no migrations, population explosions), no novelty (only blue and orange balls forever), neutrality (all genotypes are effectively equal) – no evolution occurs.

 

As abstract as the model seems, evidence for its predictive power is ubiquitously. In fact a check for Hardy-Weinberg frequency (HWF) distributions of a two-allele genotype is a typical test for a neutral mode of evolution; in other words, a case without selection. Moreover, the frequency of a rare allele in a population is typically estimated taking the square root of the fraction of aa genotypes which assumes HWF.

 

So when does evolution occur? One reason for a change in frequencies has to do with the unequal survival of genotypes. To account for this we need to alter our model. Before a generation disseminates their alleles to the next generation’s gene pool, we can stipulate a selection step such that some genotypes are fractionally overrepresented in the gene pool. Whereas originally p’ was based upon solely the p of the current generation, the fractions of the genotypes can be altered according to selective pressures. Three weights are added, corresponding to each genotype, and the general case for p in the next generation is thus:

.

The equation the same as the previous equation but now with a selection parameter for each genotype. It is clear that the neutral case we have described earlier – summarized by the Hardy-Weinberg Equilibrium – formally corresponds to all weights set to 1. When the weights are unevenly distributed, frequencies begin to change and consequently evolution occurs. We formulate changes in frequency in terms of the current and future generation.

 

Armed with this generation equation for changes in q, the frequency of the ‘a’ allele, we investigate the effects of different weight settings. Consider co-dominant selection for the a allele. By this we refer to a slight preference of the Aa genotype over the AA genotype, and a slight preference of aa over Aa. To code this in weights we assign http://www.yanaiweb.com/genome/HumanVariation/Human%20variation%202009_files/image008.gif. Solving for http://www.yanaiweb.com/genome/HumanVariation/Human%20variation%202009_files/image010.gif we find, http://www.yanaiweb.com/genome/HumanVariation/Human%20variation%202009_files/image012.gif(assuming that s is small). So for example, with s set to 0.01, and q starting out as 0.01, q quickly rises very quickly to almost 1 (under this model it never actually wipes out the other allele completely).

 

For the dominant case, where the aa and Aa are selectively equally, but favorable relative to AA, http://www.yanaiweb.com/genome/HumanVariation/Human%20variation%202009_files/image014.gif, and http://www.yanaiweb.com/genome/HumanVariation/Human%20variation%202009_files/image016.gif(again for small s). Here, q also rises over the p, but a bit slower since the a allele can effectively hide in the heterozygotes.

 

For the recessive case, only the homozygote aa is considered selectively advantageous, i.e. two copies are required for the selective difference to take hold, http://www.yanaiweb.com/genome/HumanVariation/Human%20variation%202009_files/image018.gif, and http://www.yanaiweb.com/genome/HumanVariation/Human%20variation%202009_files/image020.gif. Here the rise is of qualitative different shape, the a allele spends many generations with only very minor increases in frequency. However, once it achieves a critical-mass of about 0.1, q jumps very quickly to near 1, or fixation in the population. The reason is that when a is very rare, very few of the individuals have both alleles and thus the selective towards the homozygotes is hardly felt. However, once there is a good amount of them, selection is able to effectively increase the allele frequency.

                       

Consider the case of balancing selection where the heterozygotes are better fit, , and . Here, instead of rising to near fixation, q converges to a frequency of 0.5. Analogously, if q begins with a high frequency it will decrease to 0.5.

 

Enter the law of chance

 

We now introduce a major topic – genetic drift – that we have until now ignored by a certain innocent sounding assumption. The model we have been working has applied to large populations – a billion individuals. With this assumption we were able to escape the laws of chance that become important when the numbers are smaller. We will see that chance itself is an evolutionary force able to drastically change frequencies.

 

To appreciate the important role of chance, let’s build a slightly different model of evolution. We represent the population as N individuals. The next generation is a composition of the previous population according to random selection with replacement. As an example, imagine 100 individuals where 99 are orange and one is blue. Because the next generation is composed 100 individuals chosen with replacement, the blue ball may be selected twice and be represented twice in the next generation, or alternatively, it may not have been selected at all and consequently be removed all together from the population.

 

Because, N=100 forms such a small population, in less than 1000 generations only two outcomes are possible for a new mutation. The frequency of 1% (1 in a population of size N=100) characterizes a new mutation. In the overwhelming majority of instances, the mutation blue ball will be lost in the first few generations. However, in rare instances, by a lucky turn of events the blue ball will make gains in frequencies until it actually wipes out the initially wildtype orange ball. This is called a fixation event. We say that by genetic drift – random sampling in a finite population – the blue ball rose in frequency.

 

This aspect of populations did not escape the attention of the great population geneticist R.A. Fisher. He calculated a 78% chance of loss for a mutant allele with a selective advantage of s=0.01. This is very surprising and runs directly counter to our earlier formulation of the deterministic rise in frequency of an advantageous allele. In fact, Fisher continued, a mutant allele has only a 2% chance of spreading to the entire population. In other words, most beneficial mutations are actually lost by mere chance!

 

An important aspect of genetic drift is its removal of variation from the population. Consider starting with two variants, both of 50% frequency. According to our N=100 model, very quickly one of them will be lost. Running the simulation 10 times, we find that after a little more than 4,000 generations all variations are removes, one allele or the other.

 

What are David’s chances of beating Golliath? Our stochastic model can allow us to estimate the probability of a new mutation to achieving fixation. Through simulations we find that a new mutation with frequency 1/N, has exactly this probability, 1/N, of achieving fixation. Thus, the probability of fixation is related only to the size of the population, in the absence of selection. Smaller populations make it easier for a new mutation to achieve fixation than larger populations as may be intuitively expected. Simulations as well as analytical work with diffusion equations have yielded the result that at any point in time the chance of one allele to achieve fixation is exactly equal to its current frequency. For those mutations that eventually fixate, on average 4N generations account for their rise from rags to riches. If a selective advantage is introduced, Kimura and Ohta have shown (also applying diffusion equations) that the time to fixation is shortened to (2/s)*ln(2N) generations.

 

A real-world example of genetic drift is family names in China, where they have been passed for over 2,000 years. Interestingly, entire small villages sometimes have same last name. In addition, although there are a billion people, there are only 100 names. It is thus evident that genetic drift has acted to remove variation from the population.

 

Both genetic drift and selection (with the exception of balancing selection) remove variation from the population. With these forces, how is it possible that we find any variation at all? One might think all variation would have been wiped out by now. Mutations however make sure that this never occurs. While old variations may be removed new variations are always in the making. In a sense thus mutations are the engines of evolution, make sure it never grinds to a halt.

                                                 

Reading human history in our genomes

 

Differences in the human genomes of our global populations is helpful towards unraveling our history. Consider the question, How did agriculture spread 10,000 years ago? As irrelevant to our subject as this subject seems, remember that when people migrate they take their genes with them. Thus, deciding whether agriculture was spread by the spread of ideas or of farmers can be investigated using the different alleles that individuals in the populations. In a classic study, Cavalli-Sforza accumulated genotypes at 95 alleles for several hundred individuals from 26 present-day European populations. Allele frequencies among populations were summarized by principal components. The first component of a principle components analysis recovered 28% of the variation in the data. When mapped to globe the 1st component showed migration from the Middle East, reflecting the spread of agriculture by migration.

 

How much variation is there between continents? Very little as it turns out. Given one population, ~85% of the total genetic variation found in the species is found just there. Considering all of the populations in one continent adds another 5%. Thus only the remaining 10% of the total variation is detected between different continents.