Course notes for the January 3rd 2010 lecture of the Genome Evolution Course

 

Itai Yanai

Department of Biology

Technion – Israel Institute of Technology

yanai@technion.ac.il

 

Evolution of Biological Networks

 

Things derive their being and nature by mutual dependence and are nothing in themselves.
-Nagarjuna, second century Buddhist philosopher

 

What is a network and why is it a useful object in our understanding of biological systems? A network is defined by a set of nodes and connecting edges. For example, a network of genes and their functional relationships to one another may be captured by ‘nodes’ representing genes and ‘edges’ among the nodes representing observed relationships between pairs of genes. Here we will use the terms “networks”, “graphs”, and “pathways” interchangeably. Traditionally a gene was studied rather independently of other genes. An enzyme would be described in terms of its substrate and product. However, a gene is not an island but rather its function might only be understood in the context of its interactions with other proteins in the cell. This notion is at the core of the field of systems biology as well as some East-Asian philosophies, as hinted by the quote above. In this lecture we discuss two specific biological systems – the flagellum and blood clotting – and examine perspectives involved in their genome evolution. In the second part we consider more theoretical aspects of network structure, function, and evolution.

 

Evolution of the flagellum system

 

The flagellum of bacteria was the first microbial property observed. Antony van Leeuwenhoek the discoverer of the microbial world first described microbes and their flagella in the following words: “I must say, for my part, that no more pleasant sight has ever yet come before my eye than these many thousands of living creatures, seen all alive in a little drop of water, moving among one another, each several creature having its own proper motion.” The simplicity and elegance of the flagellum has made it into a kind of icon in the movement that seeks to falsify Darwinism, called ‘intelligent-design’. The argument of this movement essentially boils down to a disbelief that beautiful biological structures such as the flagellum could have evolved. The flagellum appears perfectly designed with a filament, a hook, and a basal body that acts as a motor. Who can imagine how such a structure evolved?

 

We first came across this sort of argument in the first lecture. William Paley introduced the ‘argument from design’ to organisms on the whole. With the flagellum the argument has effectively moved to the molecular level. As Richard Dawkins has pointed out, this argument can also be referred to as the “argument from personal incredulity” – essentially reducing to “I cannot believe that something this nice could have evolved therefore it cannot have been”. Michael Behe has formulized a more modern version of this argument with the notion of ‘irreducible complexity’. A system that is irreducibly complex is one that “cannot be produced directly by numerous, successive, slight modifications of a precursor system, because any precursor to an irreducibly complex system that is missing a part is by definition nonfunctional. .... Since natural selection can only choose systems that are already working, then if a biological system cannot be produced gradually it would have to arise as an integrated unit, in one fell swoop, for natural selection to have anything to act on.” Behe’s argument therefore is that if a system can be shown to be irreducibly complex then it has been shown that it could not have evolved. This argument seems to rise above the “argument from personal incredulity” by asserting that a system could not have evolved because it is a structure “in which the removal of an element would cause the whole system to cease functioning”. A mouse trap is an example of an ‘irreducibly complex’ system because without any of the parts the system ceases to function as a mouse trap. The stakes for the irreducible complexity argument are high because Darwin wrote: “If it could be demonstrated that any complex organ existed which could not possibly have been formed by numerous, successive, slight modifications, my theory would absolutely break down.” What then might explain the evolution of the flagellaum?

 

A detour into bacterial pathogenicity brings us to a seemingly unrelated system: the type III secretion system. This system allows gram negative bacteria to translocate proteins directly into the cytoplasm of a host cell. This system is thus crucial for the pathogenic potential of the bacteria Yersinia pestis. Strikingly, genes involved in type III secretion are homologous to flagellum proteins. Specifically the motor of both the flagellum and type III secretory system – including the MS ring, C ring, and export apparatus – is encoded by homologous genes. Phylogenetic profiling of flagellar ortholog families helps us to see this pattern more clearly. When examining the ortholog families with the keyword “flagellar” in their description, we find that generally an organism either has all of the components or none at all. However there are exceptions. For example the Chlamydia genome represented by the letter ‘i’ contains only those genes with “type III secretion system” also in their description. Thus it seems probable that Chlamydia uses these genes for secretion as opposed to movement.

 

In summary, the existence of the type III secretory system in a wide variety of bacteria demonstrates that a small portion of the “irreducibly complex” flagellum can indeed carry out an important biological function. Since such a function is clearly favored by natural selection, the contention that the flagellum must be fully-assembled before any of its component parts can be useful is obviously incorrect.

 

Evolution of the blood clotting pathway

 

The ability of the body to control the flow of blood following vascular injury is paramount to continued survival. Upon such an injury, a clot is made by the protein fibrin. Fibrin is a modified version of fibrinogen, a fibrous soluble protein, which makes up 3% of blood plasma. The heart of the blot clot reaction involves just two molecules: fibrinogen and thrombin. Thrombin converts fibrinogen to fibrin by the following manner. Thrombin removes portions of the fibrinogen protein called A’s and B’s, converting it to fibrin. Fibrin proteins clump together due to the affinity of other parts of itself – the a and b parts – for the sticky crevices left by A’s and B’s excision. The rest of the blood clotting pathway resembles a Rube Goldberg machine, defined in Wikipedia as “a deliberately over engineered machine that performs a very simple task in a very complex fashion, usually including a chain reaction.” This is because factor XII is converted to XIIa by the Kininogen protein. The XIIa protein then converts factor XI to Xia, and so on and so on, until prothrombin is converted to thrombin and fibrin is made. This system was also proposed to be irreducibly complex.

 

Examining the domain architecture of the genes in this pathway, we find a domain architecture that reveals a history of exon shuffling: many of the genes involved in the pathway share the same domains. Most of the enzymes involved in clotting are homologous serine proteases. They are also homologous to the pancreatic serine proteases trypsin, chymotrypsin, and elastase. Therefore, this domain has a history of functions quite distinct from blood clotting. The N-terminal segments of the proteases are thought to be responsible, at least in part, for the specificities of the proteolytic blood clotting factors.

 

A tree of serine proteases reveals a history of the duplications. One branch along this tree corresponds to all of the serine proteases with a gamma domain and two EGF domains. Thus, a single event can account for the introduction in the factor IX, factor X, factor VII, prothrombin and protein C. Prothrombin is the deepest division, suggesting is the most ancestral member of the pathway. Based upon the sequence comparisons, a scenario for when clotting proteins made their appearance can be inferred. Thus fibrinogen, prothrombin, tissue factor and plasminogen are the most ancestral (oldest), and other factors make their entrance into the pathway in time.

 

Further evidence for the ancestral role of fibrinogen comes from an analysis of the three different proteins that make it up: alpha, beta, and gamma. These share low sequence similarity, suggesting that a gene duplication giving rise to beta and gamma ought to have occurred at least 600 million years ago. In this case invertebrates should have at least one fibrinogen because for example, the human and lamprey fish (an invertebrate) last shared a common ancestor 450 million years ago. In fact, invertebrates do have fibrinogen: the warty sea cucumber (Parastichopus parvimensis) has been found to have a fibrinogen-like sequence.

 

So how could the blood clotting pathway have evolved by gene duplications? The following model was proposed by Ken Miller. Most serine proteases, including trypsin and thrombin, are auto-catalytic. The inactive form of the protease (A) is changed into the active form (A*) when two things happen: it is bound to tissue factor (TF) and it is activated by tissue proteases, including our protease itself (that's the autocatalytic part). This means - and this is important - that our protease is actually involved in cutting two things: Fibrinogen, and also itself, converting A's inactive precursor protein into A*. Imagine now that a gene duplication occurs in the gene for our protease, producing a new (B) version of the gene. Proteins A and B are initially identical. Each can bind to TF, each can cleave fibrinogen into fibrin, and each can activate itself or its sister serum protease. So nothing has really changed - we've just got two copies of the same gene. However, a mutation in the active site of B changes its behavior, making it a little less likely to cut fibrinogen and a little more likely to activate protease A. But why would natural selection favor a mutation like this in B's active site? The multiple steps of the cascade amplify the signal from the first stimulus. A cascade increases the efficiency of the clotting process. With so many more active proteases in the neighborhood of the injury, clotting can occur more quickly, increasing the chances of surviving a hemorrage. Thus the principle parts of this model for the evolution of the blood clotting pathway involve gene duplications , mutations to modulate the function of the duplicate, and regulation, where natural selection favors a better control of the homologous factors.    

 

In summary, the blood clotting pathway evolved by a process of gene duplications from serine proteases that once were digestive enzymes. The argument of irreducible complexity is ruled out by the discovery of fibrinogen in other contexts, the homology of serine protease to digestive enzymes, the domain composition being consistent with an exon shuffling model, the branching pattern of the proteins involved being consistent with a gene duplication model, and a  plausible scenario of cascade evolution.

 

The structure of biological networks

 

We now shift our focus to the analysis of general pathways in biology. What is the structure of these networks? We should first define a random network so that we may have something to compare with. The simplest way to produce a random network was proposed by Erdos-Renyi, where N nodes are connected with the probability of each pair of nodes being connected being p. Thus to make a network of p=0.15, where there is a 15% chance of a given edge existing between each pair of genes, we use this probability to establish a random set of edges among the nodes. We can now compare this network to a metabolic network and compare its properties. To generate this network we link each metabolite with each other metabolite in the cell in which there is a reaction which links them. We now compute the degree distribution of this network as was described in the domains lecture. The degree distribution of this network is a straight line on a log-log scale, suggesting a power-law relationship. The same degree distribution is found for protein-protein interaction networks, where proteins which interact are linked by an edge. Such a degree distribution differs from that of a random network which would be fit by an exponential of the form y=a^x.

 

Networks with a power-law degree distribution may be referred to as scale-free because no matter which scale is chosen the same distribution of degrees is observed among nodes. How can a scale-free network arise? A simple model gives us some insight. Imagine that a network generated not by the Erdos-Renyi model but instead by the following two rues:

1.      Evolution:  the network expands continuously by the addition of new vertices, and

2.      Preferential-attachment (rich get richer): new vertices attach preferentially to sites that are already well connected.

To incorporate the growing character of the network, starting with a small number (m0) of vertices, at every time step we add a new vertex with m (< m0 ) edges that link the new vertex to m different vertices already present in the system. In other words, the network grows where at every step we introduce a new node that preferentially attaches to nodes that have more attachments. Strikingly, this network evolves into a scale-free network with the probability that a vertex has k edges, following a power law with an exponent = 2.9 +/- 0.1. After t time steps, the model leads to a random network with t + m0 vertices and m_t edges. This model is known as the Barabasi and Albert model.

 

Another form of biological networks involves gene regulation. In these networks the edges in the network have a direction: A regulates B (not symmetrical). Since in this kind of network the nodes are the coding regions and the edges are encoded in the promoter sequences they can be represented by a useful annotation indicating this (slide 55). It is interesting to remember that while the genome contains all of the connections, this ‘view from the genome’ is an integration of the relationships over space and time. The first significant gene regulatory network in biology was made from the developmental process of specifying the sea urchin endoderm by Eric Davidson’s lab in Caltech.

 

The interactions of a small set of regulators can lead to interesting dynamics encoding interesting properties. For example, a lock-in switch can be encoded for genes B and C where A activates B, and B activates C, by simply adding a feedback loop where C also activates B. Thus upon indication of B by A, B and C proceed to lock-in their own activity in the ‘on’ state.

 

Uri Alon’s lab sought to identify small sets of regulators whose topology, or the set of relationships among them, are abundant in the network. These so-called network motifs include the feedforward loop, comprising A inducing B, B inducing C, and A also inducing C. It was proposed that gene regulatory networks are made up of these kinds of motifs, each performing a kind of coherent operation.

 

Davidson and Erwin proposed that gene regulatory networks are comprised of four classes of components. Kernels are evolutionarily inflexible subcircuits that perform essential upstream functions in building given body parts, which we term the ‘‘kernels’’ of the GRN. Plug-ins are small subcircuits that have been repeatedly coopted to diverse developmental purposes. Switches allow or disallowing developmental subcircuits to function in a given context and so act as input/output (I/O) devices within the network. Differentiation gene batteries are sets of genes that are co-induced and act as a concerted module, such as for example the genes expressed in a maturely specified muscle tissue. Interestingly they proposed that changes to different classes lead to different kinds of evolutionary consequences. For example, since a kernel is such a fundamental circuit, any change to it might lead to a new phylum. Changes to plug-ins or switches may change the sizes of body parts or elaborations in the morphological pattern which would define new kinds of classes, orders, or families. While alterations in the gene batteries may lead to different functional capabilities and define the differences among species.