Course notes for
the January 3rd 2010 lecture of the Genome Evolution Course
Itai Yanai
Department of
Biology
Technion – Israel
Institute of Technology
yanai@technion.ac.il
Evolution of
Biological Networks
Things derive their being and
nature by mutual dependence and are nothing in themselves.
-Nagarjuna, second century Buddhist philosopher
What is a network and why is
it a useful object in our understanding of biological systems? A network is
defined by a set of nodes and connecting edges. For example, a network of genes
and their functional relationships to one another may be captured by ‘nodes’
representing genes and ‘edges’ among the nodes representing observed
relationships between pairs of genes. Here we will use the terms “networks”,
“graphs”, and “pathways” interchangeably. Traditionally a gene was studied rather
independently of other genes. An enzyme would be described in terms of its
substrate and product. However, a gene is not an island but rather its function
might only be understood in the context of its interactions with other proteins
in the cell. This notion is at the core of the field of systems biology as well
as some East-Asian philosophies, as hinted by the quote above. In this lecture
we discuss two specific biological systems – the flagellum and blood clotting –
and examine perspectives involved in their genome evolution. In the second part
we consider more theoretical aspects of network structure, function, and
evolution.
Evolution of the flagellum
system
The flagellum of bacteria was
the first microbial property observed. Antony van Leeuwenhoek the discoverer of
the microbial world first described microbes and their flagella in the
following words: “I must say, for my part, that no more pleasant
sight has ever yet come before my eye than these many thousands of
living creatures, seen all alive in a little drop of water, moving
among one another, each several creature having its own proper
motion.” The simplicity and elegance of the flagellum has made it into a kind
of icon in the movement that seeks to falsify Darwinism, called ‘intelligent-design’.
The argument of this movement essentially boils down to a disbelief that
beautiful biological structures such as the flagellum could have evolved. The
flagellum appears perfectly designed with a filament, a hook, and a basal body
that acts as a motor. Who can imagine how such a structure evolved?
We first came across this
sort of argument in the first lecture. William Paley introduced the ‘argument
from design’ to organisms on the whole. With the flagellum the argument has
effectively moved to the molecular level. As Richard Dawkins has pointed out, this
argument can also be referred to as the “argument from personal incredulity” –
essentially reducing to “I cannot believe that something this nice could have
evolved therefore it cannot have been”. Michael Behe
has formulized a more modern version of this argument with the notion of ‘irreducible
complexity’. A system that is irreducibly complex is one that “cannot be
produced directly by numerous, successive, slight modifications of a
precursor system, because any precursor to an irreducibly complex system that
is missing a part is by definition nonfunctional. .... Since natural selection
can only choose systems that are already working, then if a biological system
cannot be produced gradually it would have to arise as an integrated unit, in
one fell swoop, for natural selection to have anything to act on.” Behe’s argument therefore is that if a system can be shown
to be irreducibly complex then it has been shown that it could not have
evolved. This argument seems to rise above the “argument from personal
incredulity” by asserting that a system could not have evolved because it is a
structure “in which the removal of an element would cause the whole system to
cease functioning”. A mouse trap is an example of an ‘irreducibly complex’ system
because without any of the parts the system ceases to function as a mouse trap.
The stakes for the irreducible complexity argument are high because Darwin wrote:
“If it could be demonstrated that any complex organ existed which could not
possibly have been formed by numerous, successive, slight modifications,
my theory would absolutely break down.” What then might explain the evolution
of the flagellaum?
A detour into bacterial pathogenicity brings us to a seemingly unrelated system:
the type III secretion system. This system allows gram negative bacteria to translocate proteins directly into the cytoplasm of a host
cell. This system is thus crucial for the pathogenic potential of the bacteria Yersinia pestis.
Strikingly, genes involved in type III secretion are homologous to flagellum
proteins. Specifically the motor of both the flagellum and type III secretory system – including the MS ring, C ring, and
export apparatus – is encoded by homologous genes. Phylogenetic profiling of flagellar ortholog families helps us to see this pattern
more clearly. When examining the ortholog families with the keyword “flagellar” in their description, we find that generally an
organism either has all of the components or none at all. However there are
exceptions. For example the Chlamydia genome represented by the letter ‘i’ contains only those genes with “type III secretion
system” also in their description. Thus it seems probable that Chlamydia
uses these genes for secretion as opposed to movement.
In summary, the existence of
the type III secretory system in a wide variety of
bacteria demonstrates that a small portion of the “irreducibly complex”
flagellum can indeed carry out an important biological function. Since such a
function is clearly favored by natural selection, the contention that the
flagellum must be fully-assembled before any of its component parts can be
useful is obviously incorrect.
Evolution of the blood
clotting pathway
The ability of the body to
control the flow of blood following vascular injury is paramount to continued
survival. Upon such an injury, a clot is made by the protein fibrin. Fibrin is
a modified version of fibrinogen, a fibrous soluble protein, which makes up 3%
of blood plasma. The heart of the blot clot reaction involves just two
molecules: fibrinogen and thrombin. Thrombin converts fibrinogen to fibrin by
the following manner. Thrombin removes portions of the fibrinogen protein
called A’s and B’s, converting it to fibrin. Fibrin proteins clump together due
to the affinity of other parts of itself – the a and b
parts – for the sticky crevices left by A’s and B’s excision. The rest of the blood
clotting pathway resembles a Rube Goldberg machine, defined in Wikipedia as “a
deliberately over engineered machine that performs a very simple task
in a very complex fashion, usually including a chain reaction.” This is because
factor XII is converted to XIIa by the Kininogen protein. The XIIa
protein then converts factor XI to Xia, and so on and so on, until prothrombin is converted to thrombin and fibrin is made. This
system was also proposed to be irreducibly complex.
Examining the domain
architecture of the genes in this pathway, we find a domain architecture that
reveals a history of exon shuffling: many of the
genes involved in the pathway share the same domains. Most of the enzymes
involved in clotting are homologous serine proteases. They are also homologous
to the pancreatic serine proteases trypsin, chymotrypsin, and elastase.
Therefore, this domain has a history of functions quite distinct from blood
clotting. The N-terminal segments of the proteases are thought to be
responsible, at least in part, for the specificities of the proteolytic
blood clotting factors.
A tree of serine proteases
reveals a history of the duplications. One branch along this tree corresponds
to all of the serine proteases with a gamma domain and two EGF domains. Thus, a
single event can account for the introduction in the factor IX, factor X, factor
VII, prothrombin and protein C. Prothrombin
is the deepest division, suggesting is the most ancestral member of the
pathway. Based upon the sequence comparisons, a scenario for when clotting
proteins made their appearance can be inferred. Thus fibrinogen, prothrombin, tissue factor and plasminogen
are the most ancestral (oldest), and other factors make their entrance into the
pathway in time.
Further evidence for the
ancestral role of fibrinogen comes from an analysis of the three different
proteins that make it up: alpha, beta, and gamma. These share low sequence
similarity, suggesting that a gene duplication giving rise to beta and gamma
ought to have occurred at least 600 million years ago. In this case invertebrates
should have at least one fibrinogen because for example, the human and lamprey
fish (an invertebrate) last shared a common ancestor 450 million years ago. In
fact, invertebrates do have fibrinogen: the warty sea cucumber (Parastichopus parvimensis)
has been found to have a fibrinogen-like sequence.
So how could the blood
clotting pathway have evolved by gene duplications? The following model was
proposed by Ken Miller. Most serine proteases, including trypsin
and thrombin, are auto-catalytic. The inactive form of the protease (A) is
changed into the active form (A*) when two things happen: it is bound to tissue
factor (TF) and it is activated by tissue proteases, including our protease
itself (that's the autocatalytic part). This means - and this is important -
that our protease is actually involved in cutting two things:
Fibrinogen, and also itself, converting A's inactive precursor protein into A*.
Imagine now that a gene duplication occurs in the gene
for our protease, producing a new (B) version of the gene. Proteins A and B are
initially identical. Each can bind to TF, each can cleave fibrinogen into
fibrin, and each can activate itself or its sister serum protease. So nothing
has really changed - we've just got two copies of the same gene. However, a
mutation in the active site of B changes its behavior, making it a little less
likely to cut fibrinogen and a little more likely to activate protease A. But
why would natural selection favor a mutation like this in B's active site? The
multiple steps of the cascade amplify the signal from the first stimulus.
A cascade increases the efficiency of the clotting process. With so many more
active proteases in the neighborhood of the injury, clotting can occur more
quickly, increasing the chances of surviving a hemorrage.
Thus the principle parts of this model for the evolution of the blood clotting
pathway involve gene duplications , mutations to
modulate the function of the duplicate, and regulation, where natural selection
favors a better control of the homologous factors.
In summary, the blood
clotting pathway evolved by a process of gene duplications
from serine proteases that once were digestive enzymes. The argument of
irreducible complexity is ruled out by the discovery of fibrinogen in other
contexts, the homology of serine protease to digestive enzymes, the domain
composition being consistent with an exon shuffling
model, the branching pattern of the proteins involved being consistent with a
gene duplication model, and a plausible scenario of cascade
evolution.
The structure of biological
networks
We now shift our focus to the
analysis of general pathways in biology. What is the structure of these
networks? We should first define a random network so that we may have something
to compare with. The simplest way to produce a random network was proposed by Erdos-Renyi, where N nodes are connected with the
probability of each pair of nodes being connected being p. Thus to make a
network of p=0.15, where there is a 15% chance of a given edge existing between
each pair of genes, we use this probability to establish a random set of edges
among the nodes. We can now compare this network to a metabolic network and
compare its properties. To generate this network we link each metabolite with
each other metabolite in the cell in which there is a reaction which links
them. We now compute the degree distribution of this network as was described
in the domains lecture. The degree distribution of this network is a straight
line on a log-log scale, suggesting a power-law relationship. The same degree
distribution is found for protein-protein interaction networks, where proteins
which interact are linked by an edge. Such a degree distribution differs from
that of a random network which would be fit by an exponential of the form y=a^x.
Networks with a power-law
degree distribution may be referred to as scale-free because no matter which
scale is chosen the same distribution of degrees is observed among nodes. How
can a scale-free network arise? A simple model gives us some insight. Imagine
that a network generated not by the Erdos-Renyi model
but instead by the following two rues:
1. Evolution: the network expands continuously by the
addition of new vertices, and
2. Preferential-attachment (rich
get richer): new vertices attach preferentially to sites that are already well
connected.
To incorporate the growing
character of the network, starting with a small number (m0)
of vertices, at every time step we add a new vertex with m (< m0 ) edges that link the new vertex
to m different vertices already present in the system. In other words,
the network grows where at every step we introduce a new node that
preferentially attaches to nodes that have more attachments. Strikingly, this
network evolves into a scale-free network with the probability that a vertex
has k edges, following a power law with an exponent = 2.9 +/- 0.1. After
t time steps, the model leads to a random network with t + m0
vertices and m_t edges. This model is
known as the Barabasi and Albert model.
Another form of biological
networks involves gene regulation. In these networks the edges in the network
have a direction: A regulates B (not symmetrical). Since in this kind of
network the nodes are the coding regions and the edges are encoded in the
promoter sequences they can be represented by a useful annotation indicating
this (slide 55). It is interesting to remember that while the genome contains
all of the connections, this ‘view from the genome’ is an integration of the
relationships over space and time. The first significant gene regulatory
network in biology was made from the developmental process of specifying the
sea urchin endoderm by Eric Davidson’s lab in Caltech.
The interactions of a small
set of regulators can lead to interesting dynamics encoding interesting properties.
For example, a lock-in switch can be encoded for genes B and C where A
activates B, and B activates C, by simply adding a
feedback loop where C also activates B. Thus upon indication of B by A, B and C
proceed to lock-in their own activity in the ‘on’ state.
Uri Alon’s
lab sought to identify small sets of regulators whose topology, or the set of relationships
among them, are abundant in the network. These so-called network motifs include
the feedforward loop, comprising A
inducing B, B inducing C, and A also inducing C. It was proposed that gene
regulatory networks are made up of these kinds of motifs, each performing a
kind of coherent operation.
Davidson and Erwin proposed
that gene regulatory networks are comprised of four classes of components. Kernels
are evolutionarily inflexible subcircuits that
perform essential upstream functions in building given body parts, which we term
the ‘‘kernels’’ of the GRN. Plug-ins are small subcircuits that have been repeatedly coopted
to diverse developmental purposes. Switches allow or disallowing developmental subcircuits to function in a given context and so act as
input/output (I/O) devices within the network. Differentiation gene batteries
are sets of genes that are co-induced and act as a concerted module, such as
for example the genes expressed in a maturely specified muscle tissue.
Interestingly they proposed that changes to different classes lead to different
kinds of evolutionary consequences. For example, since a kernel is such a
fundamental circuit, any change to it might lead to a new phylum. Changes to
plug-ins or switches may change the sizes of body parts or elaborations in the
morphological pattern which would define new kinds of classes, orders, or
families. While alterations in the gene batteries may lead to different
functional capabilities and define the differences among species.