Course notes for the December 13th 2009 lecture of the
Genome Evolution Course
Itai Yanai
Department of Biology
Technion – Israel Institute of Technology
yanai@technion.ac.il
Notes drafted by Michal
Levin
Evolution by
Protein Domains
A protein domain is an evolutionary unit of
sequence that can evolve, function, and structurally fold independently of
the entire protein in which it is comprised. Each domain forms a compact
three-dimensional structure and can often be independently stable
and folded. Many proteins consist of several domains. Domains are the
evolutionary units of sequence that comprise the gene coding regions. Novel
genes can be created by recombination of domains into new domain arrangements.
The structure of protein domains
Proteins fold into distinct secondary structures. The main
secondary structures of hemoglobin for example are alpha-helices, while thrombin
is built of mainly beta-sheets. After investigation of different proteins, spatially
distinct structural units were discovered. These units were observed to recur
in different structural contexts in different proteins. For instance the
“Rossmann fold” can be found in two independent proteins - the lactate
dehydrogenase and the alcohol dehydrogenase. Domains can also recur in multiple
copies in the same protein.
The structural definition of a domain
From a structural perspective, a protein domain may be defined as:
a distinct, compact, and stable protein structural unit that folds
independently of other such units. Many domains of multi-domained proteins, may
also be found as independently functioning proteins. For example, the
diphtheria toxin enzyme, shows an interesting domain architecture. It is made
up of three domains, each of which is involved in a different stage of
infection (receptor binding, membrane penetration, and catalysis of ADP-ribosylation
of elongation factor), and yet share similarity to distinct proteins.
A certain topology of secondary structures, e.g. tertiary
structure, is also called a domain fold type. In 1996 Holm and Sander showed
that forty percent of the folds are covered by 16 fold classes. Although each
class has individual features, most fold classes map to five attractor regions
which express structural similarities of protein structures.
The Structural Classification of Proteins (SCOP) database
is largely a manual classification of protein structural
domains based on similarities of their amino
acid sequences and three-dimensional structures. Originally
published in 1995, SCOP is significantly based on the human expertise, often
required to decide whether certain proteins are evolutionary related
and therefore should be assigned to the same superfamily, or if their
similarity is a result of structural constraints and therefore they belong to
the same fold.
SCOP thus distinguishes between these three levels of hierarchic
structural classification:
Domains are clustered into families in which significant sequence
similarity is detected as well as conservation of biochemical activity. Families
are in turn grouped into superfamilies where sequence similarity is still
recognizable and basic biochemical properties are conserved. Superfamilies and families are monophyletic (derive
from a common ancestor). A fold is a topology of the folded protein backbone. It
is unknown whether superfamilies of the same fold are monophyletic (of common
evolutionary ancestry). Homology is inferred if a 35-100% sequence similarity
can be detected.
In general, it is true that if similar structures are observed the proteins
are not necessarily homologous, e.g. share common ancestry. Two proteins can
arrive to the same sequence structure by either common ancestry or by convergence,
e.g. they acquired the same sequence/structure independently. Protein structure
is extremely robust to changes in sequence, which means that most changes are
permissible. It has been estimated that only one of each ten amino acids in the
sequence is important for the structural aspects of the protein.
How many families can be supported by a single fold? In other
words, how many families of proteins can we find to have the same fold? It
appears that most folds can support only one family, while few folds support
many families.
The number of protein folds is not a static measure, but must be
updated constantly due to increasing protein information acquired. The total
number of folds in globular, water- soluble proteins is estimated at about
1000. The sequenced genomes of unicellular organisms encode from approximately
25%, for the minimal genomes of the Mycoplasmas, to 70-80% for larger genomes,
such as Escherichia coli and yeast, of the total number of folds. The number of protein families with
significant sequence conservation was estimated to be between 4000 and 7000,
with structures available for about 20% of these.
Domains from a sequence aspect
A domain at the sequence level is defined in terms of its
recurrence across different sequences. Several databases such as SMART, Pfam,
ProSite etc. construct sequence families at the domain level. Thus, sequence
biology proceeds predominantly by decomposing proteins into their domains. On
average, domains have 100 amino acids.
Proteins can contain several different or similar domains. Mapping
domains to sequence we might expect a colinear configuration in which one
domain is encoded completely before the start of the next one. Alternatively,
the domains might be non-colinear where the domains are folded from distant
regions in the sequence. Most domains are collinear within multidomain proteins.
Traditionally, multidomain proteins make the classification of
proteins difficult. Missannotations due to multidomain proteins can occur due
to syllogism (a logical argument in which if A=B and B=C, then A must
equal C). Consider a multidomain protein C consists of two domains. The first
is similar to a domain known to function as a kinase. The second domain is
similar to another domain of unknown function. It is clearly incorrect to
deduce that the second domain must be a kinase-like domain because it is connected
to the kinase domain in the multidomain protein, yet this is precisely the
error that many annotation software programs propagated in the early days of
functional annotations.
Domains often fuse to compose a new multidomain protein. As in the
example of glycyl-tRNA synthetase where two separate E. coli proteins glyQ
and glyS fused to form a new multidomain protein in C. trachomatis.
In the case of domain fusion there could well be a functional relationship
between the two fused proteins, such as subsequent steps along a biochemical
cascade. Fusion links between different domains can thus be used to predict
protein-protein interactions. Other examples are genes involved in glycolysis
of different microorganisms and thymidylate synthesis genes in different yeast
genomes.
Many of the extracellular matrix proteins are multidomain proteins.
This enables increasing the interaction surface and allows for a larger protein
with more complex binding capabilities.
The domain architecture
of a protein contains not only information about the domain composition of the
protein but also information about domain arrangements and domain duplications. This
phenomenon is called domain accretion (accumulation). Domain accretion may have
been used in evolution of complexity in the pathway in which it occurs.
Most genes have more than one domain. The number of domain-types
per gene is exponentially distributed, however the total number of domains per
gene is ‘power-law’ distributed. Power laws describe distributions of a number
of quantities in biological and other contexts, e.g.,the node degrees (number
of connections) in metabolic and protein interactions networks, the Internet
and social networks, citations of scientific papers, population of cities,
personal wealth. Networks described by power laws are known as scale-free
- they look the same at different scales.
Domain Architecture Networks
We may abstract the domains composing a set of genes as forming a
network. A pair of domains are linked if there is at least one instance of a
protein containing both of the domains. The frequency distribution of such a domain
network within a genome follows a power-law distribution, e.g. most domains are
involved with only one other domain, while very few domains are connected to
many other domains.
Supra-domains
A supra-domain is defined as a domain combination in a particular
N-to-C-terminal orientation that occurs in at least two different domain
architectures in different proteins with: (i) different types of domains at the
N and C-terminal end of the combination; or (ii) different types of domains at
one end and no domain at the other. Supra-domains are domains that
almost always come together. For example, the P-loop containing nucleotide
triphosphate (NTP) hydrolase domain and the translation protein domain occur as
one combination in several different translation factors. This supra-domain
occurs in 35 different domain architectures. Another example is the SH3-SH2
clamp.
Evolution by domain reshuffling: Signaling and Multicellularity
One of the key problems of becoming a multicellular organism is
solving the problem of cell signaling. M. brevicollis is a
choanoflagellate which are the closest known relatives of metazoans. They
are free-living unicellular and colonial flagellate eukaryotes. The M.
brevicollis genome is 42Mb large and
contains approximately 9,200 genes.
Metazoan sequence specific transcription factors are absent from
the M. brevicollis gene catalogue, but the genome encodes cell adhesion
and signaling protein domains formerly thought to be restricted to metazoans. For
example, protein domains found in the metazoan Notch receptor (EGF, NL and ANK
(ankyrin repeats)) are encoded on separate M. brevicollis genes in
arrangements that differ from metazoan Notch proteins, and definitive domains,
such as the NOD domain are absent. Apparently domain shuffling brought about
the evolution of Notch signaling.
Another example for domain reshuffling and functional
reorganization can be found in the vast amount of different phosphorylation
related signaling enzymes.
Phosphorylation can reversibly alter the activity of an enzyme through the
combined action of a protein kinase and a protein phosphatase. Tyrosine
phosphorylation is a major mechanism of transmembrane signaling and regulates
protein–protein association. SH2 domains are modules of ~100 amino acids that
bind to specific phospho (pY)-containing peptide motifs like phosphorylated
tyrosine. The SH2 domain is found embedded in a wide variety of metazoan
proteins that regulate functionally diverse processes and therefore serve as
linkers between proteins involved in signaling. Like SH2 domains, several other
modular domains have been identified that recognize specific sequences on their
target acceptor proteins and are used in different proteins with different
biological functions. These domains function as modular building blocks that
serve as interaction domains in signal transduction. M. brevis contains
genes involved in these signaling pathways however the physical linkages among
the protein domains differ significantly between them and metazoans. Thus, abundant
domain shuffling followed the separation of the choanoflagellate and metazoan
lineages.