Course notes for the December 13th 2009 lecture of the Genome Evolution Course

 

Itai Yanai

Department of Biology

Technion – Israel Institute of Technology

yanai@technion.ac.il

 

Notes drafted by Michal Levin

 

Evolution by Protein Domains

protein domain is an evolutionary unit of sequence that can evolve, function, and structurally fold independently of the entire protein in which it is comprised. Each domain forms a compact three-dimensional structure and can often be independently stable and folded. Many proteins consist of several domains. Domains are the evolutionary units of sequence that comprise the gene coding regions. Novel genes can be created by recombination of domains into new domain arrangements.

The structure of protein domains

Proteins fold into distinct secondary structures. The main secondary structures of hemoglobin for example are alpha-helices, while thrombin is built of mainly beta-sheets. After investigation of different proteins, spatially distinct structural units were discovered. These units were observed to recur in different structural contexts in different proteins. For instance the “Rossmann fold” can be found in two independent proteins - the lactate dehydrogenase and the alcohol dehydrogenase. Domains can also recur in multiple copies in the same protein.

The structural definition of a domain

From a structural perspective, a protein domain may be defined as: a distinct, compact, and stable protein structural unit that folds independently of other such units. Many domains of multi-domained proteins, may also be found as independently functioning proteins. For example, the diphtheria toxin enzyme, shows an interesting domain architecture. It is made up of three domains, each of which is involved in a different stage of infection (receptor binding, membrane penetration, and catalysis of ADP-ribosylation of elongation factor), and yet share similarity to distinct proteins.

A certain topology of secondary structures, e.g. tertiary structure, is also called a domain fold type. In 1996 Holm and Sander showed that forty percent of the folds are covered by 16 fold classes. Although each class has individual features, most fold classes map to five attractor regions which express structural similarities of protein structures.

The Structural Classification of Proteins (SCOP) database is largely a manual classification of protein structural domains based on similarities of their amino acid sequences and three-dimensional structures. Originally published in 1995, SCOP is significantly based on the human expertise, often required to decide whether certain proteins are evolutionary related and therefore should be assigned to the same superfamily, or if their similarity is a result of structural constraints and therefore they belong to the same fold

SCOP thus distinguishes between these three levels of hierarchic structural classification:

  1. Family - Sequence similarity can be detected among proteins of the same family. They have the same fold due to this high sequence similarity.
  2. Superfamily - Sufficient structural and functional similarity between proteins of the different families may be used to infer a divergent evolutionary relationship but not necessarily detectable sequence homology. They show lower similarity than families but they are involved in similar biochemical functions.
  3. Fold – A fold is a topology of the folded protein backbone. A similar arrangement of regular secondary structures may classify proteins belonging to different superfamilies to the same fold to but without evidence of evolutionary relatedness. The same shape can be used for different functions.

Domains are clustered into families in which significant sequence similarity is detected as well as conservation of biochemical activity. Families are in turn grouped into superfamilies where sequence similarity is still recognizable and basic biochemical properties are conserved.  Superfamilies and families are monophyletic (derive from a common ancestor). A fold is a topology of the folded protein backbone. It is unknown whether superfamilies of the same fold are monophyletic (of common evolutionary ancestry). Homology is inferred if a 35-100% sequence similarity can be detected.

In general, it is true that if similar structures are observed the proteins are not necessarily homologous, e.g. share common ancestry. Two proteins can arrive to the same sequence structure by either common ancestry or by convergence, e.g. they acquired the same sequence/structure independently. Protein structure is extremely robust to changes in sequence, which means that most changes are permissible. It has been estimated that only one of each ten amino acids in the sequence is important for the structural aspects of the protein.

How many families can be supported by a single fold? In other words, how many families of proteins can we find to have the same fold? It appears that most folds can support only one family, while few folds support many families.

The number of protein folds is not a static measure, but must be updated constantly due to increasing protein information acquired. The total number of folds in globular, water- soluble proteins is estimated at about 1000. The sequenced genomes of unicellular organisms encode from approximately 25%, for the minimal genomes of the Mycoplasmas, to 70-80% for larger genomes, such as Escherichia coli and yeast, of the total number of folds.  The number of protein families with significant sequence conservation was estimated to be between 4000 and 7000, with structures available for about 20% of these.

Domains from a sequence aspect

A domain at the sequence level is defined in terms of its recurrence across different sequences. Several databases such as SMART, Pfam, ProSite etc. construct sequence families at the domain level. Thus, sequence biology proceeds predominantly by decomposing proteins into their domains. On average, domains have 100 amino acids.

Proteins can contain several different or similar domains. Mapping domains to sequence we might expect a colinear configuration in which one domain is encoded completely before the start of the next one. Alternatively, the domains might be non-colinear where the domains are folded from distant regions in the sequence. Most domains are collinear within multidomain proteins.

Traditionally, multidomain proteins make the classification of proteins difficult. Missannotations due to multidomain proteins can occur due to syllogism (a logical argument in which if A=B and B=C, then A must equal C). Consider a multidomain protein C consists of two domains. The first is similar to a domain known to function as a kinase. The second domain is similar to another domain of unknown function. It is clearly incorrect to deduce that the second domain must be a kinase-like domain because it is connected to the kinase domain in the multidomain protein, yet this is precisely the error that many annotation software programs propagated in the early days of functional annotations.

Domains often fuse to compose a new multidomain protein. As in the example of glycyl-tRNA synthetase where two separate E. coli proteins glyQ and glyS fused to form a new multidomain protein in C. trachomatis. In the case of domain fusion there could well be a functional relationship between the two fused proteins, such as subsequent steps along a biochemical cascade. Fusion links between different domains can thus be used to predict protein-protein interactions. Other examples are genes involved in glycolysis of different microorganisms and thymidylate synthesis genes in different yeast genomes.

Many of the extracellular matrix proteins are multidomain proteins. This enables increasing the interaction surface and allows for a larger protein with more complex binding capabilities.

 The domain architecture of a protein contains not only information about the domain composition of the protein but also information about domain arrangements and domain duplications. This phenomenon is called domain accretion (accumulation). Domain accretion may have been used in evolution of complexity in the pathway in which it occurs.

Most genes have more than one domain. The number of domain-types per gene is exponentially distributed, however the total number of domains per gene is ‘power-law’ distributed. Power laws describe distributions of a number of quantities in biological and other contexts, e.g.,the node degrees (number of connections) in metabolic and protein interactions networks, the Internet and social networks, citations of scientific papers, population of cities, personal wealth. Networks described by power laws are known as scale-free - they look the same at different scales.

Domain Architecture Networks

We may abstract the domains composing a set of genes as forming a network. A pair of domains are linked if there is at least one instance of a protein containing both of the domains. The frequency distribution of such a domain network within a genome follows a power-law distribution, e.g. most domains are involved with only one other domain, while very few domains are connected to many other domains.

 Supra-domains

A supra-domain is defined as a domain combination in a particular N-to-C-terminal orientation that occurs in at least two different domain architectures in different proteins with: (i) different types of domains at the N and C-terminal end of the combination; or (ii) different types of domains at one end and no domain at the other. Supra-domains are domains that almost always come together. For example, the P-loop containing nucleotide triphosphate (NTP) hydrolase domain and the translation protein domain occur as one combination in several different translation factors. This supra-domain occurs in 35 different domain architectures. Another example is the SH3-SH2 clamp.

Evolution by domain reshuffling: Signaling and Multicellularity

One of the key problems of becoming a multicellular organism is solving the problem of cell signaling. M. brevicollis is a choanoflagellate which are the closest known relatives of metazoans. They are free-living unicellular and colonial flagellate eukaryotes. The M. brevicollis genome is 42Mb  large and contains approximately 9,200 genes.

Metazoan sequence specific transcription factors are absent from the M. brevicollis gene catalogue, but the genome encodes cell adhesion and signaling protein domains formerly thought to be restricted to metazoans. For example, protein domains found in the metazoan Notch receptor (EGF, NL and ANK (ankyrin repeats)) are encoded on separate M. brevicollis genes in arrangements that differ from metazoan Notch proteins, and definitive domains, such as the NOD domain are absent. Apparently domain shuffling brought about the evolution of Notch signaling.

Another example for domain reshuffling and functional reorganization can be found in the vast amount of different phosphorylation related signaling enzymes. Phosphorylation can reversibly alter the activity of an enzyme through the combined action of a protein kinase and a protein phosphatase. Tyrosine phosphorylation is a major mechanism of transmembrane signaling and regulates protein–protein association. SH2 domains are modules of ~100 amino acids that bind to specific phospho (pY)-containing peptide motifs like phosphorylated tyrosine. The SH2 domain is found embedded in a wide variety of metazoan proteins that regulate functionally diverse processes and therefore serve as linkers between proteins involved in signaling. Like SH2 domains, several other modular domains have been identified that recognize specific sequences on their target acceptor proteins and are used in different proteins with different biological functions. These domains function as modular building blocks that serve as interaction domains in signal transduction. M. brevis contains genes involved in these signaling pathways however the physical linkages among the protein domains differ significantly between them and metazoans. Thus, abundant domain shuffling followed the separation of the choanoflagellate and metazoan lineages.