Course notes for the January 10th 2009 lecture of the Genome Evolution Course

 

Itai Yanai

Department of Biology

Technion – Israel Institute of Technology

yanai@technion.ac.il

 

Evolution of Genome Regulation

 

We learned that only about 3% of the genome codes for exons. What does the rest do? This talk is about the regulatory aspect of the genome. The “Matrix” image may be appropriate to capture the problem of identifying the regulatory signals present in the genome by filtering through most but not all.

 

In 1960, Francois Jacob and Jacques Monod proposed a model for the E. coli lac operon. The main conceptual breakthrough was a distinction between the DNA used for encoding protein sequences and that used to encode regulatory signals. In the lac operon this occurs for example in the form of an operator sequence that is bound by the lacI repressor present whenever lactose is available.

 

Sydney Brenner began in 1966 to work with the nematode C. elegans to study development and the nervous system. One question was whether the regulation in organisms more complex than the lac-operons of E. coli. We consider three examples of promoter sequences of this organism to observe how they are used as regulatory signals. The following is adapted from Okkema and Krause. Transcriptional regulation Chapter in WormBook.

 

Our first example of regulation of the genome in animal is the C. elegans myo-2 gene which encodes a myosin heavy chain expressed exclusively in the pharyngeal muscles as these cells undergo terminal differentiation. The activity of this gene depends upon distinct cell-type-specific and organ-specific subelements, termed B and C, that can separately activate gene expression either specifically in the pharyngeal muscles. In their endogenous context within the myo-2 gene, these subelements synergistically activate pharyngeal muscle gene expression. Consistent with their distinct activities, the B and C subelements are targeted by transcription factors expressed in different spatial patterns in the pharynx. The cell-type-specific B subelement binds and is activated by the pharyngeal muscle specific homeodomain factor CEH-22. The organ-specific C subelement binds and is activated by the pan-pharyngeal FoxA family transcription factor PHA-4, which is required for formation of pharyngeal muscle and all other pharyngeal cell types during embryonic development. Thus in this case we find synergistic induction of a gene by two factors.

 

In the second example, hlh-1 (the MYOD C. elegans ortholog) encodes a basic helix-loop-helix transcription factor expressed in all body wall muscle cells and their precursors. The body wall muscle cells are derived from multiple cell lineages. Dissection of the hlh-1 promoter shows that gene expression can be properly regulated by multiple elements spanning ~3 kb upstream of the ATG. A core element required for all expression resides just upstream of the ATG (the star in the figure). In addition, there are several individual elements that drive expression preferentially in one or more lineages. However, no single element is specific for expression in just one lineage. In addition, the expression during embryogenesis is controlled by a different region than that controlling postembryonic expression. The overall pattern of hlh-1 expression is thus a composite of the action of several lineage-preference elements with overlapping domains of action, working in concert with an essential core element.

 

lin-26 is expressed in epithelial tissues. Its promoter revealed regulation by a core element required for all expression working in concert with tissue-specific elements, rather than lineage-preference (as in hlh-1). lin-26 is the downstream gene in an alternatively spliced operon including lir-1 and proper expression of lin-26 requires an 11 kb upstream region including most of the lir-1 gene itself. Within this region are tissue specific regulatory modules that activate gene expression in subsets of lin-26 expressing tissues. For example, separable modules control expression in the major hypodermal cells, in the minor hypodermal cells and sheath and socket support cells, in rectal cells, or in the somatic gonad. In some cases, redundant elements contribute to expression in particular tissues (e.g., major hypodermal cells), and, in the case of the minor hypodermis and support cells located at the worms anterior and posterior ends, separable elements active either in anterior or posterior ends were identified. Thus, the lin-26 promoter region contains cis-regulatory elements active in cells that belong to the same organ, are functionally related, or have similar positions along the body, and these elements together produce the full lin-26 expression pattern in a piecemeal fashion.

 

What we may conclude from these examples is that – in its basic principles – the lac operon model holds for higher organisms. This is because in all cases there exist sequences housing regulatory functions separate from the structural properties of the gene. What does appear different however is the refinement of the signals across space and time of these used in an animal.

 

How many transcription factors (TFs) does an organism have all together? In the C. elegans ~1000 have been detected. There are several supergroups such as Zinc fingers, homeodomain, and helix-loop-helix. Each comes in sub-groups such as GATA zinc finger TFs or C2H2 zinc finger TFs. Further each of these may have multiple copies due to recent gene duplications. Thus, the repertoire of an organism’s TFs is not some set of independent genes but rather a large inter-related family. This we previously appreciated in the genome duplication lecture when discussing HOX genes.

 

How does the number of TFs scale with the total number of genes in the organism? For example, larger genomes have more regulatory genes (TFs) than smaller genomes, but do they each have the same fraction TFs? They do not. Different organisms do appear to have the same fraction metabolism genes. This means an exponent of around a=1 for a power-law with the form y = (constant)*x^a, where x is the number of genes in genome, and y is the number of TFs. Note that when a=1, there is a constant fraction of metabolism genes in different sized genomes. The exponent for TFs however is almost 2, implying that as the number of genes in the genome doubles, the number of TFs quadruples.

 

What does each TF bind? Tiling arrays can be used to answer this question. Tiling arrays are DNA microarrays composed of thousands of probes of approximately 60 base-pairs each of which corresponds to a sequence in the genome. By performing a chromatin immunoprecipitation (ChIP) of a TF with the genome and then hybridizing the microarray only with DNA enriched for association with the TF, the binding sites of the TF can be determined. This was carried out for essentially all of the TFs of the yeast S. cerevisiae in 2004 by Harbison et al. They found that many promoters have binding sites for multiple TFs. Though another pattern is that a gene’s promoter will have only a single regulator either in the form of a single or repetitive binding sites. Furthermore, some regulation was environment-dependent while others were independent for the environments tested.

 

Overall, over half of the yeast genes were not detected as having any regulatory sites; i.e. 0 binding sites for 55% of the genes. For a particular regulator, the set of its binding sites throughout the genome was found to vary in terms of its fit to the sequence, where the fit is calculated as the average fit of a sequence to the motif. Interestingly, binding sites that appear in combination with other sites tend to be less specific (lower fit). For the TF Reb1, for example, there is a nice correlation showing that for those genes with only one Reb1-binding site the specificity (or fitness) of the motif is high, however for those genes with a large number of sites, the Reb1 motifs are much less specific. The researchers describing this phenomenon, Bilu and Barkai, suggest that one possible explanation is that to ensure that transcription proceeds only in combination for genes with multiple regulators, each site should be somewhat “fuzzy”.

 

Another way to identify motifs is by using comparative genomics. Since the genome sequence evolves very fast with the exception of functional elements, one can identify motifs as phylogenetically conserved sub-sequences. The first large-scale application of this principle occurred in 2003 with the analysis of four fungi genomes. Examining the inter-genic regions indeed reveals conserved elements that corresponded to known functional sites such as the TATA-box, and binding sites for Gal4 and Mig1.

 

Applying the same principle to the human, mouse and rat genomes led to a striking discovery. “There are 481 segments longer than 200 base pairs (bp) that are absolutely conserved (100% identity with no insertions or deletions) between orthologous regions of the human, rat, and mouse genomes” (Bejerano et al. Science 2004). This can be shown to quite statistically significant given the overall level of divergence among these genomes. Furthermore, such ultra-conserved sequences not corresponding to exons but in the vicinity of genes, tend to be in the vicinity of genes involved in transcriptional regulation. Also as may have been guessed, the ultra-conserved elements are also under strong selection (otherwise they would probably not be ultra-conserved), as evidenced by the lack of observed alleles with appreciable frequencies at these sites. Thus, if a mutation appears at an ultra-conserved loci it tends not to rise in frequency perhaps due to a strong effect on the phenotype.

 

Starting in 1997, evidence that genes are not the only sequences being transcribed in the genome began to accumulate. Many yeast transcripts did not correspond to any annotated gene. One-tenth of the intergenic sequences were estimated to exhibit some transcriptional activity. In 2002, the Fantom project seeking to characterize full-length transcripts in mouse, found that almost half of the poly(A)-tailed RNAs detected are non-protein-coding transcripts that do not match any annotated sequences. In the same year, tiling arrays of chromosomes 21 and 22 were used to examine the genomic locations of transcripts. Shockingly, most hybridizing probes lay outside the positions of known exons. In fact, of probes that were positive when hybridized to samples isolated from at least one of 11 cell lines, 94% lay outside known exon positions. Since then dozens of other reports have confirmed the observation of widespread transcription of the genome. This expression may correspond to yet-to-be-identified protein coding genes, however that seems unlikely. Another intriguing explanation is that expression is a biological artifact, expressed but not needed and invisible to natural selection.

 

How could this widespread expression come about? A recent study examining the binding sites of 6 important developmental regulators in Drosophila provides a clue. It was found that these bound to thousands of regions across the genome. Are all of these functional targets of the regulators? Additional work will shed light on how the genome is regulated and in particular how we might distinguish which portions of the regulation are under selection.