Course notes for the January 10th 2009 lecture of the
Genome Evolution Course
Itai Yanai
Department of Biology
Technion – Israel Institute of Technology
yanai@technion.ac.il
Evolution of Genome
Regulation
We learned that only about 3% of the genome codes for exons. What does the rest do? This talk is about the
regulatory aspect of the genome. The “Matrix” image may be appropriate to
capture the problem of identifying the regulatory signals present in the genome
by filtering through most but not all.
In 1960, Francois Jacob and Jacques Monod proposed a model for the E.
coli lac operon. The main
conceptual breakthrough was a distinction between the DNA used for encoding
protein sequences and that used to encode regulatory signals. In the lac operon this occurs for
example in the form of an operator sequence that is bound by the lacI repressor present whenever lactose is available.
Sydney Brenner began in 1966 to work with the nematode C. elegans
to study development and the nervous system. One question was whether the
regulation in organisms more complex than the lac-operons
of E. coli. We consider three examples of promoter sequences of this organism
to observe how they are used as regulatory signals. The following is adapted
from Okkema and Krause. Transcriptional
regulation Chapter in WormBook.
Our first example of regulation of
the genome in animal is the C. elegans myo-2 gene which encodes a myosin
heavy chain expressed exclusively in the pharyngeal muscles as these cells
undergo terminal differentiation. The activity of this gene depends upon distinct
cell-type-specific and organ-specific subelements,
termed B and C, that can separately activate gene expression
either specifically in the pharyngeal muscles. In their endogenous context
within the myo-2 gene, these subelements
synergistically activate pharyngeal muscle gene expression. Consistent with
their distinct activities, the B and C subelements
are targeted by transcription factors expressed in different spatial patterns
in the pharynx. The cell-type-specific B subelement
binds and is activated by the pharyngeal muscle specific homeodomain
factor CEH-22. The organ-specific C subelement
binds and is activated by the pan-pharyngeal FoxA
family transcription factor PHA-4, which is required for formation of
pharyngeal muscle and all other pharyngeal cell types during embryonic
development. Thus in this case we find synergistic induction of a gene by two
factors.
In the second example, hlh-1 (the
MYOD C. elegans ortholog) encodes a basic helix-loop-helix
transcription factor expressed in all body wall muscle cells and their
precursors. The body wall muscle cells are derived from multiple cell lineages.
Dissection of the hlh-1 promoter shows that gene expression can be
properly regulated by multiple elements spanning ~3 kb upstream of the ATG. A
core element required for all expression resides just upstream of the ATG (the
star in the figure). In addition, there are several individual elements that
drive expression preferentially in one or more lineages. However, no single element
is specific for expression in just one lineage. In addition, the expression
during embryogenesis is controlled by a different region than that controlling
postembryonic expression. The overall pattern of hlh-1 expression is
thus a composite of the action of several lineage-preference elements with
overlapping domains of action, working in concert with an essential core
element.
lin-26 is expressed in
epithelial tissues. Its promoter revealed regulation by a core element required
for all expression working in concert with tissue-specific elements, rather
than lineage-preference (as in hlh-1). lin-26
is the downstream gene in an alternatively spliced operon
including lir-1 and proper expression of lin-26 requires an 11 kb
upstream region including most of the lir-1 gene itself. Within this
region are tissue specific regulatory modules that activate gene expression in
subsets of lin-26 expressing tissues. For example, separable modules
control expression in the major hypodermal cells, in the minor hypodermal cells
and sheath and socket support cells, in rectal cells, or in the somatic gonad.
In some cases, redundant elements contribute to expression in particular
tissues (e.g., major hypodermal cells), and, in the case of the minor
hypodermis and support cells located at the worms anterior and posterior ends,
separable elements active either in anterior or posterior ends were identified.
Thus, the lin-26 promoter region contains cis-regulatory
elements active in cells that belong to the same organ, are functionally
related, or have similar positions along the body, and these elements together
produce the full lin-26 expression pattern in a piecemeal fashion.
What we may conclude from these examples is that – in its basic
principles – the lac operon
model holds for higher organisms. This is because in all cases there exist sequences housing regulatory functions separate from
the structural properties of the gene. What does appear different however is the
refinement of the signals across space and time of these used in an animal.
How many transcription factors (TFs) does an organism have all
together? In the C. elegans ~1000 have been detected. There are several supergroups such as Zinc fingers, homeodomain,
and helix-loop-helix. Each comes in sub-groups such as GATA zinc finger TFs or
C2H2 zinc finger TFs. Further each of these may have multiple copies due to
recent gene duplications. Thus, the repertoire of an organism’s TFs is not some
set of independent genes but rather a large inter-related family. This we
previously appreciated in the genome duplication lecture when discussing HOX
genes.
How does the number of TFs scale with the total number of genes in
the organism? For example, larger genomes have more regulatory genes (TFs) than
smaller genomes, but do they each have the same fraction TFs? They do not. Different
organisms do appear to have the same fraction metabolism genes. This means an
exponent of around a=1 for a power-law with the form y = (constant)*x^a, where x is the number of genes in genome, and y is the
number of TFs. Note that when a=1, there is a constant fraction of metabolism
genes in different sized genomes. The exponent for TFs however is almost 2,
implying that as the number of genes in the genome doubles, the number of TFs quadruples.
What does each TF bind? Tiling arrays can be used to answer this
question. Tiling arrays are DNA microarrays composed of thousands of probes of approximately
60 base-pairs each of which corresponds to a sequence in the genome. By
performing a chromatin immunoprecipitation (ChIP) of a TF with the genome and then hybridizing the
microarray only with DNA enriched for association with the TF, the binding
sites of the TF can be determined. This was carried out for essentially all of
the TFs of the yeast S. cerevisiae in 2004 by Harbison
et al. They found that many promoters have binding sites for multiple TFs. Though
another pattern is that a gene’s promoter will have only a single regulator
either in the form of a single or repetitive binding sites.
Furthermore, some regulation was environment-dependent while others were
independent for the environments tested.
Overall, over half of the yeast genes were not detected as having
any regulatory sites; i.e. 0 binding sites for 55% of the genes. For a
particular regulator, the set of its binding sites throughout the genome was
found to vary in terms of its fit to the sequence, where the fit is calculated
as the average fit of a sequence to the motif. Interestingly, binding sites
that appear in combination with other sites tend to be less specific (lower
fit). For the TF Reb1, for example, there is a nice correlation showing that
for those genes with only one Reb1-binding site the specificity (or fitness) of
the motif is high, however for those genes with a large number of sites, the
Reb1 motifs are much less specific. The researchers describing this phenomenon,
Bilu and Barkai, suggest that one possible
explanation is that to ensure that transcription proceeds only in combination
for genes with multiple regulators, each site should be somewhat “fuzzy”.
Another way to identify motifs is by using comparative genomics.
Since the genome sequence evolves very fast with the exception of functional
elements, one can identify motifs as phylogenetically
conserved sub-sequences. The first large-scale application of this principle occurred
in 2003 with the analysis of four fungi genomes. Examining the inter-genic regions indeed reveals conserved elements that
corresponded to known functional sites such as the TATA-box, and binding sites
for Gal4 and Mig1.
Applying the same principle to the human, mouse and rat genomes led
to a striking discovery. “There are 481 segments longer than 200 base pairs (bp) that are absolutely conserved (100% identity with no
insertions or deletions) between orthologous regions
of the human, rat, and mouse genomes” (Bejerano et
al. Science 2004). This can be shown to quite statistically significant
given the overall level of divergence among these genomes. Furthermore, such
ultra-conserved sequences not corresponding to exons
but in the vicinity of genes, tend to be in the vicinity
of genes involved in transcriptional regulation. Also as may have been guessed,
the ultra-conserved elements are also under strong selection (otherwise they
would probably not be ultra-conserved), as evidenced by the lack of observed
alleles with appreciable frequencies at these sites. Thus, if a mutation
appears at an ultra-conserved loci it tends not to
rise in frequency perhaps due to a strong effect on the phenotype.
Starting in 1997, evidence that genes are not the only sequences
being transcribed in the genome began to accumulate. Many yeast transcripts did
not correspond to any annotated gene. One-tenth of the intergenic
sequences were estimated to exhibit some transcriptional activity. In 2002, the
Fantom project seeking to characterize full-length transcripts
in mouse, found that almost half of the poly(A)-tailed
RNAs detected are non-protein-coding transcripts that do not match any
annotated sequences. In the same year, tiling arrays of chromosomes 21 and 22 were
used to examine the genomic locations of transcripts. Shockingly, most hybridizing
probes lay outside the positions of known exons. In
fact, of probes that were positive when hybridized to samples isolated from at
least one of 11 cell lines, 94% lay outside known exon
positions. Since then dozens of other reports have confirmed the observation of
widespread transcription of the genome. This expression may correspond to yet-to-be-identified
protein coding genes, however that seems unlikely. Another intriguing
explanation is that expression is a biological artifact, expressed but not
needed and invisible to natural selection.
How could this widespread expression come about? A recent study examining
the binding sites of 6 important developmental regulators in Drosophila
provides a clue. It was found that these bound to thousands of regions across
the genome. Are all of these functional targets of the regulators? Additional
work will shed light on how the genome is regulated and in particular how we
might distinguish which portions of the regulation are under selection.