Genome Evolution Course 2009-2010
www.yanaiweb.com/genome
Itai Yanai, Technion – Israel Institute of Technology
Tutorial
Presentation as PDF or PP.
Problem Set #12 assigned January 11th, 2010
To be submitted as hard-copy in English or
Hebrew on January 17th, 2010 (at the beginning of
class, 9:30am).
Problem
1: The
probability of finding an ultraconserved sequence.
You are given two genome sequences of the same length of 3Gb,
where for each position there is a 66% chance of the basepairs
at the same positions to be identical. This is roughly the situation with the
human and mouse genomes. How long do you expect the longest stretch of identical
sequence to be?
Problem
2: Pick
a human gene (any human gene) from a deck of GeneCards.
Use the GeneCards database to find a gene for
study.
1.
Go to: http://www.genecards.org/
2.
At the bottom
of the page, click on one of the 26 letters.
3.
You
will arrive at a page with all of the genes whose name starts with that letter.
Click on one to see its ‘gene card’. Take a look at all of the useful
information. In particular, take note of the ensembl identifier (listed under ‘external
ID’s’ in the first white block of the page). For example, for the BRCA1 gene the ensembl
ID is ENSG00000012048.
4.
What
is your gene’s ensembl identifier? J
Problem
3. Find
the mouse, rat, horse, and rabbit orthologs of your chosen gene. The ensembl
database makes available the genome and annotations for many eukaryotic
genomes. You can enter the ensembl page for your gene by clicking on the genecard link or by searching for the gene on the ensembl
website (http://www.ensembl.org/index.html).
To find the orthologs click on the ‘Orthologues’ link
on the left. Then make note of the ensembl identifiers of the mouse, rat,
horse, and rabbit orthologs. What are the ensembl identifiers of your gene’s orthologues (they each begin with “ENS”..)?
Problem
4. Retrieving the upstream (promoter) sequences of your chosen gene
and its orthologs. To find the sequences use the Biomart
server:
1.
Go to: http://www.ensembl.org/biomart/martview/ae61dc4ffc69cd6843ce5db62c36500d
2.
In ‘Choose database’, select “Ensembl
56”
3.
In Choose dataset’, select “Homo
sapiens genes (GRCh37)
4.
Now you need to go through 3 steps
(specify which genes you want to learn about, specify what you want to learn
about them, and output the results)
a.
Specifying which gene:
i.
Click the “Filters” tab on the left
ii.
Click ‘Gene’ in the main region
iii.
Check the ‘ID list limit’ box
iv.
In the edit box enter the ensembl
identifier of your human gene (for example ENSG00000012048 for BRCA1)
b.
Specifying which Attributes:
i.
Click the ‘Attributes’ tab on the
left
ii.
Select the ‘Sequences’ radio button
iii.
Click on SEQUENCES
iv.
Select the ‘Flank (Gene)” radio
button
v.
Check the “Upstream flank “ box
vi.
Enter 1000 (for 1000 basepairs upstream of your gene)
vii.
Click the “Header Information” (you
may have to scroll down in the page)
viii.
Uncheck the “Ensembl Transcript ID”
(not important for you)
c.
Outputing the results:
i.
Click “Results” towards the top of
the page
5.
Repeat this for the orthologs, remember to change the dataset to the respective
genomes.
Problem
5. Detect
cis-regulatory motifs in your promoter sequences.
Now you can try to identify the common motifs in their sequences. Describe the
motifs you find. In particular, state which motifs are present in all five
sequences?
1.
Use MEME http://meme.sdsc.edu/meme4_3_0/cgi-bin/meme.cgi
2.
Enter your email address (twice)
3.
Paste in the five sequences in fasta format. Such as:
>human
ACCGT…
>mouse
GTTGGT…
>rat
…
4.
Set the parameters to:
5.
And “Start
search”