Genome Evolution Course 2009-2010

www.yanaiweb.com/genome

Itai Yanai, Technion – Israel Institute of Technology

 

Tutorial Presentation as PDF or PP.

 

Problem Set #12 assigned January 11th, 2010

 

To be submitted as hard-copy in English or Hebrew on January 17th, 2010 (at the beginning of class, 9:30am).

 

Problem 1: The probability of finding an ultraconserved sequence. You are given two genome sequences of the same length of 3Gb, where for each position there is a 66% chance of the basepairs at the same positions to be identical. This is roughly the situation with the human and mouse genomes. How long do you expect the longest stretch of identical sequence to be?

 

Problem 2: Pick a human gene (any human gene) from a deck of GeneCards. Use the GeneCards database to find a gene for study.

1.      Go to: http://www.genecards.org/

2.      At the bottom of the page, click on one of the 26 letters.

3.      You will arrive at a page with all of the genes whose name starts with that letter. Click on one to see its ‘gene card’. Take a look at all of the useful information. In particular, take note of the ensembl identifier (listed under ‘external ID’s’ in the first white block of the page). For example, for the BRCA1 gene the ensembl ID is ENSG00000012048.

4.      What is your gene’s ensembl identifier? J

 

Problem 3. Find the mouse, rat, horse, and rabbit orthologs of your chosen gene. The ensembl database makes available the genome and annotations for many eukaryotic genomes. You can enter the ensembl page for your gene by clicking on the genecard link or by searching for the gene on the ensembl website (http://www.ensembl.org/index.html). To find the orthologs click on the ‘Orthologues’ link on the left. Then make note of the ensembl identifiers of the mouse, rat, horse, and rabbit orthologs. What are the ensembl identifiers of your gene’s orthologues (they each begin with “ENS”..)?

 

Problem 4. Retrieving the upstream (promoter) sequences of your chosen gene and its orthologs. To find the sequences use the Biomart server:

1.       Go to: http://www.ensembl.org/biomart/martview/ae61dc4ffc69cd6843ce5db62c36500d

2.       In ‘Choose database’, select “Ensembl 56”

3.       In Choose dataset’, select “Homo sapiens genes (GRCh37)

4.       Now you need to go through 3 steps (specify which genes you want to learn about, specify what you want to learn about them, and output the results)

a.       Specifying which gene:

                                                              i.      Click the “Filters” tab on the left

                                                            ii.      Click ‘Gene’ in the main region

                                                          iii.      Check the ‘ID list limit’ box

                                                          iv.      In the edit box enter the ensembl identifier of your human gene (for example ENSG00000012048 for BRCA1)

b.      Specifying which Attributes:

                                                              i.      Click the ‘Attributes’ tab on the left

                                                            ii.      Select the ‘Sequences’ radio button

                                                          iii.      Click on SEQUENCES

                                                          iv.      Select the ‘Flank (Gene)” radio button

                                                            v.      Check the “Upstream flank “ box

                                                          vi.      Enter 1000 (for 1000 basepairs upstream of your gene)

                                                        vii.      Click the “Header Information” (you may have to scroll down in the page)

                                                      viii.      Uncheck the “Ensembl Transcript ID” (not important for you)

c.       Outputing the results:

                                                              i.      Click “Results” towards the top of the page

5.       Repeat this for the orthologs, remember to change the dataset to the respective genomes.

 

Problem 5. Detect cis-regulatory motifs in your promoter sequences. Now you can try to identify the common motifs in their sequences. Describe the motifs you find. In particular, state which motifs are present in all five sequences?

1.      Use MEME http://meme.sdsc.edu/meme4_3_0/cgi-bin/meme.cgi

2.      Enter your email address (twice)

3.      Paste in the five sequences in fasta format. Such as:

>human

ACCGT…

>mouse

GTTGGT…

>rat

4.      Set the parameters to:

    1. Distribution of motif occurrences: Any number of repetitions
    2. Number of different motifs: 5
    3. Minimum motif width: 6
    4. Maximum motif width: 10

5.      And “Start search”