Genome Evolution Course 2009-2010

www.yanaiweb.com/genome

Itai Yanai, Technion – Israel Institute of Technology

 

Tutorial Presentation as PDF or PP.

 

Problem Set #7 assigned December 6th, 2009

 

To be submitted as hard-copy in English or Hebrew on December 13h, 2009 (at the beginning of class, 9:30am).

 

Gene Duplications

 

I have downloaded the COG database of unicellular organisms and placed it in matrix format. Each row corresponds to a cluster of orthologous genes (a COG) - a gene family across genomes – and each column represents a genome. So for example, the first row (COG00001) is the gene family Glutamate-1-semialdehyde aminotransferase, is found as 61 different genes in 48 different genomes. A COG is constructed when orthology is detected among at least 3 organisms. Thus, only those genes in an organism that have orthologs in at least two other organisms are represented in this matrix. However, since most of a genome’s genes are represented in COGs for this exercise ignore those genes not included in COGs, that is assume that the sum of each column is equal to the total number of genes.

 

You have been assigned a genome.

 

Problem 1: Show the distribution of gene family sizes for your genome.

 

Problem 2: Identify the biggest (or one of the biggest) paralog family (COG) in your genome. What is its size distribution across the other genomes?

 

Problem 3: How many of your genome’s genes are paralogs (have duplicates in the genome assigned to the same gene family)?

 

Problem 4: What fraction of your genome’s genes belong to families with representatives across all other genomes?

 

Problem 5: In comparison with the genome next to yours in the matrix (your column + 1), which families are significantly larger (more than 3 more members) and which are significantly smaller (more than three less, yet still present)?