Genome Evolution Course
2009-2010
www.yanaiweb.com/genome
Itai Yanai, Technion – Israel Institute of Technology
Tutorial Presentation as PDF or PP.
Problem Set #7 assigned
December 6th, 2009
To be submitted as
hard-copy in English or Hebrew on December 13h, 2009 (at the beginning of
class, 9:30am).
Gene
Duplications
I have downloaded the COG database of unicellular organisms and placed it in matrix format. Each row
corresponds to a cluster of orthologous genes (a COG) - a gene family across
genomes – and each column represents a genome. So for example, the first row
(COG00001) is the gene family Glutamate-1-semialdehyde aminotransferase,
is found as 61 different genes in 48 different genomes. A COG is constructed
when orthology is detected among at least 3 organisms. Thus, only those genes
in an organism that have orthologs in at least two other organisms are
represented in this matrix. However, since most of a genome’s genes are
represented in COGs for this exercise ignore those genes not included in COGs,
that is assume that the sum of each column is equal to the total number of
genes.
You
have been assigned a genome.
Problem 1: Show the distribution of gene family sizes
for your genome.
Problem 2: Identify the biggest (or one of the biggest)
paralog family (COG) in your genome. What is its size distribution across the
other genomes?
Problem 3: How many of your genome’s genes are paralogs
(have duplicates in the genome assigned to the same gene family)?
Problem 4: What fraction of your
genome’s genes belong to families with representatives across all other
genomes?
Problem 5: In comparison with the genome next to yours in the matrix (your
column + 1), which families are significantly larger (more than 3 more members)
and which are significantly smaller (more than three less, yet still present)?