Due Wednesday February 28th. You can work alone or in a group of your choosing of up to 3 people. As usual, do not hesitate to ask me for help, either in the lab or via email or office hours.
  1. Find the distribution of amino acids in the Plasmodium genome using the sequences in this file. It may be useful to use the dictionary data type for counting all the amino acid letters. How does the distribution compare to the BLOSUM62 background frequencies (there is a convenient table in this week's reading, the Yu and Altschul paper)? Does the AT-rich nature of the genome explain the all differences?

  2. Optional but recommended: recompute the amino acid distribution from the above file, but this time filter out any records that have the word 'hypothetical' in them. How is the distribution different?

  3. Write a function that computes the entropy of a discrete distribution. It may be helpful to use the math module for logarithms (just use
     import math
    and then
     math.log(x,2)
    will compute the base-2 log of x). I.e. if we called the function 'entropy' an example of its usage would be:
     entropy([.75,.25]) 
    0.81127812445913283
    I.e. the function should take a list of numbers [p_1, ..., p_n] (lists are denoted by square brackets in python) between 0 and 1, which sum to 1, and return -sum p_i log(p_i). In case the example function from the Python tutorial seems complicated, here's a function for computing 1/x, checking to see if x is nonzero:
     def Reciprocal(x):
         if x != 0: return 1/x


    Use it to compute the information content (entropy) of the Plasmodium genome using the distribution of single nucleotides (that you calculated last week - the frequency of As would be half of the A+T frequency). Then calculate the information content of the protein file of the previous exercise using its amino acid distribution. How do these numbers compare? Does your answer make sense to you?

  4. Extra credit mystery identification: what is this protein?