Due Wednesday, February 21st. You should write and hand in individual assignments, but you are encouraged to work together.
  1. Often in bioinformatics one has to deal with large files. We will practice with a relatively small file and a medium sized (23 Mb) file. Download each of these. They are nucleotide files for Plasmodium falciparum. The larger one is an attempt at assembling all 14 of the chromosomes of the 3D7 strain. The smaller one is for one of the apicoplasts, an organelle with its own DNA. Suppose you want to compute the A+T percentage of the genome of P. falciparum. (If the four bases were equally distributed, the A+T percentage would be 50%.) Warmp up by computing this for the small apicoplast dna and if all goes well do the larger file. To get a feel for the methods needed, open an interactive python shell (such as PyCrust if its available, otherwise just type python at a Terminal). To open a text file for reading, you can use something like:
     f = file('/Users/lab_user/Desktop/PlasX95275.fasta', 'r')
    The path to the file might be different if you put it somewhere else or if you are on a different machine. To read large files, it is usually best to do it line-by-line. To get a line of text from the file, you can assign it to a string (lets call it 'TextLine' for now):
     TextLine = f.readline() 
    To see the line, just type TextLine and Enter.
    As an example, lets see how often the substring 'TATAGTTA' is present in the file. First we go back to the beginning of the file:
     f.seek(0) 
    Initialize a counter to 0:
     MyCounter = 0 
    Now we loop through the lines of the file and increment the counter every time we find that substring:
     for line in f:
         MyCounter = MyCounter + line.count('TATAGTTA')
    This method might be a little inaccurate - do you see why? Now try to get the A+T percentage in the small file. If that seems to work, find out the percentage for the large file. It may be helpful to consult with the Python Tutorial, or ask me for help.
    Note that you have to account for newline characters (usually '\n') and ambiguous nucleic acid codes (see here for example). The big file uses lower-case letters, while the smaller one uses upper-case.

    In addition to computing your percentages, hand in (or email) a copy of the python code that you used.

  2. How does the A+T percentage of the genome compare with the Plasmodium falciparum gene you studied last week? Can you think of any reasons for the high A+T percentage of this genome?

  3. Optional problem: if the previous exercises didn't cause you to sweat, compute the percentage of 'CG' + 'GC' dinucleotides in the P. falciparum genome. How does this compare to what you would expect given the answer to the first exercise?