Math 5233 lab assignment for Monday February 26th

Due Wednesday, March 28th.

Groups: You can work with other people of your choice, in groups of up to three.

This assignment will increase your familiarity with patterns in protein sequences, specifically those found at the database Prosite. You will also learn to use a little of the python' module 're', the regular expression module (described in more detail here) .

Go to Prosite and find the entry PS00198. Read the documentation page, and find the consensus pattern. We want to search the Plasmodium falciparum proteins for this pattern. To do this, we will need to load the re module:
```
import re 
```
Next we need to convert the Prosite pattern into a more standard regular expression syntax. The following function will convert most patterns, stripping out the redundant ' - ' strings and converting the escape characters:
```
def PStoRE(PrositePattern):
    rePattern = PrositePattern
    rePattern = rePattern.replace(' - ','')
    rePattern = rePattern.replace('x','.')
    rePattern = rePattern.replace('{','[^')
    rePattern = rePattern.replace('}',']')
    rePattern = rePattern.replace('(','{')
    rePattern = rePattern.replace(')','}')
    return rePattern
```
So for example if we had the Prosite consensus pattern for PS00198 as a string, we would convert it:
```
PS00198 = 'C - x - {P} - C - {C} - x - C - {CP} - x - {C} - C - [PEG]'
PS00198re = PStoRE(PS00198)
```
And now we 'compile' it into a searchable regular expression:
```
PS00198compiled = re.compile(PS00198re)
```
To search the proteins, first download this file as before (the February 19th lab). Then load the Fasta package from biopython and use it to iterate over the sequences. For example, you could do something like this:
```
PlasProtsFile = file('/Users/lab_user/Desktop/PlasProts.fasta','r')
from Bio import Fasta
Parser = Fasta.RecordParser()
FastaIterator = Fasta.Iterator(PlasProtsFile, parser = Parser)
for Record in FastaIterator:
    Matches = PS00198compiled.findall(Record.sequence)
    if len(Matches) != 0:
        print Record.title
        print Matches
```
This would print the description line of each protein record that had any matches to the regular expression.
Do the matches seem correct? Why or why not?

Repeat the above exercise for PS00534. You will have to reset the file and iterator:
```
PlasProtsFile.seek(0)
FastaIterator = Fasta.Iterator(PlasProtsFile, parser = Parser)
```
Also, do a blastp search with the full protein sequence for any matching Plasmodium proteins. What are the best hits that are not in the phylum Alveolata?

Repeat the above exercise for PS00078. There is evidence that somehow the original protein from the mitochonria split into two pieces at some point, possibly when it transitioned to the nuclear genome. What protein in P. falciparum is the other half?