Go to Prosite and find the entry PS00198. Read the documentation page, and find the consensus pattern. We want to search the Plasmodium falciparum proteins for this pattern. To do this, we will need to load the re module:
import re
Next we need to convert the Prosite pattern into a more standard regular expression syntax. The following function will convert most patterns, stripping out the redundant ' - ' strings and converting the escape characters:
def PStoRE(PrositePattern):
rePattern = PrositePattern
rePattern = rePattern.replace(' - ','')
rePattern = rePattern.replace('x','.')
rePattern = rePattern.replace('{','[^')
rePattern = rePattern.replace('}',']')
rePattern = rePattern.replace('(','{')
rePattern = rePattern.replace(')','}')
return rePattern
So for example if we had the Prosite consensus pattern for PS00198 as a string, we would convert it:
PS00198 = 'C - x - {P} - C - {C} - x - C - {CP} - x - {C} - C - [PEG]'
PS00198re = PStoRE(PS00198)
And now we 'compile' it into a searchable regular expression:
PS00198compiled = re.compile(PS00198re)
To search the proteins, first download this file as before (the February 19th lab). Then load the Fasta package from biopython and use it to iterate over the sequences. For example, you could do something like this:
PlasProtsFile = file('/Users/lab_user/Desktop/PlasProts.fasta','r')
from Bio import Fasta
Parser = Fasta.RecordParser()
FastaIterator = Fasta.Iterator(PlasProtsFile, parser = Parser)
for Record in FastaIterator:
Matches = PS00198compiled.findall(Record.sequence)
if len(Matches) != 0:
print Record.title
print Matches
This would print the description line of each protein record that had any matches to the regular expression.
Do the matches seem correct? Why or why not?