Download the fasta sequence and clustal-format alignment files for your two proteins. Start up python and load the Clustalw and AlignInfo modules: from Bio.Align import AlignInfo
from Bio import Clustalw
Now you can load a multliple alignment using something like: align1 = Clustalw.parse_file('/Users/lab_user/Desktop/msa.aln')
It is now relatively easy to extract information from the multiple alignment. For example, you can get the number of columns in the alignment with the get_alignment_length() function, or extract column number N with get_column(N).
Your goal is to use the two multiple alignments to construct a scoring matrix that reflects the transitions seen in Plasmodium falciparum.
The scores in your matrix should be the log of the ratio of the observed transitions to the expected transitions (based upon the amino acid frequencies). I believe the easiest way to keep track of what you need is to use dictionaries. I.e., you can score your observed frequencies and scores in a matrix like this one.
I have done the alignments so that the Plasmodium record is the last one. Compute all of your transition frequencies with respect to the Plasmodium record, and do not count gaps. For example, the following code counts would count the transitions for an alignment align1:
QijDict = {}
for i in range(align1.get_alignment_length()):
column = align1.get_column(i)
if column[-1] != '-':
for letters in column[:len(column)-1]:
if letters != '-':
curkey = column[-1] + letters
if QijDict.has_key(curkey):
QijDict[curkey] = QijDict[curkey] + 1
else:
QijDict[curkey] = 1
To check that you have all the amino acid letters, the following code might be convenient:
letterlist = list(set([x[0] for x in QijDict.keys()]))
(Using set() converts its argument into a set, which removes duplicates. Then I change it back to a list.)
Its a good idea to add pseudocounts - that is, add in at least 1 count for each possible transition.
Once you have a dictionary with the transition frequencies, the scores should be computed as the logarithm of (total of transition matrix)*(# transitions from 'i' to 'j')/((the sum of the 'i' row)*(sum of the 'j' column)); this is equivalent to the equations in our readings.