)
username = 'admin' # change this to your own username
filepath = os.environ['HOME'] + '/sage_notebook/worksheets/' + username +'/'
server_file = file(filepath + 'myfilename.aln','w')
server_file.write(online_file.read())
online_file.close()
server_file.close()
The following fasta files have already been aligned, so it is possible to begin to create the scoring matrix before finishing part 1. They also might help understand what to do with the other files.
Group 1: Plasmodium falciparum files: plasmyosins.fa Alignment: plasmyosins.aln
Group 2: Dictylostelium discoideum: dictymyosins.fa Alignment: dictymyosins.aln
Group 3: Mycoplasma mycoides: mycomitot1b.fa Alignment: mycomitot1b.aln
Group 4: Brugia malayi: brugiamyosins.fa Alignment: brugiamyosins.aln
EBI ClustalW server
Part 3
Load the Clustalw and AlignInfo modules:
from Bio.Align import AlignInfo
from Bio import Clustalw
Now you can load a multliple alignment using something like:
align1 = Clustalw.parse_file(filepath + 'myfilename.aln')
It is now relatively easy to extract information from the multiple alignment. For example, you can get the number of columns in the alignment with the get_alignment_length() function, or extract column number N with get_column(N).
To construct the scoring matrix, first we need to compute the transition matrix. We will do this in a way similar to the BLOSUM matrices, as described in Chapter 4. Our matrix will not be symmetric, however - we are counting changes from the last entry to all the other entries. As an example:
Aij_dict = {}
for i in range(align1.get_alignment_length()):
column = align1.get_column(i)
if column[-1] != '-':
for letter in column[:len(column)-1]:
if letter != '-':
curkey = (column[-1], letter)
if Aij_dict.has_key(curkey):
Aij_dict[curkey] += 1
else:
Aij_dict[curkey] = 1
Its a good idea to add pseudocounts - that is, add in at least 1 count for each possible transition:
letterlist = list(union([x[0] for x in Aij_dict.keys()]))
for letter1 in letterlist:
for letter2 in letterlist:
curkey = (letter1, letter2)
if Aij_dict.has_key(curkey):
Aij_dict[curkey] += 1
else:
Aij_dict[curkey] = 1
Now you should divide the Aij_dict entries by the total number of transitions, to get a dictionary of frequencies (corresponding to the book's q_{ij} matrix). Then create the scores as the log(q_{ij}/p_i q_j), where p_i is the frequency of amino acid i in the AT-rich organism (from last week's lab) and q_i is the BLOSUM62 amino acid frequency (from the Yu and Altschul paper).
Finally, scale your matrix so that its entropy is equal to the entropy of the BLOSUM62 matrix. The entropy of a scoring matrix is the sum (over all i and j) of q_{ij} S_{ij} - that is, the sum of all the scores weighted by the transition frequencies.
How does your matrix compare to the BLOSUM62 matrix? (What entries are much higher or lower? How do scores between amino acids with different properties (hydrophobicity, size, charge) compare between the two?)
Here is a way to access the BLOSUM62 matrix in a dictionary format:
from Bio.SubsMat import MatrixInfo
b62 = MatrixInfo.blosum62
for key in b62.keys():
b62[(key[1],key[0])] = b62[key]