Math 5233 lab assignment for Monday February 26th

Due Wednesday, March 7th. We will also use next Monday (March 5th) to work on this.

Groups: Group 1: Bethany, Amanda, Brad. Proteins: G6PD (fasta, alignments), adenylate cyclase (fasta, alignments)
Group 2: Luke, Alayna, Noland. Proteins: Serine/threonine protein phosphatase (fasta, alignments), elongation factor 1 (fasta, alignments)
Group 3: Maria, Eric, Terrence. Proteins: Aconitase(fasta, alignments), Serine/threonine protein phosphatase (fasta, alignments)
Group 4: Shane, Marissa, Doug. Proteins: G6PD (fasta, alignments), Heat shock protein 90(fasta, alignments)

There are several goals of this weeks lab: (1) to begin your acquaintance with biopython, a library of useful bioinformatic python tools; (2) look at some multiple sequence alignments; (3) use multiple sequence alignments to construct a scoring matrix.
You can regenerate or change alignments at the EBI server, which also has Jalview available for viewing alignments.

Do some research on the two proteins and summarize your findings in a paragraph or two; also indicate your sources. It may be helpful to use the accession numbers from the fasta files as search terms in NCBI Entrez.

Download the fasta sequence and clustal-format alignment files for your two proteins. Start up python and load the Clustalw and AlignInfo modules:
```
 from Bio.Align import AlignInfo 
```
```
 from Bio import Clustalw 
```
Now you can load a multliple alignment using something like:
```
 align1 = Clustalw.parse_file('/Users/lab_user/Desktop/msa.aln') 
```
It is now relatively easy to extract information from the multiple alignment. For example, you can get the number of columns in the alignment with the get_alignment_length() function, or extract column number N with get_column(N).
Your goal is to use the two multiple alignments to construct a scoring matrix that reflects the transitions seen in Plasmodium falciparum.
The scores in your matrix should be the log of the ratio of the observed transitions to the expected transitions (based upon the amino acid frequencies). I believe the easiest way to keep track of what you need is to use dictionaries. I.e., you can score your observed frequencies and scores in a matrix like this one. I have done the alignments so that the Plasmodium record is the last one. Compute all of your transition frequencies with respect to the Plasmodium record, and do not count gaps. For example, the following code counts would count the transitions for an alignment align1:
```
 QijDict = {}
```
```
 for i in range(align1.get_alignment_length()): 
     column = align1.get_column(i) 
     if column[-1] != '-': 
         for letters in column[:len(column)-1]: 
             if letters != '-': 
                 curkey = column[-1] + letters 
                 if QijDict.has_key(curkey): 
                     QijDict[curkey] = QijDict[curkey] + 1 
                 else: 
                     QijDict[curkey] = 1 
```
To check that you have all the amino acid letters, the following code might be convenient:
```
 letterlist = list(set([x[0] for x in QijDict.keys()])) 
```
(Using set() converts its argument into a set, which removes duplicates. Then I change it back to a list.)
Its a good idea to add pseudocounts - that is, add in at least 1 count for each possible transition.
Once you have a dictionary with the transition frequencies, the scores should be computed as the logarithm of (total of transition matrix)*(# transitions from 'i' to 'j')/((the sum of the 'i' row)*(sum of the 'j' column)); this is equivalent to the equations in our readings.

How does your matrix compare to the BLOSUM62 matrix? (What entries are much higher or lower? How do scores between amino acids with different properties (hydrophobicity, size, charge) compare between the two? What is the entropy per alignment position for each matrix? What are the expected scores?)
It may be helpful to access the BLOSUM62 matrix in python. If you have biopython installed, you can get dictionaries for many scoring matrices by loading the MatrixInfo module:
```
 from Bio.SubsMat import MatrixInfo 
 b62 = MatrixInfo.blosum62
 b62[('W','W')]
```
Note that the keys are in 2-tuples instead of concatenated as I was doing above. Also, unfortunately, since most scoring matrices are symmetric they only store half the matrix. So, for example, there is no key ('A','P'), only ('P','A'). The following would fix this:
```
for key in b62.keys():
    b62[(key[1],key[0])] = b62[key]
```
In comparing the matrices, it would be good to scale your matrix to be 'close' to BLOSUM62 in some sense. This corresponds to choosing a lambda in most notation (the S. Altschul online Blast statistics page, for example, or in the compositional bias correct paper by Yu and Altschul). Good choices might be: so the two matrices have equal largest entries, equal entropies, or equal expected scores.