Math 5233 lab assignment for Wednesday February 27

Custom scoring matrices for AT-rich genomes, part 4.

Due Monday, March 10th (before class), as an emailed worksheet.

Groups: This lab will be done individually, but some collaboration is encouraged. You must do part 4a yourself - i.e. you can consult with others but you have to write your own code, you cannot just copy it from someone else.

This lab series is somewhat complicated, by far the most complicated that we will do. Here is an overview (more details for each step are given later):

Compute amino acid frequencies for the organism
Generate alignments of some highly conserved proteins

Prepare FASTA sequence files: remove the AT rich genomes of the other three organisms from each sequence file
Create the alignments: use Clustalw to create alignments
Save the alignments on whatever server you are using

Create scoring matrix

Load the appropriate modules from biopython
Count transitions of amino acids from other organisms to your AT-rich organism
Create a log-odds scoring matrix
Normalize the matrix to have the same entropy as BLOSUM62
Save the matrix

Use the scoring matrix (this week)

Modify Needleman-Wunsch program to do Smith-Waterman local alignments.

Choose two known proteins from your organism from the previous labs

Find orthologs in another species

Use your scoring matrix and BLOSUM62 to generate Smith-Waterman alignments of the two proteins

Compare the resulting scores and alignments

Part 4

The first part of this week's lab is to modify the Needleman-Wunsch implementation below so that it implements the Smith-Waterman algorithm. To summarize this: you need to change the score in each cell so that it is at least 0, change the initial edge scores to 0, start the traceback from the maximum score (instead of the bottom corner, and end at a cell with a positive score.
To use the NW function, you need to have the BLOSUM62 matrix defined as a dictionary called b62:

from Bio.SubsMat import MatrixInfo
b62 = MatrixInfo.blosum62
for key in b62.keys():
    b62[(key[1],key[0])] = b62[key]

def NW(string1, string2, scoring_dict = b62, gap = -6, verbose = False):
    """Performs Needleman-Wunsch alignment of string1 and string2.
    Prints out the alignment and returns the array of scores and pointers(arrows).
    Scores, Pointers = NW('Pelican', 'trellisant', verbose = True)
    """   
    # initialize scoring and 'arrow' matrices to 0
    Scores = [[0 for x in range(len(string2)+1)] for y in range(len(string1)+1)]
    Pointers = [[0 for x in range(len(string2)+1)] for y in range(len(string1)+1)]
    # Convert to uppercase:
    string1 = string1.upper()
    string2 = string2.upper()
    # initialize borders
    # for pointers (arrows), use 2 for diagonal, -1 for horizontal, and 1 for vertical moves.
    # I have tried to consistently use i for rows (vertical positions) in the score and pointer tables, and j for columns (horizontal positions).
    for i in range(len(string1)+1):
        Scores[i][0] = gap*i
        Pointers[i][0] = 1
    for j in range(len(string2)+1):
        Scores[0][j] = gap*j
        Pointers[0][j] = -1
    # fill with scores
    for i in range(1,len(string1)+1):
        for j in range(1,len(string2)+1):
            letter1 = string1[i-1]
            letter2 = string2[j-1]
            DiagonalScore = Scores[i-1][j-1] + scoring_dict[(letter1, letter2)]
            HorizontalScore = Scores[i][j-1] + gap
            UpScore = Scores[i-1][j] + gap
            # TempScores is list of the three scores and their pointers
            TempScores = [[DiagonalScore,2],[HorizontalScore,-1],[UpScore,1]]
            # Now we keep the highest score, and the associated direction (pointer)
            Scores[i][j], Pointers[i][j] = max(TempScores)
    # backtrace from the last entry.  
    [i,j] = [len(string1),len(string2)]
    align1 = ''
    align2 = ''
    while [i,j] != [0,0]:
        if Pointers[i][j] == 2:
            align1 = align1 + string1[i-1]
            align2 = align2 + string2[j-1]
            i = i - 1
            j = j - 1
        elif Pointers[i][j] == -1:
            align1 = align1 + '-'
            align2 = align2 + string2[j-1]
            j = j - 1
        else:
            align1 = align1 + string1[i-1]
            align2 = align2 + '-'
            i = i - 1
    # the alignments have been created backwards, so we need to reverse them:
    align1 = align1[::-1]
    align2 = align2[::-1]
    # print out alignment
    print align1
    print align2
    print 'Score: ' + str(Scores[len(string1)][len(string2)])
    # in case you want to look at the scores and pointers, the function returns them
    if verbose: return [Scores,Pointers]

The rest of the lab is well summarized by the outline: find two orthologs to known proteins from your organism in a "normal" species such as yeast, or Homo sapiens. Use the Smith-Waterman function to generate alignments and scores with the BLOSUM62 matrix and your custom scoring matrix. Are the alignments different? If so, does one make more sense than the other?

Extra credit: try to find an unknown or hypothetical protein from the AT-rich organism with a weak match to a known protein, and redo the previous exercise.