Custom scoring matrices for AT-rich genomes, part 4.
Due Monday, March 10th (before class), as an emailed worksheet.
Groups: This lab will be done individually, but some collaboration is encouraged. You must do part 4a yourself - i.e. you can consult with others but you have to write your own code, you cannot just copy it from someone else.
This lab series is somewhat complicated, by far the most complicated that we will do. Here is an overview (more details for each step are given later):
- Compute amino acid frequencies for the organism
- Generate alignments of some highly conserved proteins
- Prepare FASTA sequence files: remove the AT rich genomes of the other three organisms from each sequence file
- Create the alignments: use Clustalw to create alignments
- Save the alignments on whatever server you are using
- Create scoring matrix
- Load the appropriate modules from biopython
- Count transitions of amino acids from other organisms to your AT-rich organism
- Create a log-odds scoring matrix
- Normalize the matrix to have the same entropy as BLOSUM62
- Save the matrix
- Use the scoring matrix (this week)
- Modify Needleman-Wunsch program to do Smith-Waterman local alignments.
- Choose two known proteins from your organism from the previous labs
- Find orthologs in another species
- Use your scoring matrix and BLOSUM62 to generate Smith-Waterman alignments of the two proteins
- Compare the resulting scores and alignments
Part 4
- The first part of this week's lab is to modify the Needleman-Wunsch implementation below so that it implements the Smith-Waterman algorithm. To summarize this: you need to change the score in each cell so that it is at least 0, change the initial edge scores to 0, start the traceback from the maximum score (instead of the bottom corner, and end at a cell with a positive score.
To use the NW function, you need to have the BLOSUM62 matrix defined as a dictionary called b62:
from Bio.SubsMat import MatrixInfo
b62 = MatrixInfo.blosum62
for key in b62.keys():
b62[(key[1],key[0])] = b62[key]
Needleman-Wunsch function:
def NW(string1, string2, scoring_dict = b62, gap = -6, verbose = False):
"""Performs Needleman-Wunsch alignment of string1 and string2.
Prints out the alignment and returns the array of scores and pointers(arrows).
Scores, Pointers = NW('Pelican', 'trellisant', verbose = True)
"""
# initialize scoring and 'arrow' matrices to 0
Scores = [[0 for x in range(len(string2)+1)] for y in range(len(string1)+1)]
Pointers = [[0 for x in range(len(string2)+1)] for y in range(len(string1)+1)]
# Convert to uppercase:
string1 = string1.upper()
string2 = string2.upper()
# initialize borders
# for pointers (arrows), use 2 for diagonal, -1 for horizontal, and 1 for vertical moves.
# I have tried to consistently use i for rows (vertical positions) in the score and pointer tables, and j for columns (horizontal positions).
for i in range(len(string1)+1):
Scores[i][0] = gap*i
Pointers[i][0] = 1
for j in range(len(string2)+1):
Scores[0][j] = gap*j
Pointers[0][j] = -1
# fill with scores
for i in range(1,len(string1)+1):
for j in range(1,len(string2)+1):
letter1 = string1[i-1]
letter2 = string2[j-1]
DiagonalScore = Scores[i-1][j-1] + scoring_dict[(letter1, letter2)]
HorizontalScore = Scores[i][j-1] + gap
UpScore = Scores[i-1][j] + gap
# TempScores is list of the three scores and their pointers
TempScores = [[DiagonalScore,2],[HorizontalScore,-1],[UpScore,1]]
# Now we keep the highest score, and the associated direction (pointer)
Scores[i][j], Pointers[i][j] = max(TempScores)
# backtrace from the last entry.
[i,j] = [len(string1),len(string2)]
align1 = ''
align2 = ''
while [i,j] != [0,0]:
if Pointers[i][j] == 2:
align1 = align1 + string1[i-1]
align2 = align2 + string2[j-1]
i = i - 1
j = j - 1
elif Pointers[i][j] == -1:
align1 = align1 + '-'
align2 = align2 + string2[j-1]
j = j - 1
else:
align1 = align1 + string1[i-1]
align2 = align2 + '-'
i = i - 1
# the alignments have been created backwards, so we need to reverse them:
align1 = align1[::-1]
align2 = align2[::-1]
# print out alignment
print align1
print align2
print 'Score: ' + str(Scores[len(string1)][len(string2)])
# in case you want to look at the scores and pointers, the function returns them
if verbose: return [Scores,Pointers]
The rest of the lab is well summarized by the outline: find two orthologs to known proteins from your organism in a "normal" species such as yeast, or Homo sapiens. Use the Smith-Waterman function to generate alignments and scores with the BLOSUM62 matrix and your custom scoring matrix. Are the alignments different? If so, does one make more sense than the other?
Extra credit: try to find an unknown or hypothetical protein from the AT-rich organism with a weak match to a known protein, and redo the previous exercise.