Processing alignments and the Jukes-Cantor measure of evolutionary change.
Due Wednesday, February 13th (1 per group).
Groups:
Group 1: Seth, Huimin, Kristin
Group 2: Annete, Feng, Gregg
Group 3: Lindsey, Matthew, Hillary
Group 4: Shanshan, Sarah, Jonathan
There are several goals of this weeks lab:
(1) To better understand the Jukes-Cantor model of sequence evolution by applying it to some actual sequences;
(2) learn to use biopython to parse two of the most common file formats in bioinformatics: FASTA and ClustalW.
As always, do not hesitate to ask for help from me or your partners.
Possibly useful webpages: the EBI ClustalW server (for generating new multiple alignments), NCBI Entrez, Python Tutorial , Biopython docs on sequence input/output.
- As a warmup, check out an entry in the Nucleotide database, and select "FASTA" at the "Display" option. The FASTA format is very simple: there are description lines starting with ">", and all other lines are assumed to contain letters in the sequence(s).
- Compute the Jukes-Cantor distance between each of the sequences in the file this file. You can access the records in the file by executing the following in cell:
import urllib2
online_file = urllib2.urlopen('http://www.d.umn.edu/~mhampton/nd4s.fa')
from Bio import SeqIO
my_parser = SeqIO.parse(online_file,'fasta')
The parser my_parser
can now access the records in the file. The following loop extracts the descriptions and sequences of each record, and puts each into a list (copy and execute):
descriptions = []
sequences = []
for record in my_parser:
descriptions.append(record.description)
sequences.append(record.seq.tostring())
online_file.close()
(The last line closes the connection to the online file since we don't need it anymore.) To check that all is well, try:
for stuff in descriptions: print stuff
These DNA sequences are all the same length and all code for NADH dehydrogenase subunit 4, a mitochondrial protein. For each pair of sequences, compute the fraction of bases that differ, D, and use that to compute the Jukes-Cantor distance d = -3*log(1 - 4D/3)/4 between them.
Do your answers make sense for these species? Why or why not?
With the assumptions of the Jukes-Cantor model, if the human-chimp split occurred 7 million years ago, when did the human-mouse split occur?
Programming note: to access the ith base in the jth sequence use the following syntax: sequences[j][i]
- Repeat the above exercise for the TRIM5 sequences in this file. This ".aln" file is in ClustalW format, named for a commonly used multiple sequence alignment program. To parse this file, modify a copy of your previous code as follows:
my_parser = SeqIO.parse(online_file,'clustal')
i.e. the parse format is 'clustal' instead of 'fasta'.
The only difference in your analysis is that you should not count gap mismatches - i.e., if the character is '-', ignore it and do not include that in the length of the sequence. For example, the two sequences:
ATTG-G-CT
ATTGCGCGT
would be considered to have 1 mismatch out of 7 bases, so D = 1/7 (NOT 1/9 or 3/9!).
- Finally, choose ONE of the following additional exercises:
- Repeat exercise 2 for a different gene. You do not have to use exactly the same species, but you should include at least: human, chimp, another primate, and the mouse, and have at least 5 species total. You will probably want to use the Clustal server linked to at the top of this page to generate an alignment.
- Repeat exercise 1 and 2 using the Kimura model instead of Jukes-Cantor (equation 4.13 on page 63 of our text). How does this affect the results?