Due Wednesday, April 11th. Bring a hard-copy of your final code and results, and be prepared to (very briefly) share your findings with the class.

Groups:
You will do a first attempt at a phylogenetic hypothesis by picking a representative protein or DNA sequence and using UGPMA (unweighted group-pair method with arithmetic mean).
Note that choosing and aligning the sequences is largely independent of the programming part of the this assignment, and these parts can mostly be done by different group members.

Feel free to ask me for help.
  1. Choose a DNA sequence with known protein orthologs in each of the five given species. Do not use a cytochrome oxidase or ribosomal protein. You want to choose something with a moderate level of variability between the species - something like a histone might not vary enough, whereas an immunoglobulin might vary too much. Start with the least-studied species of the five. Explain your choice. It may be helpful to use NCBI's Taxonomy database, with some of the optional items checked such as showing the number of Nucleotide and Protein sequences available at each taxonomic level.
  2. Get a multiple alignment of the five DNA sequences. The ClustalW server at EMBL-EBI is one recommended option. To get better results for your phylogeny, you may wish to trim some sequences to a well-aligned region - often some sequences are much longer than others, and these long gaps can distort measures of similarity and the quality of the multiple alignment.
  3. Write a program that outputs the UPGMA tree in Newick format from an alignment file. The file here will get you started - it contains a function that outputs a distance matrix from an alignment file, and has other hints; improved version as of 4/3/7. Here is an example DNA .aln file you can use as practice (it has 6 species).