Math 5233 lab assignment for Monday February 5th

Intro:
Today we will explore the Blast algorithm using NCBI's online versions. We will also do a little bit more with Python to continue building familiarity with it. As usual with the labs, do not hesitate to ask me for help.
A short (probably 1 or 2 pages) summary of your findings from each group will be due on Friday February 9th; however there will be no penalty for handing it in Monday the 12th.

Groups: Group 1: Bethany, Eric, Maria. Topic: Cathepsin C (use preprocathepsin C for P. falciparum)
Group 2: Noland, Doug, Terrence. Topic: Dynein light chain 1
Group 3: Marissa, Alayna, Brad. Topic: Huntingtin interacting protein 2
Group 4: Lucas, Shane, Aneerudh, Amanda, Ajit. Topic: Ferlin (myoferlin in humans)

Find some background information about the assigned gene, using databases such as Pubmed, OMIM, and Gene. (You may want to complete the other parts of the lab first, they are largely independent from this step.)

Find human and Plasmodium falciparum versions of your group's gene as nucleotide (preferably mRNA) and protein. Use NCBI's Nucleotide and Protein databases; if there are a lot of matches, limit to the RefSeq database. Download the FASTA format of each of these four sequences.

When searching on the NCBI site, it is helpful to limit your search by specifying fields such as gene [gene], organism [orgn], accession number [accn], or on Pubmed by specifying a MeSH (medical subject heading) term or author [au]. It is also sometimes helpful to use Boolean operators AND, OR, and NOT. For example, to search for Plasmodium falciparum genes that don't have the term 'putative' in their description, you could search for: Plasmodium falciparum[orgn] NOT putative. If you put two terms next to each other without a Boolean, they are treated as being combined with an AND operator.

Go to the NCBI Blast page. Under "Special", choose the "Align two sequences (bl2seq)" link. Use bl2seq to align each pair of genes (the nucleotide pair and the protein pair). How do these two alignments compare?

Use the protein blast (blastp) program at NCBI to find matches of the Plasmodium version of the protein sequence in the "nr" (non-redundant) database. Limit your search to (1) just vertebrates, then (2) just viridiplantae (plants and algae). What were the top hits and their scores? Try changing the scoring matrix to something other than BLOSUM62. Does this affect the results?

Search the nucleotide nr database for matches to this sequence using blastn. Then try blastx (translated queries vs. the nr protein database). How did the results differ? Why?

Use python to find the fraction of Cs and Gs in the nucleotide records for your gene. If PyCrust is installed, use it by funding it with Spotlight (upper right hand corner of the desktop). To read a file, for example one named 'HumanG6PD.fasta' on the desktop, you would begin by typing
```
 f = file('/Users/lab_user/Desktop/HumanG6PD.fasta','r') 
```
If it is a Fasta file, we don't care about the header line so we read that off first:
```
 f.readline() 
```
Now to read the rest of the file into a string called 'bases' you could do:
```
 bases = f.read()
```
Lets close the file now:
```
 f.close() 
```
We don't want to count extraneous characters such as newlines, so lets strip them out:
```
 bases = bases.replace('\n','')
```
Finally we can find the number of total bases, the number of Cs and Gs, and compute the fraction:
```
 total = len(bases) 
```
```
 CNumber = bases.count('C') 
```
```
 GNumber = bases.count('G') 
```
```
 CGfraction = float(CNumber + GNumber)/total 
```
We have to force python to use floating point numbers instead of integers, since otherwise it would round the fraction down to 0.

Record the respective percentages.