Appendices: Probability, File Formats, Complexity notation
These appendices provide some reference material that may be useful in understanding this text. If you are unfamiliar with the probability and statistics presented here, it may be very helpful to look at the text mentioned in the introduction: Modern Statistics for Modern Biology, by Susan Holmes and Wolfgang Huber.
Appendix: Useful probability distributions
Binomial and geometric distributions
The binomial distribution gives the probability of m successes out of n trials in which each success occurs with probability p:
For example, the chance of getting exactly 5 heads after flipping a fair coin (so p=0.5) ten times is P(5,10) = \dfrac{10!}{5!5!} (0.5)^5 (0.5)^5 \approx 0.246.
The geometric distribution gives the probability of the first occurence of a success in a series of n trials - that is, the probability of n-1 failures followed by a success. As with the binomial distribution, each trial is independent and is successful with probability p:
The mean of this distribution is \dfrac{1}{p}, and its variance is \dfrac{1 - p}{p^2}.
As an example, if we achieve a success by rolling a 6 with a six-sided die, the expected number of rolls needed will be 6. The probability of needing exactly six rolls is G(6) = (1-\dfrac{1}{6})^5 \dfrac{1}{6} \approx 0.067. (Note that P(1) = \dfrac{1}{6} \approx 0.167, more likely than the probability of the mean itself.)
Poisson and exponential distributions
The Poisson distribution is: $$ P(k;\lambda) = \dfrac{\lambda^k e^{-\lambda}}{k!}. $$
The exponential distribution gives the probability of a time interval x between two events from a Poisson process:
Negative binomial distribution
The negative binomial distribution is a generalization of the geometric distribution, in which we wait for m successes to occur in a series of Bernoulli trials, which have success probability p and failure probability 1-p. (Note that this means that the last event is always a success.) Then the probability of this taking n trials is
This distribution has mean \dfrac{m}{p} and variance \dfrac{m (1-p)}{p^2}. Note that other texts sometimes define this distribution somewhat differently, for example by defining it to be the probability of the number of failures rather than the total number of trials, or by fixing the number of failures instead of successes.
Appendix: File Formats
FASTA and FASTQ
The FASTA format is a very simple file format for sequence information. It was originally developed as a format for a particular software program (also called FASTA), but is now the most standard human-readable format for sequences. Sequences are delimited by description lines which must start with the character >, followed by the sequence with no additional information. Although the description line can be abused to add other metadata, in general this format is not suitable for retaining complex information such as alignment data.
Here are some simple (short) examples of FASTA records of a DNA and a protein sequence. They have line-breaks in the sequence after 60 characters, but this is an arbitrary choice chosen for readability.
>Example - a simple FASTA sequence example
CACACGTAGGGAACCGAAGTTAAAGAGGTGGTAAACAGCGAGTCAACACGAGTTGCGGCG
ACCCGTTGATCTGGTGCCGGGCGGAAAATTCGTTAATGTGCAGTTTCGAACAGATCAAAG
>gi|386828|gb|AAA59172.1| insulin [Homo sapiens]
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
FASTQ extends the FASTA format to allow for quality scores for each letter of sequence (usually nucleotide). It uses four lines per sequence:
-
A description line which starts with the character @.
-
The sequence itself.
-
A line beginning with + followed by additional descriptive information about the sequence
-
A line containing as many characters as the sequence, which represent quality scores. The quality scores are usually encoded by the ACSII ranking, in other words the order is, from high to low:
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVW XYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~}
Here is an example of a single entry of a FASTQ file:
@Example sequence ID
CACACGTAGGGAACCGAAGTTAAAGAGGTGGTAAACAGCGAGTCAACACGAGTTGCGGCG
+
-A9FR0?Y7<AHT67CN=-;38/HSWN;X4/14IK/X2A9TSADCD1K97;.55W3VFL9
SAM and BAM
SAM is an acronym for Sequence Alignment/Map format. It is most suitable for alignments of sequencing reads to a reference which is not too different from the mapped reads. BAM is simply the binary version of this format. The formal specification for the format can be found at https://samtools.github.io/hts-specs/.
Currently Biopython does not have extensive support for the SAM/BAM formats, but there is a python module called pysam which wraps the standard commandline utility, samtools.
Genbank
The Genbank format, more formally the DDBJ/EMBL/GenBank Feature Table format, is quite flexible and used for a variety of more descriptive records. It is most commonly used for primary DNA, RNA, and protein sequence data. The full specification can be found at http://www.insdc.org/files/feature_table.html. The Biopython SeqIO module can read and parse the features of Genbank files.
Here is a short example of an mRNA database entry from NCBI in the Genbank format:
LOCUS NM_001040071 651 bp mRNA linear PRI 25-AUG-2012
DEFINITION Homo sapiens seven transmembrane helix receptor (LOC650293), mRNA.
ACCESSION NM_001040071 XM_939379
VERSION NM_001040071.1 GI:91206427
KEYWORDS RefSeq.
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 651)
AUTHORS Rose,J.E., Behm,F.M., Drgon,T., Johnson,C. and Uhl,G.R.
TITLE Personalized smoking cessation: interactions between nicotine dose,
dependence and quit-success genotype score
JOURNAL Mol. Med. 16 (7-8), 247-253 (2010)
PUBMED 20379614
REMARK GeneRIF: Clinical trial of gene-disease association and
gene-environment interaction. (HuGE Navigator)
COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final
NCBI review. The reference sequence was derived from AB065698.1.
On Apr 8, 2006 this sequence version replaced gi:88978953.
FEATURES Location/Qualifiers
source 1..651
/organism="Homo sapiens"
/mol_type="mRNA"
/db_xref="taxon:9606"
/chromosome="4"
/map="4p16.1"
gene 1..651
/gene="LOC650293"
/note="seven transmembrane helix receptor"
/db_xref="GeneID:650293"
CDS 1..651
/gene="LOC650293"
/codon_start=1
/product="seven transmembrane helix receptor precursor"
/protein_id="NP_001035160.1"
/db_xref="GI:91206428"
/db_xref="GeneID:650293"
/translation="MYLVTVLRNLLSILAVSSDSPLHTPMYFFLSNLCWADIGFTSAM
VPKMIVDMQSHSRVISHEGCLTQMFFLVLFACIEGMILTVMAYDCFVAICRPLNYPVI
VNPHLCVFFILMSFFLSLLDSQLHSWIVLQFTIIKNVEISNFVCDPSQLLKLACSDSV
INSIFTYFHSTMFAFLPISAILLSYYKIVTSILRISSSDGKYKAFSTCDSHLAVVC"
sig_peptide 1..54
/gene="LOC650293"
/inference="COORDINATES: ab initio prediction:SignalP:4.0"
exon 1..651
/gene="LOC650293"
/inference="alignment:Splign:1.39.8"
ORIGIN
1 atgtatctgg tcacggtgct gaggaacctc ctcagtatcc tggctgtcag ctctgactcc
61 cccctccaca cccccatgta cttcttcctc tccaacctgt gctgggctga catcggtttc
121 acctcggcca tggttcccaa gatgattgtg gacatgcagt cgcatagcag agtcatctct
181 catgagggct gcctgacaca gatgtttttc ttggtccttt ttgcatgtat agaaggcatg
241 atcctgactg tgatggccta tgactgcttt gtagccatct gtcgccctct gaattaccca
301 gtcatcgtga atcctcacct ctgtgtcttc ttcattttga tgtccttttt ccttagcctg
361 ttggattccc agctgcacag ttggattgtg ttacaattca caatcatcaa gaatgtggaa
421 atctctaatt ttgtctgtga cccctctcaa cttctcaaac ttgcctgttc tgacagcgtc
481 atcaatagca tattcacata tttccatagt actatgtttg cttttcttcc catttcagca
541 atccttttat cttactataa aatcgtcacc tccattctca ggatttcatc ttcagatggg
601 aagtataaag ccttctccac ctgtgactct cacctagcag ttgtttgctg a
//
Alignment formats: Clustal and Phylip
The Clustal series of programs introduced one of the most common alignment file formats; there are some variations in this format, such as whether or not the alignment site counts are included or not. Here is an example (using sequences of insulin precursors from a fish, a frog, a bird, and a mammal) of the numbered Clustal Omega format:
CLUSTAL O(1.2.4) multiple sequence alignment
NP_571131.1 MAVWLQAGALLVLLVVSSVSTN-PGTPQHLCGSHLVDALYLVCGPTGFFYNPKRDVEPLL 59
XP_002920166.1 MALWTRLLPLLALLAVWAPVPARTFVNQHLCGSHLVEALYLVCGERGFFYTPKARREVED 60
NP_001079350.1 MALWMQCLPLVLVLLFSTPNTE-ALANQHLCGSHLVEALYLVCGDRGFFYYPKIKRDIEQ 59
NP_990553.1 MALWIRSLPLLALLVFSGPGTSYAAANQHLCGSHLVEALYLVCGERGFFYSPKARRDVEQ 60
**:* : *: :* . . *********:******* **** ** :
NP_571131.1 GFLPPKSAQETEVADFAFKDHAELIRKRGIVEQCCHKPCSIFELQNYCN 108
XP_002920166.1 LPAGDAELDRVPGADPQPRALAGALQRRGIVEQCCTSICSLYQLENYCN 109
NP_001079350.1 AQVNGPQDNELDGMQFQPQE--YQKMKRGIVEQCCHSTCSLFQLENYCN 106
NP_990553.1 PLVSSPLRGEAGVLPFQQEE--YEKVKRGIVEQCCHNTCSLYQLENYCN 107
. . :******** . **:::*:****
The Phylip phylogenetic program suite created by Joseph Felsenstein uses its own format, which has the somewhat annoying requirement of only allowing 10 characters for each sequence name. Here is the same alignment of insulin sequences in Phylip format:
4 109
NP_571131. MAVWLQAGAL LVLLVVSSVS TN-PGTPQHL CGSHLVDALY LVCGPTGFFY
XP_0029201 MALWTRLLPL LALLAVWAPV PARTFVNQHL CGSHLVEALY LVCGERGFFY
NP_0010793 MALWMQCLPL VLVLLFSTPN TE-ALANQHL CGSHLVEALY LVCGDRGFFY
NP_990553. MALWIRSLPL LALLVFSGPG TSYAAANQHL CGSHLVEALY LVCGERGFFY
NPKRDVEPLL GFLPPKSAQE TEVADFAFKD HAELIRKRGI VEQCCHKPCS
TPKARREVED LPAGDAELDR VPGADPQPRA LAGALQRRGI VEQCCTSICS
YPKIKRDIEQ AQVNGPQDNE LDGMQFQPQE --YQKMKRGI VEQCCHSTCS
SPKARRDVEQ PLVSSPLRGE AGVLPFQQEE --YEKVKRGI VEQCCHNTCS
IFELQNYCN
LYQLENYCN
LFQLENYCN
LYQLENYCN
Phylogeny formats: Newick, Nexus, NeXML, PhyloXML
The Nexus and Newick formats are used to describe trees (in the graph theory sense of acyclic connected graphs).
The Newick format was created in 1986 by many of the pioneers of molecular phylogenetics; it is named after the restaurant (Newick's) in Dover, New Hampshire, where they met.
The Nexus format was officially described in 1997 although it had been under development for a decade beforehand. It is a much broader and extendable data format than Newick - internally trees are stored in Newick format, along with many other possible types of data and metadata such as sequence alignments . Newick and Nexus formats are analogous to FASTA and GenBank formats with corresponding pros and cons of size, complexity, and flexibility.
Although the Nexus format can contain a variety of information, its format made some types of additions awkward. The more flexible and grammatically precise, but also somewhat more cumbersome format of PhyloXMLwas developed to address these shortcomings. Because it is a well-defined XML language format it can take advantage of many existing XML parsing and processing programming libraries. Similarly, NeXML is another XML-based extension developed around the same time as PhyloXML.
Appendix: Complexity Notation (big O)
There is an extensive notation for efficiently describing the behavior of functions, which is often used to characterize the complexity of algorithms. It is sometimes called "Big-O" notation, after the most commonly used piece of it, but there is quite a wide variety of refinements (see the Wikipedia article for a more extensive description).
A function f(x) is O(g(x)) if it can eventually (for large values of x) be bounded by a fixed multiple of g(x), i.e. if there exists a x_0 such that
for all x > x_0.
Example of big-O notation
For example, the function f(x) = 2x^3 + x + 1 is O(x^3), where in the definition we could choose any M > 2.