buildingaphylogeny [CSBQ-QCBS Wiki]

This is an old revision of the document!

Basic steps for building a phylogenetic tree

A QCBS workshop given by Annie Archambault, to the QCBS students and researchers

Sign up on this page of the QCBS website.

About the workshop

The target audience is researchers and students who will have to build a phylogenetic tree for their coming projects, but who don't quite know which program or which parameter to use.
The workshop will be held Monday September 16, 2013, 1:00 PM to 4:PM. The second day, September 17, 9:30 AM to 12:30 PM.
The computer room is located in Université Laval, Local 3434-E, Pavillon Paul Comtois
The slides are in English, but presentation will be in French.
No food or drink is allowed in the computer room.

To use your own computer

Students from UdeM, UQAM and Concordia will be able to connect to the internet through Eduroam, with their usual home university connection.
If you plan on using your own computer, simply download the following programs
- JalView with the Launch JalViewD Desktop that uses Java Web Start; or BioEdit; or SuiteMSA http://bioinfolab.unl.edu/~canderson/SuiteMSA/
- JModelTest2 https://code.google.com/p/jmodeltest2/
- ClustalX http://www.clustal.org/download/current/
- Muscle http://www.drive5.com/muscle/
- Sate http://phylo.bio.ku.edu/software/sate/sate.html (use the commandline version)
- PRANK http://code.google.com/p/prank-msa/
- FSA http://sourceforge.net/projects/fsa/
- MEGA 5.2 http://www.megasoftware.net/
- RaxML GUI http://sourceforge.net/projects/raxmlgui/files/?source=navbar
- GarLi https://code.google.com/p/garli/
- BEAST http://beast2.cs.auckland.ac.nz/index.php/Main_Page
- FigTree http://tree.bio.ed.ac.uk/software/figtree/

Guide to install the programs

I have been able to install the following programs on a Mac OS X 10.6.8 (64 bit). I have not tried any other operating system. You will find here the exact commands I used, but every user is welcome to modify them according to their operating system.

JalView Waterhouse, A.M., et al. 2009. Jalview Version 2 - a multiple sequence alignment editor and analysis workbench. Bioinformatics 25: 1189–1191.
This program will here simply be used to look at a sequence alignment created by other command line programs.
Click on the “Launch Jalview Desktop” top right corner, and click “Open with Java Web Start”. Allow the application to access your computer.
Close the example windows shown. Click File/Input alignment/From file.

ClustalX Larkin, M.A., G. Blackshields, N.P. Brown, R. Chenna, P.A. McGettigan, H. McWilliam, F. Valentin, et al. 2007. Clustal W and Clustal X version 2.0. Bioinformatics 23: 2947–2948.
At http://www.clustal.org/, follow the links to ClustalX. Load your sequences in fasta format.
Choose alignment parameters by selecting: Alignment/Alignment parameters/Multiple Alignment Parameters/
Align the sequences by selecting : Alignment/Do complete alignment/ You can choose a name for the output file. After the alignment is done, you can save the aligned sequences as a file of the fasta format and the phylip format by selecting File/Save sequences as.

PRANK Löytynoja, A., and N. Goldman. 2010. webPRANK: a phylogeny-aware multiple sequence aligner with interactive alignment browser. BMC Bioinformatics 11: 579. Dowload the program http://code.google.com/p/prank-msa/downloads/list. The README is here http://code.google.com/p/prank-msa/wiki/PRANK?tm=6
In the terminal, change directory to where the compressed file is located. Uncompress by typing:
tar -zxvf prank.osx_108.130410.tgz (or the version you had downloaded for your own system)
Change directory to where the bin file is by typing:
cd prank/bin
Many option can be applied, but the minimum command to call an alignment from PRANK is:
./prank sequencefilefullpath
I had success by typing:
sudo ./prank -d=/Users/anniearchambault/Documents/learning_teaching/Alignment_and_phylogeny_building/Protea_Faurea_ITS_trnL_75seq.fasta
The output file will be named output.best.fas . Open it in any sequence viewer. My favorite one is Geneious, but it is commercial. JalView is free.

in Windows, type:
prank.exe sequencefilefullpath

SATe Liu, K., T.J. Warnow, M.T. Holder, S.M. Nelesen, J. Yu, A.P. Stamatakis, and C.R. Linder. 2012. SATé-II: Very Fast and Accurate Simultaneous Estimation of Multiple Sequence Alignments and Phylogenetic Trees. Systematic Biology 61: 90–106.
There is a nice GUI for SATe, but I have been unable to make it work on a Mac. I instead used the commandline version. This first require to install the setupetools https://pypi.python.org/pypi/setuptools/0.6c11
Download the file that corresponds to you version of Python. Then, at the terminal, change directory to where the file is located, and type:
sudo sh setuptools-0.6c11-py2.6.egg
sudo easy_install -U dendropy

The SATe homepage is here: http://phylo.bio.ku.edu/software/sate/sate.html
Download the source code Saté (NOT THE .dmg) for your system, from this site: http://phylo.bio.ku.edu/software/sate/downloads2/ . At the terminal, uncompress by typing:
tar –zxvf satesrc-v2.2.7-2013Feb15.tar.gz
cd satesrc-v2.2.7-2013Feb15/sate-core/
sudo python setup.py develop
to test SATe, change directory to satesrc-v2.2.7-2013Feb15/sate-core, type:
python run_sate.py -i data/small.fasta -t data/small.tree -j test –auto
I aligned the “Protea_Faurea_ITS_trnL_75seq.fasta” file by typing:
python run_sate.py -i /Users/anniearchambault/Documents/learning_teaching/Alignment_and_phylogeny_building/Protea_Faurea_ITS_trnL_75seq.fasta -j test –auto

Muscle Edgar, R.C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 32: 1792–1797.
Download the program at: http://www.drive5.com/muscle/ at the terminal, type:
tar –zxvf muscle3.8.31_i86darwin64.tar.gz
chmod +x muscle3.8.31_i86darwin64
Try to align:
./muscle3.8.31_i86darwin64 -in /Users/anniearchambault/Documents/learning_teaching/Alignment_and_phylogeny_building/Protea_Faurea_ITS_trnL_75seq.fasta -out /Users/anniearchambault/Documents/learning_teaching/Alignment_and_phylogeny_building/seqsmuscleout.afa
or
muscle -in /Users/anniearchambault/Documents/learning_teaching/Alignment_and_phylogeny_building/Protea_Faurea_ITS_trnL_75seq.fasta -out /Users/anniearchambault/Documents/learning_teaching/Alignment_and_phylogeny_building/seqsmuscleout.afa

To quickly see the alignment in the terminal, type:
./muscle3.8.31_i86darwin64 -in /Users/anniearchambault/Documents/learning_teaching/Alignment_and_phylogeny_building/seqsmuscleout.afa -clw

FSA Bradley, R.K., A. Roberts, M. Smoot, S. Juvekar, J. Do, C. Dewey, I. Holmes, and L. Pachter. 2009. Fast Statistical Alignment. PLoS Comput Biol 5: e1000392.
Download the program at http://sourceforge.net/projects/fsa/ . Uncompress and configure by typing at the terminal
tar -xvzf fsa-1.15.7.tar.gz
cd fsa-1.15.7
./configure –with-mummer –with-exonerate
sudo make
sudo make install
fsa /Users/anniearchambault/Documents/learning_teaching/Alignment_and_phylogeny_building/Protea_Faurea_ITS_trnL_75seq.fasta >/Users/anniearchambault/Documents/learning_teaching/Alignment_and_phylogeny_building/Protea_Faurea_ITS_trnL_75seq.mfa

Only for this program, I have been unable to complete the installation, and I have used instead the webserver http://orangutan.math.berkeley.edu/fsa/
Load your sequence, align, and save the fsa.mfa file to your computer.

jModelTest Darriba, D., G.L. Taboada, R. Doallo, and D. Posada. 2012. jModelTest 2: more models, new heuristics and parallel computing. Nature Methods 9: 772–772.
Download the program at https://code.google.com/p/jmodeltest2/downloads/list . Install, and open.
Load the sequence alignment file, and calculate the likelihood for all models (will take a long time)

Calculate which model is best for your data, by choosing Analysis/DO AIC calculations , or Do BIC calculations
Look at the results, go up in the buffer window

MEGA TAMURA, K., D. PETERSON, N. PETERSON, G. STECHER, M. NEI, AND S. KUMAR. 2011. MEGA5: Molecular Evolutionary Genetics Analysis Using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods. Molecular Biology and Evolution 28: 2731–2739.
Download MEGA from http://www.megasoftware.net/ . Uncompress by double-clicking, and then open the program.
First convert the .nex or the .fasta files to the .meg format, then open the newly created .meg file.
Make taxa groups or sequence partitions if relevant for your data.
MEGA is one of the few free program to offer parsimony analyses. To do parsimony, select Phylogeny/Construct/Test Maximum Parsimony tree. Choose among the “MP search method” options:

Max-mini-branch-and-bound: Guaranteed to find all the MP trees, is too time consuming for more than 15 sequences.
Min-Mini heuristic: a branch swapping heuristic search (Close-Neighbor-Interchange) method that begins with an initial tree given by the Min-mini. It is faster than branch and bound, and is usually the selected option. SPR examines more trees than Close-Neighbor-Interchange and takes longer to complete.
Search level: higher numbers starts a more complex search

Garli Zwickl, D. J., 2006. Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion. Ph.D. dissertation, The University of Texas at Austin.
Download the program from here http://garli.googlecode.com At the terminal, type
tar -xvzf fsa-1.15.7.tar.gz
Download the “cheat sheet” from here http://www.molecularevolution.org/software/phylogenetics/GARLI or the complete manual from here http://www.nescent.org/informatics/download.php?software_id=4 .

Place your sequence alignment (nexus, phylip, fasta) in the same directory as the garli.conf file you will use; and copy the program Garli-2.0 (located here Applications/Garli-2.0-anyOSX/bin/Garli-2.0 ) also in that same directory.
One configuration file is given as example in the folder /Garli-2.0-anyOSX/example/basic/garli.conf . Modify the .conf file to your convenience, this may take some time. Default options will work well for most sequence alignment. A few options (e.g. filename, datatype, etc.) really depend on your data set. As mentioned in the “cheatsheet”, the settings for the four options listed below will call an intensive search, which will take long on large datasets. To reduce the search time, reduce the number for each of the option.
attachmentspertaxon = 50
genthreshfortopoterm = 20,000 to 100,000
numberofprecreductions = 20 to 40
treerejectionthreshold = 100

Tu run Garli, in the terminal, change directory to where your three files are (Garli program, .conf file, alignment file), type:
sudo ./Garli-2.0
or
./Garli-2.0
or simply
Garli

In Windows, double click on the executable.

For the exercise, create a new folder, copy in it the garli program, the ProteaFaurea_trnL.nex file, and tgarli.conf file that was located /Applications/Garli-2.0-anyOSX/example/basic. Open the garli.conf in a text editor, and change the “datafname = rana.nex” to “datafname = ProteaFaurea_trnL.nex”, and the “ofprefix = rana.nuc.GTRIG” to “ofprefix = Protea_Faurea.nuc.GTRIG”. Run Garli!

RAxML Stamatakis, A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22: 2688–2690.

The most up to date version is in this page: https://github.com/stamatak/standard-RAxML , while list of other RAxML flavors and versions are here: http://www.exelixis-lab.org/ For this workshop, we will work with the RaxMlGUI that has a user friendly graphical user interface.
Download the RAxMlGUI from http://sourceforge.net/projects/raxmlgui/ , uncompress and open.
Make sure your sequence alignment was saved in the Phylip format and load it in RaxMlGUI. Choose the desired options, and start the search.

The RAxML offers the choice among many substitutions models, e.g.: GTR, BIN, MULTI, or PROT, with GAMMA[I] or CAT[I] rate heterogeneity, or without rate heterogeneity. However, the typical motivation to use RAxML is to take advantage of the CAT rate heterogeneity, which can speed up analyses of thousands of sequences. In the CAT option, individual per–site substitution rates are classified into categories. As explained in the README file, it is a computational work–around for the GTR intensive model, but it is an approximation instead of a genuine model.

BEAST2.0 Drummond, A.J., M.A. Suchard, D. Xie, and A. Rambaut. 2012. Bayesian Phylogenetics with BEAUti and the BEAST 1.7. Molecular Biology and Evolution 29: 1969–1973.

Descriptions and tutorials are available on this page http://beast2.cs.auckland.ac.nz/index.php/Main_Page and the program itself can be downloaded from here: http://code.google.com/p/beast2/

When using BEAST, the biggest part of the work for the user is to choose among the many possible options and priors. As written in the BEAST2.0 main page, the strength of bayesian phylogenetic methods such as BEAST is that it is also a framework for testing evolutionary hypotheses without conditioning on a single tree topology.

Lengthy description for using BEAST2.0 is given on the Example using BEAST2.0 page of the QCBS wiki. To begin, download the two files for the exercise, and rename each with .nex instead of .txt extension.
http://qcbs.ca/wp-content/uploads/2013/06/ProteaFaurea_ITS.txt
http://qcbs.ca/wp-content/uploads/2013/06/ProteaFaurea_trnL.txt

Download exercise files

Download each of the following files (right-click, save link as), and rename each with the .fasta extension instead of the .txt extension.

http://qcbs.ca/wp-content/uploads/2013/06/Protea_Faurea_ITS_trnL_75seq.txt
http://qcbs.ca/wp-content/uploads/2013/06/PR10_Fabaceae_11seqs.txt
http://qcbs.ca/wp-content/uploads/2013/06/Oxytropis_ITSsequences_84seq.txt
http://qcbs.ca/wp-content/uploads/2013/09/Fungal_ITS_ref_sequence_nl.fasta_.txt

Files of already aligned sequences - for cheaters!