From Soil to Sequence Submission - A guide to amplicon sequencing experiments

A guide prepare by Terrence Bell, postdoctoral fellow at projet GenoRem, Institut de recherche en biologie végétale

Terminology note: the sequence data output by massively parallel sequencing instruments (Roche 454, Illumina HiSeq, IonTorrent…) are commonly termed “reads”. In this guide, “sequences” and “reads” are synonymous.

1. Have a real hypothesis/question

  • Why did you sequence the samples you did?
  • What do you hope to find? And don’t say ‘the diversity of the samples’. What would actually be an interesting result?
  • Do you have enough replication to statistically probe the dataset?
  • Are you asking an ecological question (e.g. testing the biodiversity-function correlation, testing the effects of disturbance on diversity etc.)?
  • Are you asking an applied question (e.g. which organisms are associated with efficient biodegradation of petrochemicals)?
  • Are you asking a question at all?

2. Initial Steps

DNA/RNA extraction

To make things easy, use MoBio distributed by VWR in Canada. It’s well-tested, and results compare easily to most of the literature. You can even do dual extractions of DNA/RNA if you need both.

If you have a ton of samples and money is a concern, I have also used the following protocol successfully:

Yergeau E, Bokhorst S, Huiskes AHL, Boschker HTS, Aerts R, Kowalchuk GA. (2007). Size and structure of bacterial, fungal and nematode communities along an Antarctic environmental gradient. FEMS Microbiol Ecol 59: 436-451.

Sample coding

An underrated but important step. Could someone easily identify your samples without you? Is it clear which samples are your DNA/RNA extracts, amplicons, purified amplicons, and amplicon pools? Minimally, your code should indicate the sample, project and MID # (for amplicons). It is also a good idea to put your initials and date on the side of each tube, not just the box.

You should also have well laid out Excel sheets that link each sample to a unique MID identifier, and samples that did not amplify properly or are otherwise excluded from sequencing should be indicated. It is extremely easy to get confused when you are working with hundreds of samples that have unique identifier codes – be extra careful at this step!

3. Selecting a sequencing platform

Criteria to consider:

  • Read length
  • Number of reads required
  • Number of unique MID primers available
  • Do you have amplicons of variable size?
Available platforms

Ion Torrent

  • Now at Genome Quebec
  • Very efficient for 16S rRNA sequences…produces good quality reads of roughly 200-250 bp in length
  • Inexpensive

Roche 454

  • Standard use for pooled amplicon experiments
  • Usually 1.5 million or more reads per run
  • Plate can be split into 8 regions
  • Each region can potentially contain the same MIDs while keeping samples separate

Illumina

  • Very high-throughput
  • Produces shorter reads
  • Used more frequently for full genome and metagenome sequencing

4. Primer ordering

Primer sets adapted to Roche 454 are currently available at Centre sur la Biodiversité for universal 16S rRNA, universal fungal ITS, and 18S rRNA primers that favour arbuscular mycorrhizal fungi (AMF). Other sets are also available for several bacterial functional genes, and possibly other regions. Ask around before ordering a costly huge set of primers!

Primers for amplicon-pool sequencing are similar to regular primers, and can contain the same target sequence that you use for a regular PCR reaction. However, you must also add a sequencing adaptor, a library code, and an MID code. You will order multiple forward primers with unique MID codes, and will then amplify different samples with different MID-coded primers. The purpose is to be able to pool your DNA fragments, and later separate them by MID in silico, with bioinformatics steps. If we did not use this approach, it would cost several thousand dollars to get a ton of reads for only one sample! Reverse primers generally do not contain MID codes, unless you will be sequencing from the reverse direction.

Consult Roche for the latest recommended primers and adaptors (You must create a login to access the documentation). This may vary depending on the chemistry used in the reaction. For the original recommended MIDs, consult ‘454 Sequencing Technical Bulletin No. 005-2009’ .

5. Amplification and pooling of amplicons

Steps to prepare samples for sequencing

  1. PCR amplify your samples
  2. Purify the amplified sample by extracting only the fragments of the expected size. This can be done by extracting band of interest from a gel, and processing with a gel purification kit, but other methods for size fragmentation exist.
  3. Quantify your samples using Picogreen or Qubit (not NanoDrop, as it is not precise).
  4. Pool your samples in an equimolar ratio. Currently Genome Quebec requests a bare minimum of 450 ng in 30 ul for an MID library. Be sure to check with them for changes in their protocol.

Can you add different concentrations of amplicons to the same pool? Yes, BUT…only in very specific circumstances. You should consider that, in many subsequent sequence analyses steps, some calculations will require to standardize all samples, to bring them to reads number equal to the sample with the lowest number of reads. For instance, if you have 2000, 3000, 2500, and 150 reads, you will have to reduce your number of reads to 150 in each sample to compare diversity metrics etc. However, if you have amplicons from two different genes, or from samples that will never be compared directly, you could potentially select two concentrations (e.g. have one gene be 3x less concentrated).

Can you use more cycles to amplify only your samples that produce weak bands? No. More cycles means more error, and more bias towards certain taxa. The PCR-amplification protocol should be standardized between all samples, and you should use as few cycles as is reasonable. If you are getting bands of weak intensity, from samples that had good measurable amount of DNA, try adding BSA or DMSO in your PCR mix, or dilute DNA samples 1:10 or 1:100.

6. Submitting samples for sequencing

If you will sequence at Genome Quebec, you must fill out the Roche Service request form, and you need to obtain a quotation number from them for the order. When your samples are ready to go, your best bet is to walk them over there yourself!

7. Receiving files

When you fill out your order, request the original .sff files. Some sequencing centres will offer to do a certain amount of the pre-processing for you, but you will want to have all of the quality data available to you for your analyses. Furthermore, many analysis pipelines start from this initial file, so it is preferable to work from one than having to redo the same steps on hundreds of pre-separated files!

8. Selecting appropriate pipelines for analysis

You do not have to be a bioinformatics specialist to work with sequence data, especially when you are profiling commonly used phylogenetic markers such as 16S and 18S rRNA. Many pipelines are publicly available, and have excellent tutorials that will allow you to learn how to process your data. Some will provide sample datasets, so that you can walk through each step and make sure that you end up with the results that you expect.

This protocol details the processing of amplicons using Mothur, as I find their Wiki, tutorials and help forum to be the most comprehensive. If you use Mothur, you should cite the following reference: Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB et al. (2009). Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 75: 7537-7541.

If you use the protocol detailed here, you should also cite the Schloss standard operating procedure (SOP): Schloss PD, Gevers D, Westcott SL. (2011). Reducing the effects of PCR amplification and sequencing artifacts on 16S rRNA-based studies. PLoS One 6: e27310.

Other resources and pipeline

Below are some other resources that you may find useful. Be sure to locate and cite the appropriate reference when using each.

  • CD-HIT A clustering algorithm that does not require reads to be aligned. Useful for ITS analysis, and pretty much the only aspect of initial sequence processing that I conduct outside of Mothur.
  • Qiime Also generally used for the processing of amplicon data produced by next generation sequencing, but can process some shotgun metagenomic data. Apparently has some good visual components, but may currently be less efficient at processing really large sequence sets.
  • MG-RAST Very good for the visualization of metagenomic reads, and allows easy comparison with other public datasets (e.g. show where your data lie compared to metagenomes of similar samples in a PCoA (principal coordinate analysis). They are starting to take amplicon datasets as well, but there are not as many options for manipulating your data as are currently found in Mothur and Qiime. The quality of classification is good, but likely less precise than other methods. They are not well annotated for eukaryotes.
  • Fast UniFrac An interesting and widely used platform for comparing divergence between communities. Their algorithm examines the overall sequence distance between samples, without necessarily grouping reads into OTUs. There are good visualization tools, and you can output the UniFrac distance matrix into graphing programs like R. The UniFrac metric is also available in Mothur.

9. Quality processing of reads and binning by MID

This step is meant to account for errors introduced in the amplification and sequencing steps, and to remove reads that are too short or do not match the MID/oligo requirements. The following sections will describe amplicon processing using Mothur. Sequence processing techniques are changing rapidly, and you should consult the Schloss SOP for the latest standard protocol and for options not described here.

Input files

The easiest way to start is to use a combined .sff (or .fastq) file, and not the files already separated by MID code. If you do not have access to a single file, you can run multiple .sff files in parallel and later merge your separate .fasta and .qual files.

## Split the .sff file into .fasta, .qual, and .flow files (for fastq use fastq.info)##

sffinfo(sff=example.sff, flow=T)

## If you run the above command on multiple .sff files, you can merge the .fasta and .qual files next. Use hyphens to separate each file name.##

merge.files(input=example1.fasta-example2.fasta-example3.fasta, output=example.fasta)
merge.files(input=example1.qual-example2.qual-example3.qual, output=example.qual)

## Take a look at your sequence stats. Do this at any step you want to monitor how many reads you have, the average size, etc. ##

summary.seqs(fasta=example.fasta)

## Now you will trim your reads to remove primers, MID tags, and bad quality segments. Reads that end up being too short will be removed. The oligos file will have to be created (you can find an example in the Schloss SOP sample data), and will contain your primer sequence(s), MID codes, and group names. You can find descriptions for the parameters I used on the Mothur wiki. If you have sequenced in the reverse direction, you need to use the parameter flip=T. ##

trim.seqs(fasta=example.fasta, oligos=example.oligos, qfile=example.qual, maxambig=0, maxhomop=8, bdiffs=1, pdiffs=2, qwindowaverage=30, qwindowsize=50,minlength=250,processors=2)

## One of the great advantages of Mothur is that it becomes possible to work with a much smaller number of reads than were actually contained in your .sff file. This is done by reducing the file to only the unique reads, and creating a names file to map all identical reads to each other. ##

unique.seqs(fasta=example.trim.fasta)
summary.seqs(fasta=example.trim.unique.fasta, name=example.trim.names)
Alignment (for alignable genes such as 16S and 18S rRNA)

If you are using 16S or 18S amplicons, or have another well aligned database, you should use the following steps in order to further improve the quality of your reads, and ensure that you are comparing your reads over the same gene region.

## First you will align your reads to a chosen database. In this case, I have selected the Mothur version of the bacterial Silva database. Other alignment options are explained in the Mothur wiki. ##

align.seqs(candidate=example.trim.unique.fasta, template=silva.bacteria.fasta, ksize=9, align=needleman, gapopen=-1,processors=2)

## It is very important to look at your reads at this point to make sure that everything appears to have aligned properly. ##

summary.seqs(fasta=example.trim.unique.align, name=example.trim.names, processors=2)

## Most of your reads should be either starting or ending in the same place, since you used primers that target a single region. If not, something has probably gone wrong earlier on (check trim.seqs parameters). In order to make sure that we are comparing reads over the same region, we have to screen out those that did not align well. If most of your reads (e.g. 90%) start around 1044, you should set start=1044 and the 10% that start after that position will be eliminated. At the other end, you will use ‘optimize’ in order to keep the best aligned reads. In the example below, I discard the 5% of the reads that end first in the alignment. ##

screen.seqs(fasta=example.trim.unique.align, name=example.trim.names,  start=1044, group=example.groups, optimize=end, criteria=95, processors=2)
summary.seqs(fasta=example.trim.unique.good.align, name=example.trim.good.names)

## This step is extremely important, and is also the most likely to give you problems. Because the alignment database is many columns long and your aligned reads only cover a region of 1000-2000 bases, you need to remove the mostly blank columns. The best option for downstream applications is to use the trump=. option in which every column with even one blank is removed. The problem is that even one poor sequence can lead to you eliminating all or most columns. Read more about this in the Mothur help forum. ##

filter.seqs(fasta=example.trim.unique.good.align,vertical=T, trump=., processors=2)
summary.seqs(fasta=example.trim.unique.good.filter.fasta, name=example.trim.good.names)

## You should now run unique.seqs again, as some of the differences between reads earlier may have been in regions in which they did not overlap. ##

unique.seqs(fasta=example.trim.unique.good.filter.fasta, name=example.trim.good.names)
summary.seqs(fasta=example.trim.unique.good.filter.unique.fasta, name=example.trim.unique.good.filter.names)

## To account for slight mismatches due to sequencing error, we use the pre.cluster command to group reads that are likely from the same OTU. ##

pre.cluster(fasta=example.trim.unique.good.filter.unique.fasta, name=example.trim.unique.good.filter.names, group=example.good.groups,diffs=2)
summary.seqs(fasta=example.trim.unique.good.filter.unique.precluster.fasta, name=example.trim.unique.good.filter.unique.precluster.names)

## As a last quality-check step, we will remove potential chimeras (a single amplicon originating from multiple amplicons) from our sequence data. ##

chimera.uchime(fasta=example.trim.unique.good.filter.unique.precluster.fasta, name=example.trim.unique.good.filter.unique.precluster.names, group=example.good.groups, processors=2)
remove.seqs(accnos=example.trim.unique.good.filter.unique.precluster.uchime.accnos, fasta=example.trim.unique.good.filter.unique.precluster.fasta, name=example.trim.unique.good.filter.unique.precluster.names, group=example.good.groups, dups=T)
summary.seqs(fasta=example.trim.unique.good.filter.unique.precluster.pick.fasta, name=example.trim.unique.good.filter.unique.precluster.pick.names)

10. Classification, OTU assignment and downstream analyses

See the Mothur wiki for more information on what comes next. You can basically decide at this point whether you want to work with OTU data (clustered reads), phylotype data (assigned sequence names), or a mix of both. There are plenty of downstream analyses built into Mothur to allow you to compare your samples and treatments.

11. Going further

Simply listing species names is not enough. You need to come up with new ways to relate community composition to function, or to answer real ecological questions. Try scanning the most recent articles of your target journal(s) to see what is new and exciting, and try to emulate some of the more interesting figures. Or come up with your own!

Accessory software

For working with data, you will minimally want to be comfortable using R and Excel. If you can think of something to do, it can be done in R, but you often have to do some searching to figure out how. If you have a copy of Adobe Illustrator available, this is very useful for cleaning up your figures, or combining multiple figure types.

Take Analyse quantitative des données biologiques course with Pierre Legendre, or learn how to use R on your own with his Numerical Ecology with R book.

12. Submitting reads to Sequence Reads Archive (SRA)

When you do massively parallel sequencing, you are technically supposed to submit all of your reads to SRA. This is because readers of your article need to be able to access not only your fasta reads, but the quality data that is associated with your sequencing runs, so that they can combine these in meta-analyses. The SRA may cease to be the go-to place for large datasets in the near future, but you will not be able to publish without making your data publically available somehow. Make sure that you do not make your data available before you publish. You can leave the accession number blank in your manuscript until the last stage of the review process, and do the submission at that point. This is preferred anyway, as they like to link your data to your article.

The procedure changes often, but ensure that you are providing all of your MID data so that other users are able to process your .sff file as you have. You can put this in one of the descriptive paragraphs that they provide for your experiment.