Introduction


DADA2 is a bioinformatics pipeline created by Callahan et al., 2016. It consists is a series of steps which filter the raw sequences obtained with Illumina sequencing. The final step is to obtain the taxonomy of the sequences that have been filtered in order to study the microbial community.

DADA2 has two major features which distinguishes it from other commonly used pipelines. On one hand, it will proceed to the modeling of the sequencing error which is supposed to make it possible to distinguish mutant sequences from erroneous sequences. On the other hand, unlike other pipelines such as QIIME or Mothur, DADA2 does not cluster 97% similar sequences in Operational Taxonomy Units (OTUs). Its Amplicon Sequence Variants (ASVs) are not grouped if the sequences are not 100% identical. See figure above.

Originally constructed for 16S marker gene sequences (Bacteria), we will use it with ITS marker gene (Fungi) sequences from Illumina MiSEQ 2x300 bp paired-end sequencing. To speed up the execution of each step, we randomly sub-sampled a dataset in order to only keep 1000 sequences per sample. Finally, Redde Caesari quae sunt Caesaris : this tutorial was largely inspired by the original DADA2 tutorial.

In general, before starting this pipeline, we must take some precautions:

  1. Samples must be demultiplexed: split into individual per-sample fastq files.
  2. If the sequences are paired-end, the forward and reverse sequences have to be in distinct fastq files but must contain reads in matched order.
  3. The nucleotides which are not part of the amplicon (primers and adapters) have to be removed. They can also be removed at the filtring step.
  4. Most functions have a multithreading option that allows faster computing time by accessing multiple processors. Just specify multithread = TRUE to enable it. Warning, this option does not work under Windows.

This figure taken from Hugerth and Andersson, 2017 illustrates the theoretical difference between OTUs and ASV. Each color represents a clade. Yellow stars indicate mutations, red stars indicate amplification or sequencing errors. The size of the space between the sequences indicates their clustering.


(A) 100 % identity clustered OTUs.
The slightest variation of sequences causes the creation of a new group. The mutant sequences and the erroneous sequences are treated similarly.
(B) 97 % identity clustered OTUs.
A wider grouping allows to no longer consider the erroneous sequences, however the mutant sequences will also be clustered in the consensus group. (C) ASVs. Learning the error rates theoretically enables to group the erroneous sequences with the consensus sequences. In contrast, the mutant sequences are considered integrally.



Let’s start !


First, we’re going to load the DADA2 package. You should have the latest version: packageVersion('dada2'). Then we’re going to create a variable (path) indicating the path which will allow to access the objects required for this pipeline.

library(dada2); packageVersion("dada2")
## Loading required package: Rcpp
## [1] '1.18.0'
path <- "data/ITS_sub/"

Let’s check where the path leads to…