Introduction


DADA2 is a bioinformatics pipeline created by Callahan et al., 2016. It consists is a series of steps which filter the raw sequences obtained with Illumina sequencing. The final step is to obtain the taxonomy of the sequences that have been filtered in order to study the microbial community.

DADA2 has two major features which distinguishes it from other commonly used pipelines. On one hand, it will proceed to the modeling of the sequencing error which is supposed to make it possible to distinguish mutant sequences from erroneous sequences. On the other hand, unlike other pipelines such as QIIME or Mothur, DADA2 does not cluster 97% similar sequences in Operational Taxonomy Units (OTUs). Its Amplicon Sequence Variants (ASVs) are not grouped if the sequences are not 100% identical. See figure above.

Originally constructed for 16S marker gene sequences (Bacteria), we will use it with ITS marker gene (Fungi) sequences from Illumina MiSEQ 2x300 bp paired-end sequencing. To speed up the execution of each step, we randomly sub-sampled a dataset in order to only keep 1000 sequences per sample. Finally, Redde Caesari quae sunt Caesaris : this tutorial was largely inspired by the original DADA2 tutorial.

In general, before starting this pipeline, we must take some precautions:

  1. Samples must be demultiplexed: split into individual per-sample fastq files.
  2. If the sequences are paired-end, the forward and reverse sequences have to be in distinct fastq files but must contain reads in matched order.
  3. The nucleotides which are not part of the amplicon (primers and adapters) have to be removed. They can also be removed at the filtring step.
  4. Most functions have a multithreading option that allows faster computing time by accessing multiple processors. Just specify multithread = TRUE to enable it. Warning, this option does not work under Windows.

This figure taken from Hugerth and Andersson, 2017 illustrates the theoretical difference between OTUs and ASV. Each color represents a clade. Yellow stars indicate mutations, red stars indicate amplification or sequencing errors. The size of the space between the sequences indicates their clustering.


(A) 100 % identity clustered OTUs.
The slightest variation of sequences causes the creation of a new group. The mutant sequences and the erroneous sequences are treated similarly.
(B) 97 % identity clustered OTUs.
A wider grouping allows to no longer consider the erroneous sequences, however the mutant sequences will also be clustered in the consensus group. (C) ASVs. Learning the error rates theoretically enables to group the erroneous sequences with the consensus sequences. In contrast, the mutant sequences are considered integrally.



Let’s start !


First, we’re going to load the DADA2 package. You should have the latest version: packageVersion('dada2'). Then we’re going to create a variable (path) indicating the path which will allow to access the objects required for this pipeline.

library(dada2); packageVersion("dada2")
## Loading required package: Rcpp
## [1] '1.18.0'
path <- "data/ITS_sub/"

Let’s check where the path leads to…


list.files(path)
##   [1] "cutadapt"           "filtered_pairedend" "filtN"             
##   [4] "S1_R1.fastq.gz"     "S1_R2.fastq.gz"     "S10_R1.fastq.gz"   
##   [7] "S10_R2.fastq.gz"    "S11_R1.fastq.gz"    "S11_R2.fastq.gz"   
##  [10] "S12_R1.fastq.gz"    "S12_R2.fastq.gz"    "S13_R1.fastq.gz"   
##  [13] "S13_R2.fastq.gz"    "S14_R1.fastq.gz"    "S14_R2.fastq.gz"   
##  [16] "S15_R1.fastq.gz"    "S15_R2.fastq.gz"    "S16_R1.fastq.gz"   
##  [19] "S16_R2.fastq.gz"    "S17_R1.fastq.gz"    "S17_R2.fastq.gz"   
##  [22] "S18_R1.fastq.gz"    "S18_R2.fastq.gz"    "S2_R1.fastq.gz"    
##  [25] "S2_R2.fastq.gz"     "S20_R1.fastq.gz"    "S20_R2.fastq.gz"   
##  [28] "S21_R1.fastq.gz"    "S21_R2.fastq.gz"    "S22_R1.fastq.gz"   
##  [31] "S22_R2.fastq.gz"    "S23_R1.fastq.gz"    "S23_R2.fastq.gz"   
##  [34] "S24_R1.fastq.gz"    "S24_R2.fastq.gz"    "S25_R1.fastq.gz"   
##  [37] "S25_R2.fastq.gz"    "S26_R1.fastq.gz"    "S26_R2.fastq.gz"   
##  [40] "S27_R1.fastq.gz"    "S27_R2.fastq.gz"    "S28_R1.fastq.gz"   
##  [43] "S28_R2.fastq.gz"    "S29_R1.fastq.gz"    "S29_R2.fastq.gz"   
##  [46] "S3_R1.fastq.gz"     "S3_R2.fastq.gz"     "S30_R1.fastq.gz"   
##  [49] "S30_R2.fastq.gz"    "S31_R1.fastq.gz"    "S31_R2.fastq.gz"   
##  [52] "S33_R1.fastq.gz"    "S33_R2.fastq.gz"    "S34_R1.fastq.gz"   
##  [55] "S34_R2.fastq.gz"    "S35_R1.fastq.gz"    "S35_R2.fastq.gz"   
##  [58] "S36_R1.fastq.gz"    "S36_R2.fastq.gz"    "S37_R1.fastq.gz"   
##  [61] "S37_R2.fastq.gz"    "S39_R1.fastq.gz"    "S39_R2.fastq.gz"   
##  [64] "S4_R1.fastq.gz"     "S4_R2.fastq.gz"     "S41_R1.fastq.gz"   
##  [67] "S41_R2.fastq.gz"    "S42_R1.fastq.gz"    "S42_R2.fastq.gz"   
##  [70] "S43_R1.fastq.gz"    "S43_R2.fastq.gz"    "S44_R1.fastq.gz"   
##  [73] "S44_R2.fastq.gz"    "S45_R1.fastq.gz"    "S45_R2.fastq.gz"   
##  [76] "S46_R1.fastq.gz"    "S46_R2.fastq.gz"    "S47_R1.fastq.gz"   
##  [79] "S47_R2.fastq.gz"    "S48_R1.fastq.gz"    "S48_R2.fastq.gz"   
##  [82] "S49_R1.fastq.gz"    "S49_R2.fastq.gz"    "S5_R1.fastq.gz"    
##  [85] "S5_R2.fastq.gz"     "S50_R1.fastq.gz"    "S50_R2.fastq.gz"   
##  [88] "S51_R1.fastq.gz"    "S51_R2.fastq.gz"    "S52_R1.fastq.gz"   
##  [91] "S52_R2.fastq.gz"    "S53_R1.fastq.gz"    "S53_R2.fastq.gz"   
##  [94] "S54_R1.fastq.gz"    "S54_R2.fastq.gz"    "S55_R1.fastq.gz"   
##  [97] "S55_R2.fastq.gz"    "S56_R1.fastq.gz"    "S56_R2.fastq.gz"   
## [100] "S58_R1.fastq.gz"    "S58_R2.fastq.gz"    "S59_R1.fastq.gz"   
## [103] "S59_R2.fastq.gz"    "S6_R1.fastq.gz"     "S6_R2.fastq.gz"    
## [106] "S60_R1.fastq.gz"    "S60_R2.fastq.gz"    "S61_R1.fastq.gz"   
## [109] "S61_R2.fastq.gz"    "S62_R1.fastq.gz"    "S62_R2.fastq.gz"   
## [112] "S63_R1.fastq.gz"    "S63_R2.fastq.gz"    "S64_R1.fastq.gz"   
## [115] "S64_R2.fastq.gz"    "S65_R1.fastq.gz"    "S65_R2.fastq.gz"   
## [118] "S66_R1.fastq.gz"    "S66_R2.fastq.gz"    "S67_R1.fastq.gz"   
## [121] "S67_R2.fastq.gz"    "S68_R1.fastq.gz"    "S68_R2.fastq.gz"   
## [124] "S69_R1.fastq.gz"    "S69_R2.fastq.gz"    "S7_R1.fastq.gz"    
## [127] "S7_R2.fastq.gz"     "S72_R1.fastq.gz"    "S72_R2.fastq.gz"   
## [130] "S73_R1.fastq.gz"    "S73_R2.fastq.gz"    "S74_R1.fastq.gz"   
## [133] "S74_R2.fastq.gz"    "S75_R1.fastq.gz"    "S75_R2.fastq.gz"   
## [136] "S76_R1.fastq.gz"    "S76_R2.fastq.gz"    "S77_R1.fastq.gz"   
## [139] "S77_R2.fastq.gz"    "S78_R1.fastq.gz"    "S78_R2.fastq.gz"   
## [142] "S79_R1.fastq.gz"    "S79_R2.fastq.gz"    "S8_R1.fastq.gz"    
## [145] "S8_R2.fastq.gz"     "S80_R1.fastq.gz"    "S80_R2.fastq.gz"   
## [148] "S81_R1.fastq.gz"    "S81_R2.fastq.gz"    "S9_R1.fastq.gz"    
## [151] "S9_R2.fastq.gz"

You should see the names of the fastq files.

Now, we’re going to read in the names of the fastq files, and perform some string manipulation to get lists of the forward and reverse fastq files. The sort function ensures forward/reverse reads are in the same order.

fnFs <- sort(list.files(path, pattern="_R1.fastq"))
fnRs <- sort(list.files(path, pattern="_R2.fastq"))

Given that the forward/reverse fastq pairs belong to the same sample, we are going to extract the name and save it in a variable. In this case, we assume that the filenames have this type of format: SAMPLENAME_XXX.fastq.

sample.names <- sapply(strsplit(fnFs, "_"), `[`, 1)
sample.names 
##  [1] "S1"  "S10" "S11" "S12" "S13" "S14" "S15" "S16" "S17" "S18" "S2"  "S20"
## [13] "S21" "S22" "S23" "S24" "S25" "S26" "S27" "S28" "S29" "S3"  "S30" "S31"
## [25] "S33" "S34" "S35" "S36" "S37" "S39" "S4"  "S41" "S42" "S43" "S44" "S45"
## [37] "S46" "S47" "S48" "S49" "S5"  "S50" "S51" "S52" "S53" "S54" "S55" "S56"
## [49] "S58" "S59" "S6"  "S60" "S61" "S62" "S63" "S64" "S65" "S66" "S67" "S68"
## [61] "S69" "S7"  "S72" "S73" "S74" "S75" "S76" "S77" "S78" "S79" "S8"  "S80"
## [73] "S81" "S9"

Now, specify the full path to the fnFs and fnRs.

fnFs <- file.path(path, fnFs)
fnRs <- file.path(path, fnRs)

Quality profile


This first step allows to visualize the sequences quality thanks to the individual Q score of each nucleotide.

plotQualityProfile(fnFs[1]) # 1st Forward sample

plotQualityProfile(fnRs[1]) # 1st Reverse sample

In these figures, the median is in green and the quartiles are the dotted orange lines. Here we only plotted the first forward and reverse fastq (fnFs[1] and fnRs[1]), but it is possible to plot multiple figures(fnFs[x:y]) or aggregate them as follows.

plotQualityProfile(fnFs, aggregate = TRUE)

plotQualityProfile(fnRs, aggregate = TRUE)

The analysis of these figures helps to choose the filtring and trimming parameters of the next step. The Q score index gives us information on sequencing’s accuracy (see table).

Q score Precision
10 90 %
20 99 %
30 99.9 %
40 99.99 %

Another tool for evaluating sequence quality: FastQC.

Filtering and Trimming


First we will create a directoy (filtered_pairedend) and objects (filtFs and filtRs) to store the filtered sequences.

filt_path <- file.path(path, "filtered_pairedend") 
filtFs <- file.path(filt_path, paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(filt_path, paste0(sample.names, "_R_filt.fastq.gz"))

Let’s procede with the filterAndTrim function, its output will be stored in the out object.

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, 
                     truncQ=6,
                     truncLen = c(280,280),
                     trimLeft=c(18,20),
                     maxEE=c(2,2))
                     #multithread=TRUE)


First, the function needs the unfiltered sequences (fnFs and FnRs) as well as the names of the objects of the filtered sequences (filtFs and filtRs). Several parameters can then be modified as we wish:

Other settings can also be changed, they are accessible on the help page of the function : ?FilterAndTrim.

Filtering visualization

pourc <- cbind((out[,2]/out[,1])*100) # Percentage filtered sequence / non-filtered sequence
pourc_disc <- cbind(out, pourc) # combines out and pourc
pourc_disc 
##                 reads.in reads.out     
## S1_R1.fastq.gz      1000       595 59.5
## S10_R1.fastq.gz     1000       366 36.6
## S11_R1.fastq.gz     1000       432 43.2
## S12_R1.fastq.gz     1000       546 54.6
## S13_R1.fastq.gz     1000       417 41.7
## S14_R1.fastq.gz     1000       229 22.9
## S15_R1.fastq.gz     1000       512 51.2
## S16_R1.fastq.gz     1000       481 48.1
## S17_R1.fastq.gz     1000       634 63.4
## S18_R1.fastq.gz     1000       607 60.7
## S2_R1.fastq.gz      1000       482 48.2
## S20_R1.fastq.gz     1000       614 61.4
## S21_R1.fastq.gz     1000       598 59.8
## S22_R1.fastq.gz     1000       539 53.9
## S23_R1.fastq.gz     1000       561 56.1
## S24_R1.fastq.gz     1000       599 59.9
## S25_R1.fastq.gz     1000       598 59.8
## S26_R1.fastq.gz     1000       567 56.7
## S27_R1.fastq.gz     1000       650 65.0
## S28_R1.fastq.gz     1000       607 60.7
## S29_R1.fastq.gz     1000       526 52.6
## S3_R1.fastq.gz      1000       556 55.6
## S30_R1.fastq.gz     1000       551 55.1
## S31_R1.fastq.gz     1000       571 57.1
## S33_R1.fastq.gz     1000       482 48.2
## S34_R1.fastq.gz     1000       644 64.4
## S35_R1.fastq.gz     1000       558 55.8
## S36_R1.fastq.gz     1000       453 45.3
## S37_R1.fastq.gz     1000       446 44.6
## S39_R1.fastq.gz     1000       521 52.1
## S4_R1.fastq.gz      1000       425 42.5
## S41_R1.fastq.gz     1000       517 51.7
## S42_R1.fastq.gz     1000       447 44.7
## S43_R1.fastq.gz     1000       500 50.0
## S44_R1.fastq.gz     1000       369 36.9
## S45_R1.fastq.gz     1000       359 35.9
## S46_R1.fastq.gz     1000       540 54.0
## S47_R1.fastq.gz     1000       501 50.1
## S48_R1.fastq.gz     1000       391 39.1
## S49_R1.fastq.gz     1000       516 51.6
## S5_R1.fastq.gz      1000       387 38.7
## S50_R1.fastq.gz     1000       422 42.2
## S51_R1.fastq.gz     1000       567 56.7
## S52_R1.fastq.gz     1000       521 52.1
## S53_R1.fastq.gz     1000       599 59.9
## S54_R1.fastq.gz     1000       472 47.2
## S55_R1.fastq.gz     1000       571 57.1
## S56_R1.fastq.gz     1000       416 41.6
## S58_R1.fastq.gz     1000       525 52.5
## S59_R1.fastq.gz     1000       559 55.9
## S6_R1.fastq.gz      1000       619 61.9
## S60_R1.fastq.gz     1000       593 59.3
## S61_R1.fastq.gz     1000       537 53.7
## S62_R1.fastq.gz     1000       470 47.0
## S63_R1.fastq.gz     1000       644 64.4
## S64_R1.fastq.gz     1000       595 59.5
## S65_R1.fastq.gz     1000       561 56.1
## S66_R1.fastq.gz     1000       551 55.1
## S67_R1.fastq.gz     1000       544 54.4
## S68_R1.fastq.gz     1000       532 53.2
## S69_R1.fastq.gz     1000       372 37.2
## S7_R1.fastq.gz      1000       521 52.1
## S72_R1.fastq.gz     1000       558 55.8
## S73_R1.fastq.gz     1000       581 58.1
## S74_R1.fastq.gz     1000       431 43.1
## S75_R1.fastq.gz     1000       356 35.6
## S76_R1.fastq.gz     1000       403 40.3
## S77_R1.fastq.gz     1000       387 38.7
## S78_R1.fastq.gz     1000       400 40.0
## S79_R1.fastq.gz     1000       560 56.0
## S8_R1.fastq.gz      1000       423 42.3
## S80_R1.fastq.gz     1000       457 45.7
## S81_R1.fastq.gz     1000       594 59.4
## S9_R1.fastq.gz      1000       492 49.2
(mean(out[,2])/mean(out[,1]))*100 # Mean percentage 
## [1] 50.98243


Almost half of the reads didn’t pass through the filtering parameters!


CHALLENGE

Draw the quality profile of the first sample once filtered and compare it to its unfiltered quality profile. Use the plotQualityProfile function.


Other filtering tools exist like: Trimmomatic.


Error Rates Learning


This step consist in estimating the error rates due to sequencing. Its purpose is to differentiate between mutant sequences and false sequences. The error model is computed by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution.

errF <- learnErrors(filtFs)
## 9884474 total bases in 37727 reads from 74 samples will be used for learning the error rates.
errR <- learnErrors(filtRs)
## 9809020 total bases in 37727 reads from 74 samples will be used for learning the error rates.
#multithread=TRUE

The minimum number of sequences to use for error rate learning can be specified with the nreads parameter.


Error rate visualization

plotErrors(errF, nominalQ=TRUE)

plotErrors(errR, nominalQ=TRUE)

The error rates for each possible transition (eg. A->C, A->G, …) are shown. Points are the observed error rates for each consensus quality score. The black line shows the estimated error rates after convergence. The red line shows the error rates expected under the nominal definition of the Q-value (for Illumina technology).


Dereplicating


Combines all identical sequencing reads into into unique sequences with a corresponding abundance. It will reduce subsequent computation time by eliminating redundant comparisons. The dereplicated sequences take the name of the samples from which they come.

derepFs <- derepFastq(filtFs)
names(derepFs) <- sample.names

derepRs <- derepFastq(filtRs)
names(derepRs) <- sample.names

The advantage of DADA2 lies in the fact that it retains a summary of the quality information associated with each unique sequence. The consensus quality profile of a unique sequence is the average of the positional qualities from the dereplicated reads. These quality profiles inform the error model of the subsequent sample inference step, significantly increasing DADA2’s accuracy.

Unique sequences


Sample Inference


dadaFs <- dada(derepFs, 
               err = errF, 
               #multithread=TRUE,
               pool=TRUE)
## 74 samples were pooled: 37727 reads in 20364 unique sequences.
dadaRs <- dada(derepRs, 
               err=errR,
               #multithread=TRUE,
               pool=TRUE)
## 74 samples were pooled: 37727 reads in 18908 unique sequences.
dadaFs[[1]]
## dada-class: object describing DADA2 denoising results
## 70 sequence variants were inferred from 364 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
dadaRs[[1]]
## dada-class: object describing DADA2 denoising results
## 73 sequence variants were inferred from 340 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
#save(dadaRs, file="data/dadaRs.rdata")
#save(dadaFs, file="data/dadaFs.rdata")


Merging paired reads


FUSION !


The whole point of paired-end sequencing lies in the goal of merging the two strands to increase our confidence in their reliability. Merging also makes it possible to obtain longer amplicons.
The function mergePairs needs to be provided with the objects computed in the two preceding stages (derep and dada). Then, the parameters we can freely modify are:

Other settings can also be changed, they are accessible on the help page of the function : ?mergePairs. For example, if returnRejects = TRUE, pairs that were rejected because of mismatches in the overlap region are kept in the output.

mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, 
                      minOverlap = 12, 
                      maxMismatch = 0)


Let’s inspect the results!

head(mergers[[1]])
##                                                                                                                                                                                                                                                                                                                                                                                  sequence
## 1    ATGCGATACGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACCTTGCGCCCCTTGGTATTCCGAGGGGCACACCCGTTTGAGTGTCGTGAATACTCTCAACCTTCTTGGTTTCTTTGACCACGAAGGCTTGGACTTTGGAGGTTTTTCTTGCTGGCCTCTTTAGAAGCCAGCTCCTCCTAAATGAATGGGTGGGGTCCGCTTTGCTGATCCTCGACGTGATAAGCATCTCTTCTACGTCTCAGTGTCAGCTCGGAACCCCGCTTTCCAACCGTCTTTGGACAAAGACAATGTTCGAGTTGCGACTCGACCTTACAAACCTTGACCTCAAATCGGGTGAGACTACCCGCTGAACTTAA
## 2                                                                                    ATGCGATAAGTAGTGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCTTGGTATCCCGAGGGGCATGCCTGTTCGAGCGTCATTTCACCACTCAAGCCTGGCTTGGTGTTGGGCGACGTCCCCTTTTGGGGACGCGTCTCGAAACGCTCGGCGGCGTGGCACCGGCTTTAAGCGTAGCAGAATCTTTCGCTTTGAAAGTCGGGGCCCCGTCTGCCGGAAGACCTACTCGCAAGGTTGACCTCGGATCAGGCAGGGATACCCGCTGAACTTAA
## 3                                                                                        ATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCTTGGTATTCCGAGGGGCATGCCTGTTCGAGCGTCATTATAACCACTCAAGCCCCGGCTTGGTCTTGGGGTTCGCGGTCCGCGGCCCTTAAACTCAGTGGCGGTGCCGTCTGGCTCTAAGCGCAGTAATTCTCTCGCTATAGTGTCTAGGTGGTTGCTTGCCATAATCCCCCAATTTTTTACGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAA
## 4                                 ATGCGATACGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCACTCCTTGGTATTCCGAGGAGTATGCCTGTTTCAGTATCATGAGCACTCTCACACCTAACCTTTGGGTTTATGGCGTGGAATTGGAATGCGCCGACTGTCATGGTTGGCCCTTCTAAAATGTAGTTCTTGGCTGTCACCTAATACAGCAGTTTGGCCTAATAGTTTTGGCATTCATTGTCAAATCTTTGGCTAACATTTGCTCCAGGAGTCAGTCTTGATAATACAGAAAACTCATTCAAATTTTGATCTGAAATCAGGTAGGGCTACCCGCTGAACTTAA
## 5 ATGCGATACGTAATGTGAATTGCAGAATTCCGTGAATCATTGAATCTTTGAACGCATCTTGCGCCTCTTGGTATTCCGAGGGGCATGCCTGTTTGAGTGTCATTAGAACTATCAAAAAAATAGATGATTTCAATCGTTAATTTTTTTGGAATTGGAGGTGGTGCTGGTCTTTTTCCATTAATGGCCCAAGCTCCTCCGAAATGCATTAGCGAATGCAGTGCACTTTTTCTCCTTGCTTTTTCTGGGCATTGATAGTTTACTCTCATGCCCTAAGCTGGTAGGGAGGAAGTCACAGAATGCTTCCCGCTCCTGAATGTAATACAAAACTTGACGATCAAACCCCTCAAATCAGGCAGGACTACCCGCTGAACTTAA
## 6                                                                        ATGCGATAAGTAATGCGAATTGCAGAATTCAGTGAGTCATCGAATCTTTGAACGCATATTGCGCCCTTTGGTATTCCGAAGGGCATGCCTGTTCGAGCGTCATGATCAACCATCAAGCCTGGCTTGTCGTTGGACCCTGTTGTCTCTGGGCGACAGGTCCGAAAGATAATGACGGTGTCATGGCAACCCCGAATGCAACGAGCTTTTTTATAGGCACGCATTTAGTGGTTGGCAAGGCCCCCTCGTGCGTTATTATTTTCTTACGGTTGACCTCGGATCAGGTAGGAATACCCGCTGAACTTAA
##   abundance forward reverse nmatch nmismatch nindel prefer accept
## 1       172      14      11    150         0      0      2   TRUE
## 2        43       7       8    230         0      0      1   TRUE
## 3        37      21      17    234         0      0      2   TRUE
## 4        34       2       1    179         0      0      2   TRUE
## 5        32       3       2    147         0      0      1   TRUE
## 6        21      11       9    218         0      0      1   TRUE
max(mergers[[1]]$nmatch) # Largest overlap 
## [1] 257
min(mergers[[1]]$nmatch) # Smallest overlap           
## [1] 113


ASVs table


We finally have our ASVs which we are going to store in the seqtab object.

seqtab <- makeSequenceTable(mergers)
dim(seqtab)
## [1]  74 551
seqtab[,1]
##  S1 S10 S11 S12 S13 S14 S15 S16 S17 S18  S2 S20 S21 S22 S23 S24 S25 S26 S27 S28 
##   0   0   0   0   0   0   0  46 207 127   0   0   1   0   0   1   0   0 532   0 
## S29  S3 S30 S31 S33 S34 S35 S36 S37 S39  S4 S41 S42 S43 S44 S45 S46 S47 S48 S49 
##   0   0  81   7   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0 
##  S5 S50 S51 S52 S53 S54 S55 S56 S58 S59  S6 S60 S61 S62 S63 S64 S65 S66 S67 S68 
##   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0 
## S69  S7 S72 S73 S74 S75 S76 S77 S78 S79  S8 S80 S81  S9 
##   0   0   0   0   0   0   0   6   0   0   0   0   0   0

We get 551 ASVs from the 74000 raw sequences we had at the beginning. seqtab [, 1] gives the number of times the first ASV’s sequence is found in each sample.

Let’s inspect the ASVs’ lengths


hist(nchar(getSequences(seqtab)),xlab="Size", ylab="Frequency", main = "ASVs length", xlim=c(250,450), ylim=c(0,250)) 


Removing chimeras



This step aims at removing all non-biological sequences, the samples in the ASV table are all pooled together for bimera identification. Other methods can be used like the consensus method where samples are checked individually for bimeras.

seqtab.nochim <- removeBimeraDenovo(seqtab, 
                                    method = "pooled", 
                                    #multithread = TRUE,
                                    verbose = TRUE) 
## Identified 5 bimeras out of 551 input sequences.
#save(seqtab.nochim, file="data/seqtab.nochim.rdata")


Let’s inspect the results !


round((sum(seqtab.nochim)/sum(seqtab)*100),2) # Percentage of the total sequence reads
## [1] 99.4
hist(nchar(getSequences(seqtab.nochim)),xlab="Size", ylab="Frequency", main = "Non-chimeric ASVs length", xlim=c(250,450), ylim=c(0,250)) # Lenght of the non-chimeric sequences

Now we can transform the ASVs occurrences in presence / absence which will allow to quantify the number of ASVs per sample.

seqtab.nochim.bin <- ifelse(seqtab.nochim>0,1,0) 


Track table


The following table shows how many sequences were eliminated at each step. An excessive loss in the number of sequences may indicate a problem. For example, if too few sequences pass the bimera removal step, it may indicate that there are still bits of primers with ambiguous nucleotides.


getN <- function(x) sum(getUniques(x))
track <- data.frame(Input=as.numeric(out[,1]), # input
                    Filtered=as.numeric(out[,2]), # filtered
                    "Filt//In"=as.numeric(round(((out[,2]/out[,1])*100),2)),# % (Filtered / Input)
                    Merge = as.numeric(sapply(mergers, getN)), # Merged 
                    "Mer//In"=as.numeric(round(((sapply(mergers, getN)/out[,1])*100),2)),# % (Merged / Input)
                    Nonchim = as.numeric(rowSums(seqtab.nochim)),# Non-chimeric                       
                    "Nonchim//In"=as.numeric(round(((rowSums(seqtab.nochim)/out[,1])*100),2)),# % (Non-chimeric / Input)
                    ASV = as.numeric(rowSums(seqtab.nochim.bin))) # Number of ASVs per sample 
rownames(track) <- sample.names # Row names
head(track)
##     Input Filtered Filt..In Merge Mer..In Nonchim Nonchim..In ASV
## S1   1000      595     59.5   564    56.4     564        56.4  65
## S10  1000      366     36.6   273    27.3     269        26.9  43
## S11  1000      432     43.2   342    34.2     342        34.2  58
## S12  1000      546     54.6   489    48.9     484        48.4  55
## S13  1000      417     41.7   340    34.0     338        33.8  71
## S14  1000      229     22.9   128    12.8     107        10.7  27


One picture is worth a thousand word!

library(ggplot2)
library(reshape2)

gtrack<- track[,c(1,2,4,6)]
gtrack$ID <- rownames(gtrack)

lgtrack <- melt(gtrack, id.vars="ID")
bar_track <- ggplot(lgtrack ,aes(x=ID, y=as.numeric(value), fill=variable)) +
      geom_bar(stat="identity", position = "identity") + 
      theme_classic() + # Theme
      theme(axis.ticks.length=unit(0.3,"cm")) + # Ticks size
      theme(axis.text.x = element_text(angle=45) , legend.title = element_blank())+ # Changes the x labels orientation & delete legend title
  scale_x_discrete(name ="Sample ID", limits=rownames(track))+ # Changes x-axis title & sorts the x label names
  scale_y_continuous(name="Abundance", breaks=seq(from = 0, to = 1000, by = 100))+ #Changes y-axis title & sets the y breaks.
  ggtitle("Track")# Main title
bar_track 


Taxonomy assignment


We are finally going to the end of the pipeline with this important step of taxonomic assignment. Thanks to the implementation of the naïve Bayesian classification method, the assignTaxonomy function takes as input all the sequences to be classified as well as reference set of sequences with known taxonomy. The taxonomic assignments are given with a minimum bootstrap confidence specified with the minBoot parameter. The database of references (UNITE) is accessible on this link https://unite.ut.ee/repository.php. Other databases are also available.

taxotab <- assignTaxonomy(seqtab.nochim,
                          refFasta = "reference_database/sh_general_release_dynamic_01.12.2017.fasta",
                          minBoot = 50, #Default 50. The minimum bootstrap #confidence for # assigning a taxonomic level.
                          multithread=TRUE)
## UNITE fungal taxonomic reference detected.
save(taxotab, file = "data/taxotab.rdata")
#load("data/taxotab.rdata")


Let’s inspect the results !

# view(taxotab) # Full taxonomy table 
write.table(taxotab[1:5,], row.names = FALSE) # First 5 ASVs' taxonomy without the ASV sequence
## "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
## "k__Fungi" "p__Basidiomycota" "c__Agaricomycetes" "o__Agaricales" "f__Tricholomataceae" "g__Mycena" NA
## "k__Fungi" "p__Mucoromycota" "c__Umbelopsidomycetes" "o__Umbelopsidales" "f__Umbelopsidaceae" "g__Umbelopsis" "s__dimorpha"
## "k__Plantae" NA NA NA NA NA NA
## "k__Fungi" "p__Ascomycota" "c__Dothideomycetes" "o__Mytilinidales" "f__Gloniaceae" "g__Cenococcum" "s__geophilum"
## "k__Fungi" "p__Basidiomycota" "c__Agaricomycetes" "o__Agaricales" "f__Hygrophoraceae" "g__Hygrocybe" "s__flavescens"
unique(unname(taxotab[,7])) # Number of unique species
##   [1] NA                    "s__dimorpha"         "s__geophilum"       
##   [4] "s__flavescens"       "s__algarvense"       "s__terricola"       
##   [7] "s__punicea"          "s__elongatum"        "s__humilis"         
##  [10] "s__sylvestris"       "s__opacum"           "s__subvinosa"       
##  [13] "s__lucidum"          "s__abramsii"         "s__rexiana"         
##  [16] "s__ericae"           "s__terminalis"       "s__variata"         
##  [19] "s__conica"           "s__zollingeri"       "s__chlamydosporicum"
##  [22] "s__asperellum"       "s__album"            "s__deciduus"        
##  [25] "s__saponaceum"       "s__rufescens"        "s__populi"          
##  [28] "s__auratus"          "s__reidii"           "s__podzolica"       
##  [31] "s__brunneoviolacea"  "s__simile"           "s__heterochroma"    
##  [34] "s__subsulphurea"     "s__pseudozygospora"  "s__mutabilis"       
##  [37] "s__microspora"       "s__humicola"         "s__camphoratus"     
##  [40] "s__acicola"          "s__fuckelii"         "s__var._bulbopilosa"
##  [43] "s__miniata"          "s__maius"            "s__splendens"       
##  [46] "s__echinulatum"      "s__chlorophana"      "s__australis"       
##  [49] "s__lignicola"        "s__lubrica"          "s__phyllophila"     
##  [52] "s__cantharellus"     "s__trabinellum"      "s__pygmaeum"        
##  [55] "s__isabellina"       "s__changbaiensis"    "s__fragilis"        
##  [58] "s__asperulatus"      "s__carneum"          "s__fuscella"        
##  [61] "s__bicolor"          "s__nigrella"         "s__bulbillosa"      
##  [64] "s__difforme"         "s__spirale"          "s__finlandica"      
##  [67] "s__myriocarpa"       "s__reginae"          "s__fellea"          
##  [70] "s__vrijmoediae"      "s__lacmus"           "s__spurius"         
##  [73] "s__arachnoidea"      "s__dioscoreae"       "s__laetior"         
##  [76] "s__coccinea"         "s__spadicea"         "s__pyriforme"       
##  [79] "s__macrocystis"      "s__risigallina"      "s__pura"            
##  [82] "s__cinereus"         "s__foliicola"        "s__rebaudengoi"     
##  [85] "s__fusiformis"       "s__metachroides"     "s__diversispora"    
##  [88] "s__fumosa"           "s__hymenocystis"     "s__sublilacina"     
##  [91] "s__rufum"            "s__atrovirens"       "s__mors-panacis"    
##  [94] "s__acerinum"         "s__skinneri"         "s__glacialis"       
##  [97] "s__cygneicollum"     "s__pullulans"        "s__crocea"          
## [100] "s__globulifera"      "s__silvestris"       "s__variabilis"      
## [103] "s__griseoviride"     "s__nitrata"          "s__sphaeroides"     
## [106] "s__renispora"        "s__honrubiae"        "s__eucalyptorum"    
## [109] "s__sindonia"         "s__grovesii"         "s__piceae"          
## [112] "s__stuposa"          "s__moravica"         "s__anomalovelatus"  
## [115] "s__miyagiana"        "s__pilosella"        "s__flavidum"        
## [118] "s__glutinosum"       "s__fallax"           "s__falcata"         
## [121] "s__rhododendri"      "s__fusispora"        "s__scaurus"         
## [124] "s__schulzeri"        "s__lauri"            "s__serotinus"       
## [127] "s__alliacea"         "s__cylichnium"       "s__alpina"          
## [130] "s__miyabei"          "s__rosea"            "s__albicastaneus"   
## [133] "s__sepiacea"         "s__fumosibrunneus"   "s__spinosum"        
## [136] "s__hyalocuspica"     "s__boeremae"         "s__epicalamia"      
## [139] "s__terrestris"       "s__rimosissimus"     "s__bombacina"       
## [142] "s__physodes"         "s__verzuoliana"      "s__acerina"         
## [145] "s__entomopaga"       "s__xylopini"         "s__umbrosum"        
## [148] "s__physaroides"

We obtained 546 different ASVs of which 147 have been identified at the species level

CHALLENGE

Can you find the number of different families?


Nota Bene