Abstract

High-throughput sequencing of PCR-amplified taxonomic markers (like the 16S rRNA gene) has enabled a new level of analysis of complex bacterial communities known as microbiomes. Many tools exist to quantify and compare abundance levels or OTU composition of communities in different conditions. The sequencing reads have to be denoised and assigned to the closest taxa from a reference database. Common approaches use a notion of 97% similarity and normalize the data by subsampling to equalize library sizes. In this paper, we show that statistical models allow more accurate abundance estimates. By providing a complete workflow in R, we enable the user to do sophisticated downstream statistical analyses, whether parametric or nonparametric. We provide examples of using the R packages dada2, phyloseq, DESeq2, ggplot2, structSSI and vegan to filter, visualize and test microbiome data. We also provide examples of supervised analyses using random forests and nonparametric testing using community networks and the ggnetwork package.

Introduction

The microbiome is formed of the ecological communities of microorganisms that dominate the living world. Bacteria can now be identified through the use of next generation sequencing applied at several levels. Shotgun sequencing of all bacteria in a sample delivers knowledge of all the genes present. Here we will only be interested in the identification and quantification of individual taxa (or species) through a ‘fingerprint gene’ called 16s rRNA which is present in all bacteria. This gene presents several variable regions which can be used to identify the different taxa.

Previous standard workflows depended on clustering all 16s rRNA sequences (generated by next generation amplicon sequencing) that occur within a 97% radius of similarity and then assigning these to ‘OTUs’ from reference trees (Caporaso et al. 2010; Schloss et al. 2009). These approaches do not incorporate all the data, in particular sequence quality information and statistical information available on the reads were not incorporated into the assignments.

In contrast, the de novo read counts used here will be constructed through the incorporation of both the quality scores and sequence frequencies in a probabilistic noise model for nucleotide transitions. For more details on the algorithmic implementation of this step see (Benjamin J Callahan et al. 2016).

After filtering the sequences and removing the chimeræ, the data are compared to a standard database of bacteria and labeled. In this workflow, we have used the labeled sequences to build a de novo phylogenetic with the .

The key step in the sequence analysis is the manner in which reads are denoised and assembled into groups we have chosen to call ASVs (Amplicon Sequence Variants)(Callahan, McMurdie, and Holmes 2017) instead of the traditional OTUs(Operational Taxonomic Units).

A published (but essentially similar) version of this workflow, including reviewer reports and comments is available (Ben J Callahan et al. 2016), see F1000Research From Raw reads.

There are extensive documentation and tutorial pages available for dada2 and phyloseq. If you have questions about this workflow, please start by consulting the relevant github issues sites for dada2, phyloseq, if the answers are not available, please post to the issues pages or Bioconductor forum. Note the posting guide for crafting an optimal question for the support site.

This is a workflow for denoising, filtering, performing data transformations, visualization, supervised learning analyses, community network tests, hierarchical testing and linear models. We provide all the code and give several examples of different types of analyses and use-cases. There are often many different objectives in experiments involving microbiome data and we will only give a flavor for what could be possible once the data has been imported into R.

In addition, the code can be easily adapted to accommodate batch effects, covariates and multiple experimental factors.

This workflow is based on software packages from the open-source Bioconductor project (Huber et al. 2015) and some CRAN packages.

We provide all steps necessary from the denoising and identification of the reads input as raw sequences as fastq files to the comparative testing and multivariate analyses of the samples and analyses of the abundances according to multiple available covariates.

Methods

Amplicon bioinformatics: from raw reads to tables

This section demonstrates the “full stack” of amplicon bioinformatics: construction of the sample-by-sequence feature table from the raw reads, assignment of taxonomy, and creation of a phylogenetic tree relating the sample sequences.

First we load the necessary packages.

library("knitr")
library("BiocStyle")
.cran_packages <- c("ggplot2", "gridExtra")
.bioc_packages <- c("dada2", "phyloseq", "DECIPHER", "phangorn")
.inst <- .cran_packages %in% installed.packages()
if(any(!.inst)) {
   install.packages(.cran_packages[!.inst])
}
.inst <- .bioc_packages %in% installed.packages()
if(any(!.inst)) {
   source("http://bioconductor.org/biocLite.R")
   biocLite(.bioc_packages[!.inst], ask = F)
}
# Load packages into session, and print package version
sapply(c(.cran_packages, .bioc_packages), require, character.only = TRUE)
##   ggplot2 gridExtra     dada2  phyloseq  DECIPHER  phangorn 
##      TRUE      TRUE      TRUE      TRUE      TRUE      TRUE
set.seed(100)

The data we will analyze here are highly-overlapping Illumina Miseq 2x250 amplicon sequences from the V4 region of the 16S gene (Kozich et al. 2013). These 360 fecal samples were collected from 12 mice longitudinally over the first year of life one mock community control. These were collected to investigate the development and stabilization of the murine microbiome (Schloss et al. 2012). These data are downloaded from the following data location and unzip. For now just consider them paired-end fastq files to be processed. Define the following path variable so that it points to the extracted directory on your machine:

miseq_path <- "./MiSeq_SOP" # CHANGE to the directory containing the fastq files after unzipping.
list.files(miseq_path)
##  [1] "F3D0_S188_L001_R1_001.fastq"   "F3D0_S188_L001_R2_001.fastq"  
##  [3] "F3D1_S189_L001_R1_001.fastq"   "F3D1_S189_L001_R2_001.fastq"  
##  [5] "F3D141_S207_L001_R1_001.fastq" "F3D141_S207_L001_R2_001.fastq"
##  [7] "F3D142_S208_L001_R1_001.fastq" "F3D142_S208_L001_R2_001.fastq"
##  [9] "F3D143_S209_L001_R1_001.fastq" "F3D143_S209_L001_R2_001.fastq"
## [11] "F3D144_S210_L001_R1_001.fastq" "F3D144_S210_L001_R2_001.fastq"
## [13] "F3D145_S211_L001_R1_001.fastq" "F3D145_S211_L001_R2_001.fastq"
## [15] "F3D146_S212_L001_R1_001.fastq" "F3D146_S212_L001_R2_001.fastq"
## [17] "F3D147_S213_L001_R1_001.fastq" "F3D147_S213_L001_R2_001.fastq"
## [19] "F3D148_S214_L001_R1_001.fastq" "F3D148_S214_L001_R2_001.fastq"
## [21] "F3D149_S215_L001_R1_001.fastq" "F3D149_S215_L001_R2_001.fastq"
## [23] "F3D150_S216_L001_R1_001.fastq" "F3D150_S216_L001_R2_001.fastq"
## [25] "F3D2_S190_L001_R1_001.fastq"   "F3D2_S190_L001_R2_001.fastq"  
## [27] "F3D3_S191_L001_R1_001.fastq"   "F3D3_S191_L001_R2_001.fastq"  
## [29] "F3D5_S193_L001_R1_001.fastq"   "F3D5_S193_L001_R2_001.fastq"  
## [31] "F3D6_S194_L001_R1_001.fastq"   "F3D6_S194_L001_R2_001.fastq"  
## [33] "F3D7_S195_L001_R1_001.fastq"   "F3D7_S195_L001_R2_001.fastq"  
## [35] "F3D8_S196_L001_R1_001.fastq"   "F3D8_S196_L001_R2_001.fastq"  
## [37] "F3D9_S197_L001_R1_001.fastq"   "F3D9_S197_L001_R2_001.fastq"  
## [39] "filtered"                      "HMP_MOCK.v35.fasta"           
## [41] "Mock_S280_L001_R1_001.fastq"   "Mock_S280_L001_R2_001.fastq"  
## [43] "mouse.dpw.metadata"            "mouse.time.design"            
## [45] "stability.batch"               "stability.files"

Filter and Trim

We begin by filtering out low-quality sequencing reads and trimming the reads to a consistent length. While generally recommended filtering and trimming parameters serve as a starting point, no two datasets are identical and therefore it is always worth inspecting the quality of the data before proceeding.

First we read in the names of the fastq files, and perform some string manipulation to get lists of the forward and reverse fastq files in matched order:

# Sort ensures forward/reverse reads are in same order
fnFs <- sort(list.files(miseq_path, pattern="_R1_001.fastq"))
fnRs <- sort(list.files(miseq_path, pattern="_R2_001.fastq"))
# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
sampleNames <- sapply(strsplit(fnFs, "_"), `[`, 1)
# Specify the full path to the fnFs and fnRs
fnFs <- file.path(miseq_path, fnFs)
fnRs <- file.path(miseq_path, fnRs)
fnFs[1:3]
## [1] "./MiSeq_SOP/F3D0_S188_L001_R1_001.fastq"   "./MiSeq_SOP/F3D1_S189_L001_R1_001.fastq"  
## [3] "./MiSeq_SOP/F3D141_S207_L001_R1_001.fastq"
fnRs[1:3]
## [1] "./MiSeq_SOP/F3D0_S188_L001_R2_001.fastq"   "./MiSeq_SOP/F3D1_S189_L001_R2_001.fastq"  
## [3] "./MiSeq_SOP/F3D141_S207_L001_R2_001.fastq"

Most Illumina sequencing data shows a trend of decreasing average quality towards the end of sequencing reads.

The first two forward reads:

plotQualityProfile(fnFs[1:2])

The first two reverse reads:

plotQualityProfile(fnRs[1:2])

Here, the forward reads maintain high quality throughout, while the quality of the reverse reads drops significantly at about position 160. Therefore, we choose to truncate the forward reads at position 245, and the reverse reads at position 160. We also choose to trim the first 10 nucleotides of each read based on empirical observations across many Illumina datasets that these base positions are particularly likely to contain pathological errors.

We define the filenames for the filtered fastq.gz files:

filt_path <- file.path(miseq_path, "filtered") # Place filtered files in filtered/ subdirectory
if(!file_test("-d", filt_path)) dir.create(filt_path)
filtFs <- file.path(filt_path, paste0(sampleNames, "_F_filt.fastq.gz"))
filtRs <- file.path(filt_path, paste0(sampleNames, "_R_filt.fastq.gz"))

We combine these trimming parameters with standard filtering parameters, the most important being the enforcement of a maximum of 2 expected errors per-read (Edgar and Flyvbjerg 2015). Trimming and filtering is performed on paired reads jointly, i.e. both reads must pass the filter for the pair to pass.

Filter the forward and reverse reads:

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(240,160),
              maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE,
              compress=TRUE, multithread=TRUE) # On Windows set multithread=FALSE
head(out)
##                               reads.in reads.out
## F3D0_S188_L001_R1_001.fastq       7793      7113
## F3D1_S189_L001_R1_001.fastq       5869      5299
## F3D141_S207_L001_R1_001.fastq     5958      5463
## F3D142_S208_L001_R1_001.fastq     3183      2914
## F3D143_S209_L001_R1_001.fastq     3178      2941
## F3D144_S210_L001_R1_001.fastq     4827      4312

Infer sequence variants

After filtering, the typical amplicon bioinformatics workflow clusters sequencing reads into operational taxonomic units (OTUs): groups of sequencing reads that differ by less than a fixed dissimilarity threshhold. Here we instead use the high-resolution DADA2 method to to infer amplicon sequence variants (ASVs) exactly, without imposing any arbitrary threshhold, and thereby resolving variants that differ by as little as one nucleotide (Benjamin J Callahan et al. 2016).

The sequence data is imported into R from demultiplexed fastq files (i.e. one fastq for each sample) and simultaneously dereplicated to remove redundancy. We name the resulting derep-class objects by their sample name.

Dereplication

Dereplication combines all identical sequencing reads into into “unique sequences” with a corresponding “abundance”: the number of reads with that unique sequence. Dereplication substantially reduces computation time by eliminating redundant comparisons.

derepFs <- derepFastq(filtFs, verbose=TRUE)
derepRs <- derepFastq(filtRs, verbose=TRUE)
# Name the derep-class objects by the sample names
names(derepFs) <- sampleNames
names(derepRs) <- sampleNames

The DADA2 method relies on a parameterized model of substitution errors to distinguish sequencing errors from real biological variation. Because error rates can (and often do) vary substantially between sequencing runs and PCR protocols, the model parameters can be discovered from the data itself using a form of unsupervised learning in which sample inference is alternated with parameter estimation until both are jointly consistent.

Parameter learning is computationally intensive, as it requires multiple iterations of the sequence inference algorithm, and therefore it is often useful to estimate the error rates from a (sufficiently large) subset of the data.

errF <- learnErrors(filtFs, multithread=TRUE)
## Initializing error rates to maximum possible estimate.
## Sample 1 - 7113 reads in 1979 unique sequences.
## Sample 2 - 5299 reads in 1639 unique sequences.
## Sample 3 - 5463 reads in 1477 unique sequences.
## Sample 4 - 2914 reads in 904 unique sequences.
## Sample 5 - 2941 reads in 939 unique sequences.
## Sample 6 - 4312 reads in 1267 unique sequences.
## Sample 7 - 6741 reads in 1756 unique sequences.
## Sample 8 - 4560 reads in 1438 unique sequences.
## Sample 9 - 15637 reads in 3590 unique sequences.
## Sample 10 - 11413 reads in 2762 unique sequences.
## Sample 11 - 12017 reads in 3021 unique sequences.
## Sample 12 - 5032 reads in 1566 unique sequences.
## Sample 13 - 18075 reads in 3707 unique sequences.
## Sample 14 - 6250 reads in 1479 unique sequences.
## Sample 15 - 4052 reads in 1195 unique sequences.
## Sample 16 - 7369 reads in 1832 unique sequences.
## Sample 17 - 4765 reads in 1183 unique sequences.
## Sample 18 - 4871 reads in 1382 unique sequences.
## Sample 19 - 6504 reads in 1709 unique sequences.
## Sample 20 - 4314 reads in 897 unique sequences.
##    selfConsist step 2 
##    selfConsist step 3 
##    selfConsist step 4 
##    selfConsist step 5 
## 
## 
## Convergence after  5  rounds.
## Total reads used:  139642
errR <- learnErrors(filtRs, multithread=TRUE)
## Initializing error rates to maximum possible estimate.
## Sample 1 - 7113 reads in 1660 unique sequences.
## Sample 2 - 5299 reads in 1349 unique sequences.
## Sample 3 - 5463 reads in 1335 unique sequences.
## Sample 4 - 2914 reads in 853 unique sequences.
## Sample 5 - 2941 reads in 880 unique sequences.
## Sample 6 - 4312 reads in 1286 unique sequences.
## Sample 7 - 6741 reads in 1803 unique sequences.
## Sample 8 - 4560 reads in 1265 unique sequences.
## Sample 9 - 15637 reads in 3414 unique sequences.
## Sample 10 - 11413 reads in 2522 unique sequences.
## Sample 11 - 12017 reads in 2771 unique sequences.
## Sample 12 - 5032 reads in 1415 unique sequences.
## Sample 13 - 18075 reads in 3290 unique sequences.
## Sample 14 - 6250 reads in 1390 unique sequences.
## Sample 15 - 4052 reads in 1134 unique sequences.
## Sample 16 - 7369 reads in 1635 unique sequences.
## Sample 17 - 4765 reads in 1084 unique sequences.
## Sample 18 - 4871 reads in 1161 unique sequences.
## Sample 19 - 6504 reads in 1502 unique sequences.
## Sample 20 - 4314 reads in 732 unique sequences.
##    selfConsist step 2 
##    selfConsist step 3 
##    selfConsist step 4 
##    selfConsist step 5 
##    selfConsist step 6 
## 
## 
## Convergence after  6  rounds.
## Total reads used:  139642
plotErrors(errF)
plotErrors(errR)
Estimated Error rates (both forward and reverse)Estimated Error rates (both forward and reverse)

Figure 1: Estimated Error rates (both forward and reverse)

In order to verify that the error rates have been reasonably well-estimated, we inspect the fit between the observed error rates (black points) and the fitted error rates (black lines) in Figure 1. These figures show the frequencies of each type of transition as a function of the quality.

The DADA2 sequence inference method can run in two different modes: Independent inference by sample (pool=FALSE), and inference from the pooled sequencing reads from all samples (pool=TRUE). Independent inference has the advantage that computation time is linear in the number of samples, and memory requirements are flat with the number of samples. This allows scaling out to datasets of almost unlimited size. Pooled inference is more computationally taxing, and can become intractable for datasets of tens of millions of reads. However, pooling improves the detection of rare variants that were seen just once or twice in an individual sample but many times across all samples. As this dataset is not particularly large, we perform pooled inference. As of version 1.2, multithreading can now be activated with the arguments multithread = TRUE, which substantially speeds this step.

dadaFs <- dada(derepFs, err=errF, multithread=TRUE)
## Sample 1 - 7113 reads in 1979 unique sequences.
## Sample 2 - 5299 reads in 1639 unique sequences.
## Sample 3 - 5463 reads in 1477 unique sequences.
## Sample 4 - 2914 reads in 904 unique sequences.
## Sample 5 - 2941 reads in 939 unique sequences.
## Sample 6 - 4312 reads in 1267 unique sequences.
## Sample 7 - 6741 reads in 1756 unique sequences.
## Sample 8 - 4560 reads in 1438 unique sequences.
## Sample 9 - 15637 reads in 3590 unique sequences.
## Sample 10 - 11413 reads in 2762 unique sequences.
## Sample 11 - 12017 reads in 3021 unique sequences.
## Sample 12 - 5032 reads in 1566 unique sequences.
## Sample 13 - 18075 reads in 3707 unique sequences.
## Sample 14 - 6250 reads in 1479 unique sequences.
## Sample 15 - 4052 reads in 1195 unique sequences.
## Sample 16 - 7369 reads in 1832 unique sequences.
## Sample 17 - 4765 reads in 1183 unique sequences.
## Sample 18 - 4871 reads in 1382 unique sequences.
## Sample 19 - 6504 reads in 1709 unique sequences.
## Sample 20 - 4314 reads in 897 unique sequences.
dadaRs <- dada(derepRs, err=errR, multithread=TRUE)
## Sample 1 - 7113 reads in 1660 unique sequences.
## Sample 2 - 5299 reads in 1349 unique sequences.
## Sample 3 - 5463 reads in 1335 unique sequences.
## Sample 4 - 2914 reads in 853 unique sequences.
## Sample 5 - 2941 reads in 880 unique sequences.
## Sample 6 - 4312 reads in 1286 unique sequences.
## Sample 7 - 6741 reads in 1803 unique sequences.
## Sample 8 - 4560 reads in 1265 unique sequences.
## Sample 9 - 15637 reads in 3414 unique sequences.
## Sample 10 - 11413 reads in 2522 unique sequences.
## Sample 11 - 12017 reads in 2771 unique sequences.
## Sample 12 - 5032 reads in 1415 unique sequences.
## Sample 13 - 18075 reads in 3290 unique sequences.
## Sample 14 - 6250 reads in 1390 unique sequences.
## Sample 15 - 4052 reads in 1134 unique sequences.
## Sample 16 - 7369 reads in 1635 unique sequences.
## Sample 17 - 4765 reads in 1084 unique sequences.
## Sample 18 - 4871 reads in 1161 unique sequences.
## Sample 19 - 6504 reads in 1502 unique sequences.
## Sample 20 - 4314 reads in 732 unique sequences.

Inspecting the dada-class object returned by dada:

dadaFs[[1]]
## dada-class: object describing DADA2 denoising results
## 128 sample sequences were inferred from 1979 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, BAND_SIZE = 16, USE_QUALS = TRUE

The DADA2 algorithm inferred 128 real sequence variants from the 1979 unique sequences in the first sample. The dada-class object contains multiple diagnostics about the quality of each inferred sequence variant(see help("dada-class") for some info).

The DADA2 sequence inference step removed (nearly) all substitution and indel errors from the data (Benjamin J Callahan et al. 2016). We now merge together the inferred forward and reverse sequences, removing paired sequences that do not perfectly overlap as a final control against residual errors.

Construct sequence table and remove chimeras

The DADA2 method produces a sequence table that is a higher-resolution analogue of the common “OTU table”, i.e. a sample by sequence feature table valued by the number of times each sequence was observed in each sample.

mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs)
seqtabAll <- makeSequenceTable(mergers[!grepl("Mock", names(mergers))])
table(nchar(getSequences(seqtabAll)))
## 
## 251 252 253 254 255 
##   1  84 182   5   2

Notably, chimeras have not yet been removed. The error model in the sequence inference algorithm does not include a chimera component, and therefore we expect this sequence table to include many chimeric sequences. We now remove chimeric sequences by comparing each inferred sequence to the others in the table, and removing those that can be reproduced by stitching together two more abundant sequences.

seqtabNoC <- removeBimeraDenovo(seqtabAll)

Although exact numbers vary substantially by experimental condition, it is typical that chimeras comprise a substantial fraction of inferred sequence variants, but only a small fraction of all reads. That is what is observed here chimeras make up about 22% of the inferred sequence variants, but those variants account for only about 4% of the total sequence reads.

Assign taxonomy

One of the benefits of using well-classified marker loci like the 16S rRNA gene is the ability to taxonomically classify the sequence variants. The dada2 package implements the naive Bayesian classifier method for this purpose (Wang et al. 2007). This classifier compares sequence variants to a training set of classified sequences, and here we use the RDP v16 training set (Cole et al. 2009).

The dada2 tutorial website contains formatted training fastas for the RDP training set, GreenGenes clustered at 97% identity, and the Silva reference database available. For fungal taxonomy, the General Fasta release files from the UNITE ITS database can be used as is. To follow this workflow, download the rdp_train_set_16.fa.gz file, and place it in the directory with the fastq files.

fastaRef <- "./rdp_train_set_16.fa.gz"
taxTab <- assignTaxonomy(seqtabNoC, refFasta = fastaRef, multithread=TRUE)
unname(head(taxTab))
##      [,1]       [,2]            [,3]          [,4]            [,5]                 [,6]         
## [1,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales" "Porphyromonadaceae" NA           
## [2,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales" "Porphyromonadaceae" NA           
## [3,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales" "Porphyromonadaceae" NA           
## [4,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales" "Porphyromonadaceae" "Barnesiella"
## [5,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales" "Bacteroidaceae"     "Bacteroides"
## [6,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales" "Porphyromonadaceae" NA

Construct phylogenetic tree

Phylogenetic relatedness is commonly used to inform downstream analyses, especially the calculation of phylogeny-aware distances between microbial communities. The DADA2 sequence inference method is reference-free, so we must construct the phylogenetic tree relating the inferred sequence variants de novo. We begin by performing a multiple-alignment using the DECIPHER R package (Wright 2015).

seqs <- getSequences(seqtabNoC)
names(seqs) <- seqs # This propagates to the tip labels of the tree
alignment <- AlignSeqs(DNAStringSet(seqs), anchor=NA,verbose=FALSE)

The phangorn R package is then used to construct a phylogenetic tree. Here we first construct a neighbor-joining tree, and then fit a GTR+G+I (Generalized time-reversible with Gamma rate variation) maximum likelihood tree using the neighbor-joining tree as a starting point.

phangAlign <- phyDat(as(alignment, "matrix"), type="DNA")
dm <- dist.ml(phangAlign)
treeNJ <- NJ(dm) # Note, tip order != sequence order
fit = pml(treeNJ, data=phangAlign)
fitGTR <- update(fit, k=4, inv=0.2)
fitGTR <- optim.pml(fitGTR, model="GTR", optInv=TRUE, optGamma=TRUE,
        rearrangement = "stochastic", control = pml.control(trace = 0))
detach("package:phangorn", unload=TRUE)

Combine data into a phyloseq object

The package phyloseq organizes and synthesizes the different data types from a typical amplicon sequencing experiment into a single data object that can be easily manipulated. The last bit of information needed is the sample data contained in a .csv file. This can be downloaded from github:

samdf <- read.csv("https://raw.githubusercontent.com/spholmes/F1000_workflow/master/data/MIMARKS_Data_combined.csv",header=TRUE)
samdf$SampleID <- paste0(gsub("00", "", samdf$host_subject_id), "D", samdf$age-21)
samdf <- samdf[!duplicated(samdf$SampleID),] # Remove dupicate entries for reverse reads
rownames(seqtabAll) <- gsub("124", "125", rownames(seqtabAll)) # Fix discrepancy
all(rownames(seqtabAll) %in% samdf$SampleID) # TRUE
## [1] TRUE
rownames(samdf) <- samdf$SampleID
keep.cols <- c("collection_date", "biome", "target_gene", "target_subfragment",
"host_common_name", "host_subject_id", "age", "sex", "body_product", "tot_mass",
"diet", "family_relationship", "genotype", "SampleID") 
samdf <- samdf[rownames(seqtabAll), keep.cols]

The full suite of data for this study – the sample-by-sequence feature table, the sample metadata, the sequence taxonomies, and the phylogenetic tree – can now be combined into a single object.

ps <- phyloseq(otu_table(seqtabNoC, taxa_are_rows=FALSE), 
               sample_data(samdf), 
               tax_table(taxTab),phy_tree(fitGTR$tree))
ps <- prune_samples(sample_names(ps) != "Mock", ps) # Remove mock sample
ps
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 215 taxa and 19 samples ]
## sample_data() Sample Data:       [ 19 samples by 14 sample variables ]
## tax_table()   Taxonomy Table:    [ 215 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 215 tips and 213 internal nodes ]

Using phyloseq

phyloseq(McMurdie and Holmes 2013) is an R package to import, store, analyze, and graphically display complex phylogenetic sequencing data that has already been clustered into Operational Taxonomic Units (OTUs) or more appropriately denoised, and it is most useful when there is also associated sample data, phylogeny, and/or taxonomic assignment of each taxa. leverages and builds upon many of the tools available in R for ecology and phylogenetic analysis (ape, vegan, ade4), while also using advanced/flexible graphic systems (ggplot2) to easily produce publication-quality graphics of complex phylogenetic data. The phyloseq package uses a specialized system of S4 data classes to store all related phylogenetic sequencing data as a single, self-consistent, self-describing experiment-level object, making it easier to share data and reproduce analyses. In general, phyloseq seeks to facilitate the use of R for efficient interactive and reproducible analysis of amplicon count data jointly with important sample covariates.

This tutorial shows a useful example workflow, but many more analyses are available to you in phyloseq, and R in general, than can fit in a single workflow. The phyloseq home page is a good place to begin browsing additional phyloseq documentation, as are the three vignettes included within the package, and linked directly at the phyloseq release page on Bioconductor.

Loading the data

Many use cases result in the need to import and combine different data into a phyloseq class object, this can be done using th import_biom function to read recent QIIME format files, older files can still be imported with import_qiime. More complete details can be found on the phyloseq FAQ page.

In the previous section the results of dada2 sequence processing were organized into a phyloseq object. We have actually run dada2 on a larger set of samples from the same data source. This object was also saved in R-native serialized RDS format. We will re-load this here for completeness as the initial object ps. If you have not downloaded the whole repository you can access the ps file though github:

ps_connect <-url("https://raw.githubusercontent.com/spholmes/F1000_workflow/master/data/ps.rds")
ps = readRDS(ps_connect)
ps
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 389 taxa and 360 samples ]
## sample_data() Sample Data:       [ 360 samples by 14 sample variables ]
## tax_table()   Taxonomy Table:    [ 389 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 389 tips and 387 internal nodes ]

Shiny-phyloseq

It can be beneficial to start the data exploration process interactively, this often saves time in detecting outliers and specific features of the data. Shiny-phyloseq (McMurdie and Holmes 2015) is an interactive web application that provides a graphical user interface to the phyloseq package. The object just loaded into the R session in this workflow is suitable for graphical exploration with Shiny-phyloseq.

Filtering

phyloseq provides useful tools for filtering, subsetting, and agglomerating taxa – a task that is often appropriate or even necessary for effective analysis of microbiome count data. In this subsection, we graphically explore the prevalence of taxa in the example dataset, and demonstrate how this can be used as a filtering criteria. One of the reasons to filter in this way is to avoid spending much time analyzing taxa that were seen only rarely among samples. This also turns out to be a useful filter of noise (taxa that are actually just artifacts of the data collection process), a step that should probably be considered essential for datasets constructed via heuristic OTU-clustering methods, which are notoriously prone to generating spurious taxa.

Taxonomic Filtering

In many biological settings, the set of all organisms from all samples are well-represented in the available taxonomic reference database. When (and only when) this is the case, it is reasonable or even advisable to filter taxonomic features for which a high-rank taxonomy could not be assigned. Such ambiguous features in this setting are almost always sequence artifacts that don’t exist in nature. It should be obvious that such a filter is not appropriate for samples from poorly characterized or novel specimens, at least until the possibility of taxonomic novelty can be satisfactorily rejected. Phylum is a useful taxonomic rank to consider using for this purpose, but others may work effectively for your data.

To begin, create a table of read counts for each Phylum present in the dataset.

# Show available ranks in the dataset
rank_names(ps)
## [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"
# Create table, number of features for each phyla
table(tax_table(ps)[, "Phylum"], exclude = NULL)
## 
##              Actinobacteria               Bacteroidetes Candidatus_Saccharibacteria 
##                          13                          23                           1 
##   Cyanobacteria/Chloroplast         Deinococcus-Thermus                  Firmicutes 
##                           4                           1                         327 
##                Fusobacteria              Proteobacteria                 Tenericutes 
##                           1                          11                           1 
##             Verrucomicrobia                        <NA> 
##                           1                           6

This shows a few phyla for which only one feature was observed. Those may be worth filtering, and we’ll check that next. First, notice that in this case, six features were annotated with a Phylum of NA. These features are probably artifacts in a dataset like this, and should be removed.

The following ensures that features with ambiguous phylum annotation are also removed. Note the flexibility in defining strings that should be considered ambiguous annotation.

ps <- subset_taxa(ps, !is.na(Phylum) & !Phylum %in% c("", "uncharacterized"))

A useful next step is to explore feature prevalence in the dataset, which we will define here as the number of samples in which a taxon appears at least once.

# Compute prevalence of each feature, store as data.frame
prevdf = apply(X = otu_table(ps),
               MARGIN = ifelse(taxa_are_rows(ps), yes = 1, no = 2),
               FUN = function(x){sum(x > 0)})
# Add taxonomy and total read counts to this data.frame
prevdf = data.frame(Prevalence = prevdf,
                    TotalAbundance = taxa_sums(ps),
                    tax_table(ps))

Are there phyla that are comprised of mostly low-prevalence features? Compute the total and average prevalences of the features in each phylum.

plyr::ddply(prevdf, "Phylum", function(df1){cbind(mean(df1$Prevalence),sum(df1$Prevalence))})
##                         Phylum         1     2
## 1               Actinobacteria 120.15385  1562
## 2                Bacteroidetes 265.52174  6107
## 3  Candidatus_Saccharibacteria 280.00000   280
## 4    Cyanobacteria/Chloroplast  64.25000   257
## 5          Deinococcus-Thermus  52.00000    52
## 6                   Firmicutes 179.24771 58614
## 7                 Fusobacteria   2.00000     2
## 8               Proteobacteria  59.09091   650
## 9                  Tenericutes 234.00000   234
## 10             Verrucomicrobia 104.00000   104

Deinococcus-Thermus appeared in just over one percent of samples, and Fusobacteria appeared in just 2 samples total. In some cases it might be worthwhile to explore these two phyla in more detail despite this (though probably not Fusobacteria’s two samples). For the purposes of this example, though, they will be filtered from the dataset.

# Define phyla to filter
filterPhyla = c("Fusobacteria", "Deinococcus-Thermus")
# Filter entries with unidentified Phylum.
ps1 = subset_taxa(ps, !Phylum %in% filterPhyla)
ps1
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 381 taxa and 360 samples ]
## sample_data() Sample Data:       [ 360 samples by 14 sample variables ]
## tax_table()   Taxonomy Table:    [ 381 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 381 tips and 379 internal nodes ]

Prevalence Filtering

The previous filtering steps are considered supervised, because they relied on prior information that is external to this experiment (a taxonomic reference database). This next filtering step is completely unsupervised, relying only on the data in this experiment, and a parameter that we will choose after exploring the data. Thus, this filtering step can be applied even in settings where taxonomic annotation is unavailable or unreliable.

First, explore the relationship of prevalence and total read count for each feature. Sometimes this reveals outliers that should probably be removed, and also provides insight into the ranges of either feature that might be useful. This aspect depends quite a lot on the experimental design and goals of the downstream inference, so keep these in mind. It may even be the case that different types of downstream inference require different choices here. There is no reason to expect ahead of time that a single filtering workflow is appropriate for all analysis.

# Subset to the remaining phyla
prevdf1 = subset(prevdf, Phylum %in% get_taxa_unique(ps1, "Phylum"))
ggplot(prevdf1, aes(TotalAbundance, Prevalence / nsamples(ps),color=Phylum)) +
  # Include a guess for parameter
  geom_hline(yintercept = 0.05, alpha = 0.5, linetype = 2) +  geom_point(size = 2, alpha = 0.7) +
  scale_x_log10() +  xlab("Total Abundance") + ylab("Prevalence [Frac. Samples]") +
  facet_wrap(~Phylum) + theme(legend.position="none")
Taxa prevalence versus total counts.

Figure 2: Taxa prevalence versus total counts

Each point in Figure 2 is a different taxa. Exploration of the data in this way is often useful for selecting filtering parameters, like the minimum prevalence criteria we will used to filter the data above.

Sometimes a natural separation in the dataset reveals itself, or at least, a conservative choice that is in a stable region for which small changes to the choice would have minor or no effect on the biological interpreation (stability). Here no natural separation is immediately evident, but it looks like we might reasonably define a prevalence threshold in a range of zero to ten percent or so. Take care that this choice does not introduce bias into a downstream analysis of association of differential abundance.

The following uses five percent of all samples as the prevalence threshold.

# Define prevalence threshold as 5% of total samples
prevalenceThreshold = 0.05 * nsamples(ps)
prevalenceThreshold
## [1] 18
# Execute prevalence filter, using `prune_taxa()` function
keepTaxa = rownames(prevdf1)[(prevdf1$Prevalence >= prevalenceThreshold)]
ps2 = prune_taxa(keepTaxa, ps)

Agglomerate taxa

When there is known to be a lot of species or sub-species functional redundancy in a microbial community, it might be useful to agglomerate the data features corresponding to closely related taxa. Ideally we would know the functional redundancies perfectly ahead of time, in which case we would agglomerate taxa using those defined relationships and the function in phyloseq. That kind of exquisite functional data is usually not available, and different pairs of microbes will have different sets of overlapping functions, complicating the matter of defining appropriate grouping criteria.

While not necessarily the most useful or functionally-accurate criteria for grouping microbial features (sometimes far from accurate), taxonomic agglomeration has the advantage of being much easier to define ahead of time. This is because taxonomies are usually defined with a comparatively simple tree-like graph structure that has a fixed number of internal nodes, called “ranks”. This structure is simple enough for the phyloseq package to represent taxonomies as table of taxonomy labels. Taxonomic agglomeration groups all the “leaves” in the hierarchy that descend from the user-prescribed agglomerating rank, this is sometimes called ‘glomming’.

The following example code shows how one would combine all features that descend from the same genus.

# How many genera would be present after filtering?
length(get_taxa_unique(ps2, taxonomic.rank = "Genus"))
## [1] 49
ps3 = tax_glom(ps2, "Genus", NArm = TRUE)

If taxonomy is not available or not reliable, tree-based agglomeration is a “taxonomy-free” alternative to combine data features corresponding to closely-related taxa. In this case, rather than taxonomic rank, the user specifies a tree height corresponding to the phylogenetic distance between features that should define their grouping. This is very similar to “OTU Clustering”, except that in many OTU Clustering algorithms the sequence distance being used does not have the same (or any) evolutionary definition.

h1 = 0.4
ps4 = tip_glom(ps2, h = h1)

Here phyloseq’s plot_tree() function compare the original unfiltered data, the tree after taxonoic agglomeration, and the tree after phylogenetic agglomeration. These are stored as separate plot objects, then rendered together in one combined graphic using gridExtra::grid.arrange.

multiPlotTitleTextSize = 15
p2tree = plot_tree(ps2, method = "treeonly",
                   ladderize = "left",
                   title = "Before Agglomeration") +
  theme(plot.title = element_text(size = multiPlotTitleTextSize))
p3tree = plot_tree(ps3, method = "treeonly",
                   ladderize = "left", title = "By Genus") +
  theme(plot.title = element_text(size = multiPlotTitleTextSize))
p4tree = plot_tree(ps4, method = "treeonly",
                   ladderize = "left", title = "By Height") +
  theme(plot.title = element_text(size = multiPlotTitleTextSize))
# group plots together
grid.arrange(nrow = 1, p2tree, p3tree, p4tree)
Different types of agglomeration

Figure 3: Different types of agglomeration

Figure 3 shows the original tree on the left, taxonomic agglomeration at Genus rank in the middle and phylogenetic agglomeration at a fixed distance of 0.4 on the right.

Abundance value transformation

It is usually necessary to transform microbiome count data to account for differences in library size, variance, scale, etc. The phyloseq package provides a flexible interface for defining new functions to accomplish these transformations of the abundance values via the function transform_sample_counts(). The first argument to this function is the phyloseq object you want to transform, and the second argument is an R function that defines the transformation. The R function is applied sample-wise, expecting that the first unnamed argument is a vector of taxa counts in the same order as the phyloseq object. Additional arguments are passed on to the function specified in the second argument, providing an explicit means to include pre-computed values, previously defined parameters/thresholds, or any other object that might be appropriate for computing the transformed values of interest.

This example begins by defining a custom plot function, plot_abundance(), that uses phyloseq’s function to define a relative abundance graphic. We will use this to compare more easily differences in scale and distribution of the abundance values in our phyloseq object before and after transformation.

plot_abundance = function(physeq,title = "",
                          Facet = "Order", Color = "Phylum"){
  # Arbitrary subset, based on Phylum, for plotting
  p1f = subset_taxa(physeq, Phylum %in% c("Firmicutes"))
  mphyseq = psmelt(p1f)
  mphyseq <- subset(mphyseq, Abundance > 0)
  ggplot(data = mphyseq, mapping = aes_string(x = "sex",y = "Abundance",
                              color = Color, fill = Color)) +
    geom_violin(fill = NA) +
    geom_point(size = 1, alpha = 0.3,
               position = position_jitter(width = 0.3)) +
    facet_wrap(facets = Facet) + scale_y_log10()+
    theme(legend.position="none")
}

The transformation in this case converts the counts from each sample into their frequencies, often referred to as proportions or relative abundances. This function is so simple that it is easiest to define it within the function call to transform_sample_counts().

# Transform to relative abundance. Save as new object.
ps3ra = transform_sample_counts(ps3, function(x){x / sum(x)})

Now we plot the abundance values before and after transformation.

plotBefore = plot_abundance(ps3,"")
plotAfter = plot_abundance(ps3ra,"")
# Combine each plot into one graphic.
grid.arrange(nrow = 2,  plotBefore, plotAfter)
Comparison of original abundances with transformed data

Figure 4: Comparison of original abundances with transformed data

Figure 4 shows the comparison of original abundances (top panel) and relative abundances (lower).

Subset by taxonomy

Notice on the previous plot that Lactobacillales appears to be a taxonomic Order with bimodal abundance profile in the data. We can check for a taxonomic explanation of this pattern by plotting just that taxonomic subset of the data. For this, we subset with the function, and then specify a more precise taxonomic rank to the argument of the function that we defined above.

psOrd = subset_taxa(ps3ra, Order == "Lactobacillales")
plot_abundance(psOrd, Facet = "Genus", Color = NULL)
Violin plot of relative abundances of Lactobacillales

Figure 5: Violin plot of relative abundances of Lactobacillales

Figure 5 shows the relative abundances of Lactobacillales taxonomic Order, grouped by host sex and genera. Here it is clear that the apparent biomodal distribution of Lactobacillales on the previous plot was the result of a mixture of two different genera, with the typical Lactobacillus relative abundance much larger than Streptococcus.

At this stage in the workflow, after converting raw reads to interpretable species abundances, and after filtering and transforming these abundances to focus attention on scientifically meaningful quantities, we are in a position to consider more careful statistical analysis. R is an ideal environment for performing these analyses, as it has an active community of package developers building simple interfaces to sophisticated techniques. As a variety of methods are available, there is no need to commit to any rigid analysis strategy a priori. Further, the ability to easily call packages without reimplementing methods frees researchers to iterate rapidly through alternative analysis ideas. The advantage of performing this full workflow in R is that this transition from bioinformatics to statistics is effortless.

Let’s start by installing a few packages that are available for these complementary analyses:

.cran_packages <- c( "shiny","miniUI", "caret", "pls", "e1071", "ggplot2", "randomForest", "dplyr", "ggrepel", "nlme", "devtools",
                  "reshape2", "PMA", "structSSI", "ade4",
                  "ggnetwork", "intergraph", "scales")
.github_packages <- c("jfukuyama/phyloseqGraphTest")
.bioc_packages <- c("genefilter", "impute")
# Install CRAN packages (if not already installed)
.inst <- .cran_packages %in% installed.packages()
if (any(!.inst)){
  install.packages(.cran_packages[!.inst],repos = "http://cran.rstudio.com/")
}
.inst <- .github_packages %in% installed.packages()
if (any(!.inst)){
  devtools::install_github(.github_packages[!.inst])
}

.inst <- .bioc_packages %in% installed.packages()
if(any(!.inst)){
  source("http://bioconductor.org/biocLite.R")
  biocLite(.bioc_packages[!.inst])
}

We back these claims by illustrating several analyses on the mouse data prepared above. We experiment with several flavors of exploratory ordination before shifting to more formal testing and modeling, explaining the settings in which the different points of view are most appropriate. Finally, we provide example analyses of multitable data, using a study in which both metabolomic and microbial abundance measurements were collected on the same samples, to demonstrate that the general workflow presented here can be adapted to the multitable setting.

Preprocessing

Before doing the multivariate projections, we will add a few columns to our sample data, which can then be used to annotate plots. From Figure 6, we see that the ages of the mice come in a couple of groups, and so we make a categorical variable corresponding to young, middle-aged, and old mice. We also record the total number of counts seen in each sample and log-transform the data as an approximate variance stabilizing transformation.

qplot(sample_data(ps)$age, geom = "histogram",binwidth=20) + xlab("age")
Histogram of age groupings

Figure 6: Histogram of age groupings

Figure 6 shows that the age covariate belongs to three separate clusters.

qplot(log10(rowSums(otu_table(ps))),binwidth=0.2) +
  xlab("Logged counts-per-sample")
Histograms comparing raw and log transformed read depths

Figure 7: Histograms comparing raw and log transformed read depths

These preliminary plots suggest certain preprocessing steps. The histogram in Figure 6 motivates the creation of a new categorical variable, binning age into one of the three peaks.

The histogram in Figure 7 suggests that a \(\log\left(1 + x\right)\) transformation might be sufficient for normalizing the abundance data for the exploratory analyses.

In fact this transformation is not sufficient for testing purposes and when performing differential abundances we recommend the variance stabilizing transformations available in DESeq2 through the phyloseq_to_deseq2 function, see the phyloseq_to_deseq2 tutorial here.

As our first step, we look at principal coordinates analysis (PCoA) with either the Bray-Curtis dissimilarity on the weighted Unifrac distance.

sample_data(ps)$age_binned <- cut(sample_data(ps)$age,
                          breaks = c(0, 100, 200, 400))
levels(sample_data(ps)$age_binned) <- list(Young100="(0,100]", Mid100to200="(100,200]", Old200="(200,400]")
sample_data(ps)$family_relationship=gsub(" ","",sample_data(ps)$family_relationship)
pslog <- transform_sample_counts(ps, function(x) log(1 + x))
out.wuf.log <- ordinate(pslog, method = "MDS", distance = "wunifrac")
evals <- out.wuf.log$values$Eigenvalues
plot_ordination(pslog, out.wuf.log, color = "age_binned") +
  labs(col = "Binned Age") +
  coord_fixed(sqrt(evals[2] / evals[1]))
Exploratory ordination analysis with log abundances.

Figure 8: Exploratory ordination analysis with log abundances

Figure 8 showing the ordination on the logged abundance data reveals a few outliers.

These turn out to be the samples from females 5 and 6 on day 165 and the samples from males 3, 4, 5, and 6 on day 175. We will take them out, since we are mainly interested in the relationships between the non-outlier points.

Before we continue, we should check the two female outliers – they have been taken over by the same OTU/ASV, which has a relative abundance of over 90% in each of them. This is the only time in the entire data set that this ASV has such a high relative abundance – the rest of the time it is below 20%. In particular, its diversity is by far the lowest of all the samples.

rel_abund <- t(apply(otu_table(ps), 1, function(x) x / sum(x)))
qplot(rel_abund[, 12], geom = "histogram",binwidth=0.05) +
  xlab("Relative abundance")
The outlier samples are dominated by a single ASV.

Figure 9: The outlier samples are dominated by a single ASV

Different Ordination Projections

As we have seen, an important first step in analyzing microbiome data is to do unsupervised, exploratory analysis. This is simple to do in phyloseq, which provides many distances and ordination methods.

After documenting the outliers, we are going to compute ordinations with these outliers removed and more carefully study the output.

outliers <- c("F5D165", "F6D165", "M3D175", "M4D175", "M5D175", "M6D175")
ps <- prune_samples(!(sample_names(ps) %in% outliers), ps)

We are also going to remove samples with fewer than 1000 reads:

which(!rowSums(otu_table(ps)) > 1000)
## F5D145 M1D149   M1D9 M2D125  M2D19 M3D148 M3D149   M3D3   M3D5   M3D8 
##     69    185    200    204    218    243    244    252    256    260
ps <- prune_samples(rowSums(otu_table(ps)) > 1000, ps)
pslog <- transform_sample_counts(ps, function(x) log(1 + x))

We’ll first perform a PCoA using Bray-Curtis dissimilarity.

out.pcoa.log <- ordinate(pslog,  method = "MDS", distance = "bray")
evals <- out.pcoa.log$values[,1]
plot_ordination(pslog, out.pcoa.log, color = "age_binned",
                  shape = "family_relationship") +
  labs(col = "Binned Age", shape = "Litter")+
  coord_fixed(sqrt(evals[2] / evals[1]))
A PCoA plot using Bray-Curtis between  samples.

Figure 10: A PCoA plot using Bray-Curtis between samples

We see that there is a fairly substantial age effect that is consistent between all the mice, male and female, and from different litters.

Next we look at double principal coordinates analysis (DPCoA) (Pavoine, Dufour, and Chessel 2004; Purdom 2010; Fukuyama et al. 2012), which is a phylogenetic ordination method and that provides a biplot representation of both samples and taxonomic categories. We see again that the second axis corresponds to young vs. old mice, and the biplot suggests an interpretation of the second axis: samples that have larger scores on the second axis have more taxa from Bacteroidetes and one subset of Firmicutes.

out.dpcoa.log <- ordinate(pslog, method = "DPCoA")
evals <- out.dpcoa.log$eig
plot_ordination(pslog, out.dpcoa.log, color = "age_binned", label= "SampleID",
                  shape = "family_relationship") +
  labs(col = "Binned Age", shape = "Litter")+
  coord_fixed(sqrt(evals[2] / evals[1]))
A DPCoA plot incorporates phylogenetic information, but is  dominated by the first axis.

Figure 11: A DPCoA plot incorporates phylogenetic information, but is dominated by the first axis

In Figure 11 we have the first axis explains 75 % of the variability, about 9 times that of the second axis; this translates into the elongated form of the ordination plot.

plot_ordination(pslog, out.dpcoa.log, type = "species", color = "Phylum") +
  coord_fixed(sqrt(evals[2] / evals[1]))
Taxa responsible for Axis 1 and 2

Figure 12: Taxa responsible for Axis 1 and 2

Finally, we can look at the results of PCoA with weighted Unifrac. As before, we find that the second axis is associated with an age effect, which is fairly similar to DPCoA. This is not surprising, because both are phylogenetic ordination methods taking abundance into account. However, when we compare biplots, we see that the DPCoA gave a much cleaner interpretation of the second axis, compared to weighted Unifrac.

out.wuf.log <- ordinate(pslog, method = "PCoA", distance ="wunifrac")
evals <- out.wuf.log$values$Eigenvalues
plot_ordination(pslog, out.wuf.log, color = "age_binned",
                  shape = "family_relationship") +
  coord_fixed(sqrt(evals[2] / evals[1])) +
  labs(col = "Binned Age", shape = "Litter")
The sample positions produced by a PCoA using weighted Unifrac.

Figure 13: The sample positions produced by a PCoA using weighted Unifrac

Why are the ordination plots so far from square?

Aspect ratio of ordination plots

In the ordination plots in Figure 8–Figure 14, you may have noticed as did the reviewers of the first version of the paper, that the maps are not presented as square representations as is often the case in standard PCoA and PCA plots in the literature.

The reason for this is that as we are trying to represent the distances between samples as faithfully as possible; we have to take into account that the second eigenvalue is always smaller than the first, sometimes considerably so, thus we normalize the axis norm ratios to the relevant eigenvalue ratios. This ensures that the variability represented in the plots is done so faithfully.

PCA on ranks

Microbial abundance data is often heavy-tailed, and sometimes it can be hard to identify a transformation that brings the data to normality. In these cases, it can be safer to ignore the raw abundances altogether, and work instead with ranks. We demonstrate this idea using a rank-transformed version of the data to perform PCA. First, we create a new matrix, representing the abundances by their ranks, where the microbe with the smallest in a sample gets mapped to rank 1, second smallest rank 2, etc.

abund <- otu_table(pslog)
abund_ranks <- t(apply(abund, 1, rank))

Naively using these ranks could make differences between pairs of low and high abundance microbes comparable. In the case where many bacteria are absent or present at trace amounts, an artificially large difference in rank could occur(Holmes et al. 2011) for minimally abundant taxa. To avoid this, all those microbes with rank below some threshold are set to be tied at 1. The ranks for the other microbes are shifted down, so there is no large gap between ranks.

abund_ranks <- abund_ranks - 329
abund_ranks[abund_ranks < 1] <- 1
library(dplyr)
library(reshape2)
abund_df <- melt(abund, value.name = "abund") %>%
  left_join(melt(abund_ranks, value.name = "rank"))
colnames(abund_df) <- c("sample", "seq", "abund", "rank")

abund_df <- melt(abund, value.name = "abund") %>%
  left_join(melt(abund_ranks, value.name = "rank"))
colnames(abund_df) <- c("sample", "seq", "abund", "rank")

sample_ix <- sample(1:nrow(abund_df), 8)
ggplot(abund_df %>%
         filter(sample %in% abund_df$sample[sample_ix])) +
  geom_point(aes(x = abund, y = rank, col = sample),
             position = position_jitter(width = 0.2), size = 1.5) +
  labs(x = "Abundance", y = "Thresholded rank") +
  scale_color_brewer(palette = "Set2")
Rank threshold transformation

Figure 14: Rank threshold transformation

This transformation is illustrated in Figure 14.

The association between abundance and rank, for a few randomly selected samples. The numbers of the \(y\)-axis are those supplied to PCA.

We can now perform PCA and study the resulting biplot, given in the Figure below. To produce annotation for this figure, we used the following block.

library(ade4)
ranks_pca <- dudi.pca(abund_ranks, scannf = F, nf = 3)
row_scores <- data.frame(li = ranks_pca$li,
                         SampleID = rownames(abund_ranks))
col_scores <- data.frame(co = ranks_pca$co,
                         seq = colnames(abund_ranks))
tax <- tax_table(ps) %>%
  data.frame(stringsAsFactors = FALSE)
tax$seq <- rownames(tax)
main_orders <- c("Clostridiales", "Bacteroidales", "Lactobacillales",
                 "Coriobacteriales")
tax$Order[!(tax$Order %in% main_orders)] <- "Other"
tax$Order <- factor(tax$Order, levels = c(main_orders, "Other"))
tax$otu_id <- seq_len(ncol(otu_table(ps)))
row_scores <- row_scores %>%
  left_join(sample_data(pslog))
col_scores <- col_scores %>%
  left_join(tax)
evals_prop <- 100 * (ranks_pca$eig / sum(ranks_pca$eig))
ggplot() +
  geom_point(data = row_scores, aes(x = li.Axis1, y = li.Axis2), shape = 2) +
  geom_point(data = col_scores, aes(x = 25 * co.Comp1, y = 25 * co.Comp2, col = Order),
             size = .3, alpha = 0.6) +
  scale_color_brewer(palette = "Set2") +
  facet_grid(~ age_binned) +
  guides(col = guide_legend(override.aes = list(size = 3))) +
  labs(x = sprintf("Axis1 [%s%% variance]", round(evals_prop[1], 2)),
       y = sprintf("Axis2 [%s%% variance]", round(evals_prop[2], 2))) +
  coord_fixed(sqrt(ranks_pca$eig[2] / ranks_pca$eig[1])) +
  theme(panel.border = element_rect(color = "#787878", fill = alpha("white", 0)))
The biplot resulting from the PCA after the truncated-ranking transformation.

Figure 15: The biplot resulting from the PCA after the truncated-ranking transformation

The results are similar to the PCoA analyses computed without applying a truncated-ranking transformation, reinforcing our confidence in the analysis on the original data.

Canonical correspondence

Canonical Correspondence Analysis (CCpnA) is an approach to ordination of a species by sample table that incorporates supplemental information about the samples. As before, the purpose of creating biplots is to determine which types of bacterial communities are most prominent in different mouse sample types. It can be easier to interpret these biplots when the ordering between samples reflects sample characteristics – variations in age or litter status in the mouse data, for example – and this central to the design of CCpnA.

The function allows us to create biplots where the positions of samples are determined by similarity in both species signatures and environmental characteristics; in contrast, principal components analysis or correspondence analysis only look at species signatures. More formally, it ensures that the resulting CCpnA directions lie in the span of the environmental variables; thorough treatments are available in (Braak 1985; Greenacre 2007).

Like PCoA and DPCoA, this method can be run using ordinate from the phyloseq package . In order to use supplemental sample data, it is necessary to provide an extra argument, specifying which of the features to consider – otherwise, defaults to using all measurements when producing the ordination.

ps_ccpna <- ordinate(pslog, "CCA", formula = pslog ~ age_binned + family_relationship)

To access the positions for the biplot, we can use the function ordinate in phyloseq. Further, to facilitate figure annotation, we also join the site scores with the environmental data in the slot. Of the 23 total taxonomic orders, we only explicitly annotate the four most abundant – this makes the biplot easier to read.

library(ggrepel)
ps_scores <- vegan::scores(ps_ccpna)
sites <- data.frame(ps_scores$sites)
sites$SampleID <- rownames(sites)
sites <- sites %>%
  left_join(sample_data(ps))

species <- data.frame(ps_scores$species)
species$otu_id <- seq_along(colnames(otu_table(ps)))
species <- species %>%
  left_join(tax)
evals_prop <- 100 * ps_ccpna$CCA$eig[1:2] / sum(ps_ccpna$CA$eig)
ggplot() +
  geom_point(data = sites, aes(x = CCA1, y = CCA2), shape = 2, alpha = 0.5) +
  geom_point(data = species, aes(x = CCA1, y = CCA2, col = Order), size = 0.5) +
  geom_text_repel(data = species %>% filter(CCA2 < -2),
                    aes(x = CCA1, y = CCA2, label = otu_id),
            size = 1.5, segment.size = 0.1) +
  facet_grid(. ~ family_relationship) +
  guides(col = guide_legend(override.aes = list(size = 3))) +
  labs(x = sprintf("Axis1 [%s%% variance]", round(evals_prop[1], 2)),
        y = sprintf("Axis2 [%s%% variance]", round(evals_prop[2], 2))) +
  scale_color_brewer(palette = "Set2") +
  coord_fixed(sqrt(ps_ccpna$CCA$eig[2] / ps_ccpna$CCA$eig[1])*0.45   ) +
  theme(panel.border = element_rect(color = "#787878", fill = alpha("white", 0)))
The mouse and bacteria scores generated by CCpnA.

Figure 16: The mouse and bacteria scores generated by CCpnA

Supervised learning

Here we illustrate some supervised learning methods that can be easily run in R. The package wraps many prediction algorithms available in R and performs parameter tuning automatically. Since we saw that microbiome signatures change with age, we’ll apply supervised techniques to try to predict age from microbiome composition.

We’ll first look at Partial Least Squares (PLS)(Wold et al. 1984). The first step is to divide the data into training and test sets, with assignments done by mouse, rather than by sample, to ensure that the test set realistically simulates the collection of new data. Once we split the data, we can use the function train to fit the PLS model.

library(caret)
sample_data(pslog)$age2 <- cut(sample_data(pslog)$age, c(0, 100, 400))
dataMatrix <- data.frame(age = sample_data(pslog)$age2, otu_table(pslog))
# take 8 mice at random to be the training set, and the remaining 4 the test set
trainingMice <- sample(unique(sample_data(pslog)$host_subject_id), size = 8)
inTrain <- which(sample_data(pslog)$host_subject_id %in% trainingMice)
training <- dataMatrix[inTrain,]
testing <- dataMatrix[-inTrain,]
plsFit <- train(age ~ ., data = training,
                method = "pls", preProc = "center")

Next we can predict class labels on the test set using the function predict and compare to the truth. We see that the method does an excellent job of predicting age.

plsClasses <- predict(plsFit, newdata = testing)
table(plsClasses, testing$age)
##            
## plsClasses  (0,100] (100,400]
##   (0,100]        66         0
##   (100,400]       2        45

As another example, we can try out random forests. This is run in exactly the same way as PLS, by switching the argument from to . Random forests also perform well at the prediction task on this test set, though there are more old mice misclassified as young.

library(randomForest)
rfFit <- train(age ~ ., data = training, method = "rf",
               preProc = "center", proximity = TRUE)
rfClasses <- predict(rfFit, newdata = testing)
table(rfClasses, testing$age)
##            
## rfClasses   (0,100] (100,400]
##   (0,100]        66         1
##   (100,400]       2        44

To interpret these PLS and random forest results, it is standard to produce biplots and proximity plots, respectively. The code below extracts coordinates and supplies annotation for points to include on the PLS biplot.

pls_biplot <- list("loadings" = loadings(plsFit$finalModel),
                   "scores" = scores(plsFit$finalModel))
class(pls_biplot$scores) <- "matrix"

pls_biplot$scores <- data.frame(sample_data(pslog)[inTrain, ],
                                pls_biplot$scores)

tax <- tax_table(ps)@.Data %>%
  data.frame(stringsAsFactors = FALSE)
main_orders <- c("Clostridiales", "Bacteroidales", "Lactobacillales",
                 "Coriobacteriales")
tax$Order[!(tax$Order %in% main_orders)] <- "Other"
tax$Order <- factor(tax$Order, levels = c(main_orders, "Other"))
class(pls_biplot$loadings) <- "matrix"
pls_biplot$loadings <- data.frame(tax, pls_biplot$loadings)
ggplot() +
  geom_point(data = pls_biplot$scores,
             aes(x = Comp.1, y = Comp.2), shape = 2) +
  geom_point(data = pls_biplot$loadings,
             aes(x = 25 * Comp.1, y = 25 * Comp.2, col = Order),
             size = 0.3, alpha = 0.6) +
  scale_color_brewer(palette = "Set2") +
  labs(x = "Axis1", y = "Axis2", col = "Binned Age") +
  guides(col = guide_legend(override.aes = list(size = 3))) +
  facet_grid( ~ age2) +
  theme(panel.border = element_rect(color = "#787878", fill = alpha("white", 0)))
PLS produces a biplot representation designed to separate  samples by a response variable.

Figure 17: PLS produces a biplot representation designed to separate samples by a response variable

The resulting biplot is displayed in Figure 17; it can be interpreted similarly to earlier ordination diagrams, with the exception that the projection is chosen with an explicit reference to the binned age variable. Specifically, PLS identifies a subspace to maximize discrimination between classes, and the biplot displays sample projections and ASV coefficients with respect to this subspace.

rf_prox <- cmdscale(1 - rfFit$finalModel$proximity) %>%
  data.frame(sample_data(pslog)[inTrain, ])

ggplot(rf_prox) +
  geom_point(aes(x = X1, y = X2, col = age_binned),
             size = 1, alpha = 0.7) +
  scale_color_manual(values = c("#A66EB8", "#238DB5", "#748B4F")) +
  guides(col = guide_legend(override.aes = list(size = 4))) +
  labs(col = "Binned Age", x = "Axis1", y = "Axis2")
The random forest model determines a distance between samples, which can be input into PCoA to produce a proximity plot.

Figure 18: The random forest model determines a distance between samples, which can be input into PCoA to produce a proximity plot

A random forest proximity plot is displayed in Figure 18. To generate this representation, a distance is calculated between samples based on how frequently sample occur in the same tree partition in the random forest’s bootstrapping procedure. If a pair of samples frequently occur in the same partition, the pair is assigned a low distance. The resulting distances are then input to PCoA, giving a glimpse into the random forests’ otherwise complex classification mechanism. The separation between classes is clear, and manually inspecting points would reveal what types of samples are easier or harder to classify.

as.vector(tax_table(ps)[which.max(importance(rfFit$finalModel)), c("Family", "Genus")])
## [1] "Lachnospiraceae" "Roseburia"
impOtu <- as.vector(otu_table(pslog)[,which.max(importance(rfFit$finalModel))])
maxImpDF <- data.frame(sample_data(pslog), abund = impOtu)
ggplot(maxImpDF) +   geom_histogram(aes(x = abund)) +
  facet_grid(age2 ~ .) +
  labs(x = "Abundance of discriminative bacteria", y = "Number of samples")
Random forest aids to interpretation: importance scores.

Figure 19: Random forest aids to interpretation: importance scores

To further understand the fitted random forest model, we identify the microbe with the most influence in the random forest prediction. This turns out to be a microbe in family Lachnospiraceae and genus Roseburia. Figure 19 plots its abundance across samples; we see that it is uniformly very low from age 0 to 100 days and much higher from age 100 to 400 days.

Graph-based analyses

Creating and plotting graphs

Phyloseq has functionality for creating graphs based on thresholding a distance matrix, and the resulting networks can be plotting using the ggnetwork package. This package overlays onto the ggplot syntax, so you can use the function ggplot on an igraph object and add and geoms to plot the network. To be able to color the nodes or edges a certain way, we need to add these attributes to the igraph object. Below we create a network by thresholding the Jaccard dissimilarity (the default distance for the function make_network) at .35, and then we add an attribute to the vertices indicating which mouse the sample came from and which litter the mouse was in. Then we can plot the network with the coloring by mouse and shape by litter.

library("phyloseqGraphTest")
library("igraph")
library("ggnetwork")
net <- make_network(ps, max.dist=0.35)
sampledata <- data.frame(sample_data(ps))
V(net)$id <- sampledata[names(V(net)), "host_subject_id"]
V(net)$litter <- sampledata[names(V(net)), "family_relationship"]
ggplot(net, aes(x = x, y = y, xend = xend, yend = yend), layout = "fruchtermanreingold") +
  geom_edges(color = "darkgray") +
  geom_nodes(aes(color = id, shape = litter),  size = 3 ) +
  theme(axis.text = element_blank(), axis.title = element_blank(),
        legend.key.height = unit(0.5,"line")) +
  guides(col = guide_legend(override.aes = list(size = .5)))
A network created by thresholding the Jaccard dissimilarity  matrix.

Figure 20: A network created by thresholding the Jaccard dissimilarity matrix

We see the resulting network in Figure 20. The colors in the Figure represent which mouse the sample came from and the shape represents which litter the mouse was in. We can see that there is grouping of the samples by both mouse and litter.

Graph-based two-sample tests

Graph-based two-sample tests were introduced by Friedman and Rafsky (Friedman and Rafsky 1979) as a generalization of the Wald-Wolfowitz runs test. They proposed the use of a minimum spanning tree (MST) based on the distances between the samples, and then counting the number of edges on the tree that were between samples in different groups. It is not necessary to use a minimum spanning tree (MST), the graph made by linking nearest neighbors (Schilling 1986) or distance thresholding can also be used as the input graph. No matter what graph we build between the samples, we can approximate a null distribution by permuting the labels of the nodes of the graph.

Minimum Spanning Tree (MST)

We first perform a test using an MST with Jaccard dissimilarity. We want to know whether the two litters (family_relationship) come from the same distribution. Since there is a grouping in the data by individual (host_subject_id), we can’t simply permute all the labels, we need to maintain this nested structure – this is what the grouping argument does. Here we permute the labels but keep the structure intact.

gt <- graph_perm_test(ps, "family_relationship", grouping = "host_subject_id",
                      distance = "jaccard", type = "mst")
gt$pval
## [1] 0.01
plotNet1=plot_test_network(gt) + theme(legend.text = element_text(size = 8),
        legend.title = element_text(size = 9))
plotPerm1=plot_permutations(gt)
grid.arrange(ncol = 2,  plotNet1, plotPerm1)
The graph and permutation histogram obtained from the minimal  spanning tree with Jaccard similarity.

Figure 21: The graph and permutation histogram obtained from the minimal spanning tree with Jaccard similarity

This test has a small \(p\)-value, and we reject the null hypothesis that the two samples come from the same distribution. From the plot of the minimum spanning tree in Figure 21, we see by eye that the samples group by litter more than we would expect by chance.

Nearest neighbors

The \(k\)-nearest neighbors graph is obtained by putting an edge between two samples whenever one of them is in the set of \(k\)-nearest neighbors of the other.

gt <- graph_perm_test(ps, "family_relationship", grouping = "host_subject_id",
                      distance = "jaccard", type = "knn", knn = 1)
plotNet2=plot_test_network(gt) + theme(legend.text = element_text(size = 8),
        legend.title = element_text(size = 9))
plotPerm2=plot_permutations(gt)
grid.arrange(ncol = 2,  plotNet2, plotPerm2)
k=1 nearest-neighbor network and permutation histogram

Figure 22: k=1 nearest-neighbor network and permutation histogram

We see from Figure 22 that if a pair of samples has an edge between them in the nearest neighbor graph, they are overwhelmingly likely to be in the same litter.

Linear modeling

It is often of interest to evaluate the degree to which microbial community diversity reflects characteristics of the environment from which it was sampled. Unlike ordination, the purpose of this analysis is not to develop a representation of many bacteria with respect to sample characteristics; rather, it is to describe how a single measure of overall community structure[^1] is associated with sample characteristics. This is a somewhat simpler statistical goal, and can be addressed through linear modeling, for which there are a range of approaches in R. As an example, we will used a mixed-effects model to study the relationship between mouse microbial community diversity and the age and litter variables that have been our focus so far. This choice was motivated by the observation that younger mice have noticeably lower Shannon diversities, but that different mice have different baseline diversities. The mixed-effects model is a starting point for formalizing this observation.

We first compute the Shannon diversity associated with each sample and join it with sample annotation.

library("nlme")
library("reshape2")
ps_alpha_div <- estimate_richness(ps, split = TRUE, measure = "Shannon")
ps_alpha_div$SampleID <- rownames(ps_alpha_div) %>%
  as.factor()
ps_samp <- sample_data(ps) %>%
  unclass() %>%
  data.frame() %>%
  left_join(ps_alpha_div, by = "SampleID") %>%
  melt(measure.vars = "Shannon",
       variable.name = "diversity_measure",
       value.name = "alpha_diversity")

# reorder's facet from lowest to highest diversity
diversity_means <- ps_samp %>%
  group_by(host_subject_id) %>%
  summarise(mean_div = mean(alpha_diversity)) %>%
  arrange(mean_div)
ps_samp$host_subject_id <- factor(ps_samp$host_subject_id)
#                                  diversity_means$host_subject_id)
alpha_div_model <- lme(fixed = alpha_diversity ~ age_binned, data = ps_samp,
                       random = ~ 1 | host_subject_id)
new_data <- expand.grid(host_subject_id = levels(ps_samp$host_subject_id),
                        age_binned = levels(ps_samp$age_binned))
new_data$pred <- predict(alpha_div_model, newdata = new_data)
X <- model.matrix(eval(eval(alpha_div_model$call$fixed)[-2]),
                  new_data[-ncol(new_data)])
pred_var_fixed <- diag(X %*% alpha_div_model$varFix %*% t(X))
new_data$pred_var <- pred_var_fixed + alpha_div_model$sigma ^ 2
# fitted values, with error bars
ggplot(ps_samp %>% left_join(new_data)) +
  geom_errorbar(aes(x = age_binned, ymin = pred - 2 * sqrt(pred_var),
                    ymax = pred + 2 * sqrt(pred_var)),
                col = "#858585", size = .1) +
  geom_point(aes(x = age_binned, y = alpha_diversity,
                 col = family_relationship), size = 0.8) +
  facet_wrap(~host_subject_id) +
  scale_y_continuous(limits = c(2.4, 4.6), breaks = seq(0, 5, .5)) +
  scale_color_brewer(palette = "Set2") +
  labs(x = "Binned Age", y = "Shannon Diversity", color = "Litter") +
  guides(col = guide_legend(override.aes = list(size = 4))) +
  theme(panel.border = element_rect(color = "#787878", fill = alpha("white", 0)),
        axis.text.x = element_text(angle = -90, size = 6),
        axis.text.y = element_text(size = 6))

Hierarchical multiple testing

Hypothesis testing can be used to identify individual microbes whose abundance relates to sample variables of interest. A standard approach is to compute a test statistic for each bacteria individually, measuring its association with sample characteristics, and then jointly adjust \(p\)-values to ensure a False Discovery Rate upper bound. This can be accomplished through the Benjamini-Hochberg procedure, for example (Benjamini and Hochberg 1995). However, this procedure does not exploit any structure among the tested hypotheses – for example, it is likely that if one Ruminococcus species is strongly associated with age, then others are as well. To integrate this information, (Benjamini and Yekutieli 2003; Benjamini and Bogomolov 2014) proposed a hierarchical testing procedure, where taxonomic groups are only tested if higher levels are found to be be associated. In the case where many related species have a slight signal, this pooling of information can increase power.

We apply this method to test the association between microbial abundance and age. This provides a complementary view of the earlier analyses, identifying individual bacteria that are responsible for the differences between young and old mice.

We digress briefly from hierarchical testing to describe an alternative form of count normalization. Rather than working with the logged data as in our earlier analysis, we consider a variance stabilizing transformation introduced by (Love, Huber, and Anders 2014) for RNA-seq data and in (McMurdie and Holmes 2014) for 16S rRNA generated count data and available in the DESeq2 package. The two transformations yield similar sets of significant microbes. One difference is that, after accounting for size factors, the histogram of row sums for DESeq is more spread out in the lower values, refer to Figure 23. This is the motivation of using such a transformation, although for high abundance counts, it is equivalent to the log, for lower and mid range abundances it does not crush the data and yields more powerful results. The code below illustrates the mechanics of computing ’s variance stabilizing transformation on a object.

library("reshape2")
library("DESeq2")
#New version of DESeq2 needs special levels
sample_data(ps)$age_binned <- cut(sample_data(ps)$age,
                          breaks = c(0, 100, 200, 400))
levels(sample_data(ps)$age_binned) <- list(Young100="(0,100]", Mid100to200="(100,200]", Old200="(200,400]")
sample_data(ps)$family_relationship = gsub(" ", "", sample_data(ps)$family_relationship)
ps_dds <- phyloseq_to_deseq2(ps, design = ~ age_binned + family_relationship)

# geometric mean, set to zero when all coordinates are zero
geo_mean_protected <- function(x) {
  if (all(x == 0)) {
    return (0)
  }
  exp(mean(log(x[x != 0])))
}

geoMeans <- apply(counts(ps_dds), 1, geo_mean_protected)
ps_dds <- estimateSizeFactors(ps_dds, geoMeans = geoMeans)
ps_dds <- estimateDispersions(ps_dds)
abund <- getVarianceStabilizedData(ps_dds)

We use the structSSI package to perform the hierarchical testing (Sankaran and Holmes 2014), we first shorten the names for each taxa/ASV.

short_names <- substr(rownames(abund), 1, 5)%>%
  make.names(unique = TRUE)
rownames(abund) <- short_names
abund_sums <- rbind(data.frame(sum = colSums(abund),
                               sample = colnames(abund),
                               type = "DESeq2"),
                    data.frame(sum = rowSums(otu_table(pslog)),
                               sample = rownames(otu_table(pslog)),
                               type = "log(1 + x)"))

ggplot(abund_sums) +
  geom_histogram(aes(x = sum), binwidth = 20) +
  facet_grid(type ~ .) +
  xlab("Total abundance within sample")
DEseq transformation abundance

Figure 23: DEseq transformation abundance

The histogram on the top gives the total DESeq2 transformed abundance within each sample. The bottom histogram is the same as that in Figure @(fig:preprocessing-setup), and is included to facilitate comparison.

Unlike standard multiple hypothesis testing, the hierarchical testing procedure needs univariate tests for each higher-level taxonomic group, not just every species. A helper function, treePValues, is available for this; it expects an edgelist encoding parent-child relationships, with the first row specifying the root node.

library("structSSI")
el <- phy_tree(pslog)$edge
el0 <- el
el0 <- el0[nrow(el):1, ]
el_names <- c(short_names, seq_len(phy_tree(pslog)$Nnode))
el[, 1] <- el_names[el0[, 1]]
el[, 2] <- el_names[as.numeric(el0[, 2])]
unadj_p <- treePValues(el, abund, sample_data(pslog)$age_binned)

We can now correct \(p\)-value using the hierarchical testing procedure. The test results are guaranteed to control several variants of FDR control, but at different levels; we defer details to (Benjamini and Yekutieli 2003; Benjamini and Bogomolov 2014; Sankaran and Holmes 2014).

hfdr_res <- hFDR.adjust(unadj_p, el, .75)
summary(hfdr_res)
## Number of hypotheses: 764 
## Number of tree discoveries: 579 
## Estimated tree FDR: 1 
## Number of tip discoveries: 280 
## Estimated tips FDR: 1 
## 
##  hFDR adjusted p-values: 
##                 unadjp         adjp adj.significance
## GCAAG.95  1.861589e-82 3.723178e-82              ***
## GCAAG.70  1.141143e-75 2.282287e-75              ***
## GCAAG.187 5.153908e-59 1.030782e-58              ***
## GCAAG.251 3.522324e-50 7.044648e-50              ***
## GCAAG.148 1.277549e-49 2.555098e-49              ***
## GCAAG.30  9.825043e-49 1.965009e-48              ***
## GCGAG.76  1.732076e-46 3.464152e-46              ***
## GCAAG.167 6.268146e-43 1.253629e-42              ***
## 255       8.780683e-40 1.756137e-39              ***
## GCAAG.64  2.717368e-36 5.434736e-36              ***
## [only 10 most significant hypotheses shown] 
## --- 
## Signif. codes:  0 '***' 0.015 '**' 0.15 '*' 0.75 '.' 1.5 '-' 1
#interactive part: not run
plot(hfdr_res, height = 5000) # opens in a browser

Screen shot Figure :A screenshot of a subtree with many differentially abundant bacteria, as determined by the hierarchical testing procedure. Currently the user is hovering over the node associated with bacteria GCGAG.33; this causes the adjusted \(p\)-value (0.0295) to appear.

The plot opens in a new browser – a static screenshot of a subtree is displayed above. Nodes are shaded according to \(p\)-values, from blue to orange, representing the strongest to weakest associations. Grey nodes were never tested, to focus power on more promising subtrees. Scanning the full tree, it becomes clear that the association between age group and bacterial abundance is present in only a few isolated taxonomic groups, but that it is quite strong in those groups. To give context to these results, we can retrieve the taxonomic identity of the rejected hypotheses.

tax <- tax_table(pslog)[, c("Family", "Genus")] %>%
  data.frame()
tax$seq <- short_names
options(digits=3)
hfdr_res@p.vals$seq <- rownames(hfdr_res@p.vals)
tax %>%
  left_join(hfdr_res@p.vals) %>%
  arrange(adjp) %>% head(10)
##             Family            Genus       seq   unadjp     adjp adj.significance
## 1  Lachnospiraceae             <NA>  GCAAG.95 1.86e-82 3.72e-82              ***
## 2  Lachnospiraceae        Roseburia  GCAAG.70 1.14e-75 2.28e-75              ***
## 3  Lachnospiraceae Clostridium_XlVa GCAAG.187 5.15e-59 1.03e-58              ***
## 4  Lachnospiraceae             <NA> GCAAG.251 3.52e-50 7.04e-50              ***
## 5  Lachnospiraceae Clostridium_XlVa GCAAG.148 1.28e-49 2.56e-49              ***
## 6  Lachnospiraceae             <NA>  GCAAG.30 9.83e-49 1.97e-48              ***
## 7  Ruminococcaceae     Ruminococcus  GCGAG.76 1.73e-46 3.46e-46              ***
## 8  Lachnospiraceae Clostridium_XlVa GCAAG.167 6.27e-43 1.25e-42              ***
## 9  Lachnospiraceae        Roseburia  GCAAG.64 2.72e-36 5.43e-36              ***
## 10            <NA>             <NA>   GCAAG.1 5.25e-35 1.05e-34              ***

It seems that the most strongly associated bacteria all belong to family Lachnospiraceae, which is consistent with the random forest results.

Multitable techniques

Many microbiome studies attempt to quantify variation in the microbial, genomic, and metabolic measurements across different experimental conditions. As a result, it is common to perform multiple assays on the same biological samples and ask what features – bacteria, genes, or metabolites, for example – are associated with different sample conditions. There are many ways to approach these questions, which to apply depends on the study’s focus.

Here, we will focus on one specific workflow that uses sparse Canonical Correlation Analysis (sparse CCA), a method well-suited to both exploratory comparisons between samples and the identification of features with interesting variation. We will use an implementation from the PMA (Witten et al. 2009).

Since the mouse data used above included only a single table, we use a new data set, collected by the study (Kashyap et al. 2013). There are two tables here, one for bacteria and another with metabolites. 12 samples were obtained, each with measurements at 637 m/z values and 20,609 OTUs; however, about 96% of the entries of the microbial abundance table are exactly zero. The code below retrieves and filters these data.

metab <- read.csv("https://raw.githubusercontent.com/spholmes/F1000_workflow/master/data/metabolites.csv",row.names = 1)
microbe_connect <-url("https://raw.githubusercontent.com/spholmes/F1000_workflow/master/data/microbe.rda")
load(microbe_connect)
microbe
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 20609 taxa and 12 samples ]
## tax_table()   Taxonomy Table:    [ 20609 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 20609 tips and 20607 internal nodes ]

We see that microbe is a phyloseq object. Our preprocessing mirrors that done for the mouse data. We first filter down to microbes and metabolites of interest, removing those that are zero across many samples. Then, we transform them to weaken the heavy tails. We also take the log of the metabolites.

library("genefilter")
keep_ix <- rowSums(metab == 0) <= 3
metab <- metab[keep_ix, ]
microbe <- prune_taxa(taxa_sums(microbe) > 4, microbe)
microbe <- filter_taxa(microbe, filterfun(kOverA(3, 2)), TRUE)
metab <- log(1 + metab, base = 10)
X <- otu_table(microbe)
X[X > 50] <- 50
dim(X)
## [1] 174  12
dim(metab)
## [1] 405  12

We see that both X and metab have 12 columns, these are actually the samples and we will transpose them.

We can now apply sparse CCA. This method compares sets of features across high-dimensional data tables, where there may be more measured features than samples. In the process, it chooses a subset of available features that capture the most covariance – these are the features that reflect signals present across multiple tables. We then apply PCA to this selected subset of features. In this sense, we use sparse CCA as a screening procedure, rather than as an ordination method.

Our implementation is below. The parameters penaltyx and penaltyz are sparsity penalties. Smaller values of penaltyx will result in fewer selected microbes, similarly penaltyz modulates the number of selected metabolites. See the documentation of the CCA in the PMA package.

We tune them manually to facilitate subsequent interpretation – we generally prefer more sparsity than the default parameters would provide.

library(PMA)
cca_res <- CCA(t(X),  t(metab), penaltyx = .15, penaltyz = .15)
## 123456789101112131415
cca_res
## Call: CCA(x = t(X), z = t(metab), penaltyx = 0.15, penaltyz = 0.15)
## 
## 
## Num non-zeros u's:  5 
## Num non-zeros v's:  15 
## Type of x:  standard 
## Type of z:  standard 
## Penalty for x: L1 bound is  0.15 
## Penalty for z: L1 bound is  0.15 
## Cor(Xu,Zv):  0.9735258

With these parameters, 5 microbes and 15 metabolites have been selected, based on their ability to explain covariation between tables. Further, these 20 features result in a correlation of 0.974 between the two tables. We interpret this to mean that the microbial and metabolomic data reflect similar underlying signals, and that these signals can be approximated well by the 20 selected features. Be wary of the correlation value, however, since the scores are far from the usual bivariate normal cloud. Further, note that it is possible that other subsets of features could explain the data just as well – sparse CCA has minimized redundancy across features, but makes no guarantee that these are the “true” features in any sense.

Nonetheless, we can still use these 20 features to compress information from the two tables without much loss. To relate the recovered metabolites and OTUs to characteristics of the samples on which they were measured, we use them as input to an ordinary PCA.

combined <- cbind(t(X[cca_res$u != 0, ]),
                  t(metab[cca_res$v != 0, ]))
pca_res <- dudi.pca(combined, scannf = F, nf = 3)
genotype <- substr(rownames(pca_res$li), 1, 2)
sample_type <- substr(rownames(pca_res$l1), 3, 4)
feature_type <- grepl("\\.", colnames(combined))
feature_type <- ifelse(feature_type, "Metabolite", "OTU")
sample_info <- data.frame(pca_res$li, genotype, sample_type)
feature_info <- data.frame(pca_res$c1,
                           feature = substr(colnames(combined), 1, 6))
ggplot() +  geom_point(data = sample_info,
            aes(x = Axis1, y = Axis2, col = sample_type, shape = genotype), size = 3) + 
  geom_label_repel(data = feature_info,
                   aes(x = 5.5 * CS1, y = 5.5 * CS2, label = feature, fill = feature_type),
                   size = 2, segment.size = 0.3,
                   label.padding = unit(0.1, "lines"), label.size = 0) +
  geom_point(data = feature_info,
             aes(x = 5.5 * CS1, y = 5.5 * CS2, fill = feature_type),
             size = 1, shape = 23, col = "#383838") +
  scale_color_brewer(palette = "Set2") +
  scale_fill_manual(values = c("#a6d854", "#e78ac3")) +
  guides(fill = guide_legend(override.aes = list(shape = 32, size = 0))) +
  coord_fixed(sqrt(pca_res$eig[2] / pca_res$eig[2])) +
  labs(x = sprintf("Axis1 [%s%% Variance]",
                   100 * round(pca_res$eig[1] / sum(pca_res$eig), 2)),
       y = sprintf("Axis2 [%s%% Variance]",
                   100 * round(pca_res$eig[2] / sum(pca_res$eig), 2)),
       fill = "Feature Type", col = "Sample Type")
A PCA triplot produced from the CCA selected features in from muliple data types;metabolites and OTUs.

Figure 24: A PCA triplot produced from the CCA selected features in from muliple data types;metabolites and OTUs

Note that we have departed from our convention of fixing the aspect ratio here as the second axis represents very little of the variability and the plot would actually become unreadable.

Figure 24 displays a PCA triplot, where we show different types of samples and the multidomain features (Metabolites and OTUs). This allows comparison across the measured samples – triangles for Knockout and circles for wild type – and characterizes the influence the different features – diamonds with text labels. For example, we see that the main variation in the data is across PD and ST samples, which correspond to the different diets.

Further, large values of 15 of the features are associated with ST status, while small values for 5 of them indicate PD status. The advantage of the sparse CCA screening is now clear – we can display most of the variation across samples using a relatively simple plot, and can avoid plotting the hundreds of additional points that would be needed to display all of the features.

Conclusions

We have shown how a complete workflow in R is now available to denoise, identify and normalize next generation amplicon sequencing reads using probabilistic models with parameters fit using the data at hand.

We have provided a brief overview of all the analyses that become possible once the data has been imported into the R environment. Multivariate projections using the phylogenetic tree as the relevant distance between OTUs/ASVs can be done using weighted unifrac or double principal coordinate analyses using the phyloseq package. Biplots provide the user with an interpretation key. These biplots have been extended to triplots in the case of multidomain data incorporating genetic, metabolic and taxa abundances. We illustrate the use of network based analyses, whether the community graph is provided from other sources or from a taxa co-occurrence computation using a Jaccard distance.

We have briefly covered a small example of using supervised learning functions (random forests and partial least squares) to predict a response variable,

The main challenges in tackling microbiome data come from the many different levels of heterogeneity both at the input and output levels. These are easily accommodated through R’s capacity to combine data into S4 classes. We are able to include layers of information, trees, sample data description matrices, contingency table in the phyloseq data structures. The plotting facilities of and allow for the layering of information in the output into plots that combine graphs, multivariate information and maps of the relationships between covariates and taxa abundances. The layering concept allows the user to provide reproducible publication level figures with multiple heterogeneous sources of information. Our main goal in providing these tools has been to enhance the statistical power of the analyses by enabling the user to combine frequencies, quality scores and covariate information into complete and testable projections.

Summary

This illustration of possible workflows for microbiome data combining trees, networks, normalized read counts and sample information showcases the capabilities and reproducibility of an R based system for analysing bacterial communities. We have implemented key components in C wrapped within the Bioconductor package dada2 to enable the different steps to be undertaken on a laptop.

Once the sequences have been filtered and tagged they can be assembled into a phylogenetic tree directly in R using the maximum likelihood tree estimation available in r. The sequences are then assembled into a phyloseq object containing all the sample covariates, the phylogenetic tree and the sample-taxa contingency table.

These data can then be visualized interactively with Shiny-phyloseq, plotted with one line wrappers in phyloseq and filtered or transformed very easily.

The last part of the paper shows more complex analyses that require direct plotting and advanced statistical analyses.

Multivariate ordination methods allow useful lower dimensional projections in the presence of phylogenetic information or multi-domain data as shown in an example combining metabolites, OTU abundances and sample covariate information.

Supervised learning methods provide lists of the most relevant taxa in discriminating between groups.

Bacterial communities can be represented as co-occurrence graphs using network based plotting procedures available in R. We have also provided examples where these graphs can be used to test community structure through non parametric permutation resampling. This provides implementations of the Friedman Rafsky(Friedman and Rafsky 1979) tests for microbiome data which have not been published previously.

Data availability

Intermediary data for the analyses are made available both on GitHub at https://github.com/spholmes/F1000_workflow and at the Stanford digital repository permanent url for this paper:
http://purl.stanford.edu/wh250nn9648. All other data have been previously published and the links are included in the paper.

Operation and Session Information

The programs and source for this article can be run using version [3.1] or above of R and version [3.3] or above of Bioconductor.

devtools::session_info()
##  setting  value                       
##  version  R version 3.4.1 (2017-06-30)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Los_Angeles         
##  date     2017-07-23                  
## 
##  package              * version  date       source        
##  ade4                 * 1.7-6    2017-03-23 CRAN (R 3.4.0)
##  ape                  * 4.1      2017-02-14 CRAN (R 3.4.0)
##  assertthat             0.2.0    2017-04-11 cran (@0.2.0) 
##  backports              1.1.0    2017-05-22 CRAN (R 3.4.0)
##  base                 * 3.4.1    2017-07-07 local         
##  bindr                  0.1      2016-11-13 cran (@0.1)   
##  bindrcpp             * 0.2      2017-06-17 cran (@0.2)   
##  Biobase                2.37.2   2017-05-10 Bioconductor  
##  BiocGenerics         * 0.23.0   2017-05-10 Bioconductor  
##  BiocParallel           1.11.4   2017-06-27 Bioconductor  
##  BiocStyle            * 2.5.8    2017-07-21 Bioconductor  
##  biomformat             1.5.0    2017-05-10 Bioconductor  
##  Biostrings           * 2.45.3   2017-07-21 Bioconductor  
##  bit                    1.1-12   2014-04-09 CRAN (R 3.4.0)
##  bit64                  0.9-7    2017-05-08 CRAN (R 3.4.0)
##  bitops                 1.0-6    2013-08-17 CRAN (R 3.4.0)
##  blob                   1.1.0    2017-06-17 CRAN (R 3.4.0)
##  bookdown               0.4      2017-05-20 CRAN (R 3.4.0)
##  car                    2.1-5    2017-07-04 CRAN (R 3.4.1)
##  caret                * 6.0-76   2017-04-18 CRAN (R 3.4.0)
##  class                  7.3-14   2015-08-30 CRAN (R 3.4.1)
##  cluster                2.0.6    2017-03-10 CRAN (R 3.4.1)
##  codetools              0.2-15   2016-10-05 CRAN (R 3.4.1)
##  colorspace             1.3-2    2016-12-14 CRAN (R 3.4.0)
##  compiler               3.4.1    2017-07-07 local         
##  curl                   2.8.1    2017-07-21 CRAN (R 3.4.1)
##  dada2                * 1.5.0    2017-05-10 Bioconductor  
##  data.table             1.10.4   2017-02-01 CRAN (R 3.4.0)
##  datasets             * 3.4.1    2017-07-07 local         
##  DBI                    0.7      2017-06-18 CRAN (R 3.4.0)
##  DECIPHER             * 2.5.0    2017-05-10 Bioconductor  
##  DelayedArray           0.3.16   2017-06-22 Bioconductor  
##  devtools               1.13.2   2017-06-02 CRAN (R 3.4.0)
##  digest                 0.6.12   2017-01-27 CRAN (R 3.4.0)
##  dplyr                * 0.7.2    2017-07-20 CRAN (R 3.4.1)
##  e1071                  1.6-8    2017-02-02 CRAN (R 3.4.0)
##  evaluate               0.10.1   2017-06-24 CRAN (R 3.4.1)
##  fastmatch              1.1-0    2017-01-28 CRAN (R 3.4.0)
##  foreach                1.4.3    2015-10-13 CRAN (R 3.4.0)
##  GenomeInfoDb           1.13.4   2017-06-06 Bioconductor  
##  GenomeInfoDbData       0.99.1   2017-07-22 Bioconductor  
##  GenomicAlignments      1.13.4   2017-07-20 Bioconductor  
##  GenomicRanges          1.29.11  2017-07-22 Bioconductor  
##  ggplot2              * 2.2.1    2016-12-30 CRAN (R 3.4.0)
##  ggrepel              * 0.6.5    2016-11-24 CRAN (R 3.4.0)
##  git2r                  0.18.0   2017-01-01 CRAN (R 3.4.0)
##  glue                   1.1.1    2017-06-21 cran (@1.1.1) 
##  graphics             * 3.4.1    2017-07-07 local         
##  grDevices            * 3.4.1    2017-07-07 local         
##  grid                   3.4.1    2017-07-07 local         
##  gridExtra            * 2.2.1    2016-02-29 CRAN (R 3.4.0)
##  gtable                 0.2.0    2016-02-26 CRAN (R 3.4.0)
##  highr                  0.6      2016-05-09 CRAN (R 3.4.0)
##  htmltools              0.3.6    2017-04-28 CRAN (R 3.4.0)
##  httr                   1.2.1    2016-07-03 CRAN (R 3.4.0)
##  hwriter                1.3.2    2014-09-10 CRAN (R 3.4.0)
##  igraph                 1.1.2    2017-07-21 CRAN (R 3.4.1)
##  IRanges              * 2.11.12  2017-07-22 Bioconductor  
##  iterators              1.0.8    2015-10-13 CRAN (R 3.4.0)
##  jsonlite               1.5      2017-06-01 CRAN (R 3.4.0)
##  knitr                * 1.16     2017-05-18 CRAN (R 3.4.0)
##  labeling               0.3      2014-08-23 CRAN (R 3.4.0)
##  lattice              * 0.20-35  2017-03-25 CRAN (R 3.4.1)
##  latticeExtra           0.6-28   2016-02-09 CRAN (R 3.4.0)
##  lazyeval               0.2.0    2016-06-12 CRAN (R 3.4.0)
##  lme4                   1.1-13   2017-04-19 CRAN (R 3.4.0)
##  magrittr               1.5      2014-11-22 CRAN (R 3.4.0)
##  MASS                   7.3-47   2017-02-26 CRAN (R 3.4.1)
##  Matrix                 1.2-10   2017-05-03 CRAN (R 3.4.1)
##  MatrixModels           0.4-1    2015-08-22 CRAN (R 3.4.0)
##  matrixStats            0.52.2   2017-04-14 CRAN (R 3.4.0)
##  memoise                1.1.0    2017-04-21 CRAN (R 3.4.0)
##  methods              * 3.4.1    2017-07-07 local         
##  mgcv                   1.8-17   2017-02-08 CRAN (R 3.4.1)
##  minqa                  1.2.4    2014-10-09 CRAN (R 3.4.0)
##  ModelMetrics           1.1.0    2016-08-26 CRAN (R 3.4.0)
##  multtest               2.33.0   2017-05-10 Bioconductor  
##  munsell                0.4.3    2016-02-13 CRAN (R 3.4.0)
##  nlme                   3.1-131  2017-02-06 CRAN (R 3.4.1)
##  nloptr                 1.0.4    2014-08-04 CRAN (R 3.4.0)
##  nnet                   7.3-12   2016-02-02 CRAN (R 3.4.1)
##  parallel             * 3.4.1    2017-07-07 local         
##  pbkrtest               0.4-7    2017-03-15 CRAN (R 3.4.0)
##  permute                0.9-4    2016-09-09 CRAN (R 3.4.0)
##  phangorn             * 2.2.0    2017-04-03 CRAN (R 3.4.0)
##  phyloseq             * 1.21.0   2017-05-10 Bioconductor  
##  pkgconfig              2.0.1    2017-03-21 CRAN (R 3.4.0)
##  pls                  * 2.6-0    2016-12-18 CRAN (R 3.4.0)
##  plyr                   1.8.4    2016-06-08 CRAN (R 3.4.0)
##  quadprog               1.5-5    2013-04-17 CRAN (R 3.4.0)
##  quantreg               5.33     2017-04-18 CRAN (R 3.4.0)
##  R6                     2.2.2    2017-06-17 CRAN (R 3.4.0)
##  randomForest         * 4.6-12   2015-10-07 CRAN (R 3.4.0)
##  RColorBrewer           1.1-2    2014-12-07 CRAN (R 3.4.0)
##  Rcpp                 * 0.12.12  2017-07-15 CRAN (R 3.4.1)
##  RcppParallel           4.3.20   2016-08-16 CRAN (R 3.4.0)
##  RCurl                  1.95-4.8 2016-03-01 CRAN (R 3.4.0)
##  reshape2             * 1.4.2    2016-10-22 CRAN (R 3.4.0)
##  rhdf5                  2.21.2   2017-06-15 Bioconductor  
##  rlang                  0.1.1    2017-05-18 CRAN (R 3.4.0)
##  rmarkdown            * 1.6      2017-06-15 CRAN (R 3.4.0)
##  rprojroot              1.2      2017-01-16 CRAN (R 3.4.0)
##  Rsamtools              1.29.0   2017-05-10 Bioconductor  
##  RSQLite              * 2.0      2017-06-19 CRAN (R 3.4.1)
##  S4Vectors            * 0.15.5   2017-06-27 Bioconductor  
##  scales                 0.4.1    2016-11-09 CRAN (R 3.4.0)
##  ShortRead              1.35.1   2017-05-10 Bioconductor  
##  SparseM                1.77     2017-04-23 CRAN (R 3.4.0)
##  splines                3.4.1    2017-07-07 local         
##  stats                * 3.4.1    2017-07-07 local         
##  stats4               * 3.4.1    2017-07-07 local         
##  stringi                1.1.5    2017-04-07 CRAN (R 3.4.0)
##  stringr                1.2.0    2017-02-18 CRAN (R 3.4.0)
##  SummarizedExperiment   1.7.5    2017-06-22 Bioconductor  
##  survival               2.41-3   2017-04-04 CRAN (R 3.4.1)
##  tibble                 1.3.3    2017-05-28 CRAN (R 3.4.0)
##  tools                  3.4.1    2017-07-07 local         
##  utils                * 3.4.1    2017-07-07 local         
##  vegan                  2.4-3    2017-04-07 CRAN (R 3.4.0)
##  withr                  1.0.2    2016-06-20 CRAN (R 3.4.0)
##  XVector              * 0.17.0   2017-05-10 Bioconductor  
##  yaml                   2.1.14   2016-11-12 CRAN (R 3.4.0)
##  zlibbioc               1.23.0   2017-05-10 Bioconductor

Author contributions

BJC, KS, JAF, PJM and SPH developed the software tools, BJC, KS, JAF, PJM and SPH developed statistical and bioinformatic methods and tested the workflow on the Mouse data sets. BJC, KS, JAF, PJM and SPH wrote the article.

Competing interests

No competing interests were disclosed.

Grant information

This work was partially supported by the NSF (DMS-1162538 to S.P.H.), the NIH (TR32 to KS and R01AI112401 to SPH), and a Stanford Interdisciplinary Graduate Fellowship supported JAF.

Acknowledgments

The authors would like to thank the members of the Relman Lab for their valuable insights on microbiology and sequencing. We are also grateful to Nandita Garud, Leo Lahti, Zachary Charlop-Powers for reviewing the F1000Research workflow paper and to Nikolaos Ignatiadis, Martina Fu and Shaheen Essabhoy for code testing and suggestions for the revised version.

References

Benjamini, Y., and Y. Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society B 57: 289–300.

Benjamini, Yoav, and Marina Bogomolov. 2014. “Selective Inference on Multiple Families of Hypotheses.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 (1). Wiley Online Library: 297–318.

Benjamini, Yoav, and Daniel Yekutieli. 2003. “Hierarchical Fdr Testing of Trees of Hypotheses.” Technical report, Department of Statistics; Operations Research, Tel Aviv University.

Braak, C ter. 1985. “Correspondence Analysis of Incidence and Abundance Data: Properties in Terms of a Unimodal Response ….” Biometrics, January. http://links.jstor.org/sici?sici=0006-341X(198512)41%253A4%253C859%253ACAOIAA%253E2.0.CO%253B2-S.

Callahan, Ben J, Kris Sankaran, Julia A Fukuyama, Paul J McMurdie, and Susan P Holmes. 2016. “Bioconductor Workflow for Microbiome Data Analysis: From Raw Reads to Community Analyses.” F1000Research 5. Faculty of 1000 Ltd.

Callahan, Benjamin J, Paul J McMurdie, and Susan P Holmes. 2017. “Exact Sequence Variants Should Replace Operational Taxonomic Units in Marker Gene Data Analysis.” bioRxiv. Cold Spring Harbor Labs Journals, 113597.

Callahan, Benjamin J, Paul J McMurdie, Michael J Rosen, Andrew W Han, Amy J Johnson, and Susan P Holmes. 2016. “DADA2: High Resolution Sample Inference from Amplicon Data.” Nature Methods. Nature Publishing, 1–4.

Caporaso, J.G., J. Kuczynski, J. Stombaugh, K. Bittinger, F.D. Bushman, E.K. Costello, N. Fierer, et al. 2010. “QIIME Allows Analysis of High-Throughput Community Sequencing Data.” Nature Methods 7 (5). Nature Publishing Group: 335–36.

Cole, J.R., Q. Wang, E. Cardenas, J. Fish, B. Chai, R.J. Farris, A.S. Kulam-Syed-Mohideen, et al. 2009. “The Ribosomal Database Project: Improved Alignments and New Tools for rRNA Analysis.” Nucleic Acids Research 37 (Supplement 1): D141–D145.

Edgar, R.C., and H. Flyvbjerg. 2015. “Error Filtering, Pair Assembly and Error Correction for Next-Generation Sequencing Reads.” Bioinformatics 31 (21): 3476–82.

Friedman, Jerome H, and Lawrence C Rafsky. 1979. “Multivariate Generalizations of the Wald-Wolfowitz and Smirnov Two-Sample Tests.” The Annals of Statistics. JSTOR, 697–717.

Fukuyama, Julia, Paul J McMurdie, Les Dethlefsen, David A Relman, and Susan Holmes. 2012. “Comparisons of Distance Methods for Combining Covariates and Abundances in Microbiome Studies.” In Pac Symp Biocomput. World Scientific.

Greenacre, Michael. 2007. Correspondence Analysis in Practice. CRC press.

Holmes, S., A. Alekseyenko, A. Timme, T. Nelson, P.J. Pasricha, and A. Spormann. 2011. “Visualization and Statistical Comparisons of Microbial Communities Using R Packages on Phylochip Data.”

Huber, Wolfgang, Vincent J Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S Carvalho, Hector Corrada Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nature Methods 12 (2). Nature Publishing Group: 115–21.

Kashyap, Purna C, Angela Marcobal, Luke K Ursell, Samuel A Smits, Erica D Sonnenburg, Elizabeth K Costello, Steven K Higginbottom, et al. 2013. “Genetically Dictated Change in Host Mucus Carbohydrate Landscape Exerts a Diet-Dependent Effect on the Gut Microbiota.” Proceedings of the National Academy of Sciences 110 (42). National Acad Sciences: 17059–64.

Kozich, James J, Sarah L Westcott, Nielson T Baxter, Sarah K Highlander, and Patrick D Schloss. 2013. “Development of a Dual-Index Sequencing Strategy and Curation Pipeline for Analyzing Amplicon Sequence Data on the Miseq Illumina Sequencing Platform.” Applied and Environmental Microbiology 79 (17). Am Soc Microbiol: 5112–20.

Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for Rna-Seq Data with Deseq2.” Genome Biology 15 (12). Springer: 1–21.

McMurdie, Paul J, and Susan Holmes. 2013. “phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data.” PLoS One 8 (4): e61217.

———. 2014. “Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible.” PLoS Computational Biology 10 (4). Public Library of Science: e1003531.

———. 2015. “Shiny-Phyloseq: Web Application for Interactive Microbiome Analysis with Provenance Tracking.” Bioinformatics 31 (2). Oxford Univ Press: 282–83.

Pavoine, Sandrine, Anne-Béatrice Dufour, and Daniel Chessel. 2004. “From Dissimilarities Among Species to Dissimilarities Among Communities: A Double Principal Coordinate Analysis.” Journal of Theoretical Biology 228 (4). Elsevier: 523–37.

Purdom, Elizabeth. 2010. “Analysis of a Data Matrix and a Graph: Metagenomic Data and the Phylogenetic Tree.” Annals of Applied Statistics, July.

Sankaran, Kris, and Susan Holmes. 2014. “StructSSI: Simultaneous and Selective Inference for Grouped or Hierarchically Structured Data.” Journal of Statistical Software 59 (1): 1–21.

Schilling, Mark F. 1986. “Multivariate Two-Sample Tests Based on Nearest Neighbors.” Journal of the American Statistical Association 81 (395). Taylor & Francis: 799–806.

Schloss, P D, S L Westcott, T Ryabin, J R Hall, M Hartmann, E B Hollister, R A Lesniewski, et al. 2009. “Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities.” Appl. and Environmental Microbiology 75 (23): 7537–41.

Schloss, P.D., A.M. Schuber, J.P. Zackular, K.D. Iverson, Young V.B., and Petrosino J.F. 2012. “Stabilization of the Murine Gut Microbiome Following Weaning.” Gut Microbes 3 (4): 383–93.

Wang, Q., G.M. Garrity, J.M. Tiedje, and J.R. Cole. 2007. “Naive Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy.” Applied and Environmental Microbiology 73 (16). Am Soc Microbiol: 5261.

Witten, D, R Tibshirani, S Gross, and B Narasimhan. 2009. “PMA: Penalized Multivariate Analysis.” R Package Version 1 (5).

Wold, Svante, Arnold Ruhe, Herman Wold, and WJ Dunn III. 1984. “The Collinearity Problem in Linear Regression. the Partial Least Squares (Pls) Approach to Generalized Inverses.” SIAM Journal on Scientific and Statistical Computing 5 (3). SIAM: 735–43.

Wright, Erik S. 2015. “DECIPHER: Harnessing Local Sequence Context to Improve Protein Multiple Sequence Alignment.” BMC Bioinformatics 16 (1). BioMed Central: 1.