- 1 Introduction
- 2 Goodness of Fit
- 3 DA methods
- 4 Type I Error Control
- 5 Concordance
- 6 Enrichment analysis
- 7 Session Info
- References

This vignette provides an introductory example on how to work with the analysis framework firstly proposed in (Calgaro *et al.*, 2020).

The package is named benchdamic, acronym for “BENCHmarking of Differential Abundance detection methods for MICrobial data”. Not only does the package structure allow the users to test a variety of commonly used methods for differential abundance analysis, but it also enables them to set benchmarks including custom methods on their datasets. Performances of each method are evaluated with respect to i) suitability of distributional assumptions, ii) ability to control false discoveries, iii) concordance of the findings, and iv) enrichment of differentially abundant microbial species in specific conditions. Each step of the assessment is flexible when it comes to the choice of differential abundance methods, their parameters, and input data types. Various graphic outputs lead the users to an informed decision when evaluating the most suitable method to use for their data.

To install this package, start R (version “4.2”) and enter:

```
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("benchdamic")
```

or use:

```
if (!require("devtools", quietly = TRUE))
install.packages("devtools")
devtools::install_github("mcalgaro93/benchdamic")
```

Then, load some packages for basic functions and data:

```
library(benchdamic)
# Parallel computation
library(BiocParallel)
# Generate simulated data
library(SPsimSeq)
# Data management
library(phyloseq)
library(SummarizedExperiment)
library(plyr)
# Graphics and tables
library(ggplot2)
library(cowplot)
library(kableExtra)
```

All datasets used in `benchdamic`

are downloaded using the `HMP16SData`

Bioconductor package (Schiffer *et al.*, 2019).

For GOF and TIEC analyses a homogeneous group of samples (*e.g.*, only samples from a specific experimental condition, phenotype, treatment, body site, etc.) *ps_stool_16S* is used (use `help("ps_stool_16S")`

for details). It contains 16S data from:

32 stool samples from participants of the Human Microbiome Project;

71 taxa, all features having the same genus-level taxonomic classification are collapsed together (a total of 71 taxa corresponding to 71 genera).

```
data("ps_stool_16S")
ps_stool_16S
```

```
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 71 taxa and 32 samples ]
## sample_data() Sample Data: [ 32 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 71 taxa by 6 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 71 tips and 70 internal nodes ]
```

For Concordance and Enrichment analyses the *ps_plaque_16S* dataset is used (use `help("ps_plaque_16S")`

for details). It contains 16S data from:

30 participants of the Human Microbiome Project;

samples collected from subgingival plaque and supragingival plaque for each subject (a total of 60 samples);

88 taxa, all features having the same genus-level taxonomic classification are collapsed together (a total of 88 taxa corresponding to 88 genera).

```
data("ps_plaque_16S")
ps_plaque_16S
```

```
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 88 taxa and 60 samples ]
## sample_data() Sample Data: [ 60 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 88 taxa by 6 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 88 tips and 87 internal nodes ]
```

**Assumption**: Many DA detection methods are based on parametric distributions.

**Research question**: Which are the parametric distributions that can fit both the proportion of zeros and the counts in your data?

As different methods rely on different statistical distributions to perform DA analysis, the goodness of fit (GOF) of the statistical models underlying some of the DA methods on a 16S dataset is assessed. For each model, its ability to correctly estimate the average counts and the proportion of zeroes by taxon is evaluated.

Five distributions are considered: (1) the negative binomial (**NB**) used in `edgeR`

and `DeSeq2`

(Robinson *et al.*, 2010; Love *et al.*, 2014), (2) the zero-inflated negative binomial (**ZINB**) used in ZINB-WaVE (Risso *et al.*, 2018), (3) the truncated Gaussian Hurdle model of `MAST`

(Finak *et al.*, 2015), (4) the zero-inflated Gaussian (**ZIG**) mixture model of `metagenomeSeq`

(Paulson *et al.*, 2013), and (5) the Dirichlet-Multinomial (**DM**) distribution underlying `ALDEx2`

Monte-Carlo sampling (Fernandes *et al.*, 2014) and multivariate extension of the beta-binomial distribution used by `corncob`

(Martin *et al.*, 2020).

The relationships between the functions used in this section are explained by the diagram in Figure 1. To help with the reading: green boxes represent the inputs or the outputs, red boxes are the methods and blue boxes are the main parameters of those method.

For any \(\mu \ge 0\) and \(\theta > 0\), let \(f_{NB}(\cdot;\mu,\theta)\) denote the probability mass function (PMF) of the negative binomial (NB) distribution with mean \(\mu\) and inverse dispersion parameter \(\theta\), namely:\[ f_{NB} = \frac{\Gamma(y+\theta)}{\Gamma(y+1)\Gamma(\theta)}\left(\frac{\theta}{\theta+1} \right)^\theta\left(\frac{\mu}{\mu+\theta} \right)^y, \forall y \in \mathbb{N} \]Note that another parametrization of the NB PMF is in terms of the dispersion parameter \(\psi = \theta^{-1}\) (although \(\theta\) is also sometimes called dispersion parameter in the literature). In both cases, the mean of the NB distribution is \(\mu\) and its variance is:\[ \sigma^2 = \mu + \frac{\mu^2}{\theta} = \mu+\psi\mu^2 \]In particular, the NB distribution boils down to a Poisson distribution when \(\psi=0 \iff \theta=+ \infty\).

For any \(\pi\in[0,1]\), let \(f_{ZINB}(\cdot;\mu,\theta,\pi)\) be the PMF of the ZINB distribution given by:

\[ f_{ZINB}(\cdot;\mu,\theta,\pi) = \pi\delta_0(y)+(1-\pi)f_{NB}(y;\mu,\theta), \forall y\in\mathbb{N} \]

where \(\delta_0(\cdot)\) is the Dirac function. Here, \(\pi\) can be interpreted as the probability that a 0 is observed instead of the actual count, resulting in an inflation of zeros compared to the NB distribution, hence the name ZINB.

To fit these distributions on real count data the `edgeR`

(Robinson *et al.*, 2010) and `zinbwave`

(Risso *et al.*, 2018) packages are used. In `benchdamic`

they are implemented in the `fitNB()`

and `fitZINB()`

functions.

The raw count for sample j and feature i is denoted by \(c_{ij}\). The zero-inflated model is defined for the continuity-corrected logarithm of the raw count data: \(y_{ij} = log_2(c_{ij}+1)\) as a mixture of a point mass at zero \(I_{0}(y)\) and a count distribution \(f_{count}(y;\mu,\sigma^2) \sim N(\mu,\sigma^2)\). Given mixture parameters \(\pi_j\), we have that the density of the ZIG distribution for feature i, in sample j with \(s_j\) total counts is: \[f_{ZIG}(y_{ij};s_j,\beta,\mu_i,\sigma^2_i) = \pi_j(s_j)\cdot I_{0}(y_{ij})+(1-\pi_j(s_j))\cdot f_{count}(y_{ij};\mu,\sigma^2)\]

The mean model is specified as:\[E(y_{ij})=\pi_{j} + (1-\pi_j)\cdot\left(b_{i0}+\eta_ilog_2\left( \frac{s_j^{\hat{l}}}{N}+1 \right) \right)\]

In this case, parameter \(b_{i0}\) is the intercept of the model while the term including the logged normalization factor \(log_2\left(\frac{s_j^{\hat{l}}}{N}+1 \right)\) captures feature-specific normalization factors through parameter \(\eta_i\). In details, \(s_j^{\hat{l}}\) is the median scaling factor resulted from the Cumulative Sum Scaling (CSS) normalization procedure. \(N\) is a constant fixed by default at 1000 but it should be a number close to the scaling factors to be used as a reference, for this reason a good choice could be the median of the scaling factors (which is used instead of 1000). The mixture parameters \(\pi_j(s_j)\) are modeled as a binomial process:

\[log\frac{\pi_j}{1-\pi_j} = \beta_0+\beta_1\cdot log(s_j)\]

To fit this distribution on real count data the `metagenomeSeq`

package (Paulson *et al.*, 2013) is used. In `benchdamic`

it is implemented in the `fitZIG()`

function.

The original field of application of this method was the single-cell RNAseq data, where \(y = log_2(TPM+1)\) expression matrix was modeled as a two-part generalized regression model (Finak *et al.*, 2015). In microbiome data that starting point translates to a \(y_{ij} = log_2\left(counts_{ij}\cdot\frac{10^6}{libSize_{j}}+1 \right)\) or a \(log_2\left(counts_{ij}\cdot\frac{ median(libSize)}{libSize_{j}}+1\right)\).

The taxon presence rate is modeled using logistic regression and, conditioning on a sample with the taxon, the transformed abundance level is modeled as Gaussian.

Given normalized, possibly thresholded, abundance \(y_{ij}\), the rate of presence and the level of abundance for the samples were the taxon is present, are modeled conditionally independent for each gene \(i\). Define the indicator \(z_{ij}\), indicating whether taxon \(i\) is expressed in sample \(j\) (*i.e.*, \(z_{ij} = 0\) if \(y_{ij} = 0\) and \(z_{ij} = 1\) if \(y_{ij} > 0\)). We fit logistic regression models for the discrete variable \(Z\) and a Gaussian linear model for the continuous variable \((Y|Z=1)\) independently, as follows:

\[ logit(Pr(Z_{ij}=1))=X_j\beta_i^D \]

\[ P(Y_{ij}=y|Z_{ij}=1) \sim N(X_j\beta^C_i,\sigma^2_i)\]

To estimate this distribution on real count data the `MAST`

package (Finak *et al.*, 2015) is used. In `benchdamic`

it is implemented in the `fitHURDLE()`

function.

The probability mass function of a \(n\) dimensional multinomial sample \(y = (y_1,...,y_n)^T\) with library size \(libSize = \sum_{i=1}^ny_i\) and parameter \(p=(p_1,...,p_n)\) is:

\[ f(y;p)= {libSize\choose y}\prod_{i=1}^np_i^{y_i} \]

The mean-variance structure of the MN model doesn’t allow over-dispersion, which is common in real data. DM distribution models the probability parameter \(p\) in the MN model by a Dirichlet distribution. The probability mass of a n-category count vector \(y\) over \(libSize\) trials under DM with parameter \(\alpha=(\alpha_1,...,\alpha_n)\), \(a_i>0\) and proportion vector \(p \in \Delta_n=\{(p_1,...,p_n):p_i\ge0,\sum_ip_i=1 \}\) is:

\[ f(y|\alpha)={libSize\choose y}\frac{\prod_{i=1}^n(a_i)y_i}{(\sum_i\alpha_i)\cdot libSize} \]

The mean value for the \(i^{th}\) taxon and \(j^{th}\) sample of the count matrix is given by \(libSize_j\cdot \frac{\alpha_{ij}}{\sum_i a_{ij}}\).

To estimate this distribution on real count data the `MGLM`

package (Zhang *et al.*, 2017; Zhang and Zhou, 2022) is used. In `benchdamic`

it is implemented in the `fitDM()`

function.

The goodness of fit for the previously described distributions is assessed comparing estimated and observed values. For each taxon the following measures are compared:

the Mean Difference (MD)

*i.e.*the difference between the estimated mean and the observed mean abundance (log scale);the Zero Probability Difference (ZPD)

*i.e.*the difference between the probability to observe a zero and the observed proportion of samples which have zero counts.

To easily compare estimated and observed mean values the natural logarithm transformation, with the continuity correction (\(log(counts+1)\)), is well suited, indeed it reduces the count range making the differences more stable.

Except for the `fitHURDLE()`

function, which performs a CPM transformation on the counts (or the one with the median library size), and the `fitZIG()`

function which models the \(log_2(counts+1)\), the other methods, `fitNB()`

, `fitZINB()`

, and `fitDM()`

, model the \(counts\) directly. For these reasons, `fitHURDLE()`

’s output should not be compared directly to the observed \(log(counts+1)\) mean values as for the other methods. Instead, the logarithm of the observed CPM (or the one with the median library size) should be used.

Here an example on how to fit a Truncated Gaussian hurdle model:

```
example_HURDLE <- fitHURDLE(
object = ps_stool_16S,
scale = "median"
)
head(example_HURDLE)
```

```
## Y Y0
## OTU_97.192 2.152592 0.312760011
## OTU_97.1666 NA 0.967681879
## OTU_97.31213 4.057621 0.094509772
## OTU_97.39094 1.123198 0.749635900
## OTU_97.40451 7.327534 0.001949454
## OTU_97.19587 4.716914 0.063341989
```

The values above are those estimated by the `fitHURDLE()`

function. Some *NA* values could be present due to taxa sparsity. The internally used function to prepare for the comparisons the observed counts is `prepareObserved()`

, specifying the *scale* parameter if the HURDLE model is considered (if *scale = “median”*, the median library size is used to scale counts instead of \(10^6\)):

```
observed_hurdle <- prepareObserved(
object = ps_stool_16S,
scale = "median")
head(observed_hurdle)
```

```
## Y Y0
## OTU_97.192 3.21387169 0.31250
## OTU_97.1666 0.03923904 0.96875
## OTU_97.31213 4.88350981 0.09375
## OTU_97.39094 4.89778959 0.75000
## OTU_97.40451 7.51646781 0.00000
## OTU_97.19587 5.26661203 0.06250
```

Which are different from the non-scaled observed values:

`head(prepareObserved(object = ps_stool_16S))`

```
## Y Y0
## OTU_97.192 3.34109346 0.31250
## OTU_97.1666 0.03077166 0.96875
## OTU_97.31213 4.77675725 0.09375
## OTU_97.39094 4.44081133 0.75000
## OTU_97.40451 7.53824550 0.00000
## OTU_97.19587 5.36801832 0.06250
```

The function to compute MD and ZPD values, is `meanDifferences()`

:

```
head(meanDifferences(
estimated = example_HURDLE,
observed = observed_hurdle
))
```

```
## MD ZPD
## 1 -1.0612798 0.0002600107
## 2 NA -0.0010681206
## 3 -0.8258883 0.0007597716
## 4 -3.7745917 -0.0003641001
## 5 -0.1889343 0.0019494545
## 6 -0.5496976 0.0008419888
```

A wrapper function to simultaneously perform the estimates and the mean differences is `fitModels()`

:

```
GOF_stool_16S <- fitModels(
object = ps_stool_16S,
models = c("NB", "ZINB", "DM", "ZIG", "HURDLE"),
scale_HURDLE = c("median", "default"),
verbose = FALSE # TRUE is always suggested
)
```

Exploiting the internal structure of the `fitModels()`

’s output the Root Mean Squared Error (RMSE) values for MD values can be extracted (the lower, the better):

`plotRMSE(GOF_stool_16S, difference = "MD", plotIt = FALSE)`

```
## Model RMSE
## 1 NB 0.07057518
## 2 ZINB 0.14298816
## 3 DM 0.97766699
## 4 ZIG 0.53150840
## 5 HURDLE_median 0.85437589
## 6 HURDLE_default 3.11439757
```

Similarly, they are extracted for ZPD values:

`plotRMSE(GOF_stool_16S, difference = "ZPD", plotIt = FALSE)`

```
## Model RMSE
## 1 NB 0.0741340631
## 2 ZINB 0.0219382366
## 3 DM 0.0436436801
## 4 ZIG 0.1129285606
## 5 HURDLE_median 0.0008776059
## 6 HURDLE_default 0.0008832596
```

To plot estimated and observed values the `plotMD()`

function can be used (Figure 2). No systematic trend are expected, moreover, the closer the values to the dotted line are (representing equality between observed and estimated values), the better the goodness of fit relative to the model.

```
plotMD(
data = GOF_stool_16S,
difference = "MD",
split = TRUE
)
```

If some warning messages are shown with this graph, they are likely due to sparse taxa. To address this, the number of NA values generated by each model can be investigated (which are 24 for each HURDLE model):

```
plyr::ldply(GOF_stool_16S, function(model)
c("Number of NAs" = sum(is.na(model))),
.id = "Distribution")
```

To summarize the goodness of fit, the Root Mean Squared Error (RMSE) metric is also displayed for each model. For the *HURDLE_default* model, a quite different range of values of mean differences is displayed because of the excessive *default* scaling proposed (1 million). It is also possible to plot only a subset of the estimated models (Figure 3).

```
plotMD(
data = GOF_stool_16S[1:5],
difference = "MD",
split = TRUE
)
```