DspikeIn with TSE

Mitra Ghotbi

2025-10-23

1 Acknowledgments

The development of the DspikeIn package was made possible through the generous and pioneering efforts of the R and Bioconductor communities. We gratefully acknowledge the developers and maintainers of the following open-source packages, whose tools and infrastructure underpin our work: Core infrastructure & data manipulation: methods, stats, utils, graphics, grDevices, data.table, dplyr, tibble, tidyr, reshape2, matrixStats, rlang, S4Vectors, grid, officer, xml2 Statistical analysis & modeling: DESeq2, edgeR, limma, randomForest, microbiome Phylogenetics & microbiome structure: phyloseq, TreeSummarizedExperiment, SummarizedExperiment, phangorn, ape, DECIPHER, msa, Biostrings Network and graph analysis: igraph, ggraph Visualization & layout design: ggplot2, ggrepel, ggpubr, ggnewscale, ggalluvial, ggtree, ggtreeExtra, ggstar, ggridges, patchwork, scales, RColorBrewer, flextable

These tools collectively empowered us to build a reproducible, modular, and extensible platform for robust absolute abundance quantification in microbial community analysis. We further acknowledge the broader scientific community working on absolute microbial quantification, spike-in calibration, and compositional data analysis, whose foundational insights directly informed the design and conceptual framework of DspikeIn.

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("DspikeIn")

library(DspikeIn)

2 DspikeIn

The DspikeIn package supports both phyloseq and TreeSummarizedExperiment formats to streamline microbial quantification across diverse experimental setups. It accommodates either a single spike-in taxon or synthetic community taxa with variable or equal spike-in volumes and copy numbers. The package offers a comprehensive suite of tools for absolute abundance (AA) quantification, addressing challenges through ten core functions: 1) validation of spiked species, 2) data preprocessing, 3) system-specific spiked species retrieval, 4) scaling factor calculation, 5) conversion to absolute abundance, 6) bias correction and normalization, 7) performance assessment, and 8) taxa exploration and filtering 9) network topology assessment 10) further analyses and visualization. For more information please check doi.org/10.1093/ismejo/wraf150

3 DspikeIn requirements

Important: DspikeIn operates on a taxonomy table with exactly seven ranks: Kingdom Phylum Class Order Family Genus Species. colnames(taxmat) <- c(“Kingdom”, “Phylum”, “Class”, “Order”, “Family”, “Genus”, “Species”) Please remove or rename extra ranks (e.g., Strain, Subfamily) before running DspikeIn.

library(SummarizedExperiment)
library(DspikeIn)

# DspikeIn requirements
# --------------------
# DspikeIn operates on a taxonomy table with exactly seven ranks:
# Kingdom  Phylum  Class  Order  Family  Genus  Species

expected <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")
actual <- colnames(as.data.frame(SummarizedExperiment::rowData(tse)))

if (!identical(expected, actual)) {
  stop(
    paste0(
      "DspikeIn requires exactly 7 taxonomic ranks in this order:\n  ",
      paste(expected, collapse = "  "),
      "\n\nYour taxonomy columns are:\n  ",
      paste(actual, collapse = ", "),
      "\n\nPlease remove or rename extra ranks (e.g., 'Strain', 'Subfamily') ",
      "to match the required format."
    )
  )
}

4 To build TreeSummarizedExperiment file please refer to

github.com/markrobinsonuzh/TreeSummarizedExperiment

library(phyloseq)
# Get path to external data folder
extdata_path <- system.file("extdata", package = "DspikeIn")
list.files(extdata_path)
## [1] "Complete.graphml" "NoBasid.graphml"  "NoHubs.graphml"   "Ref.fasta"       
## [5] "Sample.fasta"
# and
data("physeq_16SOTU", package = "DspikeIn")
length(taxa_names(physeq_16SOTU))
## [1] 9242
new_ids <- paste0("OTU", seq_along(taxa_names(physeq_16SOTU)))
taxa_names(physeq_16SOTU) <- new_ids

# Verify
head(taxa_names(physeq_16SOTU))
## [1] "OTU1" "OTU2" "OTU3" "OTU4" "OTU5" "OTU6"
# [1] "OTU1" "OTU2" "OTU3" "OTU4" "OTU5" "OTU6"



tse_16SOTU <- convert_phyloseq_to_tse(physeq_16SOTU)
tse_16SOTU <- tidy_phyloseq_tse(tse_16SOTU)

Do all detected sample spike-in sequences cluster with the reference, and are their branch lengths statistically similar, supporting a common ancestor?

5 spike-in validation

All sample-derived sequences are forming a clade with the reference. We look for a monophyletic grouping of spike-in OTUs The clade is strongly supported (bootstrap around 100 percentage). The branch lengths and distances are in a biologically plausible range.

# Use the Neighbor-Joining method  based on a Jukes-Cantor distance matrix and plot the tree with bootstrap values.
# we compare the Sanger read of Tetragenococcus halophilus with the FASTA sequence of Tetragenococcus halophilus from our phyloseq object.

library(Biostrings)
library(TreeSummarizedExperiment)
library(SummarizedExperiment)
library(DspikeIn)


# Filter TSE object to keep only Bacteria and Archaea
tse_16SOTU <- tse_16SOTU[
  rowData(tse_16SOTU)$Kingdom %in% c("Bacteria", "Archaea"),]


library(Biostrings)
# 
# # Subset the TSE object to include only Tetragenococcus
# tetragenococcus_tse <- tse_16SOTU[
#   rowData(tse_16SOTU)$Genus == "Tetragenococcus" &
#     !is.na(rownames(tse_16SOTU)) &
#     rownames(tse_16SOTU) != "", ]
# 
# ref_sequences <- referenceSeq(tetragenococcus_tse)
# 
# # Convert to DNAStringSet if needed
# ref_sequences <- Biostrings::DNAStringSet(ref_sequences)
# Biostrings::writeXStringSet(ref_sequences, filepath = "Sample.fasta")

ref_fasta <- system.file("extdata", "Ref.fasta", package = "DspikeIn")
sample_fasta <- system.file("extdata", "Sample.fasta", package = "DspikeIn")



result <- validate_spikein_clade(
  reference_fasta = ref_fasta,
  sample_fasta = sample_fasta,
  bootstrap = 200,
  output_prefix = NULL)
## use default substitution matrix
result$tree_plot

6 Did spike-ins behave as expected across all samples?

Tip labels= OTU/ASV names Branch length numbers= Actual evolutionary distances (small = very similar) Prevalence stars How frequently the OTU occurs across samples Blue bar ring= Log10 mean abundance Outer colored tiles= The metadata variable you choose (e.g., Animal.type)

library(ggstar)
library(ggplot2)

# Filter your object to only include spike-in taxa beforehand:
# We changed the OTU IDs for easy detection
# Big stars = detected in many samples
# Small stars = rarely detected
# log10(Mean Abundance) Bars= Color intensity reflects mean abundance.
# The log-transformed average abundance of each OTU across all samples where it appears.
# Extreme blue may signal unintended over-representation.
# Metadata Ring = factor of your interest e.g. Animal.type
# Each OTU is colored by where it was observed.
# Branch length numbers=    Actual evolutionary distances (small = very similar)


library(DspikeIn)
library(TreeSummarizedExperiment)

# ---- 1. Subset taxa where Genus is Tetragenococcus ----
spikein_tse <- tse_16SOTU[
  rowData(tse_16SOTU)$Genus == "Tetragenococcus", ]


# ---- 2. Diagnostic plot (tree-based) ----
ps <- plot_spikein_tree_diagnostic(
  obj = spikein_tse,
  metadata_var = "Animal.type",
  save_plot = FALSE )

print(ps)

7 Pre_processing

merges monophyletic ASVs/OTUs. Why We Merge Similar ASVs/OTUs? In amplicon sequencing, multiple Amplicon Sequence Variants (ASVs) or Operational Taxonomic Units (OTUs) often map to the same taxonomic species due to: sequencing noise, PCR amplification artifacts, or minor intra-strain variations that do not represent biologically distinct taxa. When performing absolute quantitation (such as spike-in recovery), these subtle sequence variants should not be treated as independent species, since they originate from the same biological organism. Retaining them as separate entries can artificially fragment read counts and distort abundance estimates. Therefore, this function pre-processes the dataset by merging all ASVs/OTUs that belong to the same declared taxon into a single representative unit, while keeping the phylogenetic tree, taxonomy, and reference sequences consistent. This ensures that the subsequent quantitation step reflects true organism-level abundance, not technical redundancy.

When to use: As the first preprocessing step before calculating spike-in recovery percentages or absolute quantitation metrics.

9 Calculate the percentage of spiked species retrieval

Why and What It Measures? Spike-in standards provide an internal quantitative reference for microbial load normalization. After merging the spike-in species, we can evaluate how many sequencing reads in each sample correspond to these known spike-in taxa. This value serves as a recovery control, reflecting both: the accuracy of DNA extraction and sequencing, and the efficiency of downstream normalization. By expressing spike-in reads as a percentage of total reads per sample, we can detect samples with technical failure (too low recovery) or overrepresentation (possible contamination or pipetting bias etc.). This metric provides an essential quality-control checkpoint: samples with spike-in recovery outside the expected range (e.g., <0.1% or >20%) may indicate experimental anomalies or uneven DNA extraction efficiency. The resulting table allows quick identification of samples passing or failing QC thresholds, ensuring that downstream absolute quantitation is based on reliable input data.

# * Calculate the percentage of spiked species retrieval per sample*

library(mia)         
library(dplyr)
library(SummarizedExperiment)

# Subset TSE to remove blanks 
Spiked_16S_sum_scaled_filtered <- Spiked_16S_sum_scaled[, colData(Spiked_16S_sum_scaled)$sample.or.blank != "blank"]

# Calculate spike-in retrieval percentage
Perc <- calculate_spike_percentage(
  Spiked_16S_sum_scaled_filtered,
  merged_spiked_species,
 #output_file = "output_tse.docx",
  passed_range = c(0.1, 20)
)

head(Perc)
## # A tibble: 6 × 5
##   Sample            Total_Reads Spiked_Reads Percentage Result
##   <chr>                   <dbl>        <dbl>      <dbl> <chr> 
## 1 STP1719.20422_S47       19142        14554     76.0   failed
## 2 STP213.20423_S59         8462           83      0.981 passed
## 3 STP268.20424_S71         6968           17      0.244 passed
## 4 STP544.20419_S11         2340         2259     96.5   failed
## 5 STP570.20420_S23         2647          822     31.1   failed
## 6 STP579.20421_S35         5193         1759     33.9   failed

10 spiked species retrieval is system-dependent

Spike-in controls provide a way to assess the quantitative reliability of metagenomic or amplicon sequencing workflows. However, the expected recovery percentage of spike-in reads is not universal, it depends on the sequencing platform, extraction chemistry, and the complexity of the native microbial community. As discussed in our paper (doi.org/10.1093/ismejo/wraf150), the acceptable range of spike-in recovery is system-dependent and should be determined empirically from each dataset. This function (regression_plot()) supports that step by visualizing the relationship between observed spike-in abundance and measured total reads, stratified by recovery percentage ranges. The regression_plot() generates a scatter plot of log-transformed variables (typically Observed vs Total_Reads_spiked), overlays a linear regression fit, and facets the data according to custom recovery ranges (e.g., 010%, 1030%, 3050%, etc.).

Interpretation: A strong linear trend (high R, low p-value) within a range indicates consistent quantitative behavior. Deviations or flattening slopes suggest technical bias or over-amplification. The empirically stable range can then be used to adjust the passed_range argument in calculate_spike_percentage() for future analyses. Why this matters: By empirically determining recovery behavior instead of assuming a fixed threshold, users ensure that spike-in QC is tailored to their experimental conditions, improving both accuracy and comparability of absolute quantitation results.

# The acceptable range of spiked species retrieval is system-dependent
# Spiked species become centroid of the community (Distance to Centroid)
# Spiked species become dominant and imbalance the community (Evenness)

# What range of spiked species retrieval is appropriate for your system?
# Calculate Pielou's Evenness using Shannon index and species richness (Observed)
# Hill number q = 1 = exp(Shannon index), representing the effective number of equally abundant species. Hill is weighted by relative abundance, so dominant species have more influence.
# Unlike Pielou's evenness, this metric is not normalized by richness and it shows Effective number of common species

library(TreeSummarizedExperiment)
library(SummarizedExperiment)
library(dplyr)
library(tibble)
library(microbiome)
library(mia)
library(vegan)
library(S4Vectors)
library(ggplot2)


# --- 1. Extract current metadata from TSE ---
metadata <- colData(Spiked_16S_sum_scaled) %>%
  as.data.frame() %>%
  rownames_to_column("Sample")

# --- 2. Add spike-in reads (Perc) ---
# Ensure Perc has 'Sample' column and matching format
metadata <- dplyr::left_join(metadata, Perc, by = "Sample")

# --- 3. Estimate alpha diversity indices and extract ---
Spiked_16S_sum_scaled <- mia::addAlpha(
  Spiked_16S_sum_scaled,
  index = c("observed", "shannon", "pielou", "hill")
)

alpha_df <- colData(Spiked_16S_sum_scaled) %>%
  as.data.frame() %>%
  rownames_to_column("Sample") %>%
  select(Sample, observed, shannon, pielou, hill)

metadata <- dplyr::left_join(metadata, alpha_df, by = "Sample")

# --- 4. Compute Bray-Curtis distance to centroid ---
otu_mat <- assay(Spiked_16S_sum_scaled, "counts")
otu_mat <- t(otu_mat)  # samples as rows
otu_mat_rel <- vegan::decostand(otu_mat, method = "total")

centroid_profile <- colMeans(otu_mat_rel)

dist_to_centroid <- apply(otu_mat_rel, 1, function(x) {
  vegan::vegdist(rbind(x, centroid_profile), method = "bray")[1]
})

# Match and assign distances
metadata$Dist_to_Centroid <- dist_to_centroid[metadata$Sample]

# --- 5. Assign updated metadata to TSE ---
metadata <- column_to_rownames(metadata, var = "Sample")
metadata <- metadata[colnames(Spiked_16S_sum_scaled), , drop = FALSE]  # ensure correct order and size

colData(Spiked_16S_sum_scaled) <- S4Vectors::DataFrame(metadata)


# 4. Regression Plots: Diversity vs. Spike-in Reads
# ============================================================================

# Pielous Evenness
plot_object_pielou <- regression_plot(
  data = metadata,
  x_var = "pielou",
  y_var = "Spiked_Reads",
  custom_range = c(0.1, 20, 30, 40, 50, 60, 100),
  plot_title = "Pielous Evenness vs. Spike-in Reads"
)

# Hill Number (q = 1)
plot_object_hill <- regression_plot(
  data = metadata,
  x_var = "hill",
  y_var = "Spiked_Reads",
  custom_range = c(0.1, 10, 20, 30, 100),
  plot_title = "Hill Number (q = 1) vs. Spike-in Reads"
)

plot_object_hill

# Distance to Centroid
plot_object_centroid <- regression_plot(
  data = metadata,
  x_var = "Dist_to_Centroid",
  y_var = "Spiked_Reads",
  custom_range = c(0.1, 20, 30, 40, 50, 60, 100),
  plot_title = "Distance to Global Centroid vs. Spike-in Reads"
)


# Interpretation
# ============================================================================
# - Pielou's evenness is normalized by richness; useful for detecting imbalance.
# - Hill number q = 1 gives effective number of common species; sensitive to dominance.
# - Distance to centroid in full Bray-Curtis space shows deviation from the average community.

12 Calculating Scaling Factors

The calculate_spikeIn_factors() function estimates bias-corrected scaling factors for absolute microbiome quantitation using spike-in standards. The ratio between known and observed spike-in counts yields a sample-specific correction factor that normalizes total microbial abundances to absolute scale. This whole-cell spike-in approach corrects for variations in DNA extraction efficiency, amplification yield, and sequencing depth, yielding a unified, biologically interpretable scale for measuring absolute microbial load. The function automatically handles:Species or genus-level spike-in identification, Safe tree and taxonomy synchronization, Volume-based scaling (via the spiked.volume field in metadata), Optional export of intermediate results for traceability (Total_Reads.csv, Spiked_Reads.csv, Scaling_Factors.csv).

#                      CALCULATE SCALING FACTORS

result <- calculate_spikeIn_factors(
  Spiked_16S_sum_scaled_filtered,
  spiked_cells = spiked_cells,
  merged_spiked_species = species_name)

# View extracted outputs
result$spiked_species_reads # Merged spiked species name
## # A tibble: 260 × 2
##    Sample            Spiked_Reads
##    <chr>                    <dbl>
##  1 STP1719.20422_S47        14554
##  2 STP213.20423_S59            83
##  3 STP268.20424_S71            17
##  4 STP544.20419_S11          2259
##  5 STP570.20420_S23           822
##  6 STP579.20421_S35          1759
##  7 STP614.20418_S94             0
##  8 UHM1000.20604_S22          118
##  9 UHM1001.20609_S82           93
## 10 UHM1007.20622_S48          118
## # ℹ 250 more rows
result$total_reads # Total reads detected for the spike
## # A tibble: 260 × 2
##    Sample            Total_Reads
##    <chr>                   <dbl>
##  1 STP1719.20422_S47       19142
##  2 STP213.20423_S59         8462
##  3 STP268.20424_S71         6968
##  4 STP544.20419_S11         2340
##  5 STP570.20420_S23         2647
##  6 STP579.20421_S35         5193
##  7 STP614.20418_S94            5
##  8 UHM1000.20604_S22        5538
##  9 UHM1001.20609_S82        3957
## 10 UHM1007.20622_S48       11042
## # ℹ 250 more rows
scaling_factors <- result$scaling_factors
head(scaling_factors) # Vector of scaling factors per sample
## STP1719.20422_S47  STP213.20423_S59  STP268.20424_S71  STP544.20419_S11 
##         0.1269067        22.2530120       108.6470588         0.8176184 
##  STP570.20420_S23  STP579.20421_S35 
##         2.2469586         1.0500284

13 Convert relative counts to absolute counts

This function translates relative abundance profiles into biologically meaningful absolute counts by applying empirically derived scaling factors. Absolute quantification enables direct comparison of microbial loads across samples and treatments, avoiding compositional artifacts inherent to relative data. This transformation is fundamental for integrative multi-omics analyses, network inference, and ecological interpretation of microbiome dynamics.

#             Convert relative counts to absolute counts


# absolute counts=relative countssample-specific scaling factor

# Convert to absolute counts
library(DspikeIn)
absolute <- convert_to_absolute_counts(Spiked_16S_sum_scaled_filtered, scaling_factors)

# Extract processed data
absolute_counts <- absolute$absolute_counts
tse_absolute <- absolute$obj_adj

tse_absolute <- tidy_phyloseq_tse(tse_absolute)

# View absolute count data
head(tse_absolute)
## class: TreeSummarizedExperiment 
## dim: 6 260 
## metadata(0):
## assays(1): counts
## rownames(6): OTU1 OTU2 ... OTU5 OTU6
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(260): STP1719.20422_S47 STP213.20423_S59 ... UHM998.20618_S95
##   UHM999.20617_S83
## colData names(34): sample.id X16S.biosample ... swab.presence MK.spike
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (6 rows)
## rowTree: 1 phylo tree(s) (9227 leaves)
## colLinks: NULL
## colTree: NULL

14 Summary Stat

This function generates a concise statistical summary of numeric variables, including measures of central tendency (mean, median), dispersion (standard deviation, standard error), and distribution (quartiles). The results can be exported as both .docx and .csv files for reporting.

#                      CALCULATE SPIKE PERCENTAGE & summary stat

# ** Calculate spike percentage & Generate summary statistics for absolute counts**
# absolute_counts is a data.frame of OTU/ASV table
# Generate summary statistics for absolute counts
post_eval_summary <- calculate_summary_stats_table(absolute_counts)

# You may want to Back normal to check calculation accuracy
# the scaling factor was computed based on spiked species reads and fixed cell count.
# Multiplying the spiked species read count by this scaling factor restores the exact spiked cell count.
# lets check it
# BackNormal <- calculate_spike_percentage(
#  tse_absolute,
#  merged_spiked_species,
#  passed_range = c(0.1, 20))

15 Conclusion

conclusion() evaluates spike-in performance and recovery, i.e., how well the spike-in behaved after normalization and how consistent it is across samples. The function relies on outputs from calculate_spike_percentage() and indirectly reflects how scaling factors and normalization affected your dataset. It is adviced to be performed after absolute abundance conversion because -> If you summarize spike-in performance before normalization, youre essentially measuring how biased your workflow was, not how successfully it was corrected.

# * Calculate the percentage of spiked species retrieval per sample*

library(mia)         
library(dplyr)
library(SummarizedExperiment)

# Subset TSE to remove blanks 
tse_absolute_filtered <- tse_absolute[, colData(tse_absolute)$sample.or.blank != "blank"]

# Generate conclusion report
conc <- conclusion(
  tse_absolute_filtered,
  merged_spiked_species,
  max_passed_range = 20,
  output_path = output_path)

conc$full_report
## # A tibble: 260 × 5
##    Sample            Total_Reads Spiked_Reads Percentage Result
##    <chr>                   <dbl>        <dbl>      <dbl> <chr> 
##  1 STP1719.20422_S47        2427         1847     76.1   failed
##  2 STP213.20423_S59       188314         1847      0.981 passed
##  3 STP268.20424_S71       757053         1847      0.244 passed
##  4 STP544.20419_S11         1913         1847     96.5   failed
##  5 STP570.20420_S23         5948         1847     31.1   failed
##  6 STP579.20421_S35         5452         1847     33.9   failed
##  7 STP614.20418_S94            5            0      0     failed
##  8 UHM1000.20604_S22       43347          924      2.13  passed
##  9 UHM1001.20609_S82       39309          924      2.35  passed
## 10 UHM1007.20622_S48       86418          924      1.07  passed
## # ℹ 250 more rows
# Filter to keep only the samples that passed
passed_samples <- Perc$Sample[Perc$Result == "passed"]

# Subset the TSE object to include only passed samples
tse_passed <- tse_absolute_filtered[, colnames(tse_absolute_filtered) %in% passed_samples]
dim(tse_passed)
## [1] 8742  177

16 Abundance-based Core microbiome

Alluvial plot illustrating the distribution of prevalent and core microbiome taxa across animal types, ecoregions, diets, and ecological modes. Flows represent absolute abundance of bacterial classes (colored by taxonomic class) across categorical variables. Dominant lineages such as Bacteroidia, Clostridia, Fusobacteria and Gammaproteobacteria remain prevalent across diverse hosts and habitats.

pps_Abs <- DspikeIn::get_long_format_data(tse_passed)

# calculation for relative abundance needs sum of total reads
# total_reads <- sum(pps_Abs$Abundance)

# Generate an alluvial plot

alluvial_plot_abs <- alluvial_plot(
  data = pps_Abs,
  axes = c("Animal.type", "Ecoregion.III","Diet", "Animal.ecomode"),
  abundance_threshold = 5000,
  fill_variable = "Class",
  silent = TRUE,
  abundance_type = "absolute",
  top_taxa = 10,
  text_size = 3,
  legend_ncol = 1,
  custom_colors = DspikeIn::color_palette$MG_Awesome  # Use the color palette from DspikeIn
)

17 Data transform & Normalization

Normalization is a crucial preprocessing step in microbiome analysis. It corrects for technical artifacts such as unequal sequencing depth, amplification bias, and compositionality, ensuring that observed differences reflect biological variation rather than data imbalance. The choice of normalization depends on the goal of your analysis, differential abundance, compositional analysis, or quick exploratory visualization. Normalization is a crucial preprocessing step in microbiome analysis. It corrects for technical artifacts such as unequal sequencing depth, amplification bias, and compositionality, ensuring that observed differences reflect biological variation rather than data imbalance.
The choice of normalization depends on the goal of your analysis, differential abundance, compositional analysis, or quick exploratory visualization.

Normalization Method Corrects For Best For Notes / Comments
DESeq Library size and count variance Differential abundance testing Robust, model-based normalization from DESeq2.
TMM (Trimmed Mean of M-values) Library size and compositional bias Group comparisons Widely used in edgeR, handles unequal composition.
CSS (Cumulative Sum Scaling) Sparse, compositional count data Microbiome datasets with many zeros Balances low-abundance and high-abundance taxa.
CLR (Centered Log-Ratio) Compositional structure Ordination, diversity, correlation Produces log-ratio data suitable for distance-based methods.
TC (Total Count) Library size Simple exploratory work Quick but sensitive to outliers.
UQ (Upper Quartile) Library size and extreme counts Moderate bias correction Uses the upper quartile of counts as scaling reference.
Median Library size Exploratory visualization Stable for balanced datasets.
RLE (Relative Log Expression) Sample-specific bias Large count matrices Used in RNA-seq and applicable to microbial data.
TSS (Total Sum Scaling) Library size Relative abundance conversion Similar to TC, transforms counts to proportions.
QN (Quantile Normalization) Global distribution differences Multi-omics integration Forces identical distributions across samples.
Poisson Count noise model Statistical modeling Based on Poisson GLM; often paired with inference.
Rarefying Unequal sequencing depth Richness/evenness estimation Not recommended for statistical tests.
# you may need to normalize/transform your data to reduce biases

ps <- tse_passed

# TC Normalization
result_TC <- normalization_set(ps, method = "TC", groups = "Host.species")
normalized_ps_TC <- result_TC$dat.normed
scaling_factors_TC <- result_TC$scaling.factor

# UQ Normalization
# result_UQ <- normalization_set(ps, method = "UQ", groups = "Host.species")
# normalized_ps_UQ <- result_UQ$dat.normed
# scaling_factors_UQ <- result_UQ$scaling.factor
# 
# # Median Normalization
# result_med <- normalization_set(ps, method = "med", groups = "Host.species")
# normalized_ps_med <- result_med$dat.normed
# scaling_factors_med <- result_med$scaling.factor

# DESeq Normalization
# ps_n <- remove_zero_negative_count_samples(ps)
# result_DESeq <- normalization_set(ps_n, method = "DESeq", groups = "Animal.type")
# normalized_ps_DESeq <- result_DESeq$dat.normed
# scaling_factors_DESeq <- result_DESeq$scaling.factor

# Poisson Normalization
# result_Poisson <- normalization_set(ps, method = "Poisson", groups = "Host.genus")
# normalized_ps_Poisson <- result_Poisson$dat.normed
# scaling_factors_Poisson <- result_Poisson$scaling.factor

# Quantile Normalization
# result_QN <- normalization_set(ps, method = "QN")
# normalized_ps_QN <- result_QN$dat.normed
# scaling_factors_QN <- result_QN$scaling.factor

# TMM Normalization
# result_TMM <- normalization_set(ps, method = "TMM", groups = "Animal.type")
# normalized_ps_TMM <- result_TMM$dat.normed
# scaling_factors_TMM <- result_TMM$scaling.factor

# CLR Normalization
# result_clr <- normalization_set(ps, method = "clr")
# normalized_ps_clr <- result_clr$dat.normed
# scaling_factors_clr <- result_clr$scaling.factor

# Rarefying
# result_rar <- normalization_set(ps, method = "rar")
# normalized_ps_rar <- result_rar$dat.normed
# scaling_factors_rar <- result_rar$scaling.factor

# CSS Normalization
# result_css <- normalization_set(ps, method = "css")
# normalized_ps_css <- result_css$dat.normed
# scaling_factors_css <- result_css$scaling.factor
# 
# TSS Normalization
# result_tss <- normalization_set(ps, method = "tss")
# normalized_ps_tss <- result_tss$dat.normed
# scaling_factors_tss <- result_tss$scaling.factor

# RLE Normalization
# result_rle <- normalization_set(ps, method = "rle")
# normalized_ps_rle <- result_rle$dat.normed
# scaling_factors_rle <- result_rle$scaling.factor

Random Forestbased feature selection identifies the most predictive ASVs or OTUs that distinguish samples according to a chosen metadata variable (e.g., host, treatment, or environment). It builds an ensemble of decision trees to evaluate how strongly each taxon contributes to accurate classification, ranking taxa by Mean Decrease in Gini impurity (importance score). This approach helps users pinpoint biologically informative taxa while reducing data dimensionality, noise, and redundancy, enabling focused downstream analyses such as differential abundance, biomarker discovery, or visualization of key community drivers.For methodological details, see Liaw & Wiener (2002), Classification and Regression by randomForest, R News, 2(3):1822. Available at: journal.r-project.org/articles/RN-2002-022/RN-2002-022.pdf

18 Ridge plot

The ridge plot illustrates abundance distributions of the top 10 families across all samples, showing which taxa exhibit narrow, high-density peaks (consistent abundance) versus broader or multimodal patterns (context-dependent abundance)

normalized_tse_TC<-convert_phyloseq_to_tse(normalized_ps_TC)

rf_tse <- RandomForest_selected(
  normalized_ps_TC,
  response_var = "Host.genus",
  na_vars = c("Habitat", "Ecoregion.III", "Host.genus", "Diet")
)

ridge_tse<- ridge_plot_it(rf_tse, taxrank = "Family", top_n = 10) + scale_fill_manual(values = DspikeIn::color_palette$cool_MG)

ridge_tse

19 Differential abundance (Single Pairwise)

remove the spike-in sp before further analysis Note: Before performing differential abundance analysis, it is recommended to remove the spike-in species to avoid artificial bias in normalization. If you select the DESeq2 option embedded in DspikeIn, note that it internally handles all normalization and variance stabilization steps required for modeling raw count data. DESeq2 estimates size factors (accounting for library depth differences) and dispersion parameters (accounting for biological variability). Therefore, users should always provide integer, raw counts, not normalized, transformed, or rarefied data, as input.

library(mia)
library(dplyr)
library(SummarizedExperiment)

# ---Remove unwanted taxa by Genus, Family, or Order ---
tse_filtered <- tse_absolute[
  rowData(tse_absolute)$Genus != "Tetragenococcus" &
  rowData(tse_absolute)$Family != "Chloroplast" &
  rowData(tse_absolute)$Order != "Chloroplast", ]

tse_caudate <- tse_filtered[, colData(tse_filtered)$Clade.Order == "Caudate"]
genera_keep <- c("Desmognathus", "Plethodon", "Eurycea")
tse_three_genera <- tse_caudate[, colData(tse_caudate)$Host.genus %in% genera_keep]
tse_blue_ridge <- tse_three_genera[, colData(tse_three_genera)$Ecoregion.III == "Blue Ridge"]
tse_desmog <- tse_blue_ridge[, colData(tse_blue_ridge)$Host.genus == "Desmognathus"]

# --- Differential Abundance with DESeq2 ---
results_DESeq2 <- perform_and_visualize_DA(
  obj = tse_desmog,
  method = "DESeq2",
  group_var = "Host.taxon",
  contrast = c("Desmognathus monticola", "Desmognathus imitator"),
  output_csv_path = "DA_DESeq2.csv",
  target_glom = "Genus",
  significance_level = 0.01)


results_DESeq2$results
## # A tibble: 960 × 17
##    baseMean     logFC lfcSE      stat pvalue  padj   FDR Significance    group  
##       <dbl>     <dbl> <dbl>     <dbl>  <dbl> <dbl> <dbl> <chr>           <fct>  
##  1     9     9.80e- 8 0.256  3.82e- 7  1.000 1     1     Not Significant Desmog…
##  2     2     1.35e- 7 0.536  2.52e- 7  1.000 1     1     Not Significant Desmog…
##  3     1.36 -1.07e- 6 0.753 -1.42e- 6  1.000 1     1     Not Significant Desmog…
##  4     2     1.35e- 7 0.536  2.52e- 7  1.000 1     1     Not Significant Desmog…
##  5    10     9.29e- 8 0.244  3.81e- 7  1.000 1     1     Not Significant Desmog…
##  6    10.8  -4.62e- 1 0.284 -1.63e+ 0  0.104 0.872 0.872 Not Significant Desmog…
##  7     3     1.43e- 7 0.438  3.27e- 7  1.000 1     1     Not Significant Desmog…
##  8     1     1.22e-16 0.759  1.61e-16  1     1     1     Not Significant Desmog…
##  9    21     6.10e- 8 0.172  3.55e- 7  1.000 1     1     Not Significant Desmog…
## 10     2     1.35e- 7 0.536  2.52e- 7  1.000 1     1     Not Significant Desmog…
## # ℹ 950 more rows
## # ℹ 8 more variables: OTU <chr>, Kingdom <chr>, Phylum <chr>, Class <chr>,
## #   Order <chr>, Family <chr>, Genus <chr>, Species <chr>
results_DESeq2$obj_significant
## class: TreeSummarizedExperiment 
## dim: 22 42 
## metadata(0):
## assays(1): counts
## rownames(22): OTU270 OTU333 ... OTU7070 OTU7691
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(42): UHM1240.20566_S41 UHM1248.20575_S54 ... UHM470.20533_S25
##   UHM476.20414_S46
## colData names(34): sample.id X16S.biosample ... swab.presence MK.spike
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (22 rows)
## rowTree: 1 phylo tree(s) (22 leaves)
## colLinks: NULL
## colTree: NULL
# Optional: Visualization
 results_DESeq2$bar_plot

Differentially abundant bacterial taxa between Desmognathus imitator and D. monticola are shown (adjusted p < 0.05). Purple bars indicate taxa enriched in D. imitator (positive LFC) and teal bars those enriched in D. monticola (negative LFC). Bar length denotes log fold change; lines show 1 SE. Distinct enrichment patterns reveal host-specific bacterial community differentiation.

20 Differential abundance (Multilayer Pairwise)

# DspikeIn: Differential Abundance Examples
# Using perform_and_visualize_DA() with multiple contrast scenarios

# 1. Single Contrast

res_single <- perform_and_visualize_DA(
  obj = tse_absolute,
  method = "DESeq2",
  group_var = "Diet",
  contrast = c("Insectivore", "Carnivore"),
  target_glom = "Genus")



# 2. Single Factor  All Pairwise Contrasts
col_df <- as.data.frame(colData(tse_absolute))
col_df$Host.taxon <- factor(make.names(as.character(col_df$Host.taxon)))
colData(tse_absolute)$Host.taxon <- col_df$Host.taxon

# Get unique DESeq2-compatible factor levels
host_levels <- levels(col_df$Host.taxon)
print(host_levels)

# contrast list
contrast_named <- list(
  Host.taxon = combn(host_levels, 2, simplify = FALSE))

# multiple pairwise contrasts
res_multi <- perform_and_visualize_DA(
  obj = tse_absolute,
  method = "DESeq2",
  group_var = "Host.taxon",
  contrast = contrast_named,
  target_glom = "Genus")



# 3. Single Factor  Selected Contrasts

contrast_list <- list(
  c("Insectivore", "Carnivore"),
  c("Omnivore", "Herbivore"))

res_selected <- perform_and_visualize_DA(
  obj = tse_absolute,
  method = "DESeq2",
  group_var = "Diet",
  contrast = contrast_list,
  global_fdr = TRUE)



# 4. Multiple Factors  Selected Contrasts

contrast_named <- list(
  Diet = list(
    c("Insectivore", "Carnivore"),
    c("Omnivore", "Carnivore") ),
  Animal.type = list(
    c("Frog", "Salamander") ))

res_multi_factor <- perform_and_visualize_DA(
  obj = tse_absolute,
  method = "DESeq2",
  contrast = contrast_named,
  target_glom = "Genus",
  significance_level = 0.01,
  global_fdr = TRUE)



# 5. Multiple Factors  all pairwise contrasts

colData(tse_absolute)$Host.taxon <- droplevels(factor(colData(tse_absolute)$Host.taxon))
colData(tse_absolute)$Habitat <- droplevels(factor(colData(tse_absolute)$Habitat))
host_levels <- levels(colData(tse_absolute)$Host.taxon)
habitat_levels <- levels(colData(tse_absolute)$Habitat)

# Define pairwise contrasts as a named list
contrast_named <- list(
  Host.taxon = combn(host_levels, 2, simplify = FALSE),
  Habitat = combn(habitat_levels, 2, simplify = FALSE))

# Define ComboGroup variable
colData(tse_absolute)$ComboGroup <- factor(interaction(
  colData(tse_absolute)$Animal.type,
  colData(tse_absolute)$Diet,
  drop = TRUE))



# --- Run multi-factor DESeq2 comparisons ---
results_multi_factor <- perform_and_visualize_DA(
  obj = tse_absolute,
  method = "DESeq2",
  contrast = contrast_named,
  target_glom = "Genus",
  significance_level = 0.01)

# --- Combined Group Interactions ---
colData(tse_absolute)$ComboGroup <- factor(interaction(
  colData(tse_absolute)$Animal.type,
  colData(tse_absolute)$Diet,
  drop = TRUE))

# --- Define custom interaction contrasts ---
contrast_list <- list(
  c("Salamander.Insectivore", "Lizard.Insectivore"),
  c("Salamander.Carnivore", "Snake.Carnivore"),
  c("Salamander.Carnivore", "Frog.Carnivore"))

# --- Run combined group comparisons ---
res_combo <- perform_and_visualize_DA(
  obj = tse_absolute,
  method = "DESeq2",
  group_var = "ComboGroup",
  contrast = contrast_list,
  target_glom = "Genus",
  global_fdr = TRUE)

21 Turnover (Presence/absence analysis)

taxonomic detection consistency between relative abundance (RA) and AA tables Presence/absence analysis to detect concordance between AA and RA profiles

Purpose: This analysis compares taxonomic detection consistency between RA and AA datasets. By converting both OTU tables to presence/absence matrices and intersecting shared taxa and samples, we assess how many taxa are consistently detected in both datasets.

Interpretation: The resulting “percent_shared” value represents the proportion of taxa jointly detected in RA and AA profiles, providing an estimate of detection concordance and turnover. Presenceabsence comparison between absolute (AA) and relative (RA) abundance datasets revealed that ~100% of taxa were shared, while the discrepancies likely reflect compositional bias introduced by relative normalization rather than true biological differences.

# --------------------------------------------
#  Extract OTU matrices from TSE objects
# --------------------------------------------
# relative dataset -> tse_desmog_rel
# tse_16SOTU_spiked_samples <- tse_16SOTU[, colData(tse_16SOTU)$spiked.volume %in% c("2", "1") ]

tse_filtered_rel <- tse_16SOTU_spiked_samples[
  rowData(tse_16SOTU_spiked_samples)$Genus != "Tetragenococcus" &
  rowData(tse_16SOTU_spiked_samples)$Family != "Chloroplast" &
  rowData(tse_16SOTU_spiked_samples)$Order != "Chloroplast", ]

tse_caudate_rel <- tse_filtered_rel[, colData(tse_filtered_rel)$Clade.Order == "Caudate"]
genera_keep <- c("Desmognathus", "Plethodon", "Eurycea")
tse_three_genera_rel <- tse_caudate[, colData(tse_caudate_rel)$Host.genus %in% genera_keep]
tse_blue_ridge_rel <- tse_three_genera_rel[, colData(tse_three_genera_rel)$Ecoregion.III == "Blue Ridge"]
tse_desmog_rel <- tse_blue_ridge_rel[, colData(tse_blue_ridge_rel)$Host.genus == "Desmognathus"]

# absolute dataset -> tse_desmog

otu_rel <- assay(tse_desmog_rel, "counts") 
otu_abs <- assay(tse_desmog, "counts")        

# Make sure samples are rows and taxa are columns
otu_rel <- t(otu_rel)
otu_abs <- t(otu_abs)

# Ensure they are matrices
otu_rel <- as.matrix(otu_rel)
otu_abs <- as.matrix(otu_abs)

# --------------------------------------------
# 2. Convert to Presence/Absence 
# --------------------------------------------
otu_rel_pa <- (otu_rel > 0) * 1
otu_abs_pa <- (otu_abs > 0) * 1

# --------------------------------------------
# 3. Identify common samples & taxa
# --------------------------------------------
shared_samples <- intersect(rownames(otu_rel_pa), rownames(otu_abs_pa))
shared_taxa <- intersect(colnames(otu_rel_pa), colnames(otu_abs_pa))

# Subset to common set
otu_rel_pa <- otu_rel_pa[shared_samples, shared_taxa]
otu_abs_pa <- otu_abs_pa[shared_samples, shared_taxa]

# --------------------------------------------
# 4. Compare presence/absence profiles
# --------------------------------------------
shared_pa <- (otu_rel_pa + otu_abs_pa) == 2
total_present <- (otu_rel_pa + otu_abs_pa) >= 1

# Calculate percent agreement (global)
percent_shared <- sum(shared_pa) / sum(total_present)

cat("Percent of shared taxa detections (AA vs RA):",
    round(100 * percent_shared, 1), "%\n")
## Percent of shared taxa detections (AA vs RA): 100 %

22 Visualization

#   Visualization of community composition

# Subset rows where Genus is not Tetragenococcus and not NA
keep_taxa <- !is.na(rowData(tse_absolute)$Genus) & rowData(tse_absolute)$Genus != "Tetragenococcus"
# Subset the TSE by keeping only those taxa (rows)
tse_filtered <- tse_absolute[keep_taxa, ]
# Exclude blanks
tse_filtered <- tse_filtered[, colData(tse_filtered)$sample.or.blank != "blank"]
# subset Salamander samples
tse_sal <- tse_filtered[, colData(tse_filtered)$Animal.type == "Salamander"]
Prok_OTU_sal <- tidy_phyloseq_tse(tse_sal)

TB<-taxa_barplot(
  Prok_OTU_sal,
  target_glom = "Family",
  custom_tax_names = NULL,
  normalize = TRUE,
  treatment_variable = "Habitat",
  abundance_type = "relative",
  x_angle = 15,
  fill_variable = "Family",
  facet_variable = "Diet",
  top_n_taxa = 10,
  palette = DspikeIn::color_palette$cool_MG,
  legend_size = 11,
  legend_columns = 1,
  x_scale = "free",
  xlab = NULL)

23 Microbial dynamics & NetWork comparision

To create a microbial co-occurrence network, you can refer to the SpiecEasi package available at: SpiecEasi GitHub Repository //github.com/zdk123/SpiecEasi The weight_Network() function provides an integrated framework for exploring and quantifying microbial interaction networks. It imports a pre-computed or user-supplied GraphML network, computes key topological descriptors (e.g., modularity, connectivity, transitivity, and path length), and visualizes the graph using a force-directed layout. Node colors are assigned according to modularity classes, while edge thickness and color reflect the magnitude and sign of associations, respectively, enabling intuitive interpretation of network structure. Together, these analyses highlight how community organization and interaction patterns change under different ecological contexts, for instance, after removing hubs or specific taxa (e.g., Basidiobolus) as we showed (doi.org/10.1093/ismejo/wraf150). The resulting visualization and metrics provide complementary insights into network complexity and modular structure, serving as a quantitative basis for assessing microbiome stability and cross-domain connectivity.

# 1. Initialization and loading NetWorks for Comparision


# library(SpiecEasi)
# library(ggnet)
# library(igraph)
library(DspikeIn)
library(tidyr)
library(dplyr)
library(ggpubr)
library(igraph)


# herp.Bas is a merged phyloseq object for both bacterial and fungal domains
# spiec.easi(herp.Bas, method='mb', lambda.min.ratio=1e-3, nlambda=250,pulsar.select=TRUE )

Complete <- DspikeIn::load_graphml("Complete.graphml")
NoBasid <- DspikeIn::load_graphml("NoBasid.graphml")
NoHubs <- DspikeIn::load_graphml("NoHubs.graphml")

# degree_network -> please note that needs full path of your graphml
result_kk <- DspikeIn::degree_network(load_graphml("Complete.graphml"),
  save_metrics = TRUE,
  layout_type = "stress")

result_kk$metrics
## # A tibble: 1 × 8
##   Nodes Edges Modularity Density Transitivity Diameter Avg_Path_Length
##   <dbl> <dbl>      <dbl>   <dbl>        <dbl>    <dbl>           <dbl>
## 1   308  1144      0.736  0.0242        0.179     1.09          0.0982
## # ℹ 1 more variable: Avg_Degree <dbl>
 result <- DspikeIn::weight_Network(graph_path = "Complete.graphml")
 result$plot

Burts Theory of Structural Holes (Ronald S. Burt, 1992, 2004) & Network Closure and Redundancy (1992, 2001) Cohesion and Trust & Brokerage and Network Advantage (Burt, 2005) The BrokerageClosure Tradeoff & Redundancy and Efficiency (20072010) Non-redundant Contacts and Efficiency & Structural Autonomy (Burt, 1997) Freedom through Brokerage

Ronald Burt (1992) proposed that in a network, some nodes occupy structural holes, gaps between otherwise disconnected groups or clusters. A node that bridges these holes plays a unique role: it connects information or resources between otherwise separate modules. Bridging role of Basidiobolus consistent with Burts Structural Hole Theory. The network illustrates taxon-level associations among microbial domains. Distinct bacterial and fungal subnetworks form modular clusters (colored by domain or module). Basidiobolus acts as a bridging node connecting otherwise disconnected clusters, a position corresponding to a structural hole in Burts theory (Burt, 1992). Such brokerage positions facilitate information and metabolite exchange across domains, enhancing functional redundancy and community stability. Nodes are sized by betweenness centrality, emphasizing the high structural influence of Basidiobolus in maintaining network cohesion under environmental stress. Moreover, Basidioboluss links are non-redundant; it connects functionally distinct taxa that otherwise have no direct edge. This makes it ecologically efficient in facilitating novel functional exchanges (e.g., horizontal gene transfer or cross-domain metabolic complementation).

24 Visualizes pairwise relationships between node-level metrics using a quadrant-based plot

node_level_metrics() & quadrant_plot() functions quantify and visualize node-level structural properties within microbial association networks. By integrating multiple centrality and connectivity metrics, it characterizes each nodes influence, redundancy (based on Bert’s theory), and modular affiliation, allowing comparisons of network architecture and community organization. The resulting plots and table provide complementary views of node prominence and connectivity patterns, serving as a foundation for interpreting ecological stability, functional redundancy, and cross-domain interactions.

Metric Description
‘Node’ Node name (character format)
‘Degree’ Number of edges connected to the node
‘Strength’ Sum of edge weights connected to the node
‘Closeness’ Closeness centrality (normalized, based on shortest paths)
‘Betweenness’ Betweenness centrality (normalized, measures control over network flow)
‘EigenvectorCentrality’ Eigenvector centrality (importance based on connections to influential nodes)
‘PageRank’ PageRank score (importance based on incoming links)
‘Transitivity’ Local clustering coefficient (tendency of a node to form triangles)
‘Coreness’ Node’s coreness (from k-core decomposition)
‘Constraint’ Burt’s constraint (measures structural holes in a node’s ego network)
‘EffectiveSize’ Inverse of constraint (larger values = more non-redundant connections)
‘Redundancy’ Sum of constraint values of a node’s alters
‘Community’ Community assignment from Louvain clustering
‘Efficiency’ Global efficiency (average inverse shortest path length)
‘Local_Efficiency’ Local efficiency (subgraph efficiency for a node’s neighbors)
‘Within_Module_Connectivity’ Proportion of neighbors in the same community
‘Among_Module_Connectivity’ Proportion of neighbors in different communities
# Load required libraries
library(igraph)
library(dplyr)
library(tidyr)
library(ggplot2)
library(ggrepel)
ggplot2::margin()
## [1] 0points 0points 0points 0points
# 2.  Metrics Calculation

completeMetrics <- DspikeIn::node_level_metrics(Complete)
NoHubsMetrics <- DspikeIn::node_level_metrics(NoHubs)
NoBasidMetrics <- DspikeIn::node_level_metrics(NoBasid)


Complete_metrics <- completeMetrics$metrics
Nohub_metrics <- NoHubsMetrics$metrics
Nobasid_metrics <- NoBasidMetrics$metrics

Complete_metrics <- data.frame(lapply(Complete_metrics, as.character), stringsAsFactors = FALSE)
Nohub_metrics <- data.frame(lapply(Nohub_metrics, as.character), stringsAsFactors = FALSE)
Nobasid_metrics <- data.frame(lapply(Nobasid_metrics, as.character), stringsAsFactors = FALSE)


print(igraph::vcount(Complete)) # Number of nodes in Complete network
## [1] 308
print(igraph::ecount(Complete)) # Number of edges in Complete network
## [1] 1144
print(igraph::vcount(NoBasid))
## [1] 307
print(igraph::ecount(NoBasid))
## [1] 1187
print(igraph::vcount(NoHubs))
## [1] 286
print(igraph::ecount(NoHubs))
## [1] 916
metrics_scaled <- bind_rows(
  Complete_metrics %>% mutate(Network = "Complete"),
  Nohub_metrics %>% mutate(Network = "NoHubs"),
  Nobasid_metrics %>% mutate(Network = "NoBasid")
) %>%
  dplyr::mutate(dplyr::across(where(is.numeric), scale))

metrics_long_scaled <- metrics_scaled %>%
  tidyr::pivot_longer(cols = -c(Node, Network), names_to = "Metric", values_to = "Value")


# Ensure each dataset has a "Network" column before combining
completeMetrics$metrics <- completeMetrics$metrics %>% mutate(Network = "Complete Network")
NoHubsMetrics$metrics <- NoHubsMetrics$metrics %>% mutate(Network = "Network & Module Hubs Removed")
NoBasidMetrics$metrics <- NoBasidMetrics$metrics %>% mutate(Network = "Basidiobolus Subnetwork Removed")

# Combine All Data
combined_data <- bind_rows(
  completeMetrics$metrics,
  NoHubsMetrics$metrics,
  NoBasidMetrics$metrics
)


# Add Node Identifier if missing
if (!"Node" %in% colnames(combined_data)) {
  combined_data <- combined_data %>% mutate(Node = rownames(.))
}

# Convert `Network` into Factor
combined_data$Network <- factor(combined_data$Network, levels = c(
  "Complete Network",
  "Network & Module Hubs Removed",
  "Basidiobolus Subnetwork Removed"))

# Convert Data to Long Format
metrics_long <- combined_data %>%
  pivot_longer(
    cols = c("Redundancy", "Efficiency", "Betweenness"),
    names_to = "Metric", values_to = "Value")


# Define Custom Colors and Shapes
  network_colors <- c(
  "Complete Network" = "#F1E0C5",
  "Network & Module Hubs Removed" = "#D2A5A1",
  "Basidiobolus Subnetwork Removed" = "#B2C3A8")
  
network_shapes <- c(
  "Complete Network" = 21,
  "Network & Module Hubs Removed" = 22,
  "Basidiobolus Subnetwork Removed" = 23)

# Determine Top 30% of Nodes to Label/Optional
metrics_long <- metrics_long %>%
  group_by(Network, Metric) %>%
  mutate(Label = ifelse(rank(-Value, ties.method = "random") / n() <= 0.3, Node, NA))

# ?quadrant_plot() can creat plot for indivisual network
 plot4 <- quadrant_plot(completeMetrics$metrics, x_metric = "Among_Module_Connectivity", y_metric = "Degree")

 
 
# Customized comparision Plots
create_metric_plot <- function(metric_name, data, title) {
  data_filtered <- data %>% filter(Metric == metric_name)
  median_degree <- median(data_filtered$Degree, na.rm = TRUE)
  median_value <- median(data_filtered$Value, na.rm = TRUE)

  ggplot(data_filtered, aes(x = Degree, y = Value, fill = Network)) +
    geom_point(aes(shape = Network), size = 3, stroke = 1, color = "black") +
    geom_text_repel(aes(label = Label), size = 3, max.overlaps = 50) +
    scale_fill_manual(values = network_colors) +
    scale_shape_manual(values = network_shapes) +
    geom_vline(xintercept = median_degree, linetype = "dashed", color = "black", size = 1) +
    geom_hline(yintercept = median_value, linetype = "dashed", color = "black", size = 1) +
    labs(
      title = title,
      x = "Degree",
      y = metric_name,
      fill = "Network",
      shape = "Network" ) +
    theme_minimal() +
    theme(
      plot.title = element_text(
        hjust = 0.5, size = 16, face = "bold",
        margin = margin(t = 10, b = 20) # Moves the title downward
      ),
      axis.title = element_text(size = 14, face = "bold"),
      legend.position = "top",
      legend.title = element_text(size = 14, face = "bold"),
      legend.text = element_text(size = 12)  )
}


# Generate Plots
plot_redundancy <- create_metric_plot("Redundancy", metrics_long, "Redundancy vs. Degree Across Networks")
plot_efficiency <- create_metric_plot("Efficiency", metrics_long, "Efficiency vs. Degree Across Networks")
plot_betweenness <- create_metric_plot("Betweenness", metrics_long, "Betweenness vs. Degree Across Networks")

# Print Plots
 print(plot_betweenness)

Betweenness-degree relationships illustrate how node influence and connectivity shift after hub or Basidiobolus subnetwork removal, identifying taxa that maintain global connectivity and serve as potential ecological keystones.

25 Comparision of node-level metrics

Compares standardized node-level metrics across network configurations (e.g., complete vs. hub-removed). Boxplots and Wilcoxon tests illustrate how structural perturbations alter node influence and connectivity, offering insights into microbial network robustness and modular reorganization.

# 2.  Metrics Calculation

result_Complete <- DspikeIn::node_level_metrics(Complete)
result_NoHubs <- DspikeIn::node_level_metrics(NoHubs)
result_NoBasid <- DspikeIn::node_level_metrics(NoBasid)

Complete_metrics <- result_Complete$metrics
Nohub_metrics <- result_NoHubs$metrics
Nobasid_metrics <- result_NoBasid$metrics

Complete_metrics <- data.frame(lapply(Complete_metrics, as.character), stringsAsFactors = FALSE)
Nohub_metrics <- data.frame(lapply(Nohub_metrics, as.character), stringsAsFactors = FALSE)
Nobasid_metrics <- data.frame(lapply(Nobasid_metrics, as.character), stringsAsFactors = FALSE)


print(igraph::vcount(Complete)) # Number of nodes in Complete network
## [1] 308
print(igraph::ecount(Complete)) # Number of edges in Complete network
## [1] 1144
print(igraph::vcount(NoBasid))
## [1] 307
print(igraph::ecount(NoBasid))
## [1] 1187
print(igraph::vcount(NoHubs))
## [1] 286
print(igraph::ecount(NoHubs))
## [1] 916
metrics_scaled <- bind_rows(
  Complete_metrics %>% mutate(Network = "Complete"),
  Nohub_metrics %>% mutate(Network = "NoHubs"),
  Nobasid_metrics %>% mutate(Network = "NoBasid")
) %>%
  dplyr::mutate(dplyr::across(where(is.numeric), scale))

metrics_long_scaled <- metrics_scaled %>%
  tidyr::pivot_longer(cols = -c(Node, Network), names_to = "Metric", values_to = "Value")


# Remove missing values
metrics_long_scaled <- na.omit(metrics_long_scaled)

# We visualize only six metrics
selected_metrics <- c(
  "Degree", "Closeness", "Betweenness",
  "EigenvectorCentrality", "PageRank", "Transitivity"
)

metrics_long_filtered <- metrics_long_scaled %>%
  filter(Metric %in% selected_metrics) %>%
  mutate(
    Value = as.numeric(as.character(Value)),
    Network = recode(Network,
      "Complete" = "Complete Network",
      "NoHubs" = "Network & Module Hubs Removed",
      "NoBasid" = "Basidiobolus Subnetwork Removed" ) ) %>%
  na.omit() # Remove any NA if any

metrics_long_filtered$Network <- factor(metrics_long_filtered$Network,
  levels = c(
    "Complete Network",
    "Network & Module Hubs Removed",
    "Basidiobolus Subnetwork Removed" ))

# DspikeIn::color_palette$light_MG
network_colors <- c(
  "Complete Network" = "#F1E0C5",
  "Network & Module Hubs Removed" = "#D2A5A1",
  "Basidiobolus Subnetwork Removed" = "#B2C3A8")

# statistical comparisons a vs b
comparisons <- list(
  c("Complete Network", "Network & Module Hubs Removed"),
  c("Complete Network", "Basidiobolus Subnetwork Removed"),
  c("Network & Module Hubs Removed", "Basidiobolus Subnetwork Removed"))

networks_in_data <- unique(metrics_long_filtered$Network)
comparisons <- comparisons[sapply(comparisons, function(pair) all(pair %in% networks_in_data))]

met <- ggplot(metrics_long_filtered, aes(x = Network, y = Value, fill = Network)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(aes(color = Network),
    position = position_jitter(0.2), alpha = 0.2, size = 1.5 ) +
  scale_fill_manual(values = network_colors) +
  scale_color_manual(values = network_colors) +
  facet_wrap(~Metric, scales = "free_y", labeller = label_wrap_gen(width = 20)) +
  ggpubr::stat_compare_means(method = "wilcox.test", label = "p.signif", comparisons = comparisons) +
  theme_minimal() +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.x = element_text(size = 10, angle = 10, hjust = 0.8),
    strip.text = element_text(size = 12, margin = margin(t = 15, b = 5)),
    legend.position = "top",
    legend.text = element_text(size = 13),
    legend.title = element_text(size = 13, face = "bold"),
    plot.title = element_text(size = 13, face = "bold")
  ) + labs(title = "Selected Node Metrics Across Networks", fill = "Network Type", color = "Network Type")
met

Node-level topological metrics were compared across network configurations to assess structural robustness. Removal of the Basidiobolus subnetwork altered centrality measures, particularly closeness and PageRank, while maintaining overall network coherence.

26 To find neighbors of the target node

This function identifies both first- and second-degree neighbors of a given target node within a microbial association network. It enables users to trace direct and indirect ecological connections, revealing potential mediator taxa that bridge interactions between microbial guilds. The resulting summary table facilitates the inspection of a nodes local connectivity and extended influence zone, supporting hypothesis-driven exploration of cross-domain interactions and community structuring.

Complete <- load_graphml("Complete.graphml")
result2 <- DspikeIn::extract_neighbors(
  graph = Complete,
  target_node = "OTU69:Basidiobolus_sp", mode = "all")
head(result2$summary)
## # A tibble: 6 × 2
##   Type           Node                          
##   <chr>          <chr>                         
## 1 First Neighbor OTU8:Mortierella_sp           
## 2 First Neighbor OTU13:Mortierella_sp          
## 3 First Neighbor OTU15:Mortierella_sp          
## 4 First Neighbor OTU16:Ascomycota_sp           
## 5 First Neighbor OTU18:Helotiales_sp           
## 6 First Neighbor OTU19:Margaritispora_monticola

27 Compute degree metrics and visualize the network

This analysis evaluates the global topology of microbial association networks under different structural configurationsnamely, the complete network, a hub-removed network, and a Basidiobolus-excluded subnetwork. Rather than inspecting individual nodes, this step focuses on overall network organization, including metrics that describe connectivity, efficiency, modularity, and cohesion.

The degree_network() function computes degree-based summaries and layouts for visualization, while weight_Network() extracts quantitative descriptors such as total edge weight, mean path length, and network density. Aggregating these metrics across network types enables direct comparison of topological robustness and information flow after targeted node or subnetwork removal.

The resulting bar plot displays log-scaled totals of each topological descriptor, allowing users to identify how network simplification (e.g., hub or guild removal) reshapes the overall architecture and resilience of the microbial network.

# Options: `"stress"` (default), `"graphopt"`, `"fr"`

# Compute degree metrics 
result <- degree_network(graph_path = Complete, save_metrics = TRUE)

# Compute network weights for different graph structures
NH <- weight_Network(graph_path = "NoHubs.graphml")
NB <- weight_Network(graph_path = "NoBasid.graphml")
C  <- weight_Network(graph_path = "Complete.graphml")

# Extract metrics from the computed network weights
CompleteM <- C$metrics
NoHubsM   <- NH$metrics
NoBasidM  <- NB$metrics

# Combine metrics into a single dataframe for comparison
df <- bind_rows(
  CompleteM %>% mutate(Group = "Complete Network"),
  NoHubsM %>% mutate(Group = "Network & Module Hubs Removed"),
  NoBasidM %>% mutate(Group = "Basidiobolus Subnetwork Removed")
) %>%
  pivot_longer(cols = -Group, names_to = "Metric", values_to = "Value")

# Aggregate total values by metric and group
df_bar <- df %>%
  group_by(Metric, Group) %>%
  summarise(Total_Value = sum(Value, na.rm = TRUE), .groups = "drop")

# Order metrics by mean value (optional)
metric_order <- df_bar %>%
  group_by(Metric) %>%
  summarise(mean_value = mean(Total_Value)) %>%
  arrange(desc(mean_value)) %>%
  pull(Metric)

df_bar$Metric <- factor(df_bar$Metric, levels = metric_order)

# Define refined color palette
network_colors <- c(
  "Complete Network" = "#F1E0C5",
  "Network & Module Hubs Removed" = "#D2A5A1",
  "Basidiobolus Subnetwork Removed" = "#B2C3A8"
)

# Create elegant bar plot
pg <- ggplot(df_bar, aes(x = Metric, y = log1p(Total_Value), fill = Group)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.8), width = 0.7, alpha = 0.9) +
  scale_fill_manual(values = network_colors) +
  theme_minimal(base_size = 14) +
  theme(
    panel.grid.major.x = element_blank(),
    panel.grid.minor = element_blank(),
    axis.text.x = element_text(size = 11, angle = 15, hjust = 1, vjust = 1, color = "gray20"),
    axis.text.y = element_text(size = 11, color = "gray25"),
    axis.title = element_text(size = 13, face = "bold"),
    plot.title = element_text(size = 15, face = "bold", hjust = 0.5, margin = ggplot2::margin(b = 10)),
    legend.position = "top",
    legend.title = element_blank(),
    legend.text = element_text(size = 12),
    panel.background = element_rect(fill = "white", color = NA),
    plot.background = element_rect(fill = "white", color = NA)
  ) +
  labs(
    subtitle = "Comparison of total network-level metrics (log-scaled)",
    x = NULL,
    y = "Total Value (log1p)"
  ) +
  geom_text(
    aes(label = round(log1p(Total_Value), 1)),
    position = position_dodge(width = 0.8),
    vjust = -0.3,
    size = 3.5,
    color = "gray25"
  )

28 Small-World Index Analysis

To determine whether the complete network exhibited small-world topology, we computed the Small-World Index (SWI; ) following the quantitative framework established by Humphries M, Gurney K, 2008. PLoS ONE 4 (Eq.1) (SWI; ) = (Global clustering coefficient real/Global clustering coefficient random)/(Avg Path real/Avg Path random))

library(igraph)
library(tidygraph)
library(ggraph)
library(DspikeIn)


# AA abundance
Complete<-load_graphml("Complete.graphml")

deg <- degree(Complete)

hist(deg, breaks = 30, main = "Degree Distribution", xlab = "Degree")
fit <- fit_power_law(deg + 1)  # Avoid zero degrees
print(fit)


# ---- empirical network ----
g_empirical = Complete
#g_empirical = Complete_Rel



# ---- Calculate real network metrics ----
creal <- transitivity(g_empirical, type = "global")
E(g_empirical)$weight <- abs(E(g_empirical)$weight)
lreal <- mean_distance(g_empirical, directed = FALSE, unconnected = TRUE, weights = E(g_empirical)$weight)

# ---- Generate 1000 Erds-Rnyi random graphs with same size ----
set.seed(42)  # for reproducibility
n_nodes <- vcount(g_empirical)
n_edges <- ecount(g_empirical)

crand_vals <- numeric(1000)
lrand_vals <- numeric(1000)

for (i in 1:1000) {
g_rand <- sample_gnm(n = n_nodes, m = n_edges, directed = FALSE)
if (!is_connected(g_rand)) next
  
  crand_vals[i] <- transitivity(g_rand, type = "global")
  lrand_vals[i] <- mean_distance(g_rand, directed = FALSE, unconnected = TRUE)
}

# ---- Calculate mean values across random graphs ----
crand <- mean(crand_vals, na.rm = TRUE)
lrand <- mean(lrand_vals, na.rm = TRUE)

# ---- Compute Small-World Index () ----
sigma <- (creal / crand) / (lreal / lrand)


cat("Global clustering coefficient (real):", creal, "\n")
cat("Average path length (real):", lreal, "\n")
cat("Mean clustering coefficient (random):", crand, "\n")
cat("Mean path length (random):", lrand, "\n")
cat("Small-World Index ():", round(sigma, 2), "\n")

= 231.83 is extremely high, showing very strong small-world properties. > 1 indicates a small-world network,combining high clustering with short paths.

sessionInfo()
## R version 4.5.1 Patched (2025-08-23 r88802)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] ggrepel_0.9.6                   igraph_2.2.0                   
##  [3] ggpubr_0.6.2                    tidyr_1.3.1                    
##  [5] vegan_2.7-2                     permute_0.9-8                  
##  [7] microbiome_1.31.4               tibble_3.3.0                   
##  [9] mia_1.17.10                     MultiAssayExperiment_1.35.9    
## [11] dplyr_1.1.4                     flextable_0.9.10               
## [13] ggplot2_4.0.0                   ggstar_1.0.6                   
## [15] TreeSummarizedExperiment_2.17.1 SingleCellExperiment_1.31.1    
## [17] Biostrings_2.77.2               XVector_0.49.1                 
## [19] phyloseq_1.53.0                 DspikeIn_0.99.29               
## [21] SummarizedExperiment_1.39.2     Biobase_2.69.1                 
## [23] GenomicRanges_1.61.6            Seqinfo_0.99.3                 
## [25] IRanges_2.43.5                  S4Vectors_0.47.4               
## [27] BiocGenerics_0.55.4             generics_0.1.4                 
## [29] MatrixGenerics_1.21.0           matrixStats_1.5.0              
## 
## loaded via a namespace (and not attached):
##   [1] ggtext_0.1.2                fs_1.6.6                   
##   [3] DirichletMultinomial_1.51.0 RColorBrewer_1.1-3         
##   [5] tools_4.5.1                 backports_1.5.0            
##   [7] utf8_1.2.6                  R6_2.6.1                   
##   [9] lazyeval_0.2.2              mgcv_1.9-3                 
##  [11] rhdf5filters_1.21.4         withr_3.0.2                
##  [13] gridExtra_2.3               cli_3.6.5                  
##  [15] textshaping_1.0.4           officer_0.7.0              
##  [17] sandwich_3.1-1              slam_0.1-55                
##  [19] labeling_0.4.3              sass_0.4.10                
##  [21] mvtnorm_1.3-3               S7_0.2.0                   
##  [23] readr_2.1.5                 randomForest_4.7-1.2       
##  [25] ggridges_0.5.7              askpass_1.2.1              
##  [27] systemfonts_1.3.1           yulab.utils_0.2.1          
##  [29] dichromat_2.0-0.1           scater_1.37.0              
##  [31] decontam_1.29.0             parallelly_1.45.1          
##  [33] limma_3.65.7                readxl_1.4.5               
##  [35] fillpattern_1.0.2           gridGraphics_0.5-1         
##  [37] car_3.1-3                   zip_2.3.3                  
##  [39] rbiom_2.2.1                 Matrix_1.7-4               
##  [41] biomformat_1.37.0           ggbeeswarm_0.7.2           
##  [43] DECIPHER_3.5.1              abind_1.4-8                
##  [45] lifecycle_1.0.4             multcomp_1.4-29            
##  [47] yaml_2.3.10                 edgeR_4.7.6                
##  [49] carData_3.0-5               rhdf5_2.53.6               
##  [51] SparseArray_1.9.1           Rtsne_0.17                 
##  [53] grid_4.5.1                  crayon_1.5.3               
##  [55] lattice_0.22-7              beachmat_2.25.5            
##  [57] pillar_1.11.1               knitr_1.50                 
##  [59] estimability_1.5.1          codetools_0.2-20           
##  [61] fastmatch_1.1-6             glue_1.8.0                 
##  [63] ggiraph_0.9.2               ggfun_0.2.0                
##  [65] fontLiberation_0.1.0        data.table_1.17.8          
##  [67] vctrs_0.6.5                 treeio_1.33.0              
##  [69] cellranger_1.1.0            gtable_0.3.6               
##  [71] cachem_1.1.0                xfun_0.53                  
##  [73] S4Arrays_1.9.1              tidygraph_1.3.1            
##  [75] coda_0.19-4.1               survival_3.8-3             
##  [77] iterators_1.0.14            statmod_1.5.1              
##  [79] bluster_1.19.1              TH.data_1.1-4              
##  [81] nlme_3.1-168                ggtree_3.99.2              
##  [83] fontquiver_0.2.1            bslib_0.9.0                
##  [85] irlba_2.3.5.1               vipor_0.4.7                
##  [87] DBI_1.2.3                   ade4_1.7-23                
##  [89] phangorn_2.12.1             DESeq2_1.49.8              
##  [91] tidyselect_1.2.1            emmeans_1.11.2-8           
##  [93] compiler_4.5.1              BiocNeighbors_2.3.1        
##  [95] xml2_1.4.0                  fontBitstreamVera_0.1.1    
##  [97] DelayedArray_0.35.3         scales_1.4.0               
##  [99] quadprog_1.5-8              rappdirs_0.3.3             
## [101] stringr_1.5.2               digest_0.6.37              
## [103] rmarkdown_2.30              htmltools_0.5.8.1          
## [105] pkgconfig_2.0.3             sparseMatrixStats_1.21.0   
## [107] fastmap_1.2.0               rlang_1.1.6                
## [109] htmlwidgets_1.6.4           DelayedMatrixStats_1.31.0  
## [111] farver_2.1.2                jquerylib_0.1.4            
## [113] zoo_1.8-14                  jsonlite_2.0.0             
## [115] BiocParallel_1.43.4         BiocSingular_1.25.0        
## [117] magrittr_2.0.4              polynom_1.4-1              
## [119] Formula_1.2-5               scuttle_1.19.0             
## [121] ggplotify_0.1.3             patchwork_1.3.2            
## [123] Rhdf5lib_1.31.1             Rcpp_1.1.0                 
## [125] ape_5.8-1                   ggnewscale_0.5.2           
## [127] viridis_0.6.5               gdtools_0.4.4              
## [129] stringi_1.8.7               ggalluvial_0.12.5          
## [131] ggraph_2.2.2                MASS_7.3-65                
## [133] plyr_1.8.9                  parallel_4.5.1             
## [135] graphlayouts_1.2.2          splines_4.5.1              
## [137] gridtext_0.1.5              multtest_2.65.0            
## [139] hms_1.1.4                   msa_1.41.3                 
## [141] locfit_1.5-9.12             uuid_1.2-1                 
## [143] ggtreeExtra_1.19.1          ggsignif_0.6.4             
## [145] reshape2_1.4.4              ScaledMatrix_1.17.0        
## [147] evaluate_1.0.5              tzdb_0.5.0                 
## [149] foreach_1.5.2               tweenr_2.0.3               
## [151] openssl_2.3.4               purrr_1.1.0                
## [153] polyclip_1.10-7             BiocBaseUtils_1.11.2       
## [155] ggforce_0.5.0               rsvd_1.0.5                 
## [157] broom_1.0.10                xtable_1.8-4               
## [159] tidytree_0.4.6              rstatix_0.7.3              
## [161] viridisLite_0.4.2           ragg_1.5.0                 
## [163] aplot_0.2.9                 memoise_2.0.1              
## [165] beeswarm_0.4.0              cluster_2.1.8.1

29 DspikeIn volume protocol

Spike-in volume Protocol;

The species Tetragenococcus halophilus (bacterial spike; ATCC33315) and Dekkera bruxellensis (fungal spike; WLP4642-White Labs) were selected as taxa to spike into gut microbiome samples as they were not found in an extensive collection of wildlife skin (GenBank BioProjects: PRJNA1114724, PRJNA 1114659) or gut microbiome samples. Stock cell suspensions of both microbes were grown in either static tryptic soy broth (T. halophilus) or potato dextrose broth (D. bruxellensis) for 72 hours then serially diluted and optical density (OD600) determined on a ClarioStar plate reader. Cell suspensions with an optical density of 1.0, 0.1, 0.01, 0.001 were DNA extracted using the Qiagen DNeasy Powersoil Pro Kit. These DNA isolations were used as standards to determine the proper spike in volume of cells to represent 0.1-10% of a sample (Rao et al., 2021b) Fecal pellets (3.1 1.6 mg; range = 1 5.1 mg) from an ongoing live animal study using wood frogs (Lithobates sylvaticus) were used to standardize the input material for the development of this protocol. A total of (n=9) samples were used to validate the spike in protocol. Each fecal sample was homogenized in 1mL of sterile molecular grade water then 250uL of fecal slurry was DNA extracted as above with and without spiked cells. Two approaches were used to evaluate the target spike-in of 0.1-10%, the range of effective spike-in percentage described in (Rao et al., 2021b), including 1) an expected increase of qPCR cycle threshold (Ct) value that is proportional to the amount of spiked cells and 2) the expected increase in copy number of T. halophilus and D. bruxellensis in spiked vs. unspiked samples. A standard curve was generated using a synthetic fragment of DNA for the 16S-V4 rRNA and ITS1 rDNA regions of T. halophilus and D. bruxellensis, respectively. The standard curve was used to convert Ct values into log copy number for statistical analyses (detailed approach in[2, 3]) using the formula y = -0.2426x + 10.584 for T. halophilus and y = -0.3071x + 10.349 for D. bruxellensis, where x is the average Ct for each unknown sample. Quantitative PCR (qPCR) was used to compare known copy numbers from synthetic DNA sequences of T. halophilus and D. bruxellensis to DNA extractions of T. halophilus and D. bruxellensis independently, and wood frog fecal samples with and without spiked cells. SYBR qPCR assays were run at 20ul total volume including 10ul 2X Quantabio PerfeCTa SYBR Green Fastmix, 1ul of 10uM forward and reverse primers, 1ul of ArcticEnzymes dsDNAse master mix clean up kit, and either 1ul of DNA for D. bruxellensis or 3ul for T. halophilus. Different volumes of DNA were chosen for amplification of bacteria and fungi due to previous optimization of library preparation and sequencing steps [3]. The 515F [4] and 806R [5] primers were chosen to amplify bacteria and ITS1FI2 [6] and ITS2 for fungi, as these are the same primers used during amplicon library preparation and sequencing. Cycling conditions on an Agilent AriaMX consisted of 95 C for 3 mins followed by 40 cycles of 95 C for 15 sec, 60 C for 30 sec and 72 C for 30 sec. Following amplification, a melt curve was generated under the following conditions including 95 C for 30 sec, and a melt from 60 C to 90 C increasing in resolution of 0.5 C in increments of a 5 sec soak time. To validate the spike in protocol we selected two sets of fecal samples including 360 samples from a diverse species pool of frogs, lizards, salamanders and snakes and a more targeted approach of 122 fecal samples from three genera of salamanders from the Plethodontidae. (Supplemental Table #). FFecal samples were not weighed in the field, rather, a complete fecal pellet was diluted in an equal volume of sterile water and standardized volume of fecal slurry (250L) extracted for independent samples.. A volume of 1ul T. halophilus (1847 copies) and 1ul D. bruxellensis (733 copies) were spiked into each fecal sample then DNA was extracted as above, libraries constructed, and amplicon sequenced on an Illumina MiSeq as in [7] .

<github.com/mghotbi/DspikeIn>