With compositional data analysis (CoDA), we can make sample-level comparisons of scRNAseq data based on the relative abundance celltypes that compose them.
In this vignette we show how to:
This example works with the Tabula Muris Senis lung dataset which can be downloaded directly with bioconductor It contains lung single-cell profiles from 14 donor mice (mouse_id). Each mouse is assigned to one of five age groups, from young adult to old, recorded in the age column. Standard cell-type labels (free_annotation) and additional metadata (e.g. sex) are also included.
This vignette is modular; users can update or replace the dataset-specific settings (e.g., metadata column names, reference cell type) as appropriate for their own data.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("SETA")
library(SingleCellExperiment)
library(SETA)
library(ggplot2)
library(dplyr)
library(tidyr)
library(corrplot)
library(caret)
have_ml <- all(vapply(c("caret","glmnet"), requireNamespace, logical(1), quietly = TRUE))
if (!have_ml) {
warning("Some machine learning packages are not installed. ML chunks will be skipped.")
}
library(TabulaMurisSenisData)
Quick data exploration:
table(sce$free_annotation, sce$age)
##
## 1m 3m 18m 21m 30m
## Adventitial Fibroblast 29 17 56 43 23
## Airway Smooth Muscle 2 0 10 0 0
## Alveolar Epithelial Type 2 7 2 35 18 11
## Alveolar Fibroblast 87 37 166 101 26
## Alveolar Macrophage 517 116 193 91 276
## Artery 13 16 32 26 11
## B 98 135 459 198 180
## Basophil 9 8 19 46 40
## CD4+ T 143 122 72 107 106
## CD8+ T 38 70 234 335 193
## Capillary 194 119 617 522 343
## Capillary Aerocyte 125 35 129 148 93
## Ccr7+ Dendritic 2 2 2 5 7
## Ciliated 10 0 9 4 3
## Classical Monocyte 485 111 968 242 3615
## Club 2 0 4 5 3
## Intermediate Monocyte 111 29 95 19 1495
## Interstitial Macrophage 7 13 16 21 1098
## Ly6g5b+ T 0 0 53 65 42
## Lympatic 6 2 11 12 10
## Myeloid Dendritic Type 1 38 20 31 26 44
## Myeloid Dendritic Type 2 8 10 17 16 26
## Myofibroblast 23 13 9 9 0
## Natural Killer 140 201 153 224 120
## Natural Killer T 20 7 152 50 191
## Neuroendocrine 0 0 0 0 0
## Neutrophil 64 22 413 29 22
## Nonclassical Monocyte 157 160 187 194 287
## Pericyte 11 7 8 1 8
## Plasma 3 1 5 16 23
## Plasmacytoid Dendritic 21 8 15 10 14
## Proliferating Alveolar Macrophage 60 9 7 0 14
## Proliferating Classical Monocyte 1 0 13 0 2473
## Proliferating Dendritic 4 3 8 4 8
## Proliferating NK 5 2 5 3 9
## Proliferating T 15 9 11 18 36
## Regulatory T 8 11 28 20 55
## Vein 19 28 93 107 73
## Zbtb32+ B 26 18 41 216 115
# Set up a color palette for plots
# 3-group distinct categoricals
age_palette <- c(
"1m" = "#90EE90",
"3m" = "#4CBB17",
"18m" = "#228B22",
"21m" = "#355E3B",
"30m" = "#023020")
# continuous palette of similar look
c_palette <- colorRampPalette(c("#3B9AB2", "#78B7C5",
"#EBCC2A", "#E1AF00",
"#F21A00"))(100)
The setaCounts function extracts a cell-type counts matrix given that the object contains cell-level metadata. For this dataset, we want the cell-type annotations to come from Celltype and the sample identifiers from donor_id. If necessary, we reassign these columns before extraction.
df <- data.frame(colData(sce))
df$mouse.id <- gsub("/", "", df$mouse.id)
taxa_counts <- setaCounts(
df,
cell_type_col = "free_annotation",
sample_col = "mouse.id",
bc_col = "cell")
## Converting factor class columns to character: age, cell_ontology_class, cell_ontology_id, free_annotation, method, sex, subtissue, tissue, tissue_free_annotation, louvain, leiden
taxa_counts[1:5, 1:5]
##
## Adventitial Fibroblast Airway Smooth Muscle
## 1-M-62 20 1
## 1-M-63 9 1
## 18-F-50 2 1
## 18-F-51 23 8
## 18-M-52 16 1
##
## Alveolar Epithelial Type 2 Alveolar Fibroblast Alveolar Macrophage
## 1-M-62 7 35 148
## 1-M-63 0 52 369
## 18-F-50 0 4 15
## 18-F-51 4 75 88
## 18-M-52 20 36 49
Extract sample-level metadata from a dataframe. It ensures that each metadata column contains unique values per sample. If a metadata column contains non-unique values within any sample, that column is excluded from the output, and the user is notified via a warning.
meta_df <- setaMetadata(
df,
sample_col = "mouse.id",
meta_cols = c("age", "sex"))
meta_df[1:5, ]
## sample_id age sex
## 1 18-F-50 18m female
## 2 18-F-51 18m female
## 3 18-M-52 18m male
## 4 18-M-53 18m male
## 5 21-F-54 21m female
Normalize the data using Centered Log Ratio transformation (CLR)
clr_transformed <- setaTransform(taxa_counts, method = "CLR")
Compute distance in the CLR space
In compositional analysis, the Euclidean distance is the preferred metric once the bounding problem is solved, in other words, once we transform the data. John Aitchison, who wrote “The Statistical Analysis of Compositional Data” preferred the Euclidean distance of CLR transformed data, and this distance is now referred to as “Aitchison Distance”
dist_df <- setaDistances(clr_transformed)
# Merge metadata
merged_dist <- dist_df |>
left_join(meta_df, by = c("from" = "sample_id")) |>
left_join(meta_df, by = c("to" = "sample_id"), suffix = c(".from", ".to"))
# Create a age-age category for comparison
merged_dist$age_pair <- paste(merged_dist$age.from,
merged_dist$age.to,
sep = "-")
Visualize distances
ggplot(merged_dist, aes(x = age_pair, y = distance)) +
geom_boxplot(fill = "grey90") +
geom_jitter(width = 0.2, color = "black") +
labs(title = "Aitchison Distances Between Age Groups",
x = "Age Pair",
y = "Aitchison Distance") +
theme_minimal(base_size = 16) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
clr_long <- as.data.frame(clr_transformed$counts)
colnames(clr_long) <- c("sample", "Celltype", "CLR")
clr_long <- clr_long |>
left_join(meta_df, by = c("sample" = "sample_id"))
# Apply pairwise Wilcoxon tests and plot using ggpubr
ggplot(clr_long, aes(x = age, y = CLR,
fill = age, color = age)) +
geom_boxplot(position = position_dodge(0.8), alpha = 0.7) +
geom_jitter(size = 1.5, shape = 21) +
# stat_compare_means(method = "wilcox.test",
# label = "p.signif",
# comparisons = list(c("normal", "influenza"),
# c("normal", "COVID-19"),
# c("influenza", "COVID-19")),
# position = position_dodge(0.8)) +
facet_wrap(~ Celltype) +
theme_minimal(base_size = 12) +
scale_fill_manual(values = age_palette) +
scale_color_manual(values = age_palette) +
theme(axis.text.x = element_text(angle = 45, hjust = 0.5, vjust = 1)) +
labs(title = "CLR by Celltype and Age",
x = "Age",
y = "CLR-transformed Composition")
clr_df <- clr_transformed$counts |>
data.frame() |> # as.data.frame converts it to long form
pivot_wider(names_from='Var2',
values_from = "Freq") |>
rename(`mouse.id` = Var1)
clr_metadata <- clr_df |>
left_join(meta_df, by = c("mouse.id" = "sample_id"))
clr_data <- clr_metadata |> select(where(is.numeric))
# One-hot encode 'age' and clean column names
oh <- model.matrix(~age - 1, data = clr_metadata)
colnames(oh) <- sub("^age", "", colnames(oh))
rownames(clr_data) <- clr_metadata$mouse.id
## Warning: Setting row names on a tibble is deprecated.
# Combine CLR data and metadata
combined <- cbind(clr_data, oh)
# Compute full correlation matrix
full_cor_mat <- cor(combined, method = "pearson")
p_mat <- cor.mtest(full_cor_mat)$p
Visualize correlations
corrplot(full_cor_mat,
method = "circle",
type = "full",
addrect = 4,
col = c_palette,
p.mat = p_mat,
sig.level = c(.001, .01, .05),
insig = "label_sig",
tl.cex = 0.8,
pch.cex = 1.5,
tl.col = "black",
order = "hclust",
diag = FALSE)
set.seed(687)
train_df <- clr_metadata |>
select(-`mouse.id`) |>
mutate(age = factor(age))
train_control <- trainControl(method = "cv", number = 5)
model <- train(age ~ ., data = subset(train_df),
method = "glmnet",
trControl = train_control)
importance <- varImp(model)
plot(importance,
main = "Cross-Validated Variable Importance",
sub = "Caret GLMnet model"
)
Compositional data provides an intuitive basis for making sample-level comparisons in scRNAseq data, as it provides a single sample x feature space within which to make comparisons. Let us know if you have more ideas for using CoDA!
sessionInfo()
## R version 4.5.1 Patched (2025-08-23 r88802)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] rhdf5_2.53.6 TabulaMurisSenisData_1.15.1
## [3] caret_7.0-1 lattice_0.22-7
## [5] corrplot_0.95 tidyr_1.3.1
## [7] dplyr_1.1.4 ggplot2_4.0.0
## [9] SETA_0.99.7 SingleCellExperiment_1.31.1
## [11] SummarizedExperiment_1.39.2 Biobase_2.69.1
## [13] GenomicRanges_1.61.6 Seqinfo_0.99.3
## [15] IRanges_2.43.5 S4Vectors_0.47.4
## [17] BiocGenerics_0.55.4 generics_0.1.4
## [19] MatrixGenerics_1.21.0 matrixStats_1.5.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 pROC_1.19.0.1 httr2_1.2.1
## [4] rlang_1.1.6 magrittr_2.0.4 e1071_1.7-16
## [7] compiler_4.5.1 RSQLite_2.4.3 gdata_3.0.1
## [10] png_0.1-8 vctrs_0.6.5 reshape2_1.4.4
## [13] stringr_1.5.2 pkgconfig_2.0.3 shape_1.4.6.1
## [16] crayon_1.5.3 fastmap_1.2.0 dbplyr_2.5.1
## [19] XVector_0.49.1 labeling_0.4.3 rmarkdown_2.30
## [22] prodlim_2025.04.28 bit_4.6.0 purrr_1.1.0
## [25] xfun_0.53 glmnet_4.1-10 cachem_1.1.0
## [28] jsonlite_2.0.0 blob_1.2.4 recipes_1.3.1
## [31] rhdf5filters_1.21.4 DelayedArray_0.35.3 Rhdf5lib_1.31.1
## [34] parallel_4.5.1 R6_2.6.1 bslib_0.9.0
## [37] stringi_1.8.7 RColorBrewer_1.1-3 parallelly_1.45.1
## [40] rpart_4.1.24 lubridate_1.9.4 jquerylib_0.1.4
## [43] Rcpp_1.1.0 iterators_1.0.14 knitr_1.50
## [46] future.apply_1.20.0 Matrix_1.7-4 splines_4.5.1
## [49] nnet_7.3-20 igraph_2.2.0 timechange_0.3.0
## [52] tidyselect_1.2.1 dichromat_2.0-0.1 abind_1.4-8
## [55] yaml_2.3.10 timeDate_4051.111 codetools_0.2-20
## [58] curl_7.0.0 listenv_0.9.1 tibble_3.3.0
## [61] plyr_1.8.9 KEGGREST_1.49.2 withr_3.0.2
## [64] S7_0.2.0 evaluate_1.0.5 future_1.67.0
## [67] survival_3.8-3 proxy_0.4-27 BiocFileCache_2.99.6
## [70] ExperimentHub_2.99.6 Biostrings_2.77.2 filelock_1.0.3
## [73] pillar_1.11.1 BiocManager_1.30.26 foreach_1.5.2
## [76] BiocVersion_3.22.0 scales_1.4.0 gtools_3.9.5
## [79] BiocStyle_2.37.1 globals_0.18.0 class_7.3-23
## [82] glue_1.8.0 tools_4.5.1 AnnotationHub_3.99.6
## [85] data.table_1.17.8 ModelMetrics_1.2.2.2 gower_1.0.2
## [88] tidygraph_1.3.1 grid_4.5.1 AnnotationDbi_1.71.2
## [91] ipred_0.9-15 nlme_3.1-168 HDF5Array_1.37.0
## [94] cli_3.6.5 rappdirs_0.3.3 S4Arrays_1.9.1
## [97] lava_1.8.1 gtable_0.3.6 sass_0.4.10
## [100] digest_0.6.37 SparseArray_1.9.1 farver_2.1.2
## [103] memoise_2.0.1 htmltools_0.5.8.1 lifecycle_1.0.4
## [106] h5mread_1.1.1 httr_1.4.7 hardhat_1.4.2
## [109] bit64_4.6.0-1 MASS_7.3-65