SETA provides a set of functions for compositional analysis of single-cell RNA-seq data. In earlier vignettes, we introduced the basics of compositional transforms (setaCLR, setaALR, etc.) and distance/metadata handling.
In this vignette, we demonstrate how to create and utilize multi-resolution annotations, here referred to as reference frames, for hierarchical or subcompositional analysis. Our outline:
ggraphThis example works with the Tabula Muris Senis lung dataset which can be downloaded directly with bioconductor It contains lung single-cell profiles from 14 donor mice (mouse_id). Each mouse is assigned to one of five age groups, from young adult to old, recorded in the age column. Standard cell-type labels (free_annotation) and additional metadata (e.g. sex) are also included.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("SETA")
library(SingleCellExperiment)
library(SETA)
library(ggplot2)
library(ggraph)
library(tidygraph)
library(dplyr)
library(ggplot2)
library(patchwork)
library(TabulaMurisSenisData)
# Set up a color palette for plots
# 9-group distinct categoricals
d_palette <- c("#3B9AB2", "#6BAFAF", "#9BBDAC", "#C9C171",
"#EBCC2A", "#E6A80F", "#E98903", "#F84C00", "#F21A00",
"#B10026", "#67000D")
## see ?TabulaMurisSenisData and browseVignettes('TabulaMurisSenisData') for documentation
## loading from cache
## see ?TabulaMurisSenisData and browseVignettes('TabulaMurisSenisData') for documentation
## loading from cache
## see ?TabulaMurisSenisData and browseVignettes('TabulaMurisSenisData') for documentation
## loading from cache
## see ?TabulaMurisSenisData and browseVignettes('TabulaMurisSenisData') for documentation
## loading from cache
## see ?TabulaMurisSenisData and browseVignettes('TabulaMurisSenisData') for documentation
## loading from cache
## see ?TabulaMurisSenisData and browseVignettes('TabulaMurisSenisData') for documentation
## loading from cache
Quick data exploration:
table(sce$free_annotation, sce$age)
##
## 1m 3m 18m 21m 30m
## Adventitial Fibroblast 29 17 56 43 23
## Airway Smooth Muscle 2 0 10 0 0
## Alveolar Epithelial Type 2 7 2 35 18 11
## Alveolar Fibroblast 87 37 166 101 26
## Alveolar Macrophage 517 116 193 91 276
## Artery 13 16 32 26 11
## B 98 135 459 198 180
## Basophil 9 8 19 46 40
## CD4+ T 143 122 72 107 106
## CD8+ T 38 70 234 335 193
## Capillary 194 119 617 522 343
## Capillary Aerocyte 125 35 129 148 93
## Ccr7+ Dendritic 2 2 2 5 7
## Ciliated 10 0 9 4 3
## Classical Monocyte 485 111 968 242 3615
## Club 2 0 4 5 3
## Intermediate Monocyte 111 29 95 19 1495
## Interstitial Macrophage 7 13 16 21 1098
## Ly6g5b+ T 0 0 53 65 42
## Lympatic 6 2 11 12 10
## Myeloid Dendritic Type 1 38 20 31 26 44
## Myeloid Dendritic Type 2 8 10 17 16 26
## Myofibroblast 23 13 9 9 0
## Natural Killer 140 201 153 224 120
## Natural Killer T 20 7 152 50 191
## Neuroendocrine 0 0 0 0 0
## Neutrophil 64 22 413 29 22
## Nonclassical Monocyte 157 160 187 194 287
## Pericyte 11 7 8 1 8
## Plasma 3 1 5 16 23
## Plasmacytoid Dendritic 21 8 15 10 14
## Proliferating Alveolar Macrophage 60 9 7 0 14
## Proliferating Classical Monocyte 1 0 13 0 2473
## Proliferating Dendritic 4 3 8 4 8
## Proliferating NK 5 2 5 3 9
## Proliferating T 15 9 11 18 36
## Regulatory T 8 11 28 20 55
## Vein 19 28 93 107 73
## Zbtb32+ B 26 18 41 216 115
Below, we define a custom lineage column that lumps certain cell types together:
All Monocyte variants → Monocytes
All CD4/CD8 variants → T cells
All B cell variants → B cells
Everything else remains as is.
# We'll make a small mapping from original free_annotation -> broader 'Lineage'
lineage_map <- c(
"Adventitial Fibroblast" = "Fibroblast",
"Airway Smooth Muscle" = "SMC",
"Alveolar Epithelial Type 2" = "Epithelial",
"Alveolar Fibroblast" = "Fibroblast",
"Alveolar Macrophage" = "Myeloid",
"Artery" = "Endothelial",
"B" = "B cell",
"Basophil" = "Granulocyte",
"Ccr7+ Dendritic" = "Myeloid",
"CD4+ T" = "T cell",
"CD8+ T" = "T cell",
"Capillary" = "Endothelial",
"Capillary Aerocyte" = "Endothelial",
"Ciliated" = "Epithelial",
"Classical Monocyte" = "Myeloid",
"Club" = "Epithelial",
"Intermediate Monocyte" = "Myeloid",
"Interstitial Macrophage" = "Myeloid",
"Lympatic" = "Endothelial",
"Ly6g5b+ T" = "T cell",
"Myeloid Dendritic Type 1" = "Myeloid",
"Myeloid Dendritic Type 2" = "Myeloid",
"Myofibroblast" = "Fibroblast",
"Natural Killer" = "NK",
"Natural Killer T" = "NK",
"Neuroendocrine" = "Neuroendocrine",
"Neutrophil" = "Granulocyte",
"Nonclassical Monocyte" = "Myeloid",
"Pericyte" = "Endothelial",
"Plasma" = "Plasma Cell",
"Plasmacytoid Dendritic" = "Myeloid",
"Proliferating Alveolar Macrophage" = "Proliferating",
"Proliferating Classical Monocyte" = "Proliferating",
"Proliferating Dendritic" = "Proliferating",
"Proliferating NK" = "Proliferating",
"Proliferating T" = "Proliferating",
"Regulatory T" = "T cell",
"Vein" = "Endothelial",
"Zbtb32+ B" = "B cell"
)
# Add a new column in named 'Lineage'
free_annotations <- as.character(sce$free_annotation)
sce$Lineage <- lineage_map[free_annotations]
Now we can build a taxonomic data frame that has two levels: the fine label (free_annotation) and the broad label (Lineage).
# In a Seurat Object, barcodes are in the rownames of the metadata.
# Similar to setaCounts, taxonomy DF creation can use rownames as bc.
# Then we ensure we have columns for 'free_annotation' and 'Lineage'.
print(head(sce$free_annotation))
## [1] B Nonclassical Monocyte Natural Killer
## [4] Alveolar Macrophage B Nonclassical Monocyte
## 39 Levels: Adventitial Fibroblast ... Zbtb32+ B
print(head(sce$Lineage))
## B Nonclassical Monocyte Natural Killer
## "B cell" "Myeloid" "NK"
## Alveolar Macrophage B Nonclassical Monocyte
## "Myeloid" "B cell" "Myeloid"
# We want a data frame that includes bc + 'free_annotation' + 'Lineage'
# Then we'll pass it to setaTaxonomyDF
tax_df <- setaTaxonomyDF(
obj = data.frame(colData(sce)),
resolution_cols = c("Lineage", "free_annotation"),
bc_col = "cell"
)
## Converting factor columns to character: age, cell_ontology_class, cell_ontology_id, free_annotation, method, sex, subtissue, tissue, tissue_free_annotation, louvain, leiden
# Each row is a unique 'free_annotation'.
# The last column 'free_annotation' is the finer label.
# cols must be entered in order of increasing resolution (broadest to finest)
tax_df
## Lineage
## B B cell
## Nonclassical Monocyte Myeloid
## Natural Killer NK
## Alveolar Macrophage Myeloid
## Myeloid Dendritic Type 2 Myeloid
## CD8+ T T cell
## Classical Monocyte Myeloid
## Capillary Aerocyte Endothelial
## Capillary Endothelial
## Natural Killer T NK
## Proliferating T Proliferating
## Myofibroblast Fibroblast
## Intermediate Monocyte Myeloid
## Neutrophil Granulocyte
## Adventitial Fibroblast Fibroblast
## Ly6g5b+ T T cell
## CD4+ T T cell
## Vein Endothelial
## Zbtb32+ B B cell
## Myeloid Dendritic Type 1 Myeloid
## Club Epithelial
## Basophil Granulocyte
## Alveolar Fibroblast Fibroblast
## Regulatory T T cell
## Proliferating Dendritic Proliferating
## Artery Endothelial
## Pericyte Endothelial
## Airway Smooth Muscle SMC
## Plasma Plasma Cell
## Alveolar Epithelial Type 2 Epithelial
## Proliferating Alveolar Macrophage Proliferating
## Ciliated Epithelial
## Plasmacytoid Dendritic Myeloid
## Proliferating Classical Monocyte Proliferating
## Lympatic Endothelial
## Proliferating NK Proliferating
## Interstitial Macrophage Myeloid
## Ccr7+ Dendritic Myeloid
## free_annotation
## B B
## Nonclassical Monocyte Nonclassical Monocyte
## Natural Killer Natural Killer
## Alveolar Macrophage Alveolar Macrophage
## Myeloid Dendritic Type 2 Myeloid Dendritic Type 2
## CD8+ T CD8+ T
## Classical Monocyte Classical Monocyte
## Capillary Aerocyte Capillary Aerocyte
## Capillary Capillary
## Natural Killer T Natural Killer T
## Proliferating T Proliferating T
## Myofibroblast Myofibroblast
## Intermediate Monocyte Intermediate Monocyte
## Neutrophil Neutrophil
## Adventitial Fibroblast Adventitial Fibroblast
## Ly6g5b+ T Ly6g5b+ T
## CD4+ T CD4+ T
## Vein Vein
## Zbtb32+ B Zbtb32+ B
## Myeloid Dendritic Type 1 Myeloid Dendritic Type 1
## Club Club
## Basophil Basophil
## Alveolar Fibroblast Alveolar Fibroblast
## Regulatory T Regulatory T
## Proliferating Dendritic Proliferating Dendritic
## Artery Artery
## Pericyte Pericyte
## Airway Smooth Muscle Airway Smooth Muscle
## Plasma Plasma
## Alveolar Epithelial Type 2 Alveolar Epithelial Type 2
## Proliferating Alveolar Macrophage Proliferating Alveolar Macrophage
## Ciliated Ciliated
## Plasmacytoid Dendritic Plasmacytoid Dendritic
## Proliferating Classical Monocyte Proliferating Classical Monocyte
## Lympatic Lympatic
## Proliferating NK Proliferating NK
## Interstitial Macrophage Interstitial Macrophage
## Ccr7+ Dendritic Ccr7+ Dendritic
We now have a 2-level taxonomy: free_annotation (fine) -> Lineage (coarse). For demonstration, let’s build a small graph that connects each free_annotation node to its Lineage node. We’ll color edges by the lineage label:
# Create a tidygraph object using SETA built-in utils
tbl_g <- taxonomy_to_tbl_graph(
tax_df,
columns = c("Lineage", "free_annotation"))
# Plot with ggraph
ggraph(tbl_g, layout = "dendrogram") +
geom_edge_diagonal2(aes(color = node.Lineage)) +
geom_node_text(aes(filter = leaf,
label = free_annotation, color = Lineage),
nudge_y = 0.1, vjust = 0.5, hjust = 0, size = 5) +
geom_node_text(aes(filter = !leaf,
label = Lineage),
color = 'black',
size = 5,
repel = TRUE) +
guides(edge_colour = guide_legend(title = "Lineage"),
color = guide_legend(title = "Lineage")) +
theme_linedraw(base_size = 16) +
scale_edge_color_manual(values = d_palette) +
scale_color_manual(values = d_palette) +
scale_y_reverse(breaks = seq(0, 2, by = 1),
labels = c("Type", "Lineage", "Root")) +
theme(axis.text.y = element_blank()) +
expand_limits(y = -5) +
coord_flip() +
ggtitle("SETA: Taxonomy")
Here, each Lineage is a parent node, and each free_annotation is a leaf node. The edges are colored by the lineage name.
Balances can be calculated between clades and species among the taxonomic tree. Currently, SETA supports user-specified balances with the transform “balance”, where balances are calculated as log-ratios between these clades or species. The choice of numerator and denominator groups determine which direction the resulting balance points. If the composition is weighted towards the numerator, the log-ratio will be positive, and negative if towards the denominator.
# Create sample x free_annotation counts:
taxa_counts <- setaCounts(
data.frame(colData(sce)),
cell_type_col = "free_annotation",
sample_col = "mouse.id",
bc_col = "rownames"
)
## Converting factor class columns to character: age, cell_ontology_class, cell_ontology_id, free_annotation, method, sex, subtissue, tissue, tissue_free_annotation, louvain, leiden
# Create SETA sample-level metadata
meta_df <- setaMetadata(data.frame(colData(sce)), sample_col = "mouse.id",
meta_cols = c("age", "sex"))
bal_out <- setaTransform(
counts = taxa_counts,
method = "balance",
balances = list(
epi_vs_myeloid = list(
num = "Epithelial",
denom = "Myeloid"
)
),
taxonomyDF = tax_df,
taxonomy_col = "Lineage"
)
# merge the balances with sample-level metadata
bal_df <- data.frame(
sample_id = rownames(bal_out$counts),
epi_vs_myeloid = as.numeric(bal_out$counts[, "epi_vs_myeloid"]),
stringsAsFactors = FALSE
) |>
dplyr::left_join(meta_df, by = "sample_id")
# Compare age groups by epithelial to myeloid ratios
library(ggplot2)
ggplot(bal_df, aes(x = age, y = epi_vs_myeloid, fill = age)) +
geom_boxplot(outlier.shape = NA, alpha = 0.4) +
geom_jitter(width = 0.15, size = 2) +
scale_fill_manual(values = d_palette) +
labs(title = "Epithelial / Myeloid balance across age groups",
y = "log GM(Epithelial) - log GM(Myeloid)",
x = "Age (months)") +
theme_minimal(base_size = 14) +
theme(legend.position = "none")
The y-axis shows the log-ratio of the geometric means of Epithelial versus Myeloid counts for each mouse. Positive values indicate epithelial enrichment relative to myeloid cells, while negative values indicate the opposite. Because this is a standard log-ratio, the statistic is fully compositional and can be compared directly across groups.
We can now use setaCounts to get the sample × free_annotation matrix, then apply a within-lineage transform. This lumps certain columns (cell types) into subcompositions. For example, a CLR transform within each lineage:
# We'll transform them with a 'Lineage' grouping.
# The taxonomyDF uses rownames = free_annotation
# so that colnames(taxa_counts) align with rownames(tax_df).
refframe_out <- setaTransform(
counts = taxa_counts,
method = "CLR",
taxonomyDF = tax_df,
taxonomy_col = "Lineage",
within_resolution = TRUE
)
refframe_out$within_resolution
## [1] TRUE
refframe_out$grouping_var
## [1] "Lineage"
# Compare to a standard global CLR transform:
global_out <- setaTransform(taxa_counts, method = "CLR")
Below we show how to do PCA on the sample-level coordinates from either the global CLR or the lineage-based CLR. We color points by age to see if the subcompositional approach changes clustering.
# A) Global CLR
latent_global <- setaLatent(global_out, method = "PCA", dims = 2)
pca_global <- latent_global$latentSpace
pca_global$sample_id <- rownames(pca_global)
# B) Within-Lineage CLR
latent_lineage <- setaLatent(refframe_out, method = "PCA", dims = 2)
pca_lineage <- latent_lineage$latentSpace
pca_lineage$sample_id <- rownames(pca_lineage)
# Merge with metadata
pca_global <- left_join(pca_global, meta_df)
## Joining with `by = join_by(sample_id)`
pca_lineage <- left_join(pca_lineage, meta_df)
## Joining with `by = join_by(sample_id)`
# Plot side by side
p1 <- ggplot(pca_global, aes(x = PC1, y = PC2, color = age)) +
geom_text(aes(label = sample_id)) +
scale_color_manual(
values = d_palette
) +
labs(title = "Global CLR PCA",
x = "PC1", y = "PC2", color = "Age (Months)") +
theme_minimal() +
xlab(sprintf("PC1 (%s%%)",
signif(latent_global$varExplained[1], 4) * 100)) +
ylab(sprintf("PC2 (%s%%)",
signif(latent_global$varExplained[2], 4) * 100)) +
theme(legend.position = 'none')
p2 <- ggplot(pca_lineage, aes(x = PC1, y = PC2, color = age)) +
geom_text(aes(label = sample_id)) +
scale_color_manual(
values = d_palette
) +
labs(title = "Within-Lineage CLR PCA",
x = "PC1", y = "PC2", color = "Age (Months)") +
theme_minimal() +
xlab(sprintf("PC1 (%s%%)",
signif(latent_lineage$varExplained[1], 4) * 100)) +
ylab(sprintf("PC2 (%s%%)",
signif(latent_lineage$varExplained[2], 4) * 100))
p1 + p2
sessionInfo()
## R version 4.5.1 Patched (2025-08-23 r88802)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] patchwork_1.3.2 tidygraph_1.3.1
## [3] ggraph_2.2.2 reshape2_1.4.4
## [5] rhdf5_2.53.6 TabulaMurisSenisData_1.15.1
## [7] caret_7.0-1 lattice_0.22-7
## [9] corrplot_0.95 tidyr_1.3.1
## [11] dplyr_1.1.4 ggplot2_4.0.0
## [13] SETA_0.99.7 SingleCellExperiment_1.31.1
## [15] SummarizedExperiment_1.39.2 Biobase_2.69.1
## [17] GenomicRanges_1.61.6 Seqinfo_0.99.3
## [19] IRanges_2.43.5 S4Vectors_0.47.4
## [21] BiocGenerics_0.55.4 generics_0.1.4
## [23] MatrixGenerics_1.21.0 matrixStats_1.5.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 jsonlite_2.0.0 shape_1.4.6.1
## [4] magrittr_2.0.4 farver_2.1.2 rmarkdown_2.30
## [7] vctrs_0.6.5 memoise_2.0.1 htmltools_0.5.8.1
## [10] S4Arrays_1.9.1 AnnotationHub_3.99.6 curl_7.0.0
## [13] Rhdf5lib_1.31.1 SparseArray_1.9.1 pROC_1.19.0.1
## [16] sass_0.4.10 parallelly_1.45.1 bslib_0.9.0
## [19] plyr_1.8.9 httr2_1.2.1 lubridate_1.9.4
## [22] cachem_1.1.0 igraph_2.2.0 lifecycle_1.0.4
## [25] iterators_1.0.14 pkgconfig_2.0.3 Matrix_1.7-4
## [28] R6_2.6.1 fastmap_1.2.0 future_1.67.0
## [31] digest_0.6.37 AnnotationDbi_1.71.2 ExperimentHub_2.99.6
## [34] RSQLite_2.4.3 filelock_1.0.3 labeling_0.4.3
## [37] timechange_0.3.0 gdata_3.0.1 polyclip_1.10-7
## [40] httr_1.4.7 abind_1.4-8 compiler_4.5.1
## [43] proxy_0.4-27 bit64_4.6.0-1 withr_3.0.2
## [46] S7_0.2.0 viridis_0.6.5 DBI_1.2.3
## [49] ggforce_0.5.0 HDF5Array_1.37.0 MASS_7.3-65
## [52] lava_1.8.1 rappdirs_0.3.3 DelayedArray_0.35.3
## [55] gtools_3.9.5 ModelMetrics_1.2.2.2 tools_4.5.1
## [58] future.apply_1.20.0 nnet_7.3-20 glue_1.8.0
## [61] h5mread_1.1.1 nlme_3.1-168 rhdf5filters_1.21.4
## [64] grid_4.5.1 recipes_1.3.1 gtable_0.3.6
## [67] class_7.3-23 data.table_1.17.8 XVector_0.49.1
## [70] ggrepel_0.9.6 BiocVersion_3.22.0 foreach_1.5.2
## [73] pillar_1.11.1 stringr_1.5.2 splines_4.5.1
## [76] tweenr_2.0.3 BiocFileCache_2.99.6 survival_3.8-3
## [79] bit_4.6.0 tidyselect_1.2.1 Biostrings_2.77.2
## [82] knitr_1.50 gridExtra_2.3 xfun_0.53
## [85] graphlayouts_1.2.2 hardhat_1.4.2 timeDate_4051.111
## [88] stringi_1.8.7 yaml_2.3.10 evaluate_1.0.5
## [91] codetools_0.2-20 tibble_3.3.0 BiocManager_1.30.26
## [94] cli_3.6.5 rpart_4.1.24 jquerylib_0.1.4
## [97] dichromat_2.0-0.1 Rcpp_1.1.0 globals_0.18.0
## [100] dbplyr_2.5.1 png_0.1-8 parallel_4.5.1
## [103] gower_1.0.2 blob_1.2.4 listenv_0.9.1
## [106] glmnet_4.1-10 viridisLite_0.4.2 ipred_0.9-15
## [109] scales_1.4.0 prodlim_2025.04.28 e1071_1.7-16
## [112] purrr_1.1.0 crayon_1.5.3 BiocStyle_2.37.1
## [115] rlang_1.1.6 KEGGREST_1.49.2