Multi-Resolution Compositional Analysis in scRNA-seq: Reference Frames with SETA

Introduction

SETA provides a set of functions for compositional analysis of single-cell RNA-seq data. In earlier vignettes, we introduced the basics of compositional transforms (setaCLR, setaALR, etc.) and distance/metadata handling.

In this vignette, we demonstrate how to create and utilize multi-resolution annotations, here referred to as reference frames, for hierarchical or subcompositional analysis. Our outline:

Extract a taxonomic data frame of differing cell-type resolutions
Visualize the taxonomy as a tree using ggraph
Transform counts data with a chosen taxonomic reference frame
Compare PCA latent spaces based on different reference frames

This example works with the Tabula Muris Senis lung dataset which can be downloaded directly with bioconductor It contains lung single-cell profiles from 14 donor mice (mouse_id). Each mouse is assigned to one of five age groups, from young adult to old, recorded in the age column. Standard cell-type labels (free_annotation) and additional metadata (e.g. sex) are also included.

Installation

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("SETA")

library(SingleCellExperiment)
library(SETA)
library(ggplot2)
library(ggraph)
library(tidygraph)
library(dplyr)
library(ggplot2)
library(patchwork)
library(TabulaMurisSenisData)

# Set up a color palette for plots
# 9-group distinct categoricals
d_palette <- c("#3B9AB2", "#6BAFAF", "#9BBDAC", "#C9C171",
                "#EBCC2A", "#E6A80F", "#E98903", "#F84C00", "#F21A00",
                "#B10026", "#67000D")

Load and prepare data

## see ?TabulaMurisSenisData and browseVignettes('TabulaMurisSenisData') for documentation

## loading from cache

## see ?TabulaMurisSenisData and browseVignettes('TabulaMurisSenisData') for documentation

## loading from cache

## see ?TabulaMurisSenisData and browseVignettes('TabulaMurisSenisData') for documentation

## loading from cache

## see ?TabulaMurisSenisData and browseVignettes('TabulaMurisSenisData') for documentation

## loading from cache

## see ?TabulaMurisSenisData and browseVignettes('TabulaMurisSenisData') for documentation

## loading from cache

## see ?TabulaMurisSenisData and browseVignettes('TabulaMurisSenisData') for documentation

## loading from cache

Quick data exploration:

table(sce$free_annotation, sce$age)

##                                    
##                                       1m   3m  18m  21m  30m
##   Adventitial Fibroblast              29   17   56   43   23
##   Airway Smooth Muscle                 2    0   10    0    0
##   Alveolar Epithelial Type 2           7    2   35   18   11
##   Alveolar Fibroblast                 87   37  166  101   26
##   Alveolar Macrophage                517  116  193   91  276
##   Artery                              13   16   32   26   11
##   B                                   98  135  459  198  180
##   Basophil                             9    8   19   46   40
##   CD4+ T                             143  122   72  107  106
##   CD8+ T                              38   70  234  335  193
##   Capillary                          194  119  617  522  343
##   Capillary Aerocyte                 125   35  129  148   93
##   Ccr7+ Dendritic                      2    2    2    5    7
##   Ciliated                            10    0    9    4    3
##   Classical Monocyte                 485  111  968  242 3615
##   Club                                 2    0    4    5    3
##   Intermediate Monocyte              111   29   95   19 1495
##   Interstitial Macrophage              7   13   16   21 1098
##   Ly6g5b+ T                            0    0   53   65   42
##   Lympatic                             6    2   11   12   10
##   Myeloid Dendritic Type 1            38   20   31   26   44
##   Myeloid Dendritic Type 2             8   10   17   16   26
##   Myofibroblast                       23   13    9    9    0
##   Natural Killer                     140  201  153  224  120
##   Natural Killer T                    20    7  152   50  191
##   Neuroendocrine                       0    0    0    0    0
##   Neutrophil                          64   22  413   29   22
##   Nonclassical Monocyte              157  160  187  194  287
##   Pericyte                            11    7    8    1    8
##   Plasma                               3    1    5   16   23
##   Plasmacytoid Dendritic              21    8   15   10   14
##   Proliferating Alveolar Macrophage   60    9    7    0   14
##   Proliferating Classical Monocyte     1    0   13    0 2473
##   Proliferating Dendritic              4    3    8    4    8
##   Proliferating NK                     5    2    5    3    9
##   Proliferating T                     15    9   11   18   36
##   Regulatory T                         8   11   28   20   55
##   Vein                                19   28   93  107   73
##   Zbtb32+ B                           26   18   41  216  115

Creating a Taxonomic Data Frame

Below, we define a custom lineage column that lumps certain cell types together:

All Monocyte variants → Monocytes

All CD4/CD8 variants → T cells

All B cell variants → B cells

Everything else remains as is.

# We'll make a small mapping from original free_annotation -> broader 'Lineage'
lineage_map <- c(
    "Adventitial Fibroblast" = "Fibroblast",
    "Airway Smooth Muscle" = "SMC",
    "Alveolar Epithelial Type 2" = "Epithelial",
    "Alveolar Fibroblast" = "Fibroblast",
    "Alveolar Macrophage" = "Myeloid",
    "Artery" = "Endothelial",
    "B" = "B cell",
    "Basophil" = "Granulocyte",
    "Ccr7+ Dendritic" = "Myeloid",
    "CD4+ T" = "T cell",
    "CD8+ T" = "T cell",
    "Capillary" = "Endothelial",
    "Capillary Aerocyte" = "Endothelial",
    "Ciliated" = "Epithelial",
    "Classical Monocyte" = "Myeloid",
    "Club" = "Epithelial",
    "Intermediate Monocyte" = "Myeloid",
    "Interstitial Macrophage" = "Myeloid",
    "Lympatic" = "Endothelial",
    "Ly6g5b+ T" = "T cell",
    "Myeloid Dendritic Type 1" = "Myeloid",
    "Myeloid Dendritic Type 2" = "Myeloid",
    "Myofibroblast" = "Fibroblast",
    "Natural Killer" = "NK",
    "Natural Killer T" = "NK",
    "Neuroendocrine" = "Neuroendocrine",
    "Neutrophil" = "Granulocyte",
    "Nonclassical Monocyte" = "Myeloid",
    "Pericyte" = "Endothelial",
    "Plasma" = "Plasma Cell",
    "Plasmacytoid Dendritic" = "Myeloid",
    "Proliferating Alveolar Macrophage" = "Proliferating",
    "Proliferating Classical Monocyte" = "Proliferating",
    "Proliferating Dendritic" = "Proliferating",
    "Proliferating NK" = "Proliferating",
    "Proliferating T" = "Proliferating",
    "Regulatory T" = "T cell",
    "Vein" = "Endothelial",
    "Zbtb32+ B" = "B cell"
)

# Add a new column in  named 'Lineage'
free_annotations <- as.character(sce$free_annotation)
sce$Lineage <- lineage_map[free_annotations]

Now we can build a taxonomic data frame that has two levels: the fine label (free_annotation) and the broad label (Lineage).

# In a Seurat Object, barcodes are in the rownames of the metadata.
# Similar to setaCounts, taxonomy DF creation can use rownames as bc.
# Then we ensure we have columns for 'free_annotation' and 'Lineage'.
print(head(sce$free_annotation))

## [1] B                     Nonclassical Monocyte Natural Killer       
## [4] Alveolar Macrophage   B                     Nonclassical Monocyte
## 39 Levels: Adventitial Fibroblast ... Zbtb32+ B

print(head(sce$Lineage))

##                     B Nonclassical Monocyte        Natural Killer 
##              "B cell"             "Myeloid"                  "NK" 
##   Alveolar Macrophage                     B Nonclassical Monocyte 
##             "Myeloid"              "B cell"             "Myeloid"

# We want a data frame that includes bc + 'free_annotation' + 'Lineage'
# Then we'll pass it to setaTaxonomyDF
tax_df <- setaTaxonomyDF(
    obj = data.frame(colData(sce)),
    resolution_cols = c("Lineage", "free_annotation"),
    bc_col = "cell"
)

## Converting factor columns to character: age, cell_ontology_class, cell_ontology_id, free_annotation, method, sex, subtissue, tissue, tissue_free_annotation, louvain, leiden

# Each row is a unique 'free_annotation'.
# The last column 'free_annotation' is the finer label.
# cols must be entered in order of increasing resolution (broadest to finest)
tax_df

##                                         Lineage
## B                                        B cell
## Nonclassical Monocyte                   Myeloid
## Natural Killer                               NK
## Alveolar Macrophage                     Myeloid
## Myeloid Dendritic Type 2                Myeloid
## CD8+ T                                   T cell
## Classical Monocyte                      Myeloid
## Capillary Aerocyte                  Endothelial
## Capillary                           Endothelial
## Natural Killer T                             NK
## Proliferating T                   Proliferating
## Myofibroblast                        Fibroblast
## Intermediate Monocyte                   Myeloid
## Neutrophil                          Granulocyte
## Adventitial Fibroblast               Fibroblast
## Ly6g5b+ T                                T cell
## CD4+ T                                   T cell
## Vein                                Endothelial
## Zbtb32+ B                                B cell
## Myeloid Dendritic Type 1                Myeloid
## Club                                 Epithelial
## Basophil                            Granulocyte
## Alveolar Fibroblast                  Fibroblast
## Regulatory T                             T cell
## Proliferating Dendritic           Proliferating
## Artery                              Endothelial
## Pericyte                            Endothelial
## Airway Smooth Muscle                        SMC
## Plasma                              Plasma Cell
## Alveolar Epithelial Type 2           Epithelial
## Proliferating Alveolar Macrophage Proliferating
## Ciliated                             Epithelial
## Plasmacytoid Dendritic                  Myeloid
## Proliferating Classical Monocyte  Proliferating
## Lympatic                            Endothelial
## Proliferating NK                  Proliferating
## Interstitial Macrophage                 Myeloid
## Ccr7+ Dendritic                         Myeloid
##                                                     free_annotation
## B                                                                 B
## Nonclassical Monocyte                         Nonclassical Monocyte
## Natural Killer                                       Natural Killer
## Alveolar Macrophage                             Alveolar Macrophage
## Myeloid Dendritic Type 2                   Myeloid Dendritic Type 2
## CD8+ T                                                       CD8+ T
## Classical Monocyte                               Classical Monocyte
## Capillary Aerocyte                               Capillary Aerocyte
## Capillary                                                 Capillary
## Natural Killer T                                   Natural Killer T
## Proliferating T                                     Proliferating T
## Myofibroblast                                         Myofibroblast
## Intermediate Monocyte                         Intermediate Monocyte
## Neutrophil                                               Neutrophil
## Adventitial Fibroblast                       Adventitial Fibroblast
## Ly6g5b+ T                                                 Ly6g5b+ T
## CD4+ T                                                       CD4+ T
## Vein                                                           Vein
## Zbtb32+ B                                                 Zbtb32+ B
## Myeloid Dendritic Type 1                   Myeloid Dendritic Type 1
## Club                                                           Club
## Basophil                                                   Basophil
## Alveolar Fibroblast                             Alveolar Fibroblast
## Regulatory T                                           Regulatory T
## Proliferating Dendritic                     Proliferating Dendritic
## Artery                                                       Artery
## Pericyte                                                   Pericyte
## Airway Smooth Muscle                           Airway Smooth Muscle
## Plasma                                                       Plasma
## Alveolar Epithelial Type 2               Alveolar Epithelial Type 2
## Proliferating Alveolar Macrophage Proliferating Alveolar Macrophage
## Ciliated                                                   Ciliated
## Plasmacytoid Dendritic                       Plasmacytoid Dendritic
## Proliferating Classical Monocyte   Proliferating Classical Monocyte
## Lympatic                                                   Lympatic
## Proliferating NK                                   Proliferating NK
## Interstitial Macrophage                     Interstitial Macrophage
## Ccr7+ Dendritic                                     Ccr7+ Dendritic

Visualize the Taxonomy as a Tree via ggraph

We now have a 2-level taxonomy: free_annotation (fine) -> Lineage (coarse). For demonstration, let’s build a small graph that connects each free_annotation node to its Lineage node. We’ll color edges by the lineage label:

# Create a tidygraph object using SETA built-in utils

tbl_g <- taxonomy_to_tbl_graph(
    tax_df,
    columns = c("Lineage", "free_annotation"))

# Plot with ggraph
ggraph(tbl_g, layout = "dendrogram") +
    geom_edge_diagonal2(aes(color = node.Lineage)) +
    geom_node_text(aes(filter = leaf,
                        label = free_annotation, color = Lineage),
                   nudge_y = 0.1, vjust = 0.5, hjust = 0, size = 5) +
    geom_node_text(aes(filter = !leaf,
                        label = Lineage),
                    color = 'black',
                    size = 5,
                    repel = TRUE) +
    guides(edge_colour = guide_legend(title = "Lineage"),
            color = guide_legend(title = "Lineage")) +
    theme_linedraw(base_size = 16) +
    scale_edge_color_manual(values = d_palette) +
    scale_color_manual(values = d_palette) +
    scale_y_reverse(breaks = seq(0, 2, by = 1),
                    labels = c("Type", "Lineage", "Root")) +
    theme(axis.text.y = element_blank()) +
    expand_limits(y = -5) +
    coord_flip() +
    ggtitle("SETA: Taxonomy")

Here, each Lineage is a parent node, and each free_annotation is a leaf node. The edges are colored by the lineage name.

Taxonomic Balances with SETA

Balances can be calculated between clades and species among the taxonomic tree. Currently, SETA supports user-specified balances with the transform “balance”, where balances are calculated as log-ratios between these clades or species. The choice of numerator and denominator groups determine which direction the resulting balance points. If the composition is weighted towards the numerator, the log-ratio will be positive, and negative if towards the denominator.

# Create sample x free_annotation counts:
taxa_counts <- setaCounts(
    data.frame(colData(sce)),
    cell_type_col = "free_annotation",
    sample_col = "mouse.id",
    bc_col = "rownames"
)

## Converting factor class columns to character: age, cell_ontology_class, cell_ontology_id, free_annotation, method, sex, subtissue, tissue, tissue_free_annotation, louvain, leiden

# Create SETA sample-level metadata
meta_df <- setaMetadata(data.frame(colData(sce)), sample_col = "mouse.id",
                        meta_cols = c("age", "sex"))


bal_out <- setaTransform(
    counts     = taxa_counts,
    method     = "balance",
    balances   = list(
        epi_vs_myeloid = list(
            num   = "Epithelial",
            denom = "Myeloid"
        )
    ),
    taxonomyDF   = tax_df,
    taxonomy_col = "Lineage"
)

# merge the balances with sample-level metadata
bal_df <- data.frame(
    sample_id       = rownames(bal_out$counts),
    epi_vs_myeloid  = as.numeric(bal_out$counts[, "epi_vs_myeloid"]),
    stringsAsFactors = FALSE
) |>
    dplyr::left_join(meta_df, by = "sample_id")

# Compare age groups by epithelial to myeloid ratios
library(ggplot2)
ggplot(bal_df, aes(x = age, y = epi_vs_myeloid, fill = age)) +
    geom_boxplot(outlier.shape = NA, alpha = 0.4) +
    geom_jitter(width = 0.15, size = 2) +
    scale_fill_manual(values = d_palette) +
    labs(title = "Epithelial / Myeloid balance across age groups",
        y = "log GM(Epithelial) - log GM(Myeloid)",
        x = "Age (months)") +
    theme_minimal(base_size = 14) +
    theme(legend.position = "none")

The y-axis shows the log-ratio of the geometric means of Epithelial versus Myeloid counts for each mouse. Positive values indicate epithelial enrichment relative to myeloid cells, while negative values indicate the opposite. Because this is a standard log-ratio, the statistic is fully compositional and can be compared directly across groups.

Transform Counts with a Taxonomic Reference Frame

We can now use setaCounts to get the sample × free_annotation matrix, then apply a within-lineage transform. This lumps certain columns (cell types) into subcompositions. For example, a CLR transform within each lineage:

# We'll transform them with a 'Lineage' grouping.
# The taxonomyDF uses rownames = free_annotation
# so that colnames(taxa_counts) align with rownames(tax_df).
refframe_out <- setaTransform(
    counts            = taxa_counts,
    method            = "CLR",
    taxonomyDF        = tax_df,
    taxonomy_col      = "Lineage",
    within_resolution = TRUE
)

refframe_out$within_resolution

## [1] TRUE

refframe_out$grouping_var

## [1] "Lineage"

# Compare to a standard global CLR transform:
global_out <- setaTransform(taxa_counts, method = "CLR")

Visualize Latent Spaces With Different Reference Frames

Below we show how to do PCA on the sample-level coordinates from either the global CLR or the lineage-based CLR. We color points by age to see if the subcompositional approach changes clustering.

# A) Global CLR
latent_global <- setaLatent(global_out, method = "PCA", dims = 2)
pca_global <- latent_global$latentSpace
pca_global$sample_id <- rownames(pca_global)

# B) Within-Lineage CLR
latent_lineage <- setaLatent(refframe_out, method = "PCA", dims = 2)
pca_lineage <- latent_lineage$latentSpace
pca_lineage$sample_id <- rownames(pca_lineage)

# Merge with metadata
pca_global <- left_join(pca_global, meta_df)

## Joining with `by = join_by(sample_id)`

pca_lineage <- left_join(pca_lineage, meta_df)

## Joining with `by = join_by(sample_id)`

# Plot side by side
p1 <- ggplot(pca_global, aes(x = PC1, y = PC2, color = age)) +
    geom_text(aes(label = sample_id)) +
    scale_color_manual(
        values = d_palette
    ) +
    labs(title = "Global CLR PCA",
         x = "PC1", y = "PC2", color = "Age (Months)") +
    theme_minimal() +
    xlab(sprintf("PC1 (%s%%)",
                signif(latent_global$varExplained[1], 4) * 100)) +
    ylab(sprintf("PC2 (%s%%)",
                signif(latent_global$varExplained[2], 4) * 100)) +
    theme(legend.position = 'none')

p2 <- ggplot(pca_lineage, aes(x = PC1, y = PC2, color = age)) +
    geom_text(aes(label = sample_id)) +
    scale_color_manual(
        values = d_palette
    ) +
    labs(title = "Within-Lineage CLR PCA",
         x = "PC1", y = "PC2", color = "Age (Months)") +
    theme_minimal() +
    xlab(sprintf("PC1 (%s%%)",
                signif(latent_lineage$varExplained[1], 4) * 100)) +
    ylab(sprintf("PC2 (%s%%)",
                signif(latent_lineage$varExplained[2], 4) * 100))


p1 + p2

sessionInfo()

## R version 4.5.1 Patched (2025-08-23 r88802)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] patchwork_1.3.2             tidygraph_1.3.1            
##  [3] ggraph_2.2.2                reshape2_1.4.4             
##  [5] rhdf5_2.53.6                TabulaMurisSenisData_1.15.1
##  [7] caret_7.0-1                 lattice_0.22-7             
##  [9] corrplot_0.95               tidyr_1.3.1                
## [11] dplyr_1.1.4                 ggplot2_4.0.0              
## [13] SETA_0.99.7                 SingleCellExperiment_1.31.1
## [15] SummarizedExperiment_1.39.2 Biobase_2.69.1             
## [17] GenomicRanges_1.61.6        Seqinfo_0.99.3             
## [19] IRanges_2.43.5              S4Vectors_0.47.4           
## [21] BiocGenerics_0.55.4         generics_0.1.4             
## [23] MatrixGenerics_1.21.0       matrixStats_1.5.0          
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3   jsonlite_2.0.0       shape_1.4.6.1       
##   [4] magrittr_2.0.4       farver_2.1.2         rmarkdown_2.30      
##   [7] vctrs_0.6.5          memoise_2.0.1        htmltools_0.5.8.1   
##  [10] S4Arrays_1.9.1       AnnotationHub_3.99.6 curl_7.0.0          
##  [13] Rhdf5lib_1.31.1      SparseArray_1.9.1    pROC_1.19.0.1       
##  [16] sass_0.4.10          parallelly_1.45.1    bslib_0.9.0         
##  [19] plyr_1.8.9           httr2_1.2.1          lubridate_1.9.4     
##  [22] cachem_1.1.0         igraph_2.2.0         lifecycle_1.0.4     
##  [25] iterators_1.0.14     pkgconfig_2.0.3      Matrix_1.7-4        
##  [28] R6_2.6.1             fastmap_1.2.0        future_1.67.0       
##  [31] digest_0.6.37        AnnotationDbi_1.71.2 ExperimentHub_2.99.6
##  [34] RSQLite_2.4.3        filelock_1.0.3       labeling_0.4.3      
##  [37] timechange_0.3.0     gdata_3.0.1          polyclip_1.10-7     
##  [40] httr_1.4.7           abind_1.4-8          compiler_4.5.1      
##  [43] proxy_0.4-27         bit64_4.6.0-1        withr_3.0.2         
##  [46] S7_0.2.0             viridis_0.6.5        DBI_1.2.3           
##  [49] ggforce_0.5.0        HDF5Array_1.37.0     MASS_7.3-65         
##  [52] lava_1.8.1           rappdirs_0.3.3       DelayedArray_0.35.3 
##  [55] gtools_3.9.5         ModelMetrics_1.2.2.2 tools_4.5.1         
##  [58] future.apply_1.20.0  nnet_7.3-20          glue_1.8.0          
##  [61] h5mread_1.1.1        nlme_3.1-168         rhdf5filters_1.21.4 
##  [64] grid_4.5.1           recipes_1.3.1        gtable_0.3.6        
##  [67] class_7.3-23         data.table_1.17.8    XVector_0.49.1      
##  [70] ggrepel_0.9.6        BiocVersion_3.22.0   foreach_1.5.2       
##  [73] pillar_1.11.1        stringr_1.5.2        splines_4.5.1       
##  [76] tweenr_2.0.3         BiocFileCache_2.99.6 survival_3.8-3      
##  [79] bit_4.6.0            tidyselect_1.2.1     Biostrings_2.77.2   
##  [82] knitr_1.50           gridExtra_2.3        xfun_0.53           
##  [85] graphlayouts_1.2.2   hardhat_1.4.2        timeDate_4051.111   
##  [88] stringi_1.8.7        yaml_2.3.10          evaluate_1.0.5      
##  [91] codetools_0.2-20     tibble_3.3.0         BiocManager_1.30.26 
##  [94] cli_3.6.5            rpart_4.1.24         jquerylib_0.1.4     
##  [97] dichromat_2.0-0.1    Rcpp_1.1.0           globals_0.18.0      
## [100] dbplyr_2.5.1         png_0.1-8            parallel_4.5.1      
## [103] gower_1.0.2          blob_1.2.4           listenv_0.9.1       
## [106] glmnet_4.1-10        viridisLite_0.4.2    ipred_0.9-15        
## [109] scales_1.4.0         prodlim_2025.04.28   e1071_1.7-16        
## [112] purrr_1.1.0          crayon_1.5.3         BiocStyle_2.37.1    
## [115] rlang_1.1.6          KEGGREST_1.49.2