# Contents

For details on the concept and technicalities of DS analysis, and the methods presented here, consider having a look at our publication:

Crowell HL, Soneson C*, Germain P-L*, Calini D,
Collin L, Raposo C, Malhotra D, and Robinson MD:
muscat detects subpopulation-specific state transitions from
multi-sample multi-condition single-cell transcriptomics data.
Nature Communications 11, 6077 (2020).
DOI: 10.1038/s41467-020-19894-4

library(dplyr)
library(ggplot2)
library(limma)
library(muscat)
library(purrr)

# 1 Introduction

## 1.1 What is DS analysis?

A fundamental task in the analysis of single-cell RNA-sequencing (scRNA-seq) data is the identification of systematic transcriptional changes (Stegle, Teichmann, and Marioni 2015). Such analyses are a critical step in the understanding of molecular responses, and have applications in development, in perturbation studies or in disease.
Most of the current scRNA-seq differential expression (DE) analysis methods are designed to test one set of cells against another (or more generally, multiple sets together), and can be used to compare cell clusters (e.g., for identifying marker genes) or across conditions (cells from one condition versus another) (Soneson and Robinson 2018). In such statistical models, the cells are the experimental units and thus represent the population that inferences will extrapolate to.

Using established terminology, we refer to cell identity as the combination of cell type, a stable molecular signature, and cell state, a transient snapshot of a cell’s molecular events (Wagner, Regev, and Yosef 2016; Trapnell 2015). This classification is inherently arbitrary, but still provides a basis for biological interpretation and a framework for discovering interesting expression patterns from scRNA-seq datasets. For example, T cells could be defined as a single (albeit diverse) cell type or could be divided into discrete subtypes, if relevant information to categorize each cell at this level were available. In either case, the framework presented here would be able to focus on the cell type of interest and look for changes (in expression) across samples.
Given the emergence of multi-sample multi-group scRNA-seq datasets, the goal becomes making sample-level inferences (i.e., experimental units are samples). Thus, differential state (DS) analysis is defined as following a given cell type across a set of samples (e.g., individuals) and experimental conditions (e.g., treatments), in order to identify cell-type-specific responses, i.e., changes in cell state. DS analysis: i) should be able to detect diluted changes that only affect a single cell type, a subset of cell types or even a subset of a single subpopulation; and, ii) is intended to be orthogonal to clustering or cell type assignment.

## 1.2 Starting point

The starting point for a DS analysis is a (sparse) matrix of gene expression, either as counts or some kind of normalized data, where rows = genes and columns = cells. Each cell additionally has a cluster (subpopulation) label as well as a sample label; metadata should accompany the list of samples, such that they can be organized into comparable groups with sample-level replicates (e.g., via a design matrix).

The approach presented here is modular and thus subpopulation labels could originate from an earlier step in the analysis, such as clustering (Duò, Robinson, and Soneson 2018; Freytag et al. 2018), perhaps after integration (Butler et al. 2018; Stuart et al. 2019) or after labeling of clusters (Diaz-Mejia et al. 2019) or after cell-level type assignment (Zhang et al. 2019).

# 2 Getting started

## 2.1 Data description

For this vignette, we will use a SingleCellExperiment (SCE) containing 10x droplet-based scRNA-seq PBCM data from 8 Lupus patients obtained befor and after 6h-treatment with IFN-$$\beta$$ (Kang et al. 2018). The complete raw data, as well as gene and cell metadata is available through the NCBI GEO, accession number GSE96583.

The Kang et al. (2018) dataset has been made available through Biocondcutor’s ExperimentHub and can be loaded into R as follows: We first initialize a Hub instance to search for and load available data with the ExperimentHub function, and store the complete list of records in the variable eh. Using query, we then retrieve any records that match our keyword(s) of interest, as well as their corresponding accession ID (EH1234).

library(ExperimentHub)
eh <- ExperimentHub()
query(eh, "Kang")
## ExperimentHub with 3 records
## # snapshotDate(): 2022-04-26
## # $dataprovider: NCI_GDC, GEO ## #$species: Homo sapiens
## # $rdataclass: character, SingleCellExperiment, BSseq ## # additional mcols(): taxonomyid, genome, description, ## # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, ## # rdatapath, sourceurl, sourcetype ## # retrieve records with, e.g., 'object[["EH1661"]]' ## ## title ## EH1661 | Whole Genome Bisulfit Sequencing Data for 47 samples ## EH1662 | Whole Genome Bisulfit Sequencing Data for 47 samples ## EH2259 | Kang18_8vs8 Finally, we load the data of interest into R via [[ and the corresponding accession ID. The dataset contains >35,000 genes and ~29,000 cells: (sce <- eh[["EH2259"]]) ## class: SingleCellExperiment ## dim: 35635 29065 ## metadata(0): ## assays(1): counts ## rownames(35635): MIR1302-10 FAM138A ... MT-ND6 MT-CYB ## rowData names(2): ENSEMBL SYMBOL ## colnames(29065): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1 ## TTTGCATGTCTTAC-1 ## colData names(5): ind stim cluster cell multiplets ## reducedDimNames(1): TSNE ## mainExpName: NULL ## altExpNames(0): ## 2.3 Preprocessing The scater package (McCarthy et al. 2017) provides a variety of tools for preprocessing and quality control of single-cell transcriptomic data. For completeness, we will apply some minimal filtering steps to • remove undetected genes • remove cells with very few or many detected genes • remove very lowly expressed genes • compute normalized expression values for visualization For more thorough preprocessing, we refer to the Quality control with scater vignette. # remove undetected genes sce <- sce[rowSums(counts(sce) > 0) > 0, ] dim(sce) ## [1] 18890 29065 We use perCellQCMetrics to compute various per-cell quality control metrics, and proceed with filtering cells and genes as noted above: # calculate per-cell quality control (QC) metrics library(scater) qc <- perCellQCMetrics(sce) # remove cells with few or many detected genes ol <- isOutlier(metric = qc$detected, nmads = 2, log = TRUE)
sce <- sce[, !ol]
dim(sce)
## [1] 18890 26820
# remove lowly expressed genes
sce <- sce[rowSums(counts(sce) > 1) >= 10, ]
dim(sce)
## [1]  7118 26820

Finally, we use logNormCounts to calculate log$$_2$$-transformed normalized expression values by dividing each count by its size factor, adding a pseudo-count of 1, and log-transforming1 Note that, in this workflow, expression values are used for visualization only, and that differential analyses are performed on pseudobulks (section 3.3) or the count data directly (section 3.4)..

# compute sum-factors & normalize
sce <- computeLibraryFactors(sce)
sce <- logNormCounts(sce)

Alternatively, expression values could be obtained via vst (variance stabilizing transformation) from the sctransform package (Hafemeister and Satija 2019), which returns Pearson residuals from a regularized negative binomial regression model that can be interpreted as normalized expression values:

library(sctransform)
assays(sce)$vstresiduals <- vst(counts(sce), verbosity = FALSE)$y

By default, scater’s functions will try to access the assay data specified via argument exprs_values (default logcounts) for e.g. visualization and dimension reduction. When an alternative assay such as the vstresiduals above should be used, it is thus necessary to explicitly specify this, for example, via runUMAP(sce, exprs_values = "vstresiduals") to compute UMAP cell embedings on the assay data compute above.

## 2.4 Data preparation

muscat expects a certain format of the input SCE. Specifically, the following cell metadata (colData) columns have to be provided:

• "sample_id": unique sample identifiers (e.g., PeterPan_ref1, Nautilus_trt3, …)
• "cluster_id": subpopulation (cluster) assignments (e.g., T cells, monocytes, …)
• "group_id": experimental group/condition (e.g., control/treatment, healthy/diseased, …)
sce$id <- paste0(sce$stim, sce$ind) (sce <- prepSCE(sce, kid = "cell", # subpopulation assignments gid = "stim", # group IDs (ctrl/stim) sid = "id", # sample IDs (ctrl/stim.1234) drop = TRUE)) # drop all other colData columns ## class: SingleCellExperiment ## dim: 7118 26820 ## metadata(1): experiment_info ## assays(2): counts logcounts ## rownames(7118): NOC2L HES4 ... S100B PRMT2 ## rowData names(2): ENSEMBL SYMBOL ## colnames(26820): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1 ## TTTGCATGTCTTAC-1 ## colData names(3): cluster_id sample_id group_id ## reducedDimNames(1): TSNE ## mainExpName: NULL ## altExpNames(0): For consistency and easy accession throughout this vignette, we will store cluster and sample IDs, as well as the number of clusters and samples into the following simple variables: nk <- length(kids <- levels(sce$cluster_id))
ns <- length(sids <- levels(sce$sample_id)) names(kids) <- kids; names(sids) <- sids ## 2.5 Data overview ### 2.5.1 Cluster-sample sizes As we will be aggregating measurements at the cluster-sample level, it is of particular importance to check the number of cells captured for each such instance. While aggregateData (see Section 3.1) allows excluding cluster-sample combinations with less than a threshold number of cells, clusters or samples with overall very low cell-counts may be excluded from further analysis at this point already. For the Kang et al. (2018) dataset, for example, one might consider removing the Dendritic cells and Megakaryocytes clusters, as these containg less than 50 cells across all samples. # nb. of cells per cluster-sample t(table(sce$cluster_id, sce$sample_id)) ## ## B cells CD14+ Monocytes CD4 T cells CD8 T cells Dendritic cells ## ctrl101 113 186 336 95 5 ## ctrl1015 476 783 919 226 11 ## ctrl1016 144 419 526 671 10 ## ctrl1039 30 116 202 30 5 ## ctrl107 51 222 197 31 2 ## ctrl1244 134 429 1215 82 28 ## ctrl1256 240 383 1136 156 12 ## ctrl1488 234 317 1343 78 17 ## stim101 144 222 437 121 20 ## stim1015 357 683 814 153 17 ## stim1016 129 361 426 600 10 ## stim1039 39 154 318 40 7 ## stim107 56 185 217 22 7 ## stim1244 94 318 980 46 19 ## stim1256 211 369 1047 133 16 ## stim1488 283 370 1658 73 34 ## ## FCGR3A+ Monocytes Megakaryocytes NK cells ## ctrl101 81 12 84 ## ctrl1015 232 25 208 ## ctrl1016 126 15 151 ## ctrl1039 28 5 20 ## ctrl107 29 5 49 ## ctrl1244 53 19 131 ## ctrl1256 50 15 275 ## ctrl1488 99 21 120 ## stim101 126 7 120 ## stim1015 222 24 224 ## stim1016 124 12 239 ## stim1039 36 13 32 ## stim107 36 4 51 ## stim1244 35 14 136 ## stim1256 73 21 257 ## stim1488 139 35 187 ### 2.5.2 Dimension reduction The dimension reductions (DR) available within the SCE can be accessed via reducedDims from the scater package. The data provided by Kang et al. (2018) already contains t-SNE coordinates; however, we can of course compute additional dimension reductions using one of scater’s runX functions: # compute UMAP using 1st 20 PCs sce <- runUMAP(sce, pca = 20) Using scater’s plotReducedDim function, we can plot t-SNE and UMAP representations colored by cluster and group IDs, respectively. We additionaly create a small wrapper function, .plot_dr(), to improve the readability of color legends and simplify the plotting theme: # wrapper to prettify reduced dimension plots .plot_dr <- function(sce, dr, col) plotReducedDim(sce, dimred = dr, colour_by = col) + guides(fill = guide_legend(override.aes = list(alpha = 1, size = 3))) + theme_minimal() + theme(aspect.ratio = 1) For our dataset, the t-SNE and UMAP colored by cluster_ids show that cell-populations are well-separated from one another. IFN-$$\beta$$ stimulation manifests as a severe shift in the low-dimensional projection of cells when coloring by group_ids, indicating widespread, genome-scale transcriptiontal changes. # downsample to max. 100 cells per cluster cs_by_k <- split(colnames(sce), sce$cluster_id)
cs100 <- unlist(sapply(cs_by_k, function(u)
sample(u, min(length(u), 100))))

# plot t-SNE & UMAP colored by cluster & group ID
for (dr in c("TSNE", "UMAP"))
for (col in c("cluster_id", "group_id"))
.plot_dr(sce[, cs100], dr, col)