1 Installation

The package can be installed using bioconductor install manager:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ClusterFoldSimilarity")
library(ClusterFoldSimilarity)

2 Introduction

Comparing single-cell data across different datasets, samples and batches has demonstrated to be challenging. ClusterFoldSimilarity aims to solve the complexity of comparing different single-cell datasets by computing similarity scores between clusters (or user-defined groups) from any number of independent single-cell experiments, including different species and sequencing technologies. It accomplishes this by identifying analogous fold-change patterns across cell groups that share a common set of features (such as genes). Additionally, it selects and reports the top important features that have contributed to the observed similarity, serving as a tool for feature selection.

The output is a table that contains the similarity values for all the combinations of cluster-pairs from the independent datasets. ClusterFoldSimilarity also includes various plotting utilities to enhance the interpretability of the similarity scores.

2.0.1 Cross-species analysis and sequencing technologies (e.g.: Human vs Mouse, ATAC-Seq vs RNA-Seq)

ClusterFoldSimilarity is able to compare any number of independent experiments, including different organisms, making it useful for matching cell populations across different organisms, and thus, useful for inter-species analysis. Additionally, it can be used with single-cell RNA-Seq data, single-cell ATAC-Seq data, or more broadly, with continuous numerical data that shows changes in feature abundance across a set of common features between different groups.

2.0.2 Compatibility

It can be easily integrated on any existing single-cell analysis pipeline, and it is compatible with the most used single-cell objects: Seurat and SingleCellExperiment.

Parallel computing is available through the option parallel=TRUE which make use of BiocParallel.

3 Using ClusterFoldSimilarity to find similar clusters/cell-groups across datasets

Typically, ClusterFoldSimilarity will receive as input either a list of two or more Seurat or SingleCellExperiment objects.

ClusterFoldSimilarity will obtain the raw count data from these objects ( GetAssayData(assay, slot = "counts") in the case of Seurat, or counts() for SingleCellExperiment object), and group or cluster label information (using Idents() function from Seurat, or colLabels() for SingleCellExperiment ).

For the sake of illustration, we will employ the scRNAseq package, which contains numerous individual-cell datasets ready for download and encompassing samples from both human and mouse origins. In this example, we specifically utilize 2 human single-cell datasets obtained from the pancreas.

library(Seurat)
library(scRNAseq)
library(dplyr)
# Human pancreatic single cell data 1
pancreasMuraro <- scRNAseq::MuraroPancreasData(ensembl=FALSE)
pancreasMuraro <- pancreasMuraro[,rownames(colData(pancreasMuraro)[!is.na(colData(pancreasMuraro)$label),])]
colData(pancreasMuraro)$cell.type <- colData(pancreasMuraro)$label
rownames(pancreasMuraro) <- make.names(unlist(lapply(strsplit(rownames(pancreasMuraro), split="__"), function(x)x[[1]])), unique = TRUE)
singlecell1Seurat <- CreateSeuratObject(counts=counts(pancreasMuraro), meta.data=as.data.frame(colData(pancreasMuraro)))

Table 1: Cell-types on pancreas dataset from Muraro et al.
Var1 Freq
acinar 219
alpha 812
beta 448
delta 193
duct 245
endothelial 21
epsilon 3
mesenchymal 80
pp 101
unclear 4
# Human pancreatic single cell data 2
pancreasBaron <- scRNAseq::BaronPancreasData(which="human", ensembl=FALSE)
colData(pancreasBaron)$cell.type <- colData(pancreasBaron)$label
rownames(pancreasBaron) <- make.names(rownames(pancreasBaron), unique = TRUE)

singlecell2Seurat <- CreateSeuratObject(counts=counts(pancreasBaron), meta.data=as.data.frame(colData(pancreasBaron)))

Table 2: Cell-types on pancreas dataset from Baron et al.
Var1 Freq
acinar 958
activated_stellate 284
alpha 2326
beta 2525
delta 601
ductal 1077
endothelial 252
epsilon 18
gamma 255
macrophage 55
mast 25
quiescent_stellate 173
schwann 13
t_cell 7

As we want to perform clustering analysis for later comparison of these cluster groups using ClusterFoldSimilarity, we first need to normalize and identify variable features for each dataset independently.

Note: these steps should be done tailored to each independent dataset, here we apply the same parameters for the sake of simplicity:

# Create a list with the unprocessed single-cell datasets
singlecellObjectList <- list(singlecell1Seurat, singlecell2Seurat)
# Apply the same processing to each dataset and return a list of single-cell analysis
singlecellObjectList <- lapply(X=singlecellObjectList, FUN=function(scObject){
scObject <- NormalizeData(scObject)
scObject <- FindVariableFeatures(scObject, selection.method="vst", nfeatures=2000)
scObject <- ScaleData(scObject, features=VariableFeatures(scObject))
scObject <- RunPCA(scObject, features=VariableFeatures(object=scObject))
scObject <- FindNeighbors(scObject, dims=seq(16))
scObject <- FindClusters(scObject, resolution=0.4)
})

Once we have all of our single-cell datasets analyzed independently, we can compute the similarity values. clusterFoldSimilarity() takes as arguments:

  • scList: a list of single-cell objects (mandatory) either of class Seurat or of class SingleCellExperiment.
  • sampleNames: vector with names for each of the datasets. If not set the datasets will be named in the given order as: 1, 2, …, N.
  • topN: the top n most similar clusters/groups to report for each cluster/group (default: 1, the top most similar cluster). If set to Inf it will return the values from all the possible cluster-pairs.
  • topNFeatures: the top n features (e.g.: genes) that contribute to the observed similarity between the pair of clusters (default: 1, the top contributing gene). If a negative number, the tool will report the n most dissimilar features.
  • nSubsampling: number of subsamplings (1/3 of cells on each iteration) at group level for calculating the fold-changes (default: 15). At start, the tool will report a message with the recommended number of subsamplings for the given data (average n of subsamplings needed to observe all cells).
  • parallel: whether to use parallel computing with multiple threads or not (default: FALSE). If we want to use a specific single-cell experiment for annotation (from which we know a ground-truth label, e.g. cell type, cell cycle, treatment… etc.), we can use that label to directly compare the single-cell datasets.

Here we will use the annotated pancreas cell-type labels from the dataset 1 to illustrate how to match clusters to cell-types using a reference dataset:

# Assign cell-type annotated from the original study to the cell labels:
Idents(singlecellObjectList[[1]]) <- factor(singlecellObjectList[[1]][[]][, "cell.type"])

library(ClusterFoldSimilarity)
similarityTable <- clusterFoldSimilarity(scList=singlecellObjectList, 
                                         sampleNames=c("human", "humanNA"),
                                         topN=1, 
                                         nSubsampling=24)