ClusterFoldSimilarity 1.0.0
The package can be installed using bioconductor install manager:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ClusterFoldSimilarity")
library(ClusterFoldSimilarity)
Comparing single-cell data across different datasets, samples and batches has demonstrated to be challenging. ClusterFoldSimilarity
aims to solve the complexity of comparing different single-cell datasets by computing similarity scores between clusters (or user-defined groups) from any number of independent single-cell experiments, including different species and sequencing technologies. It accomplishes this by identifying analogous fold-change patterns across cell groups that share a common set of features (such as genes). Additionally, it selects and reports the top important features that have contributed to the observed similarity, serving as a tool for feature selection.
The output is a table that contains the similarity values for all the combinations of cluster-pairs from the independent datasets. ClusterFoldSimilarity
also includes various plotting utilities to enhance the interpretability of the similarity scores.
ClusterFoldSimilarity
is able to compare any number of independent experiments, including different organisms, making it useful for matching cell populations across different organisms, and thus, useful for inter-species analysis. Additionally, it can be used with single-cell RNA-Seq data, single-cell ATAC-Seq data, or more broadly, with continuous numerical data that shows changes in feature abundance across a set of common features between different groups.
It can be easily integrated on any existing single-cell analysis pipeline, and it is compatible with the most used single-cell objects: Seurat
and SingleCellExperiment
.
Parallel computing is available through the option parallel=TRUE which make use of BiocParallel.
Typically, ClusterFoldSimilarity
will receive as input either a list of two or more Seurat
or SingleCellExperiment
objects.
ClusterFoldSimilarity
will obtain the raw count data from these objects ( GetAssayData(assay, slot = "counts")
in the case of Seurat
, or counts()
for SingleCellExperiment
object), and group or cluster label information (using Idents()
function from Seurat
, or colLabels()
for SingleCellExperiment
).
For the sake of illustration, we will employ the scRNAseq package, which contains numerous individual-cell datasets ready for download and encompassing samples from both human and mouse origins. In this example, we specifically utilize 2 human single-cell datasets obtained from the pancreas.
library(Seurat)
library(scRNAseq)
library(dplyr)
# Human pancreatic single cell data 1
pancreasMuraro <- scRNAseq::MuraroPancreasData(ensembl=FALSE)
pancreasMuraro <- pancreasMuraro[,rownames(colData(pancreasMuraro)[!is.na(colData(pancreasMuraro)$label),])]
colData(pancreasMuraro)$cell.type <- colData(pancreasMuraro)$label
rownames(pancreasMuraro) <- make.names(unlist(lapply(strsplit(rownames(pancreasMuraro), split="__"), function(x)x[[1]])), unique = TRUE)
singlecell1Seurat <- CreateSeuratObject(counts=counts(pancreasMuraro), meta.data=as.data.frame(colData(pancreasMuraro)))
Var1 | Freq |
---|---|
acinar | 219 |
alpha | 812 |
beta | 448 |
delta | 193 |
duct | 245 |
endothelial | 21 |
epsilon | 3 |
mesenchymal | 80 |
pp | 101 |
unclear | 4 |
# Human pancreatic single cell data 2
pancreasBaron <- scRNAseq::BaronPancreasData(which="human", ensembl=FALSE)
colData(pancreasBaron)$cell.type <- colData(pancreasBaron)$label
rownames(pancreasBaron) <- make.names(rownames(pancreasBaron), unique = TRUE)
singlecell2Seurat <- CreateSeuratObject(counts=counts(pancreasBaron), meta.data=as.data.frame(colData(pancreasBaron)))
Var1 | Freq |
---|---|
acinar | 958 |
activated_stellate | 284 |
alpha | 2326 |
beta | 2525 |
delta | 601 |
ductal | 1077 |
endothelial | 252 |
epsilon | 18 |
gamma | 255 |
macrophage | 55 |
mast | 25 |
quiescent_stellate | 173 |
schwann | 13 |
t_cell | 7 |
As we want to perform clustering analysis for later comparison of these cluster groups using ClusterFoldSimilarity
, we first need to normalize and identify variable features for each dataset independently.
Note: these steps should be done tailored to each independent dataset, here we apply the same parameters for the sake of simplicity:
# Create a list with the unprocessed single-cell datasets
singlecellObjectList <- list(singlecell1Seurat, singlecell2Seurat)
# Apply the same processing to each dataset and return a list of single-cell analysis
singlecellObjectList <- lapply(X=singlecellObjectList, FUN=function(scObject){
scObject <- NormalizeData(scObject)
scObject <- FindVariableFeatures(scObject, selection.method="vst", nfeatures=2000)
scObject <- ScaleData(scObject, features=VariableFeatures(scObject))
scObject <- RunPCA(scObject, features=VariableFeatures(object=scObject))
scObject <- FindNeighbors(scObject, dims=seq(16))
scObject <- FindClusters(scObject, resolution=0.4)
})
Once we have all of our single-cell datasets analyzed independently, we can compute the similarity values. clusterFoldSimilarity()
takes as arguments:
scList
: a list of single-cell objects (mandatory) either of class Seurat
or of class SingleCellExperiment
.sampleNames
: vector with names for each of the datasets. If not set the datasets will be named in the given order as: 1, 2, …, N.topN
: the top n most similar clusters/groups to report for each cluster/group (default: 1
, the top most similar cluster). If set to Inf
it will return the values from all the possible cluster-pairs.topNFeatures
: the top n features (e.g.: genes) that contribute to the observed similarity between the pair of clusters (default: 1
, the top contributing gene). If a negative number, the tool will report the n most dissimilar features.nSubsampling
: number of subsamplings (1/3 of cells on each iteration) at group level for calculating the fold-changes (default: 15
). At start, the tool will report a message with the recommended number of subsamplings for the given data (average n of subsamplings needed to observe all cells).parallel
: whether to use parallel computing with multiple threads or not (default: FALSE
). If we want to use a specific single-cell experiment for annotation (from which we know a ground-truth label, e.g. cell type, cell cycle, treatment… etc.), we can use that label to directly compare the single-cell datasets.Here we will use the annotated pancreas cell-type labels from the dataset 1 to illustrate how to match clusters to cell-types using a reference dataset:
# Assign cell-type annotated from the original study to the cell labels:
Idents(singlecellObjectList[[1]]) <- factor(singlecellObjectList[[1]][[]][, "cell.type"])
library(ClusterFoldSimilarity)
similarityTable <- clusterFoldSimilarity(scList=singlecellObjectList,
sampleNames=c("human", "humanNA"),
topN=1,
nSubsampling=24)