ILoReg is a tool for cell population identification from single-cell RNA-seq (scRNA-seq) data. In our paper , we showed that
ILoReg was able to identify, by both unsupervised clustering and visually, rare cell populations that other scRNA-seq data analysis pipelines were unable to identify.
The figure below illustrates the workflows of
ILoReg and a typical pipeline that applies feature selection prior to dimensionality reduction by principal component analysis (PCA).
In contrast to most scRNA-seq data analysis pipelines,
ILoReg does not reduce the dimensionality of the gene expression matrix by feature selection. Instead, it performs probabilistic feature extraction using iterative clustering projection (ICP), generating an \(N \times k\) -dimensional probability matrix, which contains probabilities of each of the \(N\) cells belonging to the \(k\) clusters. ICP is a novel self-supervised learning algorithm that iteratively seeks a clustering with \(k\) clusters that maximizes the adjusted Rand index (ARI) between the clustering \(C\) and its projection \(C'\) by L1-regularized logistic regression. In the ILoReg consensus approach, ICP is run \(L\) times and the \(L\) probability matrices are merged into a joint probability matrix and subsequently transformed by principal component analysis (PCA) into a lower dimensional (\(N \times p\)) matrix (consensus matrix). The final clustering step is performed using hierarhical clustering by the Ward’s method, after which the user can extract a clustering with \(K\) consensus clusters. However, the user can also use any other clustering method at this point. Two-dimensional visualization is supported using two popular nonlinear dimensionality reduction methods: t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP). Additionally, ILoReg provides user-friendly functions that enable identification of differentially expressed (DE) genes and visualization of gene expression.
ILoReg can be downloaded from Bioconductor and installed by executing the following command in the R console.
if(!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("ILoReg")
In the following, we go through the different steps of
ILoReg’s workflow and demonstrate it using a peripheral blood mononuclear cell (PBMC) dataset. The toy dataset included in the
ILoReg R package (
pbmc3k_500) contains 500 cells that have been downsampled from the pbmc3k dataset . The preprocessing was rerun with a newer reference genome (GRCh38.p12) and Cell Ranger v2.2.0  to identify different immunoglobulin subpopulations in B-cells.
The only required input for
ILoReg is a log-transformed, normalized gene expression matrix that has been, with genes/features in rows and cells/samples in columns. The input can be of
dgCMatrix class, which is then transformed into a sparse object of
dgCMatrix class. Please note that the method has been designed to work with sparse data, i.e. with a high proportion of zero values. If, for example, the features of your dataset have been standardized, the run time and the memory usage of
ILoReg will likely be much higher.
suppressMessages(library(ILoReg)) suppressMessages(library(SingleCellExperiment)) suppressMessages(library(cowplot)) # The dataset was normalized using the LogNormalize method from the Seurat R package. sce <- SingleCellExperiment(assays = list(logcounts = pbmc3k_500)) sce <- PrepareILoReg(sce)
## Data in `logcounts` slot already of `dgCMatrix` class...
## 13865/13865 genes remain after filtering genes with only zero values.
Running ICP \(L\) times in parallel is the most computationally demanding part of the workflow.
In the following, we give a brief summary of the parameters.
As general guidelines on how to adjust the parameters, we recommend leaving \(r\) and \(L\) to their defaults (\(r=5\) and \(L=200\)). However, increasing \(k\) from 15 to e.g. 30 can reveal new cell subsets that are of interest. Regarding \(d\), increasing it to somewhere between 0.4-0.6 helps if the user wants lower resolution (less distinguishable populations). Increasing \(C\) from 0.3 to 1 reduces the number of distinguishable populations, as the logistic regression model filters out fewer genes.
# ICP is stochastic. To obtain reproducible results, use set.seed(). set.seed(1) # Run ICP L times. This is the slowest step of the workflow, # and parallel processing can be used to greatly speed it up. sce <- RunParallelICP(object = sce, k = 15, d = 0.3, L = 30, r = 5, C = 0.3, reg.type = "L1", threads = 0)
## | | | 0% | |== | 3% | |===== | 7% | |======= | 10% | |========== | 14% | |============ | 17% | |============== | 21% | |================= | 24% | |=================== | 28% | |====================== | 31% | |======================== | 34% | |=========================== | 38% | |============================= | 41% | |=============================== | 45% | |================================== | 48% | |==================================== | 52% | |======================================= | 55% | |========================================= | 59% | |=========================================== | 62% | |============================================== | 66% | |================================================ | 69% | |=================================================== | 72% | |===================================================== | 76% | |======================================================== | 79% | |========================================================== | 83% | |============================================================ | 86% | |=============================================================== | 90% | |================================================================= | 93% | |==================================================================== | 97% | |======================================================================| 100%
The \(L\) probability matrices are merged into a joint probability matrix, which is then transformed into a lower dimensionality by PCA. Before applying PCA, the user can optionally scale the cluster probabilities to unit-variance.
# p = number of principal components sce <- RunPCA(sce,p=50,scale = FALSE)
Optional: PCA requires the user to specify the number of principal components, for which we selected the default value \(p=50\). To aid in decision making, the elbow plot is commonly used to seek an elbow point, of which proximity the user selects \(p\). In this case the point would be close to \(p=10\). Trying both a \(p\) that is close to the elbow point and the default \(p=50\) is recommended.
To visualize the data in two-dimensional space, nonlinear dimensionality reduction is performed using t-SNE or UMAP. The input data for this step is the \(N \times p\) -dimensional consensus matrix.
sce <- RunUMAP(sce) sce <- RunTSNE(sce,perplexity=30)
Visualize the t-SNE and UMAP transformations using the
GeneScatterPlot function, highlighting expression levels of CD3D (T cells), CD79A (B cells), CST3 (monocytes, dendritic cells, platelets), FCER1A (myeloid dendritic cells).
GeneScatterPlot(sce,c("CD3D","CD79A","CST3","FCER1A"), dim.reduction.type = "umap", point.size = 0.3)