ILoReg
is a tool for cell population identification from singlecell RNAseq (scRNAseq) data. In our paper [1], we showed that ILoReg
was able to identify, by both unsupervised clustering and visually, rare cell populations that other scRNAseq data analysis pipelines were unable to identify.
The figure below illustrates the workflows of ILoReg
and a typical pipeline that applies feature selection prior to dimensionality reduction by principal component analysis (PCA).
In contrast to most scRNAseq data analysis pipelines, ILoReg
does not reduce the dimensionality of the gene expression matrix by feature selection. Instead, it performs probabilistic feature extraction using iterative clustering projection (ICP), generating an \(N \times k\) dimensional probability matrix, which contains probabilities of each of the \(N\) cells belonging to the \(k\) clusters. ICP is a novel selfsupervised learning algorithm that iteratively seeks a clustering with \(k\) clusters that maximizes the adjusted Rand index (ARI) between the clustering \(C\) and its projection \(C'\) by L1regularized logistic regression. In the ILoReg consensus approach, ICP is run \(L\) times and the \(L\) probability matrices are merged into a joint probability matrix and subsequently transformed by principal component analysis (PCA) into a lower dimensional (\(N \times p\)) matrix (consensus matrix). The final clustering step is performed using hierarhical clustering by the Ward’s method, after which the user can extract a clustering with \(K\) consensus clusters. However, the user can also use any other clustering method at this point. Twodimensional visualization is supported using two popular nonlinear dimensionality reduction methods: tdistributed stochastic neighbor embedding (tSNE) and uniform manifold approximation and projection (UMAP). Additionally, ILoReg provides userfriendly functions that enable identification of differentially expressed (DE) genes and visualization of gene expression.
ILoReg
can be downloaded from Bioconductor and installed by executing the following command in the R console.
if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ILoReg")
In the following, we go through the different steps of ILoReg
’s workflow and demonstrate it using a peripheral blood mononuclear cell (PBMC) dataset. The toy dataset included in the ILoReg
R package (pbmc3k_500
) contains 500 cells that have been downsampled from the pbmc3k dataset [2]. The preprocessing was rerun with a newer reference genome (GRCh38.p12) and Cell Ranger v2.2.0 [3] to identify different immunoglobulin subpopulations in Bcells.
The only required input for ILoReg
is a logtransformed, normalized gene expression matrix that has been, with genes/features in rows and cells/samples in columns. The input can be of matrix
, data.frame
or dgCMatrix
class, which is then transformed into a sparse object of dgCMatrix
class. Please note that the method has been designed to work with sparse data, i.e. with a high proportion of zero values. If, for example, the features of your dataset have been standardized, the run time and the memory usage of ILoReg
will likely be much higher.
suppressMessages(library(ILoReg))
suppressMessages(library(SingleCellExperiment))
suppressMessages(library(cowplot))
# The dataset was normalized using the LogNormalize method from the Seurat R package.
sce < SingleCellExperiment(assays = list(logcounts = pbmc3k_500))
sce < PrepareILoReg(sce)
## Data in `logcounts` slot already of `dgCMatrix` class...
## 13865/13865 genes remain after filtering genes with only zero values.
Running ICP \(L\) times in parallel is the most computationally demanding part of the workflow.
In the following, we give a brief summary of the parameters.
As general guidelines on how to adjust the parameters, we recommend leaving \(r\) and \(L\) to their defaults (\(r=5\) and \(L=200\)). However, increasing \(k\) from 15 to e.g. 30 can reveal new cell subsets that are of interest. Regarding \(d\), increasing it to somewhere between 0.40.6 helps if the user wants lower resolution (less distinguishable populations). Increasing \(C\) from 0.3 to 1 reduces the number of distinguishable populations, as the logistic regression model filters out fewer genes.
# ICP is stochastic. To obtain reproducible results, use set.seed().
set.seed(1)
# Run ICP L times. This is the slowest step of the workflow,
# and parallel processing can be used to greatly speed it up.
sce < RunParallelICP(object = sce, k = 15,
d = 0.3, L = 30,
r = 5, C = 0.3,
reg.type = "L1", threads = 0)
##

  0%

==  3%

=====  7%

=======  10%

==========  14%

============  17%

==============  21%

=================  24%

===================  28%

======================  31%

========================  34%

===========================  38%

=============================  41%

===============================  45%

==================================  48%

====================================  52%

=======================================  55%

=========================================  59%

===========================================  62%

==============================================  66%

================================================  69%

===================================================  72%

=====================================================  76%

========================================================  79%

==========================================================  83%

============================================================  86%

===============================================================  90%

=================================================================  93%

====================================================================  97%

====================================================================== 100%
The \(L\) probability matrices are merged into a joint probability matrix, which is then transformed into a lower dimensionality by PCA. Before applying PCA, the user can optionally scale the cluster probabilities to unitvariance.
# p = number of principal components
sce < RunPCA(sce,p=50,scale = FALSE)
Optional: PCA requires the user to specify the number of principal components, for which we selected the default value \(p=50\). To aid in decision making, the elbow plot is commonly used to seek an elbow point, of which proximity the user selects \(p\). In this case the point would be close to \(p=10\). Trying both a \(p\) that is close to the elbow point and the default \(p=50\) is recommended.
PCAElbowPlot(sce)
To visualize the data in twodimensional space, nonlinear dimensionality reduction is performed using tSNE or UMAP. The input data for this step is the \(N \times p\) dimensional consensus matrix.
sce < RunUMAP(sce)
sce < RunTSNE(sce,perplexity=30)
Visualize the tSNE and UMAP transformations using the GeneScatterPlot
function, highlighting expression levels of CD3D (T cells), CD79A (B cells), CST3 (monocytes, dendritic cells, platelets), FCER1A (myeloid dendritic cells).
GeneScatterPlot(sce,c("CD3D","CD79A","CST3","FCER1A"),
dim.reduction.type = "umap",
point.size = 0.3)