ccfindR 1.25.0
The ccfindR
(Cancer Clone findeR) package1 contains
implementations and utilities for analyzing single-cell
RNA-sequencing data, including quality control,
unsupervised clustering for discovery of cell types,
and visualization of the outcomes. It is especially
suitable for analysis of transcript-count data utilizing
unique molecular identifiers (UMIs), e.g., data derived
from 10x Genomics platform. In these data sets, RNA counts
are non-negative integers, enabling clustering using non-negative
matrix factorization (NMF).2
Input data are UMI counts in the form of a matrix with each genetic
feature (“genes”) in rows and cells (tagged by barcodes) in columns,
produced by read alignment and counting pipelines. The count matrix and
associated gene and cell annotation files are bundled into a main
object of class scNMFSet
, which extends the
SingleCellExperiment
class [http://dx.doi.org/10.18129/B9.bioc.SingleCellExperiment)].
Quality control for both cells and genes can be performed via filtering
steps based on UMI counts and variance of expressions, respectively.
The NMF factorization is first performed for multiple values of
ranks (the reduced dimension of factorization) to find the most
likely value. A production run for the chosen rank then leads to
factor matrices, allowing the user to identify and visualize genes
representative of clusters and assign cells into clusters.
The NMF approach offers a means to identify cell subtypes and classify individual cells into these clusters based on clustering using expression counts. In contrast to alternatives such as principal component analyses,3 NMF leverages the non-negative nature of count data and factorizes the data matrix \(\sf X\) into two factor matrices \(\sf W\) and \(\sf H\):2
\[\begin{equation} \sf{X} \sim {\sf W}{\sf H}. \end{equation}\]
If \(\sf X\) is a \(p\times n\) matrix (\(p\) genes and \(n\) cells), the basis matrix \(\sf W\) is \(p \times r\) and coefficient matrix \(\sf H\) is \(r \times n\) in dimension, respectively, where the rank \(r\) is a relatively small integer. A statistical inference-based interpretation of NMF is to view \(X_{ij}\) as a realization of a Poisson distribution with the mean for each matrix element given by \(({\sf WH})_{ij}\equiv \Lambda_{ij}\), or
\[\begin{equation} \Pr(x_{ij})=\frac{e^{-\Lambda_{ij}}{\Lambda_{ij}}^{x_{ij}}} {\Gamma(1+x_{ij})}. \end{equation}\]
The maximum likelihood inference of the latter is then achieved by maximizing
\[\begin{equation} L = \sum_{ij} \left(X_{ij} \ln \frac{\Lambda_{ij}}{X_{ij}}- \Lambda_{ij}+X_{ij}\right). \end{equation}\]
The Kullback-Leibler measure of the distance between \(\sf X\) and \(\sf \Lambda\), which is minimized, is equal to \(-L\). Lee and Seung’s update rule2 solves this optimization task iteratively.
While also including this classical iterative update
algorithm to find basis and coefficient factors of the count matrix, the
main workhorse in ccfindR
is the variational Bayesian inference
algorithm proposed by Cemgil.4 Thus the key distinguishing
features of ccfindR
1 compared to other existing implementations – NMF
for generic data5 and NMFEM
for single-cell analysis6 –
are
In particular, a traditional way (in maximum likelihood inference) to determine the rank is to evaluate the factorization quality measures (and optionally compare with those from randomized data). The Bayesian formulation of NMF algorithm instead incorporates priors for factored matrix elements \(\sf W\) and \(\sf H\) modeled by gamma distributions. Inference can be combined with hyperparameter update to optimize the marginal likelihood (ML; conditional probability of data under hyperparameters and rank), which provides a statistically well-controlled means to determine the optimal rank describing data.
For large rank values, it can be challenging to interpret clusters identified. To facilitate biological interpretation, we provide a procedure where cluster assignment of cells is repeated for multiple rank values, typically ranging from 2 to the optimal rank, and a phylogenetic tree connecting different clusters at neighboring rank values are constructed. This tree gives an overview of different types of cells present in the system viewed at varying resolution.
We illustrate a typical workflow with a single-cell count data set generated from peripheral blood mononuclear cell (PBMC) data [https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/]. The particular data set used below was created by sampling from five purified immune cell subsets.
To install the package, do the following:
BiocManager::install('ccfindR')
After installation, load the package by
library(ccfindR)
The input data can be a simple matrix:
# A toy matrix for count data
set.seed(1)
mat <- matrix(rpois(n = 80, lambda = 2), nrow = 4, ncol = 20)
ABC <- LETTERS[1:4]
abc <- letters[1:20]
rownames(mat) <- ABC
colnames(mat) <- abc
The main S4
object containing data and subsequent analysis outcomes is of
class scNMFSet
, created by
# create scNMFSet object
sc <- scNMFSet(count = mat)
This class extends SingleCellExperiment
class,
adding extra slots for storing factorization outcomes.
In particular, assays
, rowData
, and colData
slots of
SingleCellExperiment
class are used to store RNA count matrix,
gene, and cell annotation data frames, respectively.
In the simplest initialization above, the named argument count
is
used as the count matrix and is equivalent to
# create scNMFSet object
sc <- scNMFSet(assays = list(counts = mat))
See SingleCellExperiment
documentations for more details of these
main slots. For instance, row and column names can be stored by
# set row and column names
suppressMessages(library(S4Vectors))
genes <- DataFrame(ABC)
rownames(genes) <- ABC
cells <- DataFrame(abc)
rownames(cells) <- abc
sc <- scNMFSet(count = mat, rowData = genes, colData = cells)
sc
## class: scNMFSet
## dim: 4 20
## rownames: [1] "A" "B" "C" "D"
## colnames: [1] "a" "b" "c" "d" "e" "f"
Alternatively, sparse matrix format (of class dgCMatrix
) can be used.
A MatrixMarket
format file can be read directly:
# read sparse matrix
dir <- system.file('extdata', package = 'ccfindR')
mat <- Matrix::readMM(paste0(dir,'/matrix.mtx'))
rownames(mat) <- 1:nrow(mat)
colnames(mat) <- 1:ncol(mat)
sc <- scNMFSet(count = mat, rowData = DataFrame(1:nrow(mat)),
colData = DataFrame(1:ncol(mat)))
sc
## class: scNMFSet
## dim: 1030 450
## rownames: [1] "1" "2" "3" "4" "5" "6"
## colnames: [1] "1" "2" "3" "4" "5" "6"
The number of rows in assays$counts
and rowData
, the number of columns
in assays$counts
and rows in colData
must match.
The gene and barcode meta-data and count files resulting from 10x Genomics’ Cell Ranger pipeline (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger) can also be read:
# read 10x files
sc <- read_10x(dir = dir, count = 'matrix.mtx', genes = 'genes.tsv',
barcodes = 'barcodes.tsv')
## 'as(<dgTMatrix>, "dgCMatrix")' is deprecated.
## Use 'as(., "CsparseMatrix")' instead.
## See help("Deprecated") and help("Matrix-deprecated").
sc
## class: scNMFSet
## dim: 1030 450
## rownames: [1] "ENSG00000187608" "ENSG00000186891" "ENSG00000127054" "ENSG00000158109"
## [5] "ENSG00000116251" "ENSG00000074800"
## colnames: [1] "ATGCAGTGCTTGGA-1" "CATGTACTCCATGA-1" "GAGAAATGGCAAGG-1" "TGATATGACGTTAG-1"
## [5] "AGTAGGCTCGGGAA-1" "TGACCGCTGTAGCT-1"
The parameter dir
is the directory containing the files.
Filenames above are the defaults and can be omitted.
The function returns an scNMFSet
object.
By default, any row or column entirely consisting of zeros in counts
and the
corresponding elements in rowData
and colData
slots will be removed.
This feature can be turned off by remove.zeros = FALSE
.
For quality control, cells and genes can be filtered manually using normal
subsetting syntax of R
: the slots in the object sc
are accessed and
edited using accessors and sub-setting rules;
see
SingleCellExperiment
:
# slots and subsetting
counts(sc)[1:7,1:3]
## 7 x 3 sparse Matrix of class "dgCMatrix"
## ATGCAGTGCTTGGA-1 CATGTACTCCATGA-1 GAGAAATGGCAAGG-1
## ENSG00000187608 . 2 .
## ENSG00000186891 . . .
## ENSG00000127054 . . .
## ENSG00000158109 . . .
## ENSG00000116251 . 3 .
## ENSG00000074800 2 . .
## ENSG00000162444 . . .
head(rowData(sc))
## DataFrame with 6 rows and 2 columns
## V1 V2
## <character> <character>
## ENSG00000187608 ENSG00000187608 ISG15
## ENSG00000186891 ENSG00000186891 TNFRSF18
## ENSG00000127054 ENSG00000127054 CPSF3L
## ENSG00000158109 ENSG00000158109 TPRG1L
## ENSG00000116251 ENSG00000116251 RPL22
## ENSG00000074800 ENSG00000074800 ENO1
head(colData(sc))
## DataFrame with 6 rows and 1 column
## V1
## <character>
## ATGCAGTGCTTGGA-1 ATGCAGTGCTTGGA-1
## CATGTACTCCATGA-1 CATGTACTCCATGA-1
## GAGAAATGGCAAGG-1 GAGAAATGGCAAGG-1
## TGATATGACGTTAG-1 TGATATGACGTTAG-1
## AGTAGGCTCGGGAA-1 AGTAGGCTCGGGAA-1
## TGACCGCTGTAGCT-1 TGACCGCTGTAGCT-1
sc2 <- sc[1:20,1:70] # subsetting of object
sc2 <- remove_zeros(sc2) # remove empty rows/columns
## 6 empty rows removed
sc2
## class: scNMFSet
## dim: 14 70
## rownames: [1] "ENSG00000187608" "ENSG00000186891" "ENSG00000127054" "ENSG00000158109"
## [5] "ENSG00000116251" "ENSG00000074800"
## colnames: [1] "ATGCAGTGCTTGGA-1" "CATGTACTCCATGA-1" "GAGAAATGGCAAGG-1" "TGATATGACGTTAG-1"
## [5] "AGTAGGCTCGGGAA-1" "TGACCGCTGTAGCT-1"
We provide two streamlined functions each for cell and gene filtering, which are illustrated below:
sc <- filter_cells(sc, umi.min = 10^2.6, umi.max = 10^3.4)
## 438 cells out of 450 selected
## 21 empty rows removed
markers <- c('CD4','CD8A','CD8B','CD19','CD3G','CD3D',
'CD3Z','CD14')
sc0 <- filter_genes(sc, markers = markers, vmr.min = 1.5,
min.cells.expressed = 50, rescue.genes = FALSE)
## 5 marker genes found
## 297 variable genes out of 1009
## 299 genes selected