The ccfindR (Cancer Clone findeR) package contains implementations and utilities for analyzing single-cell RNA-sequencing data, including quality control, unsupervised clustering for discovery of cell types, and visualization of the outcomes. It is especially suitable for analysis of transcript-count data utilizing unique molecular identifiers (UMIs), e.g., data derived from 10x Genomics platform. In these data sets, RNA counts are non-negative integers, enabling clustering using non-negative matrix factorization (NMF)1.

Input data are UMI counts in the form of a matrix with each genetic feature (“genes”) in rows and cells (tagged by barcodes) in columns, produced by read alignment and counting pipelines. The count matrix and associated gene and cell annotation files are bundled into a main object of class scNMFSet, which extends the SingleCellExperiment class [http://dx.doi.org/10.18129/B9.bioc.SingleCellExperiment)]. Quality control for both cells and genes can be performed via filtering steps based on UMI counts and variance of expressions, respectively. The NMF factorization is first performed for multiple values of ranks (the reduced dimension of factorization) to find the most likely value. A production run for the chosen rank then leads to factor matrices, allowing the user to identify and visualize genes representative of clusters and assign cells into clusters.

1 Algorithm

The NMF approach offers a means to identify cell subtypes and classify individual cells into these clusters based on clustering using expression counts. In contrast to alternatives such as principal component analyses2, NMF leverages the non-negative nature of count data and factorizes the data matrix \(\sf X\) into two factor matrices \(\sf W\) and \(\sf H\)1:

\[\begin{equation} \sf{X} \sim {\sf W}{\sf H}. \end{equation}\]

If \(\sf X\) is a \(p\times n\) matrix (\(p\) genes and \(n\) cells), the basis matrix \(\sf W\) is \(p \times r\) and coefficient matrix \(\sf H\) is \(r \times n\) in dimension, respectively, where the rank \(r\) is a relatively small integer. A statistical inference-based interpretation of NMF is to view \(X_{ij}\) as a realization of a Poisson distribution with the mean for each matrix element given by \(({\sf WH})_{ij}\equiv \Lambda_{ij}\), or \[\begin{equation} \Pr(x_{ij})=\frac{e^{-\Lambda_{ij}}{\Lambda_{ij}}^{x_{ij}}} {\Gamma(1+x_{ij})}. \end{equation}\] The maximum likelihood inference of the latter is then achieved by maximizing \[\begin{equation} L = \sum_{ij} \left(X_{ij} \ln \frac{\Lambda_{ij}}{X_{ij}}- \Lambda_{ij}+X_{ij}\right). \end{equation}\] The Kullback-Leibler measure of the distance between \(\sf X\) and \(\sf \Lambda\), which is minimized, is equal to \(-L\). Lee and Seung’s update rule1 solves this optimization task iteratively.

While also including this classical iterative update algorithm to find basis and coefficient factors of the count matrix, the main workhorse in ccfindR is the variational Bayesian inference algorithm proposed by Cemgil3. Thus the key distinguishing features of ccfindR4 compared to other existing implementations – NMF for generic data5 and NMFEM for single-cell analysis6 – are

  • Bayesian inference allowing for a statistically well-controlled procedure to determine the most likely value of rank \(r\).
  • Procedure to derive hierarchical relationships among clusters identified under different ranks.

In particular, a traditional way (in maximum likelihood inference) to determine the rank is to evaluate the factorization quality measures (and optionally compare with those from randomized data). The Bayesian formulation of NMF algorithm instead incorporates priors for factored matrix elements \(\sf W\) and \(\sf H\) modeled by gamma distributions. Inference can be combined with hyperparameter update to optimize the marginal likelihood (ML; conditional probability of data under hyperparameters and rank), which provides a statistically well-controlled means to determine the optimal rank describing data.

For large rank values, it can be challenging to interpret clusters identified. To facilitate biological interpretation, we provide a procedure where cluster assignment of cells is repeated for multiple rank values, typically ranging from 2 to the optimal rank, and a phylogenetic tree connecting different clusters at neighboring rank values are constructed. This tree gives an overview of different types of cells present in the system viewed at varying resolution.

2 Workflow

We illustrate a typical workflow with a single-cell count data set generated from peripheral blood mononuclear cell (PBMC) data [https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/]. The particular data set used below was created by sampling from five purified immune cell subsets.

2.1 Installation

To install the package, do the following:

BiocManager::install('ccfindR')

After installation, load the package by

library(ccfindR)

2.2 Data input

The input data can be a simple matrix:

# A toy matrix for count data
set.seed(1)
mat <- matrix(rpois(n = 80, lambda = 2), nrow = 4, ncol = 20)
ABC <- LETTERS[1:4]
abc <- letters[1:20]
rownames(mat) <- ABC
colnames(mat) <- abc

The main S4 object containing data and subsequent analysis outcomes is of class scNMFSet, created by

# create scNMFSet object
sc <- scNMFSet(count = mat)

This class extends SingleCellExperiment class, adding extra slots for storing factorization outcomes. In particular, assays, rowData, and colData slots of SingleCellExperiment class are used to store RNA count matrix, gene, and cell annotation data frames, respectively. In the simplest initialization above, the named argument count is used as the count matrix and is equivalent to

# create scNMFSet object
sc <- scNMFSet(assays = list(counts = mat))

See SingleCellExperiment documentations for more details of these main slots. For instance, row and column names can be stored by

# set row and column names
suppressMessages(library(S4Vectors))
genes <- DataFrame(ABC)
rownames(genes) <- ABC
cells <- DataFrame(abc)
rownames(cells) <- abc
sc <- scNMFSet(count = mat, rowData = genes, colData = cells)
sc
## class: scNMFSet 
## dim: 4 20 
## rownames: [1] "A" "B" "C" "D"
## colnames: [1] "a" "b" "c" "d" "e" "f"

Alternatively, sparse matrix format (of class dgCMatrix) can be used. A MatrixMarket format file can be read directly:

# read sparse matrix
dir <- system.file('extdata', package = 'ccfindR')
mat <- Matrix::readMM(paste0(dir,'/matrix.mtx'))
rownames(mat) <- 1:nrow(mat)
colnames(mat) <- 1:ncol(mat)
sc <- scNMFSet(count = mat, rowData = DataFrame(1:nrow(mat)),
               colData = DataFrame(1:ncol(mat)))
sc
## class: scNMFSet 
## dim: 1030 450 
## rownames: [1] "1" "2" "3" "4" "5" "6"
## colnames: [1] "1" "2" "3" "4" "5" "6"

The number of rows in assays$counts and rowData, the number of columns in assays$counts and rows in colData must match.

The gene and barcode meta-data and count files resulting from 10x Genomics’ Cell Ranger pipeline (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger) can also be read:

# read 10x files
sc <- read_10x(dir = dir, count = 'matrix.mtx', genes = 'genes.tsv',
               barcodes = 'barcodes.tsv')
sc
## class: scNMFSet 
## dim: 1030 450 
## rownames: [1] "ENSG00000187608" "ENSG00000186891" "ENSG00000127054" "ENSG00000158109"
## [5] "ENSG00000116251" "ENSG00000074800"
## colnames: [1] "ATGCAGTGCTTGGA-1" "CATGTACTCCATGA-1" "GAGAAATGGCAAGG-1"
## [4] "TGATATGACGTTAG-1" "AGTAGGCTCGGGAA-1" "TGACCGCTGTAGCT-1"

The parameter dir is the directory containing the files. Filenames above are the defaults and can be omitted. The function returns an scNMFSet object. By default, any row or column entirely consisting of zeros in counts and the corresponding elements in rowData and colData slots will be removed. This feature can be turned off by remove.zeros = FALSE.

2.3 Quality control

For quality control, cells and genes can be filtered manually using normal subsetting syntax of R: the slots in the object sc are accessed and edited using accessors and sub-setting rules; see SingleCellExperiment:

# slots and subsetting
counts(sc)[1:7,1:3]
## 7 x 3 sparse Matrix of class "dgCMatrix"
##                 ATGCAGTGCTTGGA-1 CATGTACTCCATGA-1 GAGAAATGGCAAGG-1
## ENSG00000187608                .                2                .
## ENSG00000186891                .                .                .
## ENSG00000127054                .                .                .
## ENSG00000158109                .                .                .
## ENSG00000116251                .                3                .
## ENSG00000074800                2                .                .
## ENSG00000162444                .                .                .
head(rowData(sc))
## DataFrame with 6 rows and 2 columns
##                              V1          V2
##                     <character> <character>
## ENSG00000187608 ENSG00000187608       ISG15
## ENSG00000186891 ENSG00000186891    TNFRSF18
## ENSG00000127054 ENSG00000127054      CPSF3L
## ENSG00000158109 ENSG00000158109      TPRG1L
## ENSG00000116251 ENSG00000116251       RPL22
## ENSG00000074800 ENSG00000074800        ENO1
head(colData(sc))
## DataFrame with 6 rows and 1 column
##                                V1
##                       <character>
## ATGCAGTGCTTGGA-1 ATGCAGTGCTTGGA-1
## CATGTACTCCATGA-1 CATGTACTCCATGA-1
## GAGAAATGGCAAGG-1 GAGAAATGGCAAGG-1
## TGATATGACGTTAG-1 TGATATGACGTTAG-1
## AGTAGGCTCGGGAA-1 AGTAGGCTCGGGAA-1
## TGACCGCTGTAGCT-1 TGACCGCTGTAGCT-1
sc2 <- sc[1:20,1:70]       # subsetting of object
sc2 <- remove_zeros(sc2)   # remove empty rows/columns
## 6 empty rows removed
sc2
## class: scNMFSet 
## dim: 14 70 
## rownames: [1] "ENSG00000187608" "ENSG00000186891" "ENSG00000127054" "ENSG00000158109"
## [5] "ENSG00000116251" "ENSG00000074800"
## colnames: [1] "ATGCAGTGCTTGGA-1" "CATGTACTCCATGA-1" "GAGAAATGGCAAGG-1"
## [4] "TGATATGACGTTAG-1" "AGTAGGCTCGGGAA-1" "TGACCGCTGTAGCT-1"

We provide two streamlined functions each for cell and gene filtering, which are illustrated below:

sc <- filter_cells(sc, umi.min = 10^2.6, umi.max = 10^3.4)
Quality control filtering of cells. Histogram of UMI counts is shown. Cells can be selected (red) by setting lower and upper thresholds of the UMI count.

Figure 1: Quality control filtering of cells
Histogram of UMI counts is shown. Cells can be selected (red) by setting lower and upper thresholds of the UMI count.

## 438 cells out of 450 selected
## 21 empty rows removed
markers <- c('CD4','CD8A','CD8B','CD19','CD3G','CD3D',
             'CD3Z','CD14')
sc0 <- filter_genes(sc, markers = markers, vmr.min = 1.5, 
            min.cells.expressed = 50, rescue.genes = FALSE)
## 5 marker genes found
## 297 variable genes out of 1009 
## 299 genes selected