ccfindR 1.6.0

The `ccfindR`

(Cancer Clone findeR) package^{1} contains
implementations and utilities for analyzing single-cell
RNA-sequencing data, including quality control,
unsupervised clustering for discovery of cell types,
and visualization of the outcomes. It is especially
suitable for analysis of transcript-count data utilizing
unique molecular identifiers (UMIs), e.g., data derived
from 10x Genomics platform. In these data sets, RNA counts
are non-negative integers, enabling clustering using non-negative
matrix factorization (NMF)^{2}.

Input data are UMI counts in the form of a matrix with each genetic
feature (“genes”) in rows and cells (tagged by barcodes) in columns,
produced by read alignment and counting pipelines. The count matrix and
associated gene and cell annotation files are bundled into a main
object of class `scNMFSet`

, which extends the
`SingleCellExperiment`

class [http://dx.doi.org/10.18129/B9.bioc.SingleCellExperiment)].
Quality control for both cells and genes can be performed via filtering
steps based on UMI counts and variance of expressions, respectively.
The NMF factorization is first performed for multiple values of
ranks (the reduced dimension of factorization) to find the most
likely value. A production run for the chosen rank then leads to
factor matrices, allowing the user to identify and visualize genes
representative of clusters and assign cells into clusters.

The NMF approach offers a means to identify cell subtypes and classify
individual cells into these clusters based on clustering using
expression counts. In contrast to alternatives such as principal
component analyses^{3}, NMF leverages the non-negative nature
of count data and factorizes the data matrix \(\sf X\) into two
factor matrices
\(\sf W\) and \(\sf H\)^{2}:

If \(\sf X\) is a \(p\times n\) matrix (\(p\) genes and \(n\) cells), the basis matrix \(\sf W\) is \(p \times r\) and coefficient matrix \(\sf H\) is \(r \times n\) in dimension, respectively, where the rank \(r\) is a relatively small integer. A statistical inference-based interpretation of NMF is to view \(X_{ij}\) as a realization of a Poisson distribution with the mean for each matrix element given by \(({\sf WH})_{ij}\equiv \Lambda_{ij}\), or

\[\begin{equation} \Pr(x_{ij})=\frac{e^{-\Lambda_{ij}}{\Lambda_{ij}}^{x_{ij}}} {\Gamma(1+x_{ij})}. \end{equation}\]The maximum likelihood inference of the latter is then achieved by maximizing

\[\begin{equation} L = \sum_{ij} \left(X_{ij} \ln \frac{\Lambda_{ij}}{X_{ij}}- \Lambda_{ij}+X_{ij}\right). \end{equation}\]The Kullback-Leibler measure of the distance between \(\sf X\) and
\(\sf \Lambda\),
which is minimized, is equal to \(-L\).
Lee and Seung’s update rule^{2} solves this optimization task
iteratively.

While also including this classical iterative update
algorithm to find basis and coefficient factors of the count matrix, the
main workhorse in `ccfindR`

is the variational Bayesian inference
algorithm proposed by Cemgil^{4}. Thus the key distinguishing
features of `ccfindR`

^{1} compared to other existing implementations – `NMF`

for generic data^{5} and `NMFEM`

for single-cell analysis^{6} –
are

- Bayesian inference allowing for a statistically well-controlled procedure to determine the most likely value of rank \(r\).
- Procedure to derive hierarchical relationships among clusters identified under different ranks.

In particular, a traditional way (in maximum likelihood inference) to determine the rank is to evaluate the factorization quality measures (and optionally compare with those from randomized data). The Bayesian formulation of NMF algorithm instead incorporates priors for factored matrix elements \(\sf W\) and \(\sf H\) modeled by gamma distributions. Inference can be combined with hyperparameter update to optimize the marginal likelihood (ML; conditional probability of data under hyperparameters and rank), which provides a statistically well-controlled means to determine the optimal rank describing data.

For large rank values, it can be challenging to interpret clusters identified. To facilitate biological interpretation, we provide a procedure where cluster assignment of cells is repeated for multiple rank values, typically ranging from 2 to the optimal rank, and a phylogenetic tree connecting different clusters at neighboring rank values are constructed. This tree gives an overview of different types of cells present in the system viewed at varying resolution.

We illustrate a typical workflow with a single-cell count data set generated from peripheral blood mononuclear cell (PBMC) data [https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/]. The particular data set used below was created by sampling from five purified immune cell subsets.

To install the package, do the following:

`BiocManager::install('ccfindR')`

After installation, load the package by

`library(ccfindR)`

The input data can be a simple matrix:

```
# A toy matrix for count data
set.seed(1)
mat <- matrix(rpois(n = 80, lambda = 2), nrow = 4, ncol = 20)
ABC <- LETTERS[1:4]
abc <- letters[1:20]
rownames(mat) <- ABC
colnames(mat) <- abc
```

The main `S4`

object containing data and subsequent analysis outcomes is of
class `scNMFSet`

, created by

```
# create scNMFSet object
sc <- scNMFSet(count = mat)
```

This class extends `SingleCellExperiment`

class,
adding extra slots for storing factorization outcomes.
In particular, `assays`

, `rowData`

, and `colData`

slots of
`SingleCellExperiment`

class are used to store RNA count matrix,
gene, and cell annotation data frames, respectively.
In the simplest initialization above, the named argument `count`

is
used as the count matrix and is equivalent to

```
# create scNMFSet object
sc <- scNMFSet(assays = list(counts = mat))
```

See `SingleCellExperiment`

documentations for more details of these
main slots. For instance, row and column names can be stored by

```
# set row and column names
suppressMessages(library(S4Vectors))
genes <- DataFrame(ABC)
rownames(genes) <- ABC
cells <- DataFrame(abc)
rownames(cells) <- abc
sc <- scNMFSet(count = mat, rowData = genes, colData = cells)
sc
```

```
## class: scNMFSet
## dim: 4 20
## rownames: [1] "A" "B" "C" "D"
## colnames: [1] "a" "b" "c" "d" "e" "f"
```

Alternatively, sparse matrix format (of class `dgCMatrix`

) can be used.
A `MatrixMarket`

format file can be read directly:

```
# read sparse matrix
dir <- system.file('extdata', package = 'ccfindR')
mat <- Matrix::readMM(paste0(dir,'/matrix.mtx'))
rownames(mat) <- 1:nrow(mat)
colnames(mat) <- 1:ncol(mat)
sc <- scNMFSet(count = mat, rowData = DataFrame(1:nrow(mat)),
colData = DataFrame(1:ncol(mat)))
sc
```

```
## class: scNMFSet
## dim: 1030 450
## rownames: [1] "1" "2" "3" "4" "5" "6"
## colnames: [1] "1" "2" "3" "4" "5" "6"
```

The number of rows in `assays$counts`

and `rowData`

, the number of columns
in `assays$counts`

and rows in `colData`

must match.

The gene and barcode meta-data and count files resulting from 10x Genomics’ Cell Ranger pipeline (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger) can also be read:

```
# read 10x files
sc <- read_10x(dir = dir, count = 'matrix.mtx', genes = 'genes.tsv',
barcodes = 'barcodes.tsv')
sc
```

```
## class: scNMFSet
## dim: 1030 450
## rownames: [1] "ENSG00000187608" "ENSG00000186891" "ENSG00000127054" "ENSG00000158109"
## [5] "ENSG00000116251" "ENSG00000074800"
## colnames: [1] "ATGCAGTGCTTGGA-1" "CATGTACTCCATGA-1" "GAGAAATGGCAAGG-1"
## [4] "TGATATGACGTTAG-1" "AGTAGGCTCGGGAA-1" "TGACCGCTGTAGCT-1"
```

The parameter `dir`

is the directory containing the files.
Filenames above are the defaults and can be omitted.
The function returns an `scNMFSet`

object.
By default, any row or column entirely consisting of zeros in `counts`

and the
corresponding elements in `rowData`

and `colData`

slots will be removed.
This feature can be turned off by `remove.zeros = FALSE`

.

For quality control, cells and genes can be filtered manually using normal
subsetting syntax of `R`

: the slots in the object `sc`

are accessed and
edited using accessors and sub-setting rules;
see
`SingleCellExperiment`

:

```
# slots and subsetting
counts(sc)[1:7,1:3]
```

```
## 7 x 3 sparse Matrix of class "dgCMatrix"
## ATGCAGTGCTTGGA-1 CATGTACTCCATGA-1 GAGAAATGGCAAGG-1
## ENSG00000187608 . 2 .
## ENSG00000186891 . . .
## ENSG00000127054 . . .
## ENSG00000158109 . . .
## ENSG00000116251 . 3 .
## ENSG00000074800 2 . .
## ENSG00000162444 . . .
```

`head(rowData(sc))`

```
## DataFrame with 6 rows and 2 columns
## V1 V2
## <character> <character>
## ENSG00000187608 ENSG00000187608 ISG15
## ENSG00000186891 ENSG00000186891 TNFRSF18
## ENSG00000127054 ENSG00000127054 CPSF3L
## ENSG00000158109 ENSG00000158109 TPRG1L
## ENSG00000116251 ENSG00000116251 RPL22
## ENSG00000074800 ENSG00000074800 ENO1
```

`head(colData(sc))`

```
## DataFrame with 6 rows and 1 column
## V1
## <character>
## ATGCAGTGCTTGGA-1 ATGCAGTGCTTGGA-1
## CATGTACTCCATGA-1 CATGTACTCCATGA-1
## GAGAAATGGCAAGG-1 GAGAAATGGCAAGG-1
## TGATATGACGTTAG-1 TGATATGACGTTAG-1
## AGTAGGCTCGGGAA-1 AGTAGGCTCGGGAA-1
## TGACCGCTGTAGCT-1 TGACCGCTGTAGCT-1
```

```
sc2 <- sc[1:20,1:70] # subsetting of object
sc2 <- remove_zeros(sc2) # remove empty rows/columns
```

`## 6 empty rows removed`

`sc2`

```
## class: scNMFSet
## dim: 14 70
## rownames: [1] "ENSG00000187608" "ENSG00000186891" "ENSG00000127054" "ENSG00000158109"
## [5] "ENSG00000116251" "ENSG00000074800"
## colnames: [1] "ATGCAGTGCTTGGA-1" "CATGTACTCCATGA-1" "GAGAAATGGCAAGG-1"
## [4] "TGATATGACGTTAG-1" "AGTAGGCTCGGGAA-1" "TGACCGCTGTAGCT-1"
```

We provide two streamlined functions each for cell and gene filtering, which are illustrated below:

`sc <- filter_cells(sc, umi.min = 10^2.6, umi.max = 10^3.4)`

```
## 438 cells out of 450 selected
## 21 empty rows removed
```

```
markers <- c('CD4','CD8A','CD8B','CD19','CD3G','CD3D',
'CD3Z','CD14')
sc0 <- filter_genes(sc, markers = markers, vmr.min = 1.5,
min.cells.expressed = 50, rescue.genes = FALSE)
```

```
## 5 marker genes found
## 297 variable genes out of 1009
## 299 genes selected
```