Droplet-based scRNA-seq protocols capture cells in droplets for massively multiplexed library prepation (Klein et al. 2015; Macosko et al. 2015). This greatly increases the throughput of scRNA-seq studies, allowing tens of thousands of individual cells to be profiled in a routine experiment. However, it (again) involves some differences from the previous workflows to reflect some unique aspects of droplet-based data.

Here, we describe a brief analysis of the peripheral blood mononuclear cell (PBMC) dataset from 10X Genomics (Zheng et al. 2017).
The data are publicly available from the 10X Genomics website,
from which we download the raw gene/barcode count matrices, i.e., before cell calling from the *CellRanger* pipeline.

```
library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask = FALSE)
raw.path <- bfcrpath(bfc, file.path("http://cf.10xgenomics.com/samples",
"cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz"))
untar(raw.path, exdir="pbmc4k")
```

We load in the raw count matrix using the `read10xCounts()`

function from the *DropletUtils* package.
This will create a `SingleCellExperiment`

object where each column corresponds to a cell barcode.

```
library(DropletUtils)
fname <- "pbmc4k/raw_gene_bc_matrices/GRCh38"
sce <- read10xCounts(fname, col.names=TRUE)
sce
```

```
## class: SingleCellExperiment
## dim: 33694 737280
## metadata(0):
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
## ENSG00000268674
## rowData names(2): ID Symbol
## colnames(737280): AAACCTGAGAAACCAT-1 AAACCTGAGAAACCGC-1 ...
## TTTGTCATCTTTAGTC-1 TTTGTCATCTTTCCTC-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## spikeNames(0):
```

Here, each count represents the number of unique molecular identifiers (UMIs) assigned to a gene for a cell barcode.
Note that the counts are loaded as a sparse matrix object - specifically, a `dgCMatrix`

instance from the *Matrix* package.
This avoids allocating memory to hold zero counts, which is highly memory-efficient for low-coverage scRNA-seq data.

`class(counts(sce))`

```
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
```

We relabel the rows with the gene symbols for easier reading.
This is done using the `uniquifyFeatureNames()`

function, which ensures uniqueness in the case of duplicated or missing symbols.

```
library(scater)
rownames(sce) <- uniquifyFeatureNames(rowData(sce)$ID, rowData(sce)$Symbol)
head(rownames(sce))
```

```
## [1] "RP11-34P13.3" "FAM138A" "OR4F5" "RP11-34P13.7"
## [5] "RP11-34P13.8" "RP11-34P13.14"
```

We also identify the chromosomal location for each gene. The mitochondrial location is particularly useful for later quality control.

```
library(EnsDb.Hsapiens.v86)
location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce)$ID,
column="SEQNAME", keytype="GENEID")
rowData(sce)$CHR <- location
summary(location=="MT")
```

```
## Mode FALSE TRUE NA's
## logical 33537 13 144
```

An interesting aspect of droplet-based data is that we have no prior knowledge about which droplets (i.e., cell barcodes) actually contain cells, and which are empty. Thus, we need to call cells from empty droplets based on the observed expression profiles. This is not entirely straightforward as empty droplets can contain ambient (i.e., extracellular) RNA that can be captured and sequenced. The distribution of total counts exhibits a sharp transition between barcodes with large and small total counts (Figure 1), probably corresponding to cell-containing and empty droplets respectively.

```
bcrank <- barcodeRanks(counts(sce))
# Only showing unique points for plotting speed.
uniq <- !duplicated(bcrank$rank)
plot(bcrank$rank[uniq], bcrank$total[uniq], log="xy",
xlab="Rank", ylab="Total UMI count", cex.lab=1.2)
abline(h=bcrank$inflection, col="darkgreen", lty=2)
abline(h=bcrank$knee, col="dodgerblue", lty=2)
legend("bottomleft", legend=c("Inflection", "Knee"),
col=c("darkgreen", "dodgerblue"), lty=2, cex=1.2)
```

We use the `emptyDrops()`

function to test whether the expression profile for each cell barcode is significantly different from the ambient RNA pool (Lun et al. 2018).
Any significant deviation indicates that the barcode corresponds to a cell-containing droplet.
We call cells at a false discovery rate (FDR) of 1%, meaning that no more than 1% of our called barcodes should be empty droplets on average.

```
set.seed(100)
e.out <- emptyDrops(counts(sce))
sum(e.out$FDR <= 0.01, na.rm=TRUE)
```

`## [1] 4358`

We then subset our `SingleCellExperiment`

object to retain only the detected cells.

```
# using which() to automatically remove NAs.
sce <- sce[,which(e.out$FDR <= 0.01)]
```

**Comments from Aaron:**

`emptyDrops()`

computes Monte Carlo*p*-values based on a Dirichlet-multinomial model of sampling molecules into droplets. These*p*-values are stochastic so it is important to set the random seed to obtain reproducible results.- The function assumes that cell barcodes with total UMI counts below a certain threshold (
`lower=100`

by default) correspond to empty droplets, and uses them to estimate the ambient expression profile. By definition, these barcodes cannot be cell-containing droplets and are excluded from the hypothesis testing, hence the`NA`

s in the output. - Users wanting to use the cell calling algorithm from the
*CellRanger*pipeline can call`defaultDrops()`

instead. This tends to be quite conservative as it often discards genuine cells with low RNA content (and thus low total counts). It also requires an estimate of the expected number of cells in the experiment.

The number of Monte Carlo iterations (specified by the `niters`

argument in `emptyDrops()`

) determines the lower bound for the _p_values (Phipson and Smyth 2010).
The `Limited`

field in the output indicates whether or not the computed *p*-value for a particular barcode is bounded by the number of iterations.
If any non-significant barcodes are `TRUE`

for `Limited`

, we may need to increase the number of iterations.
A larger number of iterations will often result in a lower *p*-value for these barcodes, which may allow them to be detected after correcting for multiple testing.

`table(Sig=e.out$FDR <= 0.01, Limited=e.out$Limited)`

```
## Limited
## Sig FALSE TRUE
## FALSE 931 0
## TRUE 1745 2613
```

As mentioned above, `emptyDrops()`

assumes that barcodes with low total UMI counts are empty droplets.
Thus, the null hypothesis should be true for all of these barcodes.
We can check whether the hypothesis test holds its size by examining the distribution of *p*-values for low-total barcodes.
Ideally, the distribution should be close to uniform.

```
full.data <- read10xCounts(fname, col.names=TRUE)
set.seed(100)
limit <- 100
all.out <- emptyDrops(counts(full.data), lower=limit, test.ambient=TRUE)
hist(all.out$PValue[all.out$Total <= limit & all.out$Total > 0],
xlab="P-value", main="", col="grey80")
```

Large peaks near zero indicate that barcodes with total counts below `lower`

are not all ambient in origin.
This can be resolved by decreasing `lower`

further to exclude barcodes corresponding to droplets with very small cells.

The previous step only distinguishes cells from empty droplets, but makes no statement about the quality of the cells.
It is entirely possible for droplets to contain damaged or dying cells, which need to be removed prior to downstream analysis.
We compute some QC metrics using `calculateQCMetrics()`

(McCarthy et al. 2017) and examine their distributions in Figure 3.

```
sce <- calculateQCMetrics(sce, feature_controls=list(Mito=which(location=="MT")))
par(mfrow=c(1,3))
hist(sce$log10_total_counts, breaks=20, col="grey80",
xlab="Log-total UMI count")
hist(sce$log10_total_features_by_counts, breaks=20, col="grey80",
xlab="Log-total number of expressed features")
hist(sce$pct_counts_Mito, breaks=20, col="grey80",
xlab="Proportion of reads in mitochondrial genes")
```