1 Overview

Droplet-based scRNA-seq protocols capture cells in droplets for massively multiplexed library prepation (Klein et al. 2015; Macosko et al. 2015). This greatly increases the throughput of scRNA-seq studies, allowing tens of thousands of individual cells to be profiled in a routine experiment. However, it (again) involves some differences from the previous workflows to reflect some unique aspects of droplet-based data.

Here, we describe a brief analysis of the peripheral blood mononuclear cell (PBMC) dataset from 10X Genomics (Zheng et al. 2017). The data are publicly available from the 10X Genomics website, from which we download the raw gene/barcode count matrices, i.e., before cell calling from the CellRanger pipeline.

library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask = FALSE)
raw.path <- bfcrpath(bfc, file.path("http://cf.10xgenomics.com/samples",
    "cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz"))
untar(raw.path, exdir="pbmc4k")

2 Setting up the data

2.1 Reading in a sparse matrix

We load in the raw count matrix using the read10xCounts() function from the DropletUtils package. This will create a SingleCellExperiment object where each column corresponds to a cell barcode.

library(DropletUtils)
fname <- "pbmc4k/raw_gene_bc_matrices/GRCh38"
sce <- read10xCounts(fname, col.names=TRUE)
sce
## class: SingleCellExperiment 
## dim: 33694 737280 
## metadata(0):
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
##   ENSG00000268674
## rowData names(2): ID Symbol
## colnames(737280): AAACCTGAGAAACCAT-1 AAACCTGAGAAACCGC-1 ...
##   TTTGTCATCTTTAGTC-1 TTTGTCATCTTTCCTC-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## spikeNames(0):

Here, each count represents the number of unique molecular identifiers (UMIs) assigned to a gene for a cell barcode. Note that the counts are loaded as a sparse matrix object - specifically, a dgCMatrix instance from the Matrix package. This avoids allocating memory to hold zero counts, which is highly memory-efficient for low-coverage scRNA-seq data.

class(counts(sce))
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"

2.2 Annotating the rows

We relabel the rows with the gene symbols for easier reading. This is done using the uniquifyFeatureNames() function, which ensures uniqueness in the case of duplicated or missing symbols.

library(scater)
rownames(sce) <- uniquifyFeatureNames(rowData(sce)$ID, rowData(sce)$Symbol)
head(rownames(sce))
## [1] "RP11-34P13.3"  "FAM138A"       "OR4F5"         "RP11-34P13.7" 
## [5] "RP11-34P13.8"  "RP11-34P13.14"

We also identify the chromosomal location for each gene. The mitochondrial location is particularly useful for later quality control.

library(EnsDb.Hsapiens.v86)
location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce)$ID, 
    column="SEQNAME", keytype="GENEID")
rowData(sce)$CHR <- location
summary(location=="MT")
##    Mode   FALSE    TRUE    NA's 
## logical   33537      13     144

3 Calling cells from empty droplets

3.1 Testing for deviations from ambient expression

An interesting aspect of droplet-based data is that we have no prior knowledge about which droplets (i.e., cell barcodes) actually contain cells, and which are empty. Thus, we need to call cells from empty droplets based on the observed expression profiles. This is not entirely straightforward as empty droplets can contain ambient (i.e., extracellular) RNA that can be captured and sequenced. The distribution of total counts exhibits a sharp transition between barcodes with large and small total counts (Figure 1), probably corresponding to cell-containing and empty droplets respectively.

bcrank <- barcodeRanks(counts(sce))

# Only showing unique points for plotting speed.
uniq <- !duplicated(bcrank$rank)
plot(bcrank$rank[uniq], bcrank$total[uniq], log="xy",
    xlab="Rank", ylab="Total UMI count", cex.lab=1.2)

abline(h=bcrank$inflection, col="darkgreen", lty=2)
abline(h=bcrank$knee, col="dodgerblue", lty=2)

legend("bottomleft", legend=c("Inflection", "Knee"), 
    col=c("darkgreen", "dodgerblue"), lty=2, cex=1.2)
Total UMI count for each barcode in the PBMC dataset, plotted against its rank (in decreasing order of total counts). The inferred locations of the inflection and knee points are also shown.

Figure 1: Total UMI count for each barcode in the PBMC dataset, plotted against its rank (in decreasing order of total counts). The inferred locations of the inflection and knee points are also shown.

We use the emptyDrops() function to test whether the expression profile for each cell barcode is significantly different from the ambient RNA pool (Lun et al. 2018). Any significant deviation indicates that the barcode corresponds to a cell-containing droplet. We call cells at a false discovery rate (FDR) of 1%, meaning that no more than 1% of our called barcodes should be empty droplets on average.

set.seed(100)
e.out <- emptyDrops(counts(sce))
sum(e.out$FDR <= 0.01, na.rm=TRUE)
## [1] 4358

We then subset our SingleCellExperiment object to retain only the detected cells.

# using which() to automatically remove NAs.
sce <- sce[,which(e.out$FDR <= 0.01)]

Comments from Aaron:

  • emptyDrops() computes Monte Carlo p-values based on a Dirichlet-multinomial model of sampling molecules into droplets. These p-values are stochastic so it is important to set the random seed to obtain reproducible results.
  • The function assumes that cell barcodes with total UMI counts below a certain threshold (lower=100 by default) correspond to empty droplets, and uses them to estimate the ambient expression profile. By definition, these barcodes cannot be cell-containing droplets and are excluded from the hypothesis testing, hence the NAs in the output.
  • Users wanting to use the cell calling algorithm from the CellRanger pipeline can call defaultDrops() instead. This tends to be quite conservative as it often discards genuine cells with low RNA content (and thus low total counts). It also requires an estimate of the expected number of cells in the experiment.

3.2 Examining cell-calling diagnostics

The number of Monte Carlo iterations (specified by the niters argument in emptyDrops()) determines the lower bound for the _p_values (Phipson and Smyth 2010). The Limited field in the output indicates whether or not the computed p-value for a particular barcode is bounded by the number of iterations. If any non-significant barcodes are TRUE for Limited, we may need to increase the number of iterations. A larger number of iterations will often result in a lower p-value for these barcodes, which may allow them to be detected after correcting for multiple testing.

table(Sig=e.out$FDR <= 0.01, Limited=e.out$Limited)
##        Limited
## Sig     FALSE TRUE
##   FALSE   931    0
##   TRUE   1745 2613

As mentioned above, emptyDrops() assumes that barcodes with low total UMI counts are empty droplets. Thus, the null hypothesis should be true for all of these barcodes. We can check whether the hypothesis test holds its size by examining the distribution of p-values for low-total barcodes. Ideally, the distribution should be close to uniform.

full.data <- read10xCounts(fname, col.names=TRUE)
set.seed(100)
limit <- 100   
all.out <- emptyDrops(counts(full.data), lower=limit, test.ambient=TRUE)
hist(all.out$PValue[all.out$Total <= limit & all.out$Total > 0],
    xlab="P-value", main="", col="grey80") 
Distribution of p-values for the assumed empty droplets.

Figure 2: Distribution of p-values for the assumed empty droplets.

Large peaks near zero indicate that barcodes with total counts below lower are not all ambient in origin. This can be resolved by decreasing lower further to exclude barcodes corresponding to droplets with very small cells.

4 Quality control on the cells

The previous step only distinguishes cells from empty droplets, but makes no statement about the quality of the cells. It is entirely possible for droplets to contain damaged or dying cells, which need to be removed prior to downstream analysis. We compute some QC metrics using calculateQCMetrics() (McCarthy et al. 2017) and examine their distributions in Figure 3.

sce <- calculateQCMetrics(sce, feature_controls=list(Mito=which(location=="MT")))
par(mfrow=c(1,3))
hist(sce$log10_total_counts, breaks=20, col="grey80",
    xlab="Log-total UMI count")
hist(sce$log10_total_features_by_counts, breaks=20, col="grey80",
    xlab="Log-total number of expressed features")
hist(sce$pct_counts_Mito, breaks=20, col="grey80",
    xlab="Proportion of reads in mitochondrial genes")