1 Introduction

This document gives an introduction to and overview of the quality control functionality of the scater package. scater contains tools to help with the analysis of single-cell transcriptomic data, focusing on low-level steps such as quality control, normalization and visualization. It is based on the SingleCellExperiment class (from the SingleCellExperiment package), and thus is interoperable with many other Bioconductor packages such as scran, batchelor and iSEE.

Note: A more comprehensive description of the use of scater (along with other packages) in a scRNA-seq analysis workflow is available at https://osca.bioconductor.org.

2 Setting up the data

2.1 Generating a SingleCellExperiment object

We assume that you have a matrix containing expression count data summarised at the level of some features (gene, exon, region, etc.). First, we create a SingleCellExperiment object containing the data, as demonstrated below with a famous brain dataset. Rows of the object correspond to features, while columns correspond to samples, i.e., cells in the context of single-cell โ€™omics data.

library(scRNAseq)
example_sce <- ZeiselBrainData()
example_sce
## class: SingleCellExperiment 
## dim: 20006 3005 
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
##   1772058148_F03
## colData names(10): tissue group # ... level1class level2class
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(2): ERCC repeat

We usually expect (raw) count data to be labelled as "counts" in the assays, which can be easily retrieved with the counts accessor. Getters and setters are also provided for exprs, tpm, cpm, fpkm and versions of these with the prefix norm_.

str(counts(example_sce))

Row and column-level metadata are easily accessed (or modified) as shown below. There are also dedicated getters and setters for size factor values (sizeFactors()); reduced dimensionality results (reducedDim()); and alternative experimental features (altExp()).

example_sce$whee <- sample(LETTERS, ncol(example_sce), replace=TRUE)
colData(example_sce)
## DataFrame with 3005 rows and 11 columns
##                        tissue   group # total mRNA mol      well       sex
##                   <character> <numeric>      <numeric> <numeric> <numeric>
## 1772071015_C02       sscortex         1           1221         3         3
## 1772071017_G12       sscortex         1           1231        95         1
## 1772071017_A05       sscortex         1           1652        27         1
## 1772071014_B06       sscortex         1           1696        37         3
## 1772067065_H06       sscortex         1           1219        43         3
## ...                       ...       ...            ...       ...       ...
## 1772067059_B04 ca1hippocampus         9           1997        19         1
## 1772066097_D04 ca1hippocampus         9           1415        21         1
## 1772063068_D01       sscortex         9           1876        34         3
## 1772066098_A12 ca1hippocampus         9           1546        88         1
## 1772058148_F03       sscortex         9           1970        15         3
##                      age  diameter        cell_id       level1class level2class
##                <numeric> <numeric>    <character>       <character> <character>
## 1772071015_C02         2         1 1772071015_C02      interneurons       Int10
## 1772071017_G12         1       353 1772071017_G12      interneurons       Int10
## 1772071017_A05         1        13 1772071017_A05      interneurons        Int6
## 1772071014_B06         2        19 1772071014_B06      interneurons       Int10
## 1772067065_H06         6        12 1772067065_H06      interneurons        Int9
## ...                  ...       ...            ...               ...         ...
## 1772067059_B04         4       382 1772067059_B04 endothelial-mural       Peric
## 1772066097_D04         7        12 1772066097_D04 endothelial-mural        Vsmc
## 1772063068_D01         7       268 1772063068_D01 endothelial-mural        Vsmc
## 1772066098_A12         7       324 1772066098_A12 endothelial-mural        Vsmc
## 1772058148_F03         7         6 1772058148_F03 endothelial-mural        Vsmc
##                       whee
##                <character>
## 1772071015_C02           F
## 1772071017_G12           A
## 1772071017_A05           H
## 1772071014_B06           X
## 1772067065_H06           X
## ...                    ...
## 1772067059_B04           T
## 1772066097_D04           H
## 1772063068_D01           K
## 1772066098_A12           E
## 1772058148_F03           Y
rowData(example_sce)$stuff <- runif(nrow(example_sce))
rowData(example_sce)
## DataFrame with 20006 rows and 2 columns
##          featureType             stuff
##          <character>         <numeric>
## Tspan12   endogenous 0.531340830726549
## Tshz1     endogenous 0.245747287524864
## Fnbp1l    endogenous 0.841682275990024
## Adamts15  endogenous  0.47632492124103
## Cldn12    endogenous 0.631566006690264
## ...              ...               ...
## mt-Co2          mito 0.542126515181735
## mt-Co1          mito 0.915390015114099
## mt-Rnr2         mito 0.665483738295734
## mt-Rnr1         mito 0.612728938227519
## mt-Nd4l         mito 0.610844046343118

Subsetting is very convenient with this class, as both data and metadata are processed in a synchronized manner. More details about the SingleCellExperiment class can be found in the documentation for SingleCellExperiment package.

2.2 Other methods of data import

Count matrices stored as CSV files or equivalent can be easily read into R session using read.table() from utils or fread() from the data.table package. It is advisable to coerce the resulting object into a matrix before storing it in a SingleCellExperiment object.

For large data sets, the matrix can be read in chunk-by-chunk with progressive coercion into a sparse matrix from the Matrix package. This is performed using the readSparseCounts() function and reduces memory usage by not explicitly storing zeroes in memory.

Data from 10X Genomics experiments can be read in using the read10xCounts function from the DropletUtils package. This will automatically generate a SingleCellExperiment with a sparse matrix, see the documentation for more details.

Transcript abundances from the kallisto and Salmon pseudo-aligners can be imported using methods from the tximeta package. This produces a SummarizedExperiment object that can be coerced into a SingleCellExperiment simply with as(se, "SingleCellExperiment").

3 Quality control

3.1 Background

scater provides functionality for three levels of quality control (QC):

  1. QC and filtering of cells
  2. QC and filtering of features (genes)
  3. QC of experimental variables

3.2 Cell-level QC

3.2.1 Definition of metrics

Cell-level metrics are computed by the perCellQCMetrics() function and include:

  • sum: total number of counts for the cell (i.e., the library size).
  • detected: the number of features for the cell that have counts above the detection limit (default of zero).
  • subsets_X_percent: percentage of all counts that come from the feature control set named X.
library(scater)
per.cell <- perCellQCMetrics(example_sce, 
    subsets=list(Mito=grep("mt-", rownames(example_sce))))
summary(per.cell$sum)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2574    8130   12913   14954   19284   63505
summary(per.cell$detected)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     785    2484    3656    3777    4929    8167
summary(per.cell$subsets_Mito_percent)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.992   6.653   7.956  10.290  56.955

It is often convenient to store this in the colData() of our SingleCellExperiment object for future reference. (In fact, the addPerCellQC() function will do this automatically.)

colData(example_sce) <- cbind(colData(example_sce), per.cell)

3.2.2 Diagnostic plots

Metadata variables can be plotted against each other using the plotColData() function, as shown below. We expect to see an increasing number of detected genes with increasing total count. Each point represents a cell that is coloured according to its tissue of origin.

plotColData(example_sce, x = "sum", y="detected", colour_by="tissue")