1 Introduction

As the scale of single cell/nucleus RNA-seq has increased, so has the complexity of study designs. Analysis of datasets with simple study designs can be performed using linear model as in the muscat package. Yet analysis of datsets with complex study designs such as repeated measures or many technical batches can benefit from linear mixed model analysis to model to correlation structure between samples. We previously developed dream to apply linear mixed models to bulk RNA-seq data using a limma-style workflow. Dreamlet extends the previous work of dream and muscat to apply linear mixed models to pseudobulk data. Dreamlet also supports linear models and facilitates application of 1) variancePartition to quantify the contribution of multiple variables to expression variation, and 2) zenith to perform gene set analysis on the differential expression signatures.

2 Installation

To install this package, start R (version “4.3”) and enter:

if (!require("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}

# Select release #1 or #2

# 1) Bioconductor release
BiocManager::install("dreamlet")

# 2) Latest stable release
devtools::install_github("DiseaseNeurogenomics/dreamlet")

3 Process single cell count data

Here we perform analysis of PBMCs from 8 individuals stimulated with interferon-β (Kang, et al, 2018, Nature Biotech). This is a small dataset that does not have repeated measures or high dimensional batch effects, so the sophisticated features of dreamlet are not strictly necessary. But this gives us an opportunity to walk through a standard dreamlet workflow.

3.1 Preprocess data

Here, single cell RNA-seq data is downloaded from ExperimentHub.

library(dreamlet)
library(muscat)
library(ExperimentHub)
library(zenith)
library(scater)

# Download data, specifying EH2259 for the Kang, et al study
eh <- ExperimentHub()
sce <- eh[["EH2259"]]

# only keep singlet cells with sufficient reads
sce <- sce[rowSums(counts(sce) > 0) > 0, ]
sce <- sce[, colData(sce)$multiplets == "singlet"]

# compute QC metrics
qc <- perCellQCMetrics(sce)

# remove cells with few or many detected genes
ol <- isOutlier(metric = qc$detected, nmads = 2, log = TRUE)
sce <- sce[, !ol]

# set variable indicating stimulated (stim) or control (ctrl)
sce$StimStatus <- sce$stim

3.2 Aggregate to pseudobulk

Dreamlet, like muscat, performs analysis at the pseudobulk-level by summing raw counts across cells for a given sample and cell type. aggregateToPseudoBulk is substantially faster for large on-disk datasets than muscat::aggregateData.

# Since 'ind' is the individual and 'StimStatus' is the stimulus status,
# create unique identifier for each sample
sce$id <- paste0(sce$StimStatus, sce$ind)

# Create pseudobulk data by specifying cluster_id and sample_id
# Count data for each cell type is then stored in the `assay` field
# assay: entry in assayNames(sce) storing raw counts
# cluster_id: variable in colData(sce) indicating cell clusters
# sample_id: variable in colData(sce) indicating sample id for aggregating cells
pb <- aggregateToPseudoBulk(sce,
  assay = "counts",
  cluster_id = "cell",
  sample_id = "id",
  verbose = FALSE
)

# one 'assay' per cell type
assayNames(pb)
## [1] "B cells"           "CD14+ Monocytes"   "CD4 T cells"      
## [4] "CD8 T cells"       "Dendritic cells"   "FCGR3A+ Monocytes"
## [7] "Megakaryocytes"    "NK cells"

4 Voom for pseudobulk

Apply voom-style normalization for pseudobulk counts within each cell cluster using voomWithDreamWeights to handle random effects (if specified).

# Normalize and apply voom/voomWithDreamWeights
res.proc <- processAssays(pb, ~StimStatus, min.count = 5)

# the resulting object of class dreamletProcessedData stores
# normalized data and other information
res.proc
## class: dreamletProcessedData 
## assays(8): B cells CD14+ Monocytes ... Megakaryocytes NK cells
## colData(4): ind stim multiplets StimStatus
## metadata(3): cell id cluster
## Samples:
##  min: 13 
##  max: 16
## Genes:
##  min: 164 
##  max: 5262 
## details(7): assay n_retain ... n_errors error_initial

processAssays() retains samples with at least min.cells in a given cell type. While dropping a few samples usually is not a problem, in some cases dropping sames can mean that a variable included in the regression formula no longer has any variation. For example, dropping all stimulated samples from analysis of a given cell type would be mean the variable StimStatus has no variation and is perfectly colinear with the intercept term. This colinearity issue is detected internally and variables with these problem are dropped from the regression formula for that particular cell type. The number of samples retained and the resulting formula used in each cell type can be accessed as follows. In this analysis, samples are dropped from 3 cell types but the original formula remains valid in each case.

# view details of dropping samples
details(res.proc)
##               assay n_retain     formula formDropsTerms n_genes n_errors
## 1           B cells       16 ~StimStatus          FALSE    1961        0
## 2   CD14+ Monocytes       16 ~StimStatus          FALSE    3087        0
## 3       CD4 T cells       16 ~StimStatus          FALSE    5262        0
## 4       CD8 T cells       16 ~StimStatus          FALSE    1030        0
## 5   Dendritic cells       13 ~StimStatus          FALSE     164        0
## 6 FCGR3A+ Monocytes       16 ~StimStatus          FALSE    1160        0
## 7    Megakaryocytes       13 ~StimStatus          FALSE     172        0
## 8          NK cells       16 ~StimStatus          FALSE    1656        0
##   error_initial
## 1         FALSE
## 2         FALSE
## 3         FALSE
## 4         FALSE
## 5         FALSE
## 6         FALSE
## 7         FALSE
## 8         FALSE

Here the mean-variance trend from voom is shown for each cell type. Cell types with sufficient number of cells and reads show a clear mean-variance trend. While in rare cell types like megakaryocytes, fewer genes have sufficient reads and the trend is less apparent.

# show voom plot for each cell clusters
plotVoom(res.proc)