1 Introduction

Flow cytometry and the more recently introduced CyTOF (cytometry by time-of-flight mass spectrometry or mass cytometry) are high-throughput technologies that measure protein abundance on the surface or within cells. In flow cytometry, antibodies are labeled with fluorescent dyes and fluorescence intensity is measured using lasers and photodetectors. CyTOF utilizes antibodies tagged with metal isotopes from the lanthanide series, which have favorable chemistry and do not occur in biological systems; abundances per cell are recorded with a time-of-flight mass spectrometer. In either case, fluorescence intensities (flow cytometry) or ion counts (mass cytometry) are assumed to be proportional to the expression level of the antibody-targeted antigens of interest.

Due to the differences in acquisition, further distinct characteristics should be noted. Conventional fluorophore-based flow cytometry is non-destructive and can be used to sort cells for further analysis. However, because of the spectral overlap between fluorophores, compensation of the data needs to be performed (Roederer 2001), which also limits the number of parameters that can be measured simultaneously. Thus, standard flow cytometry experiments measure 6-12 parameters with modern systems measuring up to 20 channels (Mahnke and Roederer 2007), while new developments (e.g. BD FACSymphony) promise to increase this capacity towards 50. Moreover, flow cytometry offers the highest throughput with tens of thousands of cells measured per second at relatively low operating costs per sample.

By using rare metal isotopes in CyTOF, cell autofluorescence can be avoided and spectral overlap is drastically reduced. However, the sensitivity of mass spectrometry results in the measurement of metal impurities and oxide formations, which need to be carefully considered in antibody panel design (e.g. through antibody concentrations and coupling of antibodies to neighboring metals). Leipold et al. recently commented that minimal spillover does not equal no spillover (Leipold 2015). Nonetheless, CyTOF offers a high dimension of parameters measured per cell, with current panels using ~40 parameters and the promise of up to 100. Throughput of CyTOF is lower, at the rate of hundreds of cells per second, and cells are destroyed during ionization.

The ability of flow cytometry and mass cytometry to analyze individual cells at high-throughput scales has resulted in a wide range of biological and medical applications. For example, immunophenotyping assays are used to detect and quantify cell populations of interest, to uncover new cell populations and compare abundance of cell populations between different conditions, for example between patient groups (Unen et al. 2016). Thus, it can be used as a biomarker discovery tool.

Various methodological approaches aim for biomarker discovery (Saeys, Gassen, and Lambrecht 2016). A common strategy, which we will refer to through this workflow as the “classic” approach, is to first identify cell populations of interest by manual gating or automated clustering (Hartmann et al. 2016; Pejoski et al. 2016). Second, using statistical tests, one can determine which of the cell subpopulations or protein markers are associated with a phenotype (e.g. clinical outcome) of interest. Typically, cell subpopulation abundance expressed as cluster cell counts or median marker expression would be used in the statistical model to relate to the sample-level phenotype.

Importantly, there are many alternatives to what we propose below, and several new methods are emerging. For instance, citrus (Bruggner et al. 2014) tackles the differential discovery problem by strong over-clustering of the cells, and by building a hierarchy of clusters from very specific to general ones. Using model selection and regularization techniques, clusters and markers that associate with the outcome are identified. A new machine learning approach, CellCnn (Arvaniti and Claassen 2016), learns the representation of clusters that are associated with the considered phenotype by means of convolutional neural networks, which makes it particularly applicable to detecting discriminating rare cell populations. However, there are tradeoffs to consider. citrus performs feature selection but does not provide significance levels, such as p-values, for the strength of associations. Due to its computational requirements, citrus can not be run on entire mass cytometry datasets and one typically must analyze a subset of the data. The “filters” from CellCnn may identify one or more cell subsets that distinguish experimental groups, while these groups may not necessarily correspond to any of the canonical cell types, since they are learned with a data-driven approach.

A noticeable distinction between the machine-learning approaches and our classical regression approach is how the model is designed. citrus and CellCnn model the patient response as a function of the measured HDCyto values, whereas the classical approach models the HDCyto data itself as the response, thus putting the distributional assumptions on the experimental HDCyto data. This carries the distinct advantage that covariates (e.g. age, gender, batch) can be included, together with finding associations of the phenotype to the predictors of interest (e.g. cell type abundance). Specifically, neither citrus nor CellCnn are able to directly account for complex designs, such as paired experiments or presence of batches.

Within the classical approach, hybrid methods are certainly possible, where discovery of interesting cell populations is done with one algorithm, and quantifications or signal aggregations are modeled in standard regression frameworks. For instance, CellCnn provides p-values from a t-test or Mann-Whitney U-test conducted on the frequencies of previously detected cell populations. The models we propose below are flexible extensions of this strategy.

Step by step, this workflow presents differential discovery analyses assembled from a suite of tools and methods that, in our view, lead to a higher level of flexibility and robust, statistically-supported and interpretable results. Cell population identification is conducted by means of unsupervised clustering using the FlowSOM and ConsensusClusterPlus packages, which together were among the best performing clustering approaches for high-dimensional cytometry data (Weber and Robinson 2016). Notably, FlowSOM scales easily to millions of cells and thus no subsetting of the data is required.

To be able to analyze arbitrary experimental designs (e.g. batch effects, paired experiments, etc.), we show how to conduct the differential analysis of cell population abundances using the generalized linear mixed models (GLMM) and of marker intensities using linear models (LM) and linear mixed models (LMM). Model fitting is performed with lme4 and stats packages, and hypothesis testing with the multcomp package.

We use the ggplot2 package as our graphical engine. Notably, we propose a suite of useful visual representations of HDCyto data characteristics, such as an MDS (multidimensional scaling) plot of aggregated signal for exploring sample similarities. The obtained cell populations are visualized using dimension reduction techniques (e.g. t-SNE via the Rtsne package) and heatmaps (via the pheatmap package) to represent characteristics of the annotated cell populations and identified biomarkers.

The workflow is intentionally not fully automatic. First, we strongly advocate for exploratory data analysis to get an understanding of data characteristics before formal statistical modeling. Second, the workflow involves an optional step where the user can manually merge and annotate clusters (see Cluster merging and annotation section) but in a way that is easily reproducible. The CyTOF data used here (see Data description section) is already preprocessed; i.e. the normalization and de-barcoding, as well as removal of doublets, debris and dead cells, were already performed. To see how such an analysis could be performed, please see the Data preprocessing section.

Notably, this workflow is equally applicable to flow or mass cytometry datasets, for which the preprocessing steps have already been performed. In addition, the workflow is modular and can be adapted as new algorithms or new knowledge about how to best use existing tools comes to light. Alternative clustering algorithms such as the popular PhenoGraph algorithm (Levine et al. 2015) (e.g. via the Rphenograph package), dimensionality reduction techniques, such as diffusion maps (L. Haghverdi, Buettner, and Theis 2015) via the destiny package (Angerer et al. 2016)), and SIMLR (Wang et al. 2017) via the SIMLR package could be inserted to the workflow.

2 Data description

We use a subset of CyTOF data originating from Bodenmiller et al. (Bodenmiller et al. 2012) that was also used in the citrus paper (Bruggner et al. 2014). Specifically, we perform our analysis on samples of peripheral blood mononuclear cells (PBMCs) from 8 healthy donors, where for each individual, an unstimulated and a stimulated samples (for 30 minutes with B cell receptor/Fc receptor crosslinking, known as BCR/FcR-XL) were collected, resulting in 16 samples in total. For each sample, 14 signaling markers and 10 cell surface markers were measured.

The original data is available from the Cytobank report. The subset used here can be downloaded from the Citrus Cytobank repository (files with _BCR-XL.fcs or _Reference.fcs endings) or from our web server (see Data import section).

In both the Bodenmiller et al. and citrus manuscripts, the 10 lineage markers were used to identify cell subpopulations. These were then investigated for differences between reference and stimulated cell subpopulations separately for each of the 14 functional markers. The same strategy is used in this workflow; 10 lineage markers are used for cell clustering and 14 functional markers are tested for differential expression between the reference and BCR/FcR-XL stimulation. Even though differential analysis of cell abundance was not in the scope of the Bodenmiller et al. experiment, we present it here to highlight the generality of the discovery.

3 Data preprocessing

Conventional flow cytometers and mass cytometers produce .fcs files that can be manually analyzed using programs such as FlowJo [TriStar] or Cytobank (Kotecha, Krutzik, and Irish 2001), or using the R/Bioconductor packages, such as the flowCore package (Ellis et al. 2017). During this initial analysis step, dead cells are removed, compensation is checked and with simple two dimensional scatter plots (e.g. marker intensity versus time), marker expression patterns are checked. Often, modern experiments are barcoded in order to remove analytical biases due to individual sample variation or acquisition time. Preprocessing steps including normalization using bead standards, de-barcoding and compensation can be completed with the CATALYST package, which provides an implementation of the de-barcoding algorithm described by Zunder et al. (Zunder et al. 2015) and the bead-based normalization from Finck et al. (Finck et al. 2013). Of course, preprocessing steps can occur using custom scripts within R or outside of R (e.g. Normalizer (Finck et al. 2013)).

4 Data import

We recommend as standard practice to keep an independent record of all samples collected, with additional information about the experimental condition, including sample or patient identifiers, processing batch and so on. That is, we recommend having a trail of metadata for each experiment. In our example, the metadata file, PBMC8_metadata.xlsx, can be downloaded from the Robinson Lab server with the download.file function. For the workflow, the user should place it in the current working directory (getwd()). Here, we load it into R with the read_excel function from the readxl package and save it into a variable called md, but other file types and interfaces to read them in are also possible.

The data frame md contains the following columns:

file_name with names of the .fcs files corresponding to the reference (suffix “Reference”) and BCR/FcR-XL stimulation (suffix “BCR-XL”) samples,
sample_id with shorter unique names for each sample containing information about conditions and patient IDs,
condition describes whether samples originate from the reference (Ref) or stimulated (BCRXL) condition,
patient_id defines the IDs of patients.

The sample_id variable is used as row names in metadata and will be used all over the workflow to label the samples. It is important to carefully check whether variables are of the desired type (factor, numeric, character), since input methods may convert columns into different data types. For the statistical modeling, we want to make the condition variable a factor with the reference (Ref) samples being the reference level, where the order of factor levels can be defined with the levels parameter of the factor function. We also specify colors for the different conditions in a variable color_conditions.

library(readxl)
url <- "http://imlspenticton.uzh.ch/robinson_lab/cytofWorkflow"
metadata_filename <- "PBMC8_metadata.xlsx"
download.file(paste0(url, "/", metadata_filename), destfile = metadata_filename,
  mode = "wb")
md <- read_excel(metadata_filename)

## Make sure condition variables are factors with the right levels
md$condition <- factor(md$condition, levels = c("Ref", "BCRXL"))
head(md)

## # A tibble: 6 x 4
##                            file_name sample_id condition patient_id
##                                <chr>     <chr>    <fctr>      <chr>
## 1    PBMC8_30min_patient1_BCR-XL.fcs    BCRXL1     BCRXL   Patient1
## 2 PBMC8_30min_patient1_Reference.fcs      Ref1       Ref   Patient1
## 3    PBMC8_30min_patient2_BCR-XL.fcs    BCRXL2     BCRXL   Patient2
## 4 PBMC8_30min_patient2_Reference.fcs      Ref2       Ref   Patient2
## 5    PBMC8_30min_patient3_BCR-XL.fcs    BCRXL3     BCRXL   Patient3
## 6 PBMC8_30min_patient3_Reference.fcs      Ref3       Ref   Patient3

## Define colors for conditions
color_conditions <- c("#6A3D9A", "#FF7F00")
names(color_conditions) <- levels(md$condition)

The .fcs files listed in the metadata can be downloaded manually from the Citrus Cytobank repository or automatically from the Robinson Lab server where they are saved in a compressed archive file, PBMC8_fcs_files.zip.

fcs_filename <- "PBMC8_fcs_files.zip"
download.file(paste0(url, "/", fcs_filename), destfile = fcs_filename, 
  mode = "wb")
unzip(fcs_filename)

To load the content of the .fcs files into R, we use the flowCore. Using read.flowSet, we read in all files into a flowSet object, which is a general container for HDCyto data. Importantly, read.flowSet and the underlying read.FCS functions, by default, may transform the marker intensities and remove cells with extreme positive values. We keep these options off to be sure that we control the exact preprocessing steps.

library(flowCore)
fcs_raw <- read.flowSet(md$file_name, transformation = FALSE, 
  truncate_max_range = FALSE)
fcs_raw

In our example, information about the panel is also available in a file called PBMC8_panel.xlsx, and can be downloaded from the Robinson Lab server and loaded into a variable called panel. It contains columns for Isotope and Metal that define the atomic mass number and the symbol of the chemical element conjugated to the antibody, respectively, and Antigen, which specifies the protein marker that was targeted; two additional columns specify whether a channel belongs to the lineage or surface type of marker.

The isotope, metal and antigen information that the instrument receives is also stored in the flowFrame (container for one sample) or flowSet (container for multiple samples) objects. You can type fcs_raw[[1]] to see the first flowFrame, which contains a table with columns name and desc. Their content can be accessed with functions pData(parameters()), which is identical for all the flowFrame objects in the flowSet. The variable name corresponds to the column names in the flowSet object, you can type in R colnames(fcs_raw).

It should be checked that elements from panel can be matched to their corresponding entries in the flowSet object to make the analysis less prone to subsetting mistakes. Here, for example, the entries in panel$Antigen have their exact equivalents in the desc columns of the flowFrame objects. In the following analysis, we will often use marker IDs as column names in the tables containing expression values. As a cautionary note, during object conversion from one type to another (e.g. in the creation of data.frame from a matrix), some characters (e.g. dashes) in the dimension names are replaced with dots, which may cause problems in matching. To avoid this problem, we replace all the dashes with underscores. Also, we define two variables that indicate the lineage and functional markers.

panel_filename <- "PBMC8_panel.xlsx"
download.file(paste0(url, "/", panel_filename), destfile = panel_filename, 
  mode = "wb")
panel <- read_excel(panel_filename)
head(data.frame(panel))

##   Metal Isotope Antigen Lineage Functional
## 1    Cd 110:114     CD3       1          0
## 2    In     115    CD45       1          0
## 3    La     139     BC1       0          0
## 4    Pr     141     BC2       0          0
## 5    Nd     142   pNFkB       0          1
## 6    Nd     144    pp38       0          1

# Replace problematic characters 
panel$Antigen <- gsub("-", "_", panel$Antigen)

panel_fcs <- pData(parameters(fcs_raw[[1]]))
head(panel_fcs)

##               name        desc   range  minRange maxRange
## $P1           Time        Time 2377271   0.00000  2377270
## $P2    Cell_length Cell_length      66   0.00000       65
## $P3 CD3(110:114)Dd         CD3    1212 -13.66756     1211
## $P4  CD45(In115)Dd        CD45    2654   0.00000     2653
## $P5   BC1(La139)Dd         BC1   13357   0.00000    13356
## $P6   BC2(Pr141)Dd         BC2      39 -66.97583       38

# Replace problematic characters 
panel_fcs$desc <- gsub("-", "_", panel_fcs$desc)

# Lineage markers
(lineage_markers <- panel$Antigen[panel$Lineage == 1])

##  [1] "CD3"    "CD45"   "CD4"    "CD20"   "CD33"   "CD123"  "CD14"  
##  [8] "IgM"    "HLA_DR" "CD7"

# Functional markers
(functional_markers <- panel$Antigen[panel$Functional == 1])

##  [1] "pNFkB"  "pp38"   "pStat5" "pAkt"   "pStat1" "pSHP2"  "pZap70"
##  [8] "pStat3" "pSlp76" "pBtk"   "pPlcg2" "pErk"   "pLat"   "pS6"

# Spot checks
all(lineage_markers %in% panel_fcs$desc)

## [1] TRUE

all(functional_markers %in% panel_fcs$desc)

## [1] TRUE

5 Data transformation

Usually, the raw marker intensities read by a cytometer have strongly skewed distributions with varying ranges of expression, thus making it difficult to distinguish between the negative and positive cell populations. It is common practice to transform CyTOF marker intensities using, for example, arcsinh (hyperbolic inverse sine) with cofactor 5 (Bendall et al. 2011 Figure S2; Bruggner et al. 2014) to make the distributions more symmetric and to map them to a comparable range of expression, which is important for clustering. A cofactor of 150 has been promoted for flow cytometry, but users are free to implement alternative transformations, some of which are available from the transform function of the flowCore package. In the following step, we include only those channels that correspond to the lineage and functional markers. We also rename the columns in the flowSet to the antigen names from panel$desc.

## arcsinh transformation and column subsetting
fcs <- fsApply(fcs_raw, function(x, cofactor = 5){
  colnames(x) <- panel_fcs$desc
  expr <- exprs(x)
  expr <- asinh(expr[, c(lineage_markers, functional_markers)] / cofactor)
  exprs(x) <- expr
  x
})
fcs

## A flowSet with 16 experiments.
## 
##   column names:
##   CD3 CD45 CD4 CD20 CD33 CD123 CD14 IgM HLA_DR CD7 pNFkB pp38 pStat5 pAkt pStat1 pSHP2 pZap70 pStat3 pSlp76 pBtk pPlcg2 pErk pLat pS6

For some of the further analysis, it is more convenient for us to work using a matrix (called expr) that contains marker expression for cells from all samples. We create such a matrix with the fsApply function that extracts the expression matrices (function exprs) from each element of the flowSet object.

## Extract expression
expr <- fsApply(fcs, exprs)
dim(expr)

## [1] 172791     24

As the ranges of marker intensities can vary substantially, we apply another transformation that scales expression of all markers to values between 0 and 1 using low (e.g. 1%) and high (e.g. 99%) percentiles as the boundary. This additional transformation of the arcsinh-transformed data can sometimes give better representation of relative differences in marker expression between annotated cell populations, however, it is only used here for visualization.

library(matrixStats)
rng <- colQuantiles(expr, probs = c(0.01, 0.99))
expr01 <- t((t(expr) - rng[, 1]) / (rng[, 2] - rng[, 1]))
expr01[expr01 < 0] <- 0
expr01[expr01 > 1] <- 1

6 Diagnostic plots

We propose some quick checks to verify whether the data we analyze globally represents what we expect; for example, whether samples that are replicates of one condition are more similar and are distinct from samples from another condition. Another important check is to verify that marker expression distributions do not have any abnormalities such as having different ranges or distinct distributions for a subset of the samples. This could highlight problems with the sample collection or HDCyto acquisition, or batch effects that were unexpected. Depending on the situation, one can then consider removing problematic markers or samples from further analysis; in the case of batch effects, a covariate column could be added to the metadata table and used below in the statistical analyses.

The step below generates a plot with per-sample marker expression distributions, colored by condition (see Figure 1). Here, we can already see distinguishing markers, such as pNFkB and CD20, between stimulated and unstimulated conditions.

## Generate sample IDs corresponding to each cell in the `expr` matrix
sample_ids <- rep(md$sample_id, fsApply(fcs_raw, nrow))

library(ggplot2)
library(reshape2)

ggdf <- data.frame(sample_id = sample_ids, expr)
ggdf <- melt(ggdf, id.var = "sample_id", 
  value.name = "expression", variable.name = "antigen")
mm <- match(ggdf$sample_id, md$sample_id)
ggdf$condition <- md$condition[mm]

ggplot(ggdf, aes(x = expression, color = condition, 
  group = sample_id)) +
  geom_density() +
  facet_wrap(~ antigen, nrow = 4, scales = "free") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1), 
    strip.text = element_text(size = 7), axis.text = element_text(size = 5)) +
  guides(color = guide_legend(ncol = 1)) +
  scale_color_manual(values = color_conditions)

Figure 1: Per-sample smoothed densities of marker expression (arcsinh-transformed) of 10 lineage markers and 14 functional markers in the PBMC dataset. Two conditions: unstimulated (Ref) and stimulated with BCR/FcR-XL (BCRXL) for each of the 8 healthy donors are presented and colored by experimental condition.

6.1 MDS plot

In transcriptomics applications, one of the most utilized exploratory plots is the multi-dimensional scaling (MDS) plot or a principal component analysis (PCA) plot. Such plots show similarities between samples measured in an unsupervised way and give a sense of how much differential expression can be detected before conducting any formal tests. An MDS plot can be generated with the plotMDS function from the limma package. In transcriptomics, distances between samples are calculated based on the expression of the top varying genes. We propose a similar plot for HDCyto data using median marker expression over all cells to calculate dissimilarities between samples (other aggregations are also possible, and one could reduce the number of top varying markers to include in the calculation). Ideally, samples should cluster well within the same condition, although this depends on the magnitude of the difference between experimental conditions. With this diagnostic, one can identify the outlier samples and eliminate them if the circumstances warrant it.

In our MDS plot on median marker expression values (see Figure 2), we can see that the first dimension (MDS1) separates the unstimulated and stimulated samples reasonably well. The second dimension (MDS2) represents, to some degree, differences between patients. Most of the samples that originate from the same patient are placed at a similar point along the y-axis, for example, samples from patients 7, 5, and 8 are at the top of the plot, samples from patient 4 are located at the bottom of the plot. This also indicates that the marker expression of individual patients is driving similarity and perhaps should be formally accounted for in the downstream statistical modeling.

# Get the median marker expression per sample
library(dplyr)

expr_median_sample_tbl <- data.frame(sample_id = sample_ids, expr) %>%
  group_by(sample_id) %>% 
  summarize_all(funs(median))

expr_median_sample <- t(expr_median_sample_tbl[, -1])
colnames(expr_median_sample) <- expr_median_sample_tbl$sample_id

library(limma)
mds <- plotMDS(expr_median_sample, plot = FALSE)

library(ggrepel)
ggdf <- data.frame(MDS1 = mds$x, MDS2 = mds$y, 
  sample_id = colnames(expr_median_sample))
mm <- match(ggdf$sample_id, md$sample_id)
ggdf$condition <- md$condition[mm]

ggplot(ggdf, aes(x = MDS1, y = MDS2, color = condition)) +
  geom_point(size = 2, alpha = 0.8) +
  geom_label_repel(aes(label = sample_id)) +
  theme_bw() +
  scale_color_manual(values = color_conditions)

Figure 2: MDS plot for the unstimulated (Ref) and stimulated with BCR/FcR-XL (BCRXL) samples obtained for each of the 8 healthy donors in the PBMC dataset. Calculations are based on the median (arcsinh-transformed) marker expression of 10 lineage markers and 14 functional markers across all cells measured for each sample. Distances between samples on the plot approximate the typical change in medians. Numbers in the label names indicate patient IDs.

7 Cell population identification with FlowSOM and ConsensusClusterPlus

Cell population identification typically has been carried out by manual gating, a method based on visual inspection of a series of two-dimensional scatterplots. At each step, a subset of cells, either positive or negative for the two visualized markers, is selected and further stratified in the subsequent iterations until populations of interest across a range of marker combinations are captured. However, manual gating has drawbacks, such as subjectivity, bias toward well-known cell types, and inefficiency when analyzing large datasets, which also contribute to a lack of reproducibility (Saeys, Gassen, and Lambrecht 2016).

Considerable effort has been made to improve and automate cell population identification, such as unsupervised clustering (Aghaeepour et al. 2013). However, not all methods scale well in terms of performance and speed from the lower dimensionality flow cytometry data to the higher dimensionality mass cytometry data (Weber and Robinson 2016), since clustering in higher dimensions can suffer the “curse of dimensionality”.

Beside the mathematical and algorithmic challenges of clustering, cell population identification may be difficult due to the chemical and biological aspects of the cytometry experiment itself. Therefore, caution should be taken when designing panels aimed at detecting rare cell populations by assigning higher sensitivity metals to rare markers. The right choice of a marker panel used for clustering can also be important. It should include all markers that are relevant for cell type identification.

In this workflow, we conduct cell clustering with FlowSOM (Van Gassen et al. 2015) and ConsensusClusterPlus (Wilkerson and Hayes 2010), which appeared amongst the fastest and best performing clustering approaches in a recent study of HDCyto datasets (Weber and Robinson 2016). This ensemble showed strong performance in detecting both high and low frequency cell populations and is one of the fastest methods to run, which enables its interactive usage. We use a slight modification of the original workflow presented in the FlowSOM vignette, which we find more flexible. In particular, we directly call the ConsensusClusterPlus function that is embedded in metaClustering_consensus. Thus, we are able to access all the functionality of the ConsensusClusterPlus package to identify the number of clusters.

The FlowSOM workflow consists of three main steps. First, a self-organizing map (SOM) is built using the BuildSOM function, where cells are assigned according to their similarities to 100 (by default) grid points (or, so-called codebook vectors or codes) of the SOM. The building of a minimal spanning tree, which is mainly used for graphical representation of the clusters, is skipped in this pipeline. And finally, metaclustering of the SOM codes, is performed directly with the ConsensusClusterPlus function. Additionally, we add an optional round of manual expert-based merging of the metaclusters and allow this to be done in a reproducible fashion.

FlowSOM output can be sensitive to random starts (Weber and Robinson 2016). To make results reproducible, one must specify the seed for the random number generation in R using function set.seed. It is also advisable to rerun analyses with multiple random seeds, for two reasons. First, one can see how robust the detected clusters are, and second, when the goal is to find smaller cell populations, it may happen that, in some runs, random starting points do not represent rare cell populations, as the chance of selecting starting cells from them is low and they are merged into a larger cluster.

It is important to point out that we cluster all cells from all samples together. This strategy is beneficial, since we label cell populations only once and the mapping of cell types between samples is automatically consistent. In our analysis, cell populations are identified using only the 10 lineage markers as defined in the BuildSOM function with the colsToUse argument.

library(FlowSOM)

fsom <- ReadInput(fcs, transform = FALSE, scale = FALSE)
set.seed(1234)
som <- BuildSOM(fsom, colsToUse = lineage_markers)

Automatic approaches for selecting the number of clusters in HDCyto data do not always succeed (Weber and Robinson 2016). In general, we therefore recommend some level of over-clustering, and if desired, manual merging of clusters. Such a hierarchical approach is especially suited when the goal is to detect smaller cell populations.

The SPADE analysis performed by Bodenmiller et al. (Bodenmiller et al. 2012) identified 6 main cell types (T-cells, monocytes, dendritic cells, B-cells, NK cells and surface- cells) that were further stratified into 14 more specific subpopulations (CD4+ T-cells, CD8+ T-cells, CD14+ HLA-DR high monocytes, CD14+ HLA-DR med monocytes, CD14+ HLA-DR low monocytes, CD14- HLA-DR high monocytes, CD14- HLA-DR med monocytes, CD14- HLA-DR low monocytes, dendritic cells, IgM+ B-cells, IgM- B-cells, NK cells, surface- CD14+ cells and surface- CD14- cells). In our analysis, we are interested in identifying the 6 main PBMC populations, including: CD4+ T-cells, CD8+ T-cells, monocytes, dendritic cells, NK cells and B-cells. Following the concept of over-clustering we perform the metaclustering of the (by default) 100 SOM codes into more than expected number of groups. For example, stratification into 20 groups should give enough resolution. We can explore the clustering in a wide variety of visualizations: t-SNE plots, heatmaps and a plot generated by ConsensusClusterPlus called “delta area”.

We call ConsensusClusterPlus with maximum number of clusters maxK = 20 and other clustering parameters set to the values as in the metaClustering_consensus function. Again, to ensure that the analyses are reproducible, we define the random seed.

## Metaclustering into 20 clusters with ConsensusClusterPlus
library(ConsensusClusterPlus)

codes <- som$map$codes
plot_outdir <- "consensus_plots"
nmc <- 20

mc <- ConsensusClusterPlus(t(codes), maxK = nmc, reps = 100, 
  pItem = 0.9, pFeature = 1, title = plot_outdir, plot = "png", 
  clusterAlg = "hc", innerLinkage = "average", finalLinkage = "average", 
  distance = "euclidean", seed = 1234)

## Get cluster ids for each cell
code_clustering1 <- mc[[nmc]]$consensusClass
cell_clustering1 <- code_clustering1[som$map$mapping[,1]]

We can then investigate characteristics of identified clusters with heatmaps that illustrate median marker expression in each cluster (see Figure 3). As the range of marker expression can vary substantially from marker to marker, we use the 0-1 transformed data for some visualizations. However, to stay consistent with FlowSOM and ConsensusClusterPlus, we use the (arcsinh-transformed) unscaled data to generate the dendrogram of the hierarchical structure of metaclusters.

Since we will use the heatmap plots again later on in this workflow, in code chunks below, we create a wrapper function that generates these plots.

color_clusters <- c("#DC050C", "#FB8072", "#1965B0", "#7BAFDE", "#882E72", 
  "#B17BA6", "#FF7F00", "#FDB462", "#E7298A", "#E78AC3", 
  "#33A02C", "#B2DF8A", "#55A1B1", "#8DD3C7", "#A6761D", 
  "#E6AB02", "#7570B3", "#BEAED4", "#666666", "#999999", 
  "#aa8282", "#d4b7b7", "#8600bf", "#ba5ce3", "#808000", 
  "#aeae5c", "#1e90ff", "#00bfff", "#56ff0d", "#ffff00")

plot_clustering_heatmap_wrapper <- function(expr, expr01, 
  cell_clustering, color_clusters, cluster_merging = NULL){
  
  # Calculate the median expression
  expr_median <- data.frame(expr, cell_clustering = cell_clustering) %>%
    group_by(cell_clustering) %>% 
    summarize_all(funs(median))
  expr01_median <- data.frame(expr01, cell_clustering = cell_clustering) %>%
    group_by(cell_clustering) %>% 
    summarize_all(funs(median))
  
  # Calculate cluster frequencies
  clustering_table <- as.numeric(table(cell_clustering))
  
  # This clustering is based on the markers that were used for the main clustering
  d <- dist(expr_median[, colnames(expr)], method = "euclidean")
  cluster_rows <- hclust(d, method = "average")
  
  expr_heat <- as.matrix(expr01_median[, colnames(expr01)])
  rownames(expr_heat) <- expr01_median$cell_clustering
  
  labels_row <- paste0(rownames(expr_heat), " (", 
    round(clustering_table / sum(clustering_table) * 100, 2), "%)")
  labels_col <- colnames(expr_heat)
  
  # Row annotation for the heatmap
  annotation_row <- data.frame(cluster = factor(expr01_median$cell_clustering))
  rownames(annotation_row) <- rownames(expr_heat)
  
  color_clusters <- color_clusters[1:nlevels(annotation_row$cluster)]
  names(color_clusters) <- levels(annotation_row$cluster)
  annotation_colors <- list(cluster = color_clusters)
  annotation_legend <- FALSE
  
  if(!is.null(cluster_merging)){
    cluster_merging$new_cluster <- factor(cluster_merging$new_cluster)
    annotation_row$cluster_merging <- cluster_merging$new_cluster 
    color_clusters <- color_clusters[1:nlevels(cluster_merging$new_cluster)]
    names(color_clusters) <- levels(cluster_merging$new_cluster)
    annotation_colors$cluster_merging <- color_clusters
    annotation_legend <- TRUE
  }
  
  # Colors for the heatmap
  color <- colorRampPalette(rev(brewer.pal(n = 9, name = "RdYlBu")))(100)
  
  pheatmap(expr_heat, color = color, 
    cluster_cols = FALSE, cluster_rows = cluster_rows, 
    labels_col = labels_col, labels_row = labels_row, 
    display_numbers = TRUE, number_color = "black", 
    fontsize = 8, fontsize_number = 4,
    annotation_row = annotation_row, annotation_colors = annotation_colors, 
    annotation_legend = annotation_legend)
  
}

plot_clustering_heatmap_wrapper(expr = expr[, lineage_markers], 
  expr01 = expr01[, lineage_markers], 
  cell_clustering = cell_clustering1, color_clusters = color_clusters)

Figure 3: Heatmap of the median marker intensities of the 10 lineage markers across the 20 cell populations obtained with FlowSOM after the metaclustering step with ConsensusClusterPlus (PBMC data). The color in the heatmap represents the median of the arcsinh, 0-1 transformed marker expression. The dendrogram on the left represents the hierarchical similarity between the 20 metaclusters (metric: Euclidean distance; linkage: average). Values in the brackets indicate the relative size of a given cluster.

7.1 Visual representation with t-SNE

One of the most popular plots for representing single cell data are t-SNE plots, where each cell is represented in a lower, usually two-dimensional, space computed using t-stochastic neighbor embedding (t-SNE) (Van Der Maaten and Hinton 2008). More generally, dimensionality reduction techniques represent the similarity of points in 2 or 3 dimensions, such that similar objects in high dimensional space are also similar in lower dimensional space. Mathematically, there are a myriad of ways to define this similarity. For example, principal components analysis (PCA) uses linear combinations of the original features to find orthogonal dimensions that show the highest levels of variability; the top 2 or 3 principal components can then be visualized.

Nevertheless, there are few notes of caution when using t-SNE or any other dimensionality reduction technique. Since they are based on preserving similarities between cells, those that are similar in the original space will be close in the 2D/3D representation, but the opposite does not always hold. In our experience, t-SNE with default parameters for HDCyto data is often suitable; for more guidance on the specifics of t-SNE, see How to Use t-SNE Effectively (Wattenberg, Viégas, and Johnson 2016). Due to the stochastic nature of t-SNE optimization, rerunning the method will result in different lower dimensional projections, thus it is advisable to run it a few times to identify the common trends and get a feeling about the variability of the results. As with other methods, to be sure that the analysis is reproducible, the user can define the random seed.

t-SNE is a method that requires significant computational time to process the data even for tens of thousands of cells. CyTOF datasets are usually much larger and thus to keep running times reasonable, one may use a subset of cells; for example, here we use 1000 cells from each sample. The t-SNE map below is colored according to the expression level of the CD4 marker, highlighting that the CD4+ T-cells are placed to the left side of the plot (see Figure 4). In this way, one can use a collection of markers to highlight where cell types of interest are located on the map.

Instead of t-SNE, one could also use other dimension reduction techniques, such as PCA, diffusion maps, SIMLR (Wang et al. 2017) or isomaps, some of which are conveniently available via the cytof_dimReduction function from the cytofkit package (H. Chen et al. 2016). To speed up the t-SNE analysis, one could use a multicore version that is available via the Rtsne.multicore package. Alternative algorithms, such as largeVis (Tang et al. 2016) (available via the largeVis package), can be used for dimensionality reduction of very large datasets without downsampling.

## Find and skip duplicates
dups <- which(!duplicated(expr[, lineage_markers]))

## Data subsampling: create indices by sample
inds <- split(1:length(sample_ids), sample_ids) 

## How many cells to downsample per-sample
tsne_ncells <- pmin(table(sample_ids), 1000)  

## Get subsampled indices
set.seed(1234)
tsne_inds <- lapply(names(inds), function(i){
  s <- sample(inds[[i]], tsne_ncells[i], replace = FALSE)
  intersect(s, dups)
})

tsne_inds <- unlist(tsne_inds)

tsne_expr <- expr[tsne_inds, lineage_markers]

## Run t-SNE
library(Rtsne)

set.seed(1234)
tsne_out <- Rtsne(tsne_expr, check_duplicates = FALSE, pca = FALSE)

## Plot t-SNE colored by CD4 expression
dr <- data.frame(tSNE1 = tsne_out$Y[, 1], tSNE2 = tsne_out$Y[, 2], 
  expr[tsne_inds, lineage_markers])

ggplot(dr,  aes(x = tSNE1, y = tSNE2, color = CD4)) +
  geom_point(size = 0.8) +
  theme_bw() +
  scale_color_gradientn("CD4", 
    colours = colorRampPalette(rev(brewer.pal(n = 11, name = "Spectral")))(50))

Figure 4: t-SNE plot based on the arcsinh-transformed expression of the 10 lineage markers in the cells from the PBMC dataset (t-SNE was run with no PCA step, perplexity = 30, 1000 iterations). From each of the 16 samples, 2000 cells were randomly selected. Cells are colored according to the expression level of the CD4 marker.

We can color the cells by cluster. Ideally, cells of the same color should be close to each other (see Figure 5).

dr$sample_id <- sample_ids[tsne_inds]
mm <- match(dr$sample_id, md$sample_id)
dr$condition <- md$condition[mm]
dr$cell_clustering1 <- factor(cell_clustering1[tsne_inds], levels = 1:nmc)

## Plot t-SNE colored by clusters
ggplot(dr,  aes(x = tSNE1, y = tSNE2, color = cell_clustering1)) +
  geom_point(size = 0.8) +
  theme_bw() +
  scale_color_manual(values = color_clusters) +
  guides(color = guide_legend(override.aes = list(size = 4), ncol = 2))

Figure 5: t-SNE plot based on the arcsinh-transformed expression of the 10 lineage markers in the cells from the PBMC dataset. From each of the 16 samples, 2000 cells were randomly selected. Cells are colored according to the 20 cell populations obtained with FlowSOM after the metaclustering step with ConsensusClusterPlus.

7.2 Cluster merging and annotation

In our experience, manual merging of clusters leads to slightly different results compared to an algorithm with a specified number of clusters. In order to detect somewhat rare populations, some level of over-clustering is necessary so that the more subtle populations become separated from the main populations. In addition, merging can always follow an over-clustering step, but splitting of existing clusters is not generally feasible.

In our setup, over-clustering is also useful when the interest is identifying the “natural” number of clusters present in the data. In addition to the t-SNE plots, one could investigate the delta area plot from the ConsensusClusterPlus package and the hierarchical clustering dendrogram of the over-clustered subpopulations, as shown below.

In our example, we expect around 6 specific cell types, and we have performed FlowSOM clustering into 20 groups as a reasonable over-estimate. After analyzing the heatmaps and t-SNE plots, we can clearly see that stratification of the data into 20 clusters may be too strong. Many clusters are placed very close to each other, indicating that they could be merged together. The same can be deduced from the heatmaps, highlighting that marker expression patterns for some neighboring clusters are very similar. Cluster merging and annotating is somewhat manual, based partially on visual inspection of t-SNE plots and heatmaps and thus, benefits from expert knowledge of the cell types.

7.2.1 Manual cluster merging and annotating based on heatmaps

In our experience, the main reference for manual merging of clusters is the heatmap of marker characteristics across metaclusters, with dendrograms showing the hierarchy of similarities. Such plots aggregate information over many cells and thus show average marker expression for each cluster. Together with dimensionality reduction, these plots give good insight into the relationships between clusters and the marker levels within each cluster. Given expert knowledge of the cell types and markers, it is then left to the researcher to decide how exactly to merge clusters (e.g., with higher weight given to some markers).

The dendrogram highlights the similarity between the metaclusters and can be used explicitly for the merging. However, there are reasons why we would not always follow the dendrogram exactly. In general, when it comes to clustering, blindly following the hierarchy of codes will lead to identification of populations of similar cells, but it does not necessarily mean that they are of biological interest. The distances between metaclusters are calculated across all the markers, and it may be that some markers carry higher weight for certain cell types. In addition, different linkage methods may lead to different hierarchy, especially when clusters are not fully distinct. Another aspect to consider in cluster merging is the cluster size, represented in the parentheses next to the cluster label in our plots. If the cluster size is very small, but the cluster seems relevant and distinct, one can keep it as separate. However, if it is small and different from the neighboring clustering only in a somewhat unimportant marker, it could be merged. And, if some of the metaclusters do not represent any specific cell types, they could be dropped out of the downstream analysis instead of being merged. However, in case an automated solution for cluster merging is truly needed, one could use the cutree() function applied to the dendrogram.

Based on the seed that was set, cluster merging of the 20 metaclusters is defined in the PBMC8_cluster_merging1.xlsx file on the Robinson Lab server with the IDs of the original clusters and new cluster names, and we save it as a cluster_merging1 data frame. The expert has annotated 8 cell populations: CD8 T-cells, CD4 T-cells, B-cells IgM-, B-cells IgM+, NK cells, dendritic cells (DCs), monocytes and surface negative cells; monocytes could be further subdivided based on HLA-DR into high, medium and low subtypes.

cluster_merging1_filename <- "PBMC8_cluster_merging1.xlsx"
download.file(paste0(url, "/", cluster_merging1_filename), 
  destfile = cluster_merging1_filename, mode = "wb")
cluster_merging1 <- read_excel(cluster_merging1_filename)
data.frame(cluster_merging1)

##    original_cluster  new_cluster
## 1                 1 B-cells IgM+
## 2                 2     surface-
## 3                 3     NK cells
## 4                 4  CD8 T-cells
## 5                 5 B-cells IgM-
## 6                 6    monocytes
## 7                 7    monocytes
## 8                 8  CD8 T-cells
## 9                 9  CD8 T-cells
## 10               10    monocytes
## 11               11    monocytes
## 12               12  CD4 T-cells
## 13               13           DC
## 14               14  CD8 T-cells
## 15               15  CD4 T-cells
## 16               16           DC
## 17               17  CD4 T-cells
## 18               18  CD4 T-cells
## 19               19  CD4 T-cells
## 20               20  CD4 T-cells

## New clustering1m
mm <- match(cell_clustering1, cluster_merging1$original_cluster)
cell_clustering1m <- cluster_merging1$new_cluster[mm]

mm <- match(code_clustering1, cluster_merging1$original_cluster)
code_clustering1m <- cluster_merging1$new_cluster[mm]

We update the t-SNE plot with the new annotated cell populations, Figure 6.

dr$cell_clustering1m <- factor(cell_clustering1m[tsne_inds])
gg_tsne_cl1m <- ggplot(dr,  aes(x = tSNE1, y = tSNE2, color = cell_clustering1m)) +
  geom_point(size = 0.8) +
  theme_bw() +
  scale_color_manual(values = color_clusters) +
  guides(color = guide_legend(override.aes = list(size = 4)))
gg_tsne_cl1m

Figure 6: t-SNE plot based on the arcsinh-transformed expression of the 10 lineage markers in the cells from the PBMC dataset. From each of the 16 samples, 1000 cells were randomly selected. Cells are colored according to the manual merging of the 20 cell populations obtained with FlowSOM into 8 PBMC populations.

When the plots are further stratified by sample (see Figure 7), we can verify whether similar cell populations are present in all replicates, which can help in identifying outlying samples. Optionally, stratification can be done by condition (see Figure 8). With such a spot-check plot, we can inspect whether differences in cell abundance are strong between conditions, and we can identify distinguishing clusters.

## Facet per sample
gg_tsne_cl1m + facet_wrap(~ sample_id)

$Figure 7: t-SNE plot as in the Figure 6, but stratified by sample.$

Figure 7: t-SNE plot as in the Figure 6, but stratified by sample.

## Facet per condition
gg_tsne_cl1m + facet_wrap(~ condition)

Figure 8: t-SNE plot as in the Figure 6, but stratified by condition.

One of the usefull representations of merging is a heatmap of median marker expression in each of the original clusters, which are labeled according to the proposed merging, Figure 9.

plot_clustering_heatmap_wrapper(expr = expr[, lineage_markers],
  expr01 = expr01[, lineage_markers], cell_clustering = cell_clustering1,
  color_clusters = color_clusters, cluster_merging = cluster_merging1)

Figure 9: Heatmap as in Figure 3 with the color bars on the left indicating how the 20 metaclusters obtained with FlowSOM are merged into the 8 PBMC populations.

To get a final summary of the annotated cell types, one can plot a heatmap of median marker expression, calculated based on the cells in each of the annotated populations, Figure 10.

plot_clustering_heatmap_wrapper(expr = expr[, lineage_markers],
  expr01 = expr01[, lineage_markers], cell_clustering = cell_clustering1m,
  color_clusters = color_clusters)

Figure 10: Heatmap of the median marker intensities of the 10 lineage markers in the 8 PBMC cell populations obtained by manual merging of the 20 metaclusters generated by FlowSOM. The heat represents the median of arcsinh and 0-1 transformed marker expression. Values in the brackets indicate the relative size of each of the cell populations across all the samples.

8 Differential analysis

For the dataset used in this workflow (Bodenmiller et al. 2012; Bruggner et al. 2014), we perform three types of analyses that aim to identify subsets of PBMCs and signaling markers that respond to BCR/FcR-XL stimulation, by comparing stimulated samples to unstimulated samples. We first describe the differential abundance of the defined cell populations, followed by differential analysis of marker expression within each cluster. Finally, differential analysis of the overall aggregated marker expression could also be of interest.

The PBMC subset analyzed in this workflow originates from a paired experiment, where samples from 8 patients were treated with 12 different stimulation conditions for 30 minutes, together with unstimulated reference samples (Bodenmiller et al. 2012). This is a natural example where one would choose a mixed model to model the response (abundance or marker signal), and patients would be treated as a random effect. In this way, one can formally account for within-patient variability, observed to be quite strong in the MDS plot (see MDS plot section), and this should give a gain in power to detect differences between conditions.

We use the stats and lme4 packages to fit the fixed and mixed models, respectively, and the multcomp package for hypothesis testing. In all differential analyses here, we want to test for differences between the reference (Ref) and BCR/FcR-XL treatment (BCRXL). The fixed model formula is straightforward: ~ condition, where condition indicates the treatment group. The corresponding full model design matrix consists of the intercept and dummy variable indicating the treated samples. In the presence of batches, one can include them in the model by using a formula ~ condition + batch, or if they affect the treatment, a formula with interactions ~ condition * batch.

For testing, we use the general linear hypotheses function glht, which allows testing of arbitrary hypotheses. The linfct parameter specifies the linear hypotheses to be tested. It should be a matrix where each row corresponds to one comparison (contrast), and the number of columns must be the same as in the design matrix. In our analysis, the contrast matrix indicates that the regression coefficient corresponding to conditionBCRXL is tested to be equal to zero; i.e. we test the null hypothesis that there is no effect of the BCR/FcR-XL treatment. The result of the test is a p-value, which indicates the probability of observing an as strong (or stronger) difference between the two conditions assuming the null hypothesis is true.

Testing is performed on each cluster and marker separately, resulting in 8 tests for differential abundance (one for each merged population), 14 tests for overall differential marker expression analysis and 8 x 14 tests for differential marker expression within-populations. Thus, to account for the multiple testing correction, we apply the Benjamini & Hochberg adjustment to each of them using an FDR cutoff of 5%.

library(lme4)
library(multcomp)
## Model formula without random effects
model.matrix( ~ condition, data = md)

##    (Intercept) conditionBCRXL
## 1            1              1
## 2            1              0
## 3            1              1
## 4            1              0
## 5            1              1
## 6            1              0
## 7            1              1
## 8            1              0
## 9            1              1
## 10           1              0
## 11           1              1
## 12           1              0
## 13           1              1
## 14           1              0
## 15           1              1
## 16           1              0
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$condition
## [1] "contr.treatment"

## Create contrasts
contrast_names <- c("BCRXLvsRef")
k1 <- c(0, 1)
K <- matrix(k1, nrow = 1, byrow = TRUE, dimnames = list(contrast_names))
K

##            [,1] [,2]
## BCRXLvsRef    0    1

FDR_cutoff <- 0.05

8.1 Differential cell population abundance

Differential analysis of cell population abundance compares the proportions of cell types across experimental conditions and aims to highlight populations that are present at different ratios. First, we calculate two tables: one that contains cell counts for each sample and population and one with the corresponding proportions of cell types by sample. The proportions are used only for plotting, since the statistical modeling takes the cell counts by cluster and sample as input.

counts_table <- table(cell_clustering1m, sample_ids)
props_table <- t(t(counts_table) / colSums(counts_table)) * 100

counts <- as.data.frame.matrix(counts_table)
props <- as.data.frame.matrix(props_table)

For each sample, we plot its PBMC cell type composition represented with colored bars, where the size of a given stripe reflects the proportion of the corresponding cell type in a given sample (see Figure 11).

ggdf <- melt(data.frame(cluster = rownames(props), props), 
  id.vars = "cluster", value.name = "proportion", variable.name = "sample_id")

## Add condition info
mm <- match(ggdf$sample_id, md$sample_id)
ggdf$condition <- factor(md$condition[mm])

ggplot(ggdf, aes(x = sample_id, y = proportion, fill = cluster)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ condition, scales = "free_x") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_fill_manual(values = color_clusters)

Figure 11: Relative abundance of the 8 PBMC populations in each sample (x-axis), in the PBMC dataset, represented with a barplot. The 8 cell populations are a result of manual merging of the 20 FlowSOM metaclusters.

It may be quite hard to see the differences in cluster abundances in the plot above, especially for clusters with very low frequency. And, since boxplots cannot represent multimodal distributions, we show boxplots with jittered points of the sample-level cluster proportions overlaid (see Figure 12). The y-axes are scaled to the range of data plotted for each cluster, to better visualize the differences in frequency of lower abundance clusters. For this experiment, it may be interesting to additionally depict the patient information. We do this by plotting a different point shape for each patient. Indeed, we can see that often the direction of abundance changes between the two conditions are concordant among the patients.

ggdf$patient_id <- factor(md$patient_id[mm])

ggplot(ggdf) +
  geom_boxplot(aes(x = condition, y = proportion, color = condition, 
    fill = condition),  position = position_dodge(), alpha = 0.5, 
    outlier.color = NA) +
  geom_point(aes(x = condition, y = proportion, color = condition, 
    shape = patient_id), alpha = 0.8, position = position_jitterdodge()) +
  facet_wrap(~ cluster, scales = "free", nrow = 2) +
  theme_bw() +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(), 
    strip.text = element_text(size = 6)) +
  scale_color_manual(values = color_conditions) +
  scale_fill_manual(values = color_conditions) +
  scale_shape_manual(values = c(16, 17, 8, 3, 12, 0, 1, 2))

Figure 12: Relative abundance of the 8 PBMC populations in each sample, in the PBMC dataset, represented with boxplots. Different colors are used for the two conditions: unstimulated (Ref) and stimulated with BCR/FcR-XL (BCRXL). Values for each patient are indicated with different shape. The 8 cell populations are a result of manual merging of the 20 FlowSOM metaclusters.

As our goal is to compare proportions, one could take these values, transform them (e.g. logit) and use them as a dependent variable in a linear model. However, this approach does not take into account the uncertainty of proportion estimates, which is higher when ratios are calculated for samples with lower total cell counts. A distribution that naturally accounts for such uncertainty is the binomial distribution (i.e. logistic regression), which takes the cell counts as input (relative to the total for each sample). Nevertheless, as in the genomic data analysis, the pure logistic regression is not able to capture the overdispersion that is present in HDCyto data. A natural extension to model the extra variation is the generalized linear mixed model (GLMM), where the random effect is defined by the sample ID (Zhao et al. 2013; Jia et al. 2014). Additionally, in our example the patient pairing could be accounted in the model by incorporating a random intercept for each patient. Thus, we present two GLMMs. Both of them comprise a random effect defined by the sample ID to model the overdispersion in proportions. The second model includes a random effect defined by the patient ID to account for the experiment pairing.

In our model, the blocking variable is patient ID $i = 1, ..., n$, where $n=8$. For each patient, there are $n_i$ samples measured, and $j = 1,..., n_i$ indicates the sample ID. Here, $n_i=2$ for all $i$ (one from reference and one from BCR/FcR-XL stimulated).
We assume that for a given cell population, the cell counts $Y_{ij}$ follow a binomial distribution $Y_{ij} \sim Bin(m_{ij}, \pi_{ij})$, where $m_{ij}$ is a total number of cells in a sample corresponding to patient $i$ and condition $j$. The generalized linear mixed model with observation-level random effects $\xi_{ij}$ is defined as follows:

\[E(Y_{ij}|\beta_0, \beta_1, \xi_{ij}) = logit^{-1}(\beta_0 + \beta_1 x_{ij} + \xi_{ij}),\] where $\xi_{ij} \sim N(0, \sigma^2_\xi)$ and $x_{ij}$ corresponds to the conditionBCRXL column in the design matrix and indicates whether a sample $ij$ belongs to the reference ($x_{ij}=0$) or treated condition ($x_{ij}=1$). Since $E(Y_{ij}|\beta_0, \beta_1, \xi_{ij}) = \pi_{ij}$, the above formula can be written as follows:

\[ logit(\pi_{ij}) = \beta_0 + \beta_1 x_{ij} + \xi_{ij}.\]

The generalized linear mixed model that furthermore accounts for the patient pairing incorporates additionally a random intercept for each patient $i$:

\[E(Y_{ij}|\beta_0, \beta_1, \gamma_i, \xi_{ij}) = logit^{-1}(\beta_0 + \beta_1 x_{ij} + \gamma_i + \xi_{ij}),\] where $\gamma_{i} \sim N(0, \sigma^2_\gamma)$.

formula_glmer_binomial1 <- y/total ~ condition + (1|sample_id)
formula_glmer_binomial2 <- y/total ~ condition + (1|patient_id) + (1|sample_id)

The wrapper function below takes as input a data frame with cell counts (each row is a population, each column is a sample), the metadata table, and the formula, and performs the differential analysis specified with contrast K for each population separately, returning a list with coefficients, non-adjusted and adjusted p-values.

differential_abundance_wrapper <- function(counts, md, formula, K){
  ## Fit the GLMM for each cluster separately
  ntot <- colSums(counts)
  
  fit_binomial <- lapply(1:nrow(counts), function(i){

    data_tmp <- data.frame(y = as.numeric(counts[i, md$sample_id]), 
      total = ntot[md$sample_id], md)
    
    fit_tmp <- glmer(formula, weights = total, family = binomial, 
      data = data_tmp)
    
    ## Fit contrasts one by one
    out <- apply(K, 1, function(k){
      contr_tmp <- glht(fit_tmp, linfct = matrix(k, 1))
      summ_tmp <- summary(contr_tmp)
      out <- c(summ_tmp$test$coefficients, summ_tmp$test$pvalues)
      names(out) <- c("coeff", "pval")
      return(out)
    })
    
    return(t(out))
  })
  
  ### Extract fitted contrast coefficients
  coeffs <- lapply(fit_binomial, function(x){
    x[, "coeff"]
  })
  coeffs <- do.call(rbind, coeffs)
  colnames(coeffs) <- paste0("coeff_", contrast_names)
  rownames(coeffs) <- rownames(counts)
  
  ### Extract p-values
  pvals <- lapply(fit_binomial, function(x){
    x[, "pval"]
  })
  pvals <- do.call(rbind, pvals)
  colnames(pvals) <- paste0("pval_", contrast_names)
  rownames(pvals) <- rownames(counts)
  
  ### Adjust the p-values
  adjp <- apply(pvals, 2, p.adjust, method = "BH")
  colnames(adjp) <- paste0("adjp_", contrast_names)
  
  return(list(coeffs = coeffs, pvals = pvals, adjp = adjp))
}

We fit both of the GLMMs specified above. We can see that accounting for the patient pairing increases the sensitivity to detect differentially abundant cell populations.

da_out1 <- differential_abundance_wrapper(counts, md = md, 
  formula = formula_glmer_binomial1, K = K)
apply(da_out1$adjp < FDR_cutoff, 2, table)

##       adjp_BCRXLvsRef
## FALSE               5
## TRUE                3

da_out2 <- differential_abundance_wrapper(counts, md = md, 
  formula = formula_glmer_binomial2, K = K)
apply(da_out2$adjp < FDR_cutoff, 2, table)

##       adjp_BCRXLvsRef
## FALSE               2
## TRUE                6

An output table containing the observed cell population proportions in each sample and p-values can be assembled (and optionally written to a file).

da_output2 <- data.frame(cluster = rownames(props),  props, 
  da_out2$coeffs, da_out2$pvals, da_out2$adjp, row.names = NULL)
print(head(da_output2), digits = 2)

##        cluster BCRXL1 BCRXL2 BCRXL3 BCRXL4 BCRXL5 BCRXL6 BCRXL7 BCRXL8
## 1 B-cells IgM+   3.95   1.43    4.1    3.8    3.9    4.0    2.6    2.7
## 2 B-cells IgM-   1.09   1.01    2.0    1.1    1.5    1.2    2.1    1.7
## 3  CD4 T-cells  26.81  35.78   32.3   31.8   36.7   44.1   26.8   31.5
## 4  CD8 T-cells  46.05  40.87   26.5   25.9   28.2   24.6   34.3   39.7
## 5           DC   0.18   0.83    1.5    1.4    2.1    1.4    1.2    1.2
## 6     NK cells  15.64  11.69   16.6   18.3   12.6    6.8   25.5   12.9
##    Ref1 Ref2 Ref3 Ref4 Ref5  Ref6  Ref7  Ref8 coeff_BCRXLvsRef
## 1  4.82  2.8  8.3  4.7  4.4  5.68  4.34  3.82            -0.42
## 2  1.90  1.3  3.3  1.4  2.5  2.34  2.79  2.19            -0.40
## 3 44.72 49.1 39.7 32.4 38.4 47.33 28.16 36.94            -0.27
## 4 23.66 23.8 15.5 17.6 26.0 25.31 33.49 34.21             0.41
## 5  0.22  0.9  1.2  1.2  1.6  0.86  0.93  0.89             0.22
## 6 14.31  9.7 15.1 14.5 10.2  6.67 22.54 10.99             0.17
##   pval_BCRXLvsRef adjp_BCRXLvsRef
## 1         3.5e-08         9.2e-08
## 2         2.2e-11         8.8e-11
## 3         1.9e-03         2.5e-03
## 4         1.2e-03         1.9e-03
## 5         7.1e-05         1.4e-04
## 6         4.5e-13         3.6e-12

We use a heatmap to report the differential cell populations (see Figure 13). Proportions are first scaled with the arcsine-square-root transformation (as an alternative to logit that does not return NAs when ratios are equal to zero or one). Then, Z-score normalization is applied to each population to better highlight the relative differences between compared conditions. We created two wrapper functions: normalization_wrapper performs the normalization of submitted expression expr to mean 0 and standard deviation 1, and plot_differential_heatmap_wrapper generates a heatmap of submitted expression expr_norm, where samples are grouped by condition, indicated with a color bar on top of the plot. Additionally, labels of clusters contain the adjusted p-values in parenthesis.

normalization_wrapper <- function(expr, th = 2.5){
  expr_norm <- apply(expr, 1, function(x){ 
    sdx <- sd(x, na.rm = TRUE)
    if(sdx == 0){
      x <- (x - mean(x, na.rm = TRUE))
    }else{ 
      x <- (x - mean(x, na.rm = TRUE)) / sdx
    }
    x[x > th] <- th
    x[x < -th] <- -th
    return(x)
  })
  expr_norm <- t(expr_norm)
}

plot_differential_heatmap_wrapper <- function(expr_norm, sign_adjp, 
  condition, color_conditions, th = 2.5){
  ## Order samples by condition
  oo <- order(condition)
  condition <- condition[oo]
  expr_norm <- expr_norm[, oo, drop = FALSE]
  
  ## Create the row labels with adj p-values and other objects for pheatmap
  labels_row <- paste0(rownames(expr_norm), " (", sprintf( "%.02e", sign_adjp), ")")
  labels_col <- colnames(expr_norm)
  annotation_col <- data.frame(condition = factor(condition))
  rownames(annotation_col) <- colnames(expr_norm)
  annotation_colors <- list(condition = color_conditions)
  color <- colorRampPalette(c("#87CEFA", "#56B4E9", "#56B4E9", "#0072B2", 
    "#000000", "#D55E00", "#E69F00", "#E69F00", "#FFD700"))(100)
  breaks = seq(from = -th, to = th, length.out = 101)
  legend_breaks = seq(from = -round(th), to = round(th), by = 1)
  gaps_col <- as.numeric(table(annotation_col$condition))
  
  pheatmap(expr_norm, color = color, breaks = breaks, 
    legend_breaks = legend_breaks, cluster_cols = FALSE, cluster_rows = FALSE, 
    labels_col = labels_col, labels_row = labels_row, gaps_col = gaps_col, 
    annotation_col = annotation_col, annotation_colors = annotation_colors, 
    fontsize = 8)
}

## Apply the arcsine-square-root transformation
asin_table <- asin(sqrt((t(t(counts_table) / colSums(counts_table)))))
asin <- as.data.frame.matrix(asin_table)
## Keep significant clusters and sort them by significance
sign_clusters <- names(which(sort(da_out2$adjp[, "adjp_BCRXLvsRef"]) < FDR_cutoff))
## Get the adjusted p-values
sign_adjp <- da_out2$adjp[sign_clusters , "adjp_BCRXLvsRef", drop=FALSE]
## Keep samples for condition A and normalize to mean = 0 and sd = 1
asin_norm <- normalization_wrapper(asin[sign_clusters, ])

mm <- match(colnames(asin_norm), md$sample_id)
plot_differential_heatmap_wrapper(expr_norm = asin_norm, sign_adjp = sign_adjp, 
  condition = md$condition[mm], color_conditions = color_conditions)

Figure 13: Normalized proportions of PBMC cell populations that are significantly differentially abundant between BCR/FcR-XL stimulated and unstimulated condition.

8.2 Differential analysis of marker expression stratified by cell population

For this part of the analysis, we calculate the median expression of the 14 signaling markers in each cell population (merged cluster) and sample. These will be used as the response variable $Y_{ij}$ in the linear model (LM) or linear mixed model (LMM), for which we assume that the median marker expression follows a Gaussian distribution (on the already arcsinh-transformed scale). The linear model is formulated as follows:

\[Y_{ij} = \beta_0 + \beta_1 x_{ij} + \epsilon_{ij},\] where $\epsilon_{ij} \sim N(0, \sigma^2)$, and the mixed model includes a random intercept for each patient:

\[Y_{ij} = \beta_0 + \beta_1 x_{ij} + \gamma_{i} + \epsilon_{ij},\] where $\gamma_{i} \sim N(0, \sigma^2_\gamma)$. In the current experiment, we have an intercept (basal level) and a single covariate, $x_{ij}$, which is represented as a binary (stimulated/unstimulated) variable. For more complicated designs or batch effects, additional columns of a design matrix can be used.

One drawback of summarizing the protein marker intensity with a median over cells is that all the other characteristics of the distribution, such as bimodality, skewness and variance, are ignored. On the other hand, it results in a simple, easy to interpret approach, which in many cases will be able to detect interesting changes. Another issue that arises from using a summary statistic is the level of uncertainty, which increases as the number of cells used to calculate it decreases. In the statistical modeling, this problem could be partially handled by assigning observation weights (number of cells) to each cluster and sample. However, since each cluster is tested separately, these weights do not account for the differences in size between clusters.

There might be instances of small cell populations for which no cells are observed in some samples or where the number of cells is very low. For clusters absent from a sample (e.g. due to biological variance or insufficient sampling), NAs are introduced because no median expression can be calculated; in the case of few cells, the median may be quite variable. Thus, we apply a filter to remove samples that have fewer than 5 cells. We also remove cases where marker expression is equal to zero in all the samples, as this leads to an error during model fitting.

## Get median marker expression per sample and cluster
expr_median_sample_cluster_tbl <- data.frame(expr[, functional_markers], 
  sample_id = sample_ids, cluster = cell_clustering1m) %>%
  group_by(sample_id, cluster) %>% 
  summarize_all(funs(median))
## Melt
expr_median_sample_cluster_melt <- melt(expr_median_sample_cluster_tbl, 
  id.vars = c("sample_id", "cluster"), value.name = "median_expression", 
  variable.name = "antigen")
## Rearange so the rows represent clusters and markers
expr_median_sample_cluster <- dcast(expr_median_sample_cluster_melt, 
  cluster + antigen ~ sample_id,  value.var = "median_expression")
rownames(expr_median_sample_cluster) <- paste0(expr_median_sample_cluster$cluster, 
  "_", expr_median_sample_cluster$antigen)
## Eliminate clusters with low frequency
clusters_keep <- names(which((rowSums(counts < 5) == 0)))
keepLF <- expr_median_sample_cluster$cluster %in% clusters_keep
expr_median_sample_cluster <- expr_median_sample_cluster[keepLF, ]
## Eliminate cases with zero expression in all samples
keep0 <- rowSums(expr_median_sample_cluster[, md$sample_id]) > 0
expr_median_sample_cluster <- expr_median_sample_cluster[keep0, ]

It is helpful to plot the median expression of all the markers in each cluster for each sample colored by condition, to get a rough image of how strong the differences might be (see Figure 14). We do this by combining boxplots and jitter.

ggdf <- expr_median_sample_cluster_melt[expr_median_sample_cluster_melt$cluster 
  %in% clusters_keep, ]
## Add info about samples
mm <- match(ggdf$sample_id, md$sample_id)
ggdf$condition <- factor(md$condition[mm])
ggdf$patient_id <- factor(md$patient_id[mm])
ggplot(ggdf) +
  geom_boxplot(aes(x = antigen, y = median_expression, 
    color = condition, fill = condition), 
    position = position_dodge(), alpha = 0.5, outlier.color = NA) +
  geom_point(aes(x = antigen, y = median_expression, color = condition, 
    shape = patient_id), alpha = 0.8, position = position_jitterdodge(), 
    size = 0.7) +
  facet_wrap(~ cluster, scales = "free_y", ncol=2) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  scale_color_manual(values = color_conditions) +
  scale_fill_manual(values = color_conditions) +
  scale_shape_manual(values = c(16, 17, 8, 3, 12, 0, 1, 2))

Figure 14: Median (arcsinh-transformed) expression of 14 signaling markers across the 8 identified PBMC cell populations. Different colors are used for the two conditions unstimulated (Ref) and stimulated with BCR/FcR-XL (BCRXL). Values for each patient are indicated with different shape. The 8 cell populations are a result of manual merging of the 20 metaclusters.

We created a wrapper function differential_expression_wrapper that performs the differential analysis of marker expression. The user needs to specify a data frame expr_median with marker expression, where each column corresponds to a sample and each row to a cluster/marker combination. One can choose between fitting a regular linear model model = "lm" or a linear mixed model model = "lmer". The formula parameter must be adjusted adequately to the model choice. The wrapper function returns the contrast coefficients, non-adjusted and adjusted p-values for each of the specified contrasts K for each cluster/marker combination.

differential_expression_wrapper <- function(expr_median, md, model = "lmer", formula, K){
  
  ## Fit LMM or LM for each marker separately
  fit_gaussian <- lapply(1:nrow(expr_median), function(i){
    
    data_tmp <- data.frame(y = as.numeric(expr_median[i, md$sample_id]), md)
    
    switch(model, 
      lmer = {
        fit_tmp <- lmer(formula, data = data_tmp)
      },
      lm = {
        fit_tmp <- lm(formula, data = data_tmp)
      })
    
    ## Fit contrasts one by one
    out <- apply(K, 1, function(k){
      contr_tmp <- glht(fit_tmp, linfct = matrix(k, 1))
      summ_tmp <- summary(contr_tmp)
      out <- c(summ_tmp$test$coefficients, summ_tmp$test$pvalues)
      names(out) <- c("coeff", "pval")
      return(out)
    })
    
    return(t(out))
  })
  
  ### Extract fitted contrast coefficients
  coeffs <- lapply(fit_gaussian, function(x){
    x[, "coeff"]
  })
  coeffs <- do.call(rbind, coeffs)
  colnames(coeffs) <- paste0("coeff_", contrast_names)
  rownames(coeffs) <- rownames(expr_median)
  
  ### Extract p-values
  pvals <- lapply(fit_gaussian, function(x){
    x[, "pval"]
  })
  pvals <- do.call(rbind, pvals)
  colnames(pvals) <- paste0("pval_", contrast_names)
  rownames(pvals) <- rownames(expr_median)
  
  ### Adjust the p-values
  adjp <- apply(pvals, 2, p.adjust, method = "BH")
  colnames(adjp) <- paste0("adjp_", contrast_names)
  
  return(list(coeffs = coeffs, pvals = pvals, adjp = adjp))
}

To present how accounting for the within patient variability with the mixed model increases sensitivity, we also fit a regular linear model. The linear mixed model has a random intercept for each patient.

formula_lm <- y ~ condition
formula_lmer <- y ~ condition + (1|patient_id)

By accounting for the patient effect, we detect almost twice as many cases of differential signaling compared to the regular linear model.

de_out1 <- differential_expression_wrapper(expr_median = expr_median_sample_cluster, 
  md = md, model = "lm", formula = formula_lm, K = K)
apply(de_out1$adjp < FDR_cutoff, 2, table)

##       adjp_BCRXLvsRef
## FALSE              51
## TRUE               42

de_out2 <- differential_expression_wrapper(expr_median = expr_median_sample_cluster, 
  md = md, model = "lmer", formula = formula_lmer, K = K)
apply(de_out2$adjp < FDR_cutoff, 2, table)

##       adjp_BCRXLvsRef
## FALSE              23
## TRUE               70

One can assemble together an output table with the information about median marker expression in each cluster and sample, and the obtained contrast coefficients and p-values.

de_output2 <- data.frame(expr_median_sample_cluster, 
  de_out2$coeffs, de_out2$pvals, de_out2$adjp, row.names = NULL)
print(head(de_output2), digits = 2)

##        cluster antigen BCRXL1 BCRXL2 BCRXL3 BCRXL4 BCRXL5 BCRXL6 BCRXL7
## 1 B-cells IgM+   pNFkB  1.179  0.880  0.808   1.47  1.361  1.725  1.436
## 2 B-cells IgM+    pp38  0.109 -0.012  0.044   0.24 -0.046  0.083 -0.039
## 3 B-cells IgM+    pAkt  3.247  2.960  2.951   3.26  2.382  3.184  2.762
## 4 B-cells IgM+  pStat1  0.343  0.126  0.242   0.33 -0.010  0.616 -0.050
## 5 B-cells IgM+  pZap70  0.317  0.287  0.351   0.40  0.132  0.604  0.267
## 6 B-cells IgM+  pStat3 -0.047 -0.059  0.451   0.35 -0.058 -0.026  0.534
##    BCRXL8    Ref1   Ref2    Ref3    Ref4   Ref5   Ref6   Ref7   Ref8
## 1  1.5747  1.9639  1.869  1.7726  2.1833  1.861  1.953  1.915  1.979
## 2 -0.0055  0.8891  1.113  0.8534  0.6424  0.126  0.210  0.128  0.126
## 3  3.1439  2.3195  2.310  2.2688  3.0858  1.729  2.024  2.145  2.603
## 4  0.3795 -0.0058  0.064  0.0079  0.5151 -0.047  0.030 -0.034  0.191
## 5  0.3202 -0.0198 -0.033 -0.0336 -0.0056 -0.061 -0.060 -0.032 -0.017
## 6  0.3092 -0.0479 -0.082  0.2652  0.1567 -0.060 -0.066  0.275  0.381
##   coeff_BCRXLvsRef pval_BCRXLvsRef adjp_BCRXLvsRef
## 1           -0.633         6.1e-11         2.7e-10
## 2           -0.463         7.5e-04         1.6e-03
## 3            0.675         2.6e-11         1.3e-10
## 4            0.157         6.2e-02         7.5e-02
## 5            0.367         1.6e-14         1.0e-13
## 6            0.079         5.6e-02         7.1e-02

To report the significant results, we use a heatmap (see Figure 15). Instead of plotting the absolute expression, we display the normalized expression, which better highlights the direction of marker changes. Additionally, we order the cluster-marker instances by their significance and group them by cell type (cluster).

## Keep the significant markers, sort them by significance and group by cluster
sign_clusters_markers <- names(which(de_out2$adjp[, "adjp_BCRXLvsRef"] < FDR_cutoff))
oo <- order(expr_median_sample_cluster[sign_clusters_markers, "cluster"], 
  de_out2$adjp[sign_clusters_markers, "adjp_BCRXLvsRef"])
sign_clusters_markers <- sign_clusters_markers[oo]

## Get the significant adjusted p-values
sign_adjp <- de_out2$adjp[sign_clusters_markers , "adjp_BCRXLvsRef"]

## Normalize expression to mean = 0 and sd = 1
expr_s <- expr_median_sample_cluster[sign_clusters_markers,md$sample_id]
expr_median_sample_cluster_norm <- normalization_wrapper(expr_s)

mm <- match(colnames(expr_median_sample_cluster_norm), md$sample_id)
plot_differential_heatmap_wrapper(expr_norm = expr_median_sample_cluster_norm, 
  sign_adjp = sign_adjp, condition = md$condition[mm],
  color_conditions = color_conditions)

Figure 15: Normalized expression of signaling markers in the 8 PBMC populations that are significantly differentially expressed between BCR/FcR-XL stimulated and unstimulated condition.

9 Software availability

All software packages used in this workflow are publicly available from the Comprehensive R Archive Network (https://cran.r-project.org) or the Bioconductor project (http://bioconductor.org). The specific version numbers of the packages used are shown below, along with the version of the R installation. Version numbers of all Bioconductor packages correspond to release version 3.6 of the Bioconductor project.

sessionInfo()

## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
## 
## Matrix products: default
## BLAS: /usr/local/lib/R/lib/libRblas.so
## LAPACK: /usr/local/lib/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2                multcomp_1.4-6             
##  [3] TH.data_1.0-8               MASS_7.3-47                
##  [5] survival_2.41-3             mvtnorm_1.0-6              
##  [7] lme4_1.1-13                 Matrix_1.2-10              
##  [9] cowplot_0.7.0               Rtsne_0.13                 
## [11] ConsensusClusterPlus_1.41.0 FlowSOM_1.9.0              
## [13] igraph_1.1.2                pheatmap_1.0.8             
## [15] RColorBrewer_1.1-2          ggrepel_0.6.5              
## [17] limma_3.33.5                dplyr_0.7.2                
## [19] reshape2_1.4.2              ggplot2_2.2.1              
## [21] matrixStats_0.52.2          flowCore_1.43.5            
## [23] readxl_1.0.0                captioner_2.2.3            
## [25] knitr_1.16                  BiocStyle_2.5.8            
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.12        lattice_0.20-35     corpcor_1.6.9      
##  [4] zoo_1.8-0           assertthat_0.2.0    rprojroot_1.2      
##  [7] digest_0.6.12       R6_2.2.2            cellranger_1.1.0   
## [10] plyr_1.8.4          backports_1.1.0     stats4_3.4.1       
## [13] pcaPP_1.9-72        evaluate_0.10.1     highr_0.6          
## [16] rlang_0.1.1         lazyeval_0.2.0      minqa_1.2.4        
## [19] nloptr_1.0.4        rmarkdown_1.6       labeling_0.3       
## [22] splines_3.4.1       stringr_1.2.0       munsell_0.4.3      
## [25] compiler_3.4.1      pkgconfig_2.0.1     BiocGenerics_0.23.0
## [28] htmltools_0.3.6     tibble_1.3.3        codetools_0.2-15   
## [31] XML_3.98-1.9        rrcov_1.4-3         grid_3.4.1         
## [34] nlme_3.1-131        tsne_0.1-3          gtable_0.2.0       
## [37] magrittr_1.5        scales_0.4.1        graph_1.55.0       
## [40] stringi_1.1.5       robustbase_0.92-7   sandwich_2.3-4     
## [43] tools_3.4.1         Biobase_2.37.2      glue_1.1.1         
## [46] DEoptimR_1.0-8      parallel_3.4.1      yaml_2.1.14        
## [49] colorspace_1.3-2    cluster_2.0.6       bindr_0.1

References

Aghaeepour, Nima, Greg Finak, Holger Hoos, Tim R Mosmann, Ryan Brinkman, Raphael Gottardo, and Richard H Scheuermann. 2013. “Critical assessment of automated flow cytometry data analysis techniques.” Nat Meth 10 (3). Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.: 228–38. http://dx.doi.org/10.1038/nmeth.2365 http://www.nature.com/nmeth/journal/v10/n3/abs/nmeth.2365.html{\#}supplementary-information.

Angerer, Philipp, Laleh Haghverdi, Maren Büttner, Fabian J Theis, Carsten Marr, and Florian Buettner. 2016. “destiny: diffusion maps for large-scale single-cell data in R.” Bioinformatics 32 (8): 1241–3. doi:10.1093/bioinformatics/btv715.

Arvaniti, Eirini, and Manfred Claassen. 2016. “Sensitive detection of rare disease-associated cell subsets via representation learning.” BioRxiv, March. http://biorxiv.org/content/early/2016/03/31/046508.abstract.

Bendall, Sean C, Erin F Simonds, Peng Qiu, El-ad D Amir, Peter O Krutzik, Rachel Finck, Robert V Bruggner, et al. 2011. “Single-Cell Mass Cytometry of Differential Immune and Drug Responses Across a Human Hematopoietic Continuum.” Science 332 (6030). American Association for the Advancement of Science: 687–96. doi:10.1126/science.1198704.

Bodenmiller, Bernd, Eli R Zunder, Rachel Finck, Tiffany J Chen, Erica S Savig, Robert V Bruggner, Erin F Simonds, et al. 2012. “Multiplexed mass cytometry profiling of cellular states perturbed by small-molecule regulators.” Nature Biotechnology 30 (9). Nature Publishing Group: 858–67. doi:10.1038/nbt.2317.

Bruggner, Robert V, Bernd Bodenmiller, David L Dill, Robert J Tibshirani, and Garry P Nolan. 2014. “Automated identification of stratifying signatures in cellular subpopulations.” Proceedings of the National Academy of Sciences of the United States of America 111 (26): E2770–7. doi:10.1073/pnas.1408792111.

Chen, Hao, Mai Chan Lau, Michael Thomas Wong, Evan W Newell, Michael Poidinger, and Jinmiao Chen. 2016. “Cytofkit: A Bioconductor Package for an Integrated Mass Cytometry Data Analysis Pipeline.” PLOS Computational Biology 12 (9). Public Library of Science: 1–17. doi:10.1371/journal.pcbi.1005112.

Ellis, B., P. Haaland, F. Hahne, N. Le Meur, N. Gopalakrishnan, J. Spidlen, and M. Jiang. 2017. FlowCore: FlowCore: Basic Structures for Flow Cytometry Data.

Finck, Rachel, Erin F Simonds, Astraea Jager, Smita Krishnaswamy, Karen Sachs, Wendy Fantl, Dana Pe’er, Garry P Nolan, and Sean C Bendall. 2013. “Normalization of mass cytometry data with bead standards.” Cytometry Part A 83A: 483–94. doi:10.1002/cyto.a.22271.

Haghverdi, L., F. Buettner, and F. J. Theis. 2015. “Diffusion maps for high-dimensional single-cell analysis of differentiation data.” Bioinformatics 31 (May): 2989–98. doi:10.1093/bioinformatics/btv325.

Hartmann, Felix J, Raphaël Bernard-Valnet, Clémence Quériault, Dunja Mrdjen, Lukas M Weber, Edoardo Galli, Carsten Krieg, et al. 2016. “High-dimensional single-cell analysis reveals the immune signature of narcolepsy.” Journal of Experimental Medicine 213 (12). Rockefeller University Press: 2621–33. doi:10.1084/jem.20160897.

Jia, Cheng, Yu Hu, Yichuan Liu, and Mingyao Li. 2014. “Mapping Splicing Quantitative Trait Loci in RNA-Seq.” Cancer Informatics 13: 35–43. doi:10.4137/CIN.S13971.Received.

Kotecha, Nikesh, Peter O Krutzik, and Jonathan M Irish. 2001. “Web-Based Analysis and Publication of Flow Cytometry Experiments.” In Current Protocols in Cytometry. John Wiley & Sons, Inc. doi:10.1002/0471142956.cy1017s53.

Leipold, Michael D. 2015. “Another step on the path to mass cytometry standardization.” Cytometry Part A 87 (5): 380–82. doi:10.1002/cyto.a.22661.

Levine, Jacob H., Erin F. Simonds, Sean C. Bendall, Kara L. Davis, El-ad D. Amir, Michelle D. Tadmor, Oren Litvin, et al. 2015. “Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis.” Cell 162 (1). Elsevier: 184–97. doi:10.1016/j.cell.2015.05.047.

Mahnke, Yolanda D, and Mario Roederer. 2007. “Optimizing a Multicolor Immunophenotyping Assay.” Clinics in Laboratory Medicine 27 (3): 469–85. doi:http://doi.org/10.1016/j.cll.2007.05.002.

Pejoski, David, Nicolas Tchitchek, André Rodriguez Pozo, Jamila Elhmouzi-Younes, Rahima Yousfi-Bogniaho, Christine Rogez-Kreuz, Pascal Clayette, et al. 2016. “Identification of Vaccine-Altered Circulating B Cell Phenotypes Using Mass Cytometry and a Two-Step Clustering Analysis.” The Journal of Immunology 196 (11). American Association of Immunologists: 4814–31. doi:10.4049/jimmunol.1502005.

Roederer, Mario. 2001. “Spectral compensation for flow cytometry: Visualization artifacts, limitations, and caveats.” Cytometry 45 (3). John Wiley & Sons, Inc.: 194–205. doi:10.1002/1097-0320(20011101)45:3<194::AID-CYTO1163>3.0.CO;2-C.

Saeys, Yvan, Sofie Van Gassen, and Bart N Lambrecht. 2016. “Computational flow cytometry: helping to make sense of high-dimensional immunology data.” Nat Rev Immunol 16 (7). Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.: 449–62. http://dx.doi.org/10.1038/nri.2016.56 http://10.0.4.14/nri.2016.56.

Tang, Jian, Jingzhou Liu, Ming Zhang, and Qiaozhu Mei. 2016. “Visualization Large-scale and High-dimensional Data.” CoRR abs/1602.00370. http://arxiv.org/abs/1602.00370.

Unen, Vincent van, Na Li, Ilse Molendijk, Mine Temurhan, Thomas Höllt, Andrea E van der Meulen-de Jong, Hein W Verspaget, et al. 2016. “Mass Cytometry of the Human Mucosal Immune System Identifies Tissue- and Disease-Associated Immune Subsets.” Immunity 44 (5): 1227–39. doi:http://dx.doi.org/10.1016/j.immuni.2016.04.014.

Van Der Maaten, L J P, and G E Hinton. 2008. “Visualizing high-dimensional data using t-sne.” Journal of Machine Learning Research. doi:10.1007/s10479-011-0841-3.

Van Gassen, Sofie, Britt Callebaut, Mary J Van Helden, Bart N Lambrecht, Piet Demeester, Tom Dhaene, and Yvan Saeys. 2015. “FlowSOM: Using self-organizing maps for visualization and interpretation of cytometry data.” Cytometry. Part A : The Journal of the International Society for Analytical Cytology 87 (7): 636–45. doi:10.1002/cyto.a.22625.

Wang, Bo, Daniele Ramazzotti, Luca De Sano, Junjie Zhu, Emma Pierson, and Serafim Batzoglou. 2017. “SIMLR: A Tool For Large-Scale Single-Cell Analysis By Multi-Kernel Learning.” BioRxiv. Cold Spring Harbor Labs Journals. doi:10.1101/118901.

Wattenberg, Martin, Fernanda Viégas, and Ian Johnson. 2016. “How to Use t-SNE Effectively.” Distill. doi:10.23915/distill.00002.

Weber, Lukas M, and Mark D Robinson. 2016. “Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data.” Cytometry Part A 89 (12): 1084–96. doi:10.1002/cyto.a.23030.

Wilkerson, Matthew D, and D Neil Hayes. 2010. “ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking.” Bioinformatics 26 (12): 1572. doi:10.1093/bioinformatics/btq170.

Zhao, Keyan, Zhi-Xiang Lu, Juw Won Park, Qing Zhou, and Yi Xing. 2013. “GLiMMPS: Robust statistical model for regulatory variation of alternative splicing using RNA-seq data.” Genome Biology 14 (7). BioMed Central Ltd: R74. http://www.ncbi.nlm.nih.gov/pubmed/23876401.

Zunder, Eli R, Rachel Finck, Gregory K Behbehani, El-ad D Amir, Smita Krishnaswamy, Veronica D Gonzalez, Cynthia G Lorang, et al. 2015. “Palladium-based mass tag cell barcoding with a doublet-filtering scheme and single-cell deconvolution algorithm.” Nature Protocols 10 (2). Nature Publishing Group: 316–33. doi:10.1038/nprot.2015.020.

CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets. BioC 2017 workshop.

Malgorzata Nowicka

Carsten Krieg

Lukas M. Weber

Felix J. Hartmann

Silvia Guglietta

Burkhard Becher

Mitchell P. Levesque

Mark D. Robinson

Abstract

Contents