--- title: "Analysis of Designed Experiments Using OpenCyto" author: "Greg Finak and Mike Jiang" date: "`r Sys.Date()`" output: rmarkdown::html_vignette bibliography: bibliography.bib vignette: > %\VignetteIndexEntry{AnalysingDesignedExperimentsOpenCyto} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r preamble, echo = FALSE, results = "hide"} library(knitr) opts_chunk$set(error=FALSE, message = FALSE, eval = TRUE, warning = FALSE, cache=FALSE,fig.width=7,fig.height=7) ``` ```{r libs, echo=FALSE, results = "hide"} library(BioC2015OpenCyto) library(ggcyto) data(tbdata) library(openCyto) ``` ## Welcome to the OpenCyto workshop! ### Data The `tbdata` GatingSet is part of a data set of intracellular cytokine staining (ICS) samples from TB-infected and non-infected subjects, stimulated with TB-specific and non-specific peptides. The samples were assayed for Ag-specific response to stimulation. The data were analyzed in @Lin:2015hy, using results from manual gating. ### The data structures and objects underlying OpenCyto and the BioConductor Flow Cytometry Infrastructure 1. `flowFrame` is an R object that corresponds to a single FCS file. 2. `flowSet` is a collection of `flowFrame`s, all with the same markers and channels. 3. `GatingHierarchy` is a `flowFrame` with its associated data transformation, compensation, and gating information defining cell populations. 4. `GatingSet` is a collection of `GatingHierarchy` objects, or equivalently a `flowSet`, with all transformation, compensation, and gating information. #### Gating information stored in a GatingSet The GatingSet associates each sample with a hierarchy of gates that define cell popualtions. The gates can be one, two, three or high-dimensional. Any gates supported by the `flowCore` infrastructure can be attached to a a hierarchy of gates in a `GatingSet`. The GatingSet lets us visualize FCM data analysis at the individual event-level (single-cells). Data are stored on disk using HDF5 and are read-in as needed. This has a very efficient memory footprint and allows us to analyze large data sets on modest hardware. It provides a conveninent and unified way to interact with FCM data for performing hierarchical or non-hierarchical automated gating by managing cell population definitions and their relationships. If your aren't yet using this infrastructure, you should be. #### OpenCyto There are at least 50 different algorithms in the literature for performing automated gating of FCM data. 75% of those are available for BioConductor. There is no "one best algorithm" for FCM gating. Different algorithms have different strengths (Aghaeepour2013a). OpenCyto provides a mechansim to use multiple algorithms to identify different cell populations. Complex high-dimensional methods are rarely necessary, and the focus is on simple methods that work robustly rather than fancy approaches that work on one data set. #### Gating Templates The general idea is that for a given FCM assay utilizing a set staining panel generated by one lab, a general strategy to identify cell populations of interest should be applicable to any set of data produced by the lab. OpenCyto formalizes this in the form of a `gatingTemplate`. This template defines cell populations in terms of their phenotypic markers, parent populations, and gating methods / algorithms used to define them. The template is in the form of a `csv` file that is easier to create than an R script. openCyto will take the template and a data set and produce data-driven gates for each cell population and sample. #### Let's get started! Our goal here will be to gate the `tbdata` data set. We have multiple samples per subject. One sample is unstimulated, the other is stimulated with ESAT-6 peptide. We'll focus on identifying antigen-specific CD4+ T cells. To this end, we need to compare the stimulated and non-stimulated samples within each subject. #### Cleaning up the data. The data were imported from a manual gating analysis. We'll begin by deleting the manual gates, as we're not interested in them for this demo. We'll also rename some phenotypic data columns so that we can use faceting more easily. The general gating scheme we'll apply here is `root/Singlets/CD14-/nonDebris/Lymphocytes/Live/CD3/CD4/Functional markers`. ```{r clean} #Delete existing gates Rm("Singlets",tbdata) ``` Next, we'll use the new API `add_pop` in openCyto to build our new automated gating strategy. We'll grab a subset of the data for testing. ```{r subset} tbdata_subset = subset(tbdata,`PID`%in%unique(pData(tbdata)$`PID`)[1:2]) ``` ### Add a singlet population We'll build up our gating template incrementally. ```{r singlets,warning=FALSE,error=FALSE} template = add_pop( tbdata_subset, alias = "Singlets", pop = "Singlets", parent = "root", dims = "FSC-A,FSC-H",gating_method = "singletGate",gating_args = "wider_gate=TRUE,subsample_pct=0.1" ) ``` Plotting it, we see it looks reasonable. ```{r plot_singlets} ggcyto(tbdata_subset,mapping = aes(x = `FSC-A`,y = `FSC-H`)) + geom_hex(bins = 50) + geom_gate("Singlets") + xlim(c(0,2e5)) ``` #### CD14- Cells Next we add CD14- cells. Lets' see what this dimension looks like and decide on a good gating method. ```{r plot_cd14} ggcyto(tbdata_subset,mapping = aes(x = "CD14"),subset = "Singlets") + geom_histogram(binwidth = 100) + xlim(c(-100,4096)) + facet_wrap( ~ PID,scales = "free") ``` Mindensity should be able to handle this easily. #### mindensity method on CD14- ```{r cd14} template = rbind( template,add_pop( tbdata_subset,alias = "CD14n", "CD14-", parent = "Singlets", dims = "CD14", gating_method = "mindensity",groupBy = "PID",collapseDataForGating = TRUE ) ) ``` ```{r replot_cd14} ggcyto(tbdata_subset,mapping = aes(x = "CD14"),subset = "Singlets") + geom_histogram(binwidth = 100) + xlim(c(-100,4096)) + facet_wrap( ~ PID,scales = "free") + geom_gate("CD14n") ``` ### Next we gate out debris ```{r nondebris} template = template[1:2,] template = rbind( template,add_pop( tbdata_subset,alias = "nonDebris",pop = "nonDebris+",parent = "CD14n",dims = "FSC-A",gating_method = "boundary",collapseDataForGating = FALSE,gating_args = "min=40000,max=2.5e5" ) ) ggcyto(tbdata_subset,mapping = aes(x = "FSC-A",y = "SSC-A"),subset = "CD14n") + geom_hex(bins = 100) + geom_gate() ``` ### And then lymphocytes ```{r lymphocytes} template = rbind( template, add_pop( tbdata_subset,alias = "Lymphocytes",pop = "Lymphocytes+",parent = "nonDebris",dims = "FSC-A,SSC-A",gating_method = "flowClust.2d",preprocessing_method = "prior_flowClust",preprocessing_args="K=3",collapseDataForGating = FALSE,gating_args = "quantile=0.95,target=c(80000,50000),K=3" ) ) ggcyto(tbdata_subset,mapping = aes(x = "FSC-A",y = "SSC-A"),subset = "nonDebris") + geom_hex(bins = 50) + geom_gate("Lymphocytes") ``` ### Next we gate live cells. ```{r live} template = rbind( template, add_pop( tbdata_subset,alias = "Live",pop = "Live+",parent = "Lymphocytes",dims = "",gating_method = "boundary",gating_args = "min=0,max=2000",collapseDataForGating = FALSE ) ) ggcyto(tbdata_subset,mapping = aes(x = "AViD",y="SSC-A"),subset = "Lymphocytes") + geom_hex(bins = 100) + ggcyto_par_set(limits = "instrument") + geom_gate("Live") ``` ### Then CD3+ cells ```{r cd3} template = rbind( template, add_pop( tbdata_subset,alias = "CD3",pop = "CD3+",parent = "Live",dims = "CD3",gating_method = "mindensity" ) ) ggcyto(tbdata_subset,mapping = aes(x = "CD3",y = "FSC-A"),subset = "Live") + geom_hex(bins=100) + ggcyto_par_set(limits = "instrument") + geom_gate("CD3") ``` ### And finally CD4+/CD8- cells ```{r cd4cd8} template = rbind( template, add_pop( tbdata_subset,alias = "*",pop = "CD4+/-CD8+/-",dims = "CD4,CD8",gating_method = "mindensity2",gating_args="gate_range=c(500,4000)",parent = "CD3") ) ggcyto(tbdata_subset,mapping = aes(x = "CD4",y = "CD8"),subset = "CD3") + geom_hex(bins=100) + ggcyto_par_set(limits = "data") + geom_gate()+xlim(0,3000)+ylim(0,4000) ``` ### Next we will gate the cytokine-positive cells In order to make inferences on the proportions of cytokine positive cells in the stimulated and non-stimulated samples within each subject, we should, ideally, apply the same gate to both samples within each subject. We can do this using the `groupBy` column in the template, to specify the Patient ID (`PID`) as the grouping variable, and set `collapseDataForGating` to `TRUE` in order to use both samples together to derive the gate. The cytokines of interest are: ```{r table,results='asis'} kable(pData(parameters(getData(tbdata_subset[[1]])))[,1:2],row.names=FALSE) ``` AViD, CD14, CD3, CD4 and CD8 are phenotypic markers. We want to gate IL2, IFNg, CD154, IL22, IL4, IL17a and TNFa marginally for CD4+/CD8- T cells. #### TNFa ```{r tnf} template = rbind( template, add_pop( tbdata_subset,alias = "TNF",pop = "TNF+",dims = "TNFa",gating_method = "tailgate",gating_args = "auto_tol=TRUE",parent = "CD4+CD8-",collapseDataForGating = TRUE,groupBy = "PID" ) ) ggcyto(tbdata_subset,mapping = aes(x = "TNFa",y = "SSC-A"),subset = "CD4+CD8-") + geom_hex(bins = 100) + ggcyto_par_set(limits = "instrument") + geom_gate() + facet_grid(PID ~ Peptide,scale = "free") + geom_gate() ``` #### IL2 ```{r il2} template = rbind( template, add_pop( tbdata_subset,alias = "IL2",pop = "IL2+",dims = "IL2",gating_method = "tailgate",gating_args = "auto_tol=TRUE",parent = "CD4+CD8-",collapseDataForGating = TRUE,groupBy = "PID" ) ) ggcyto(tbdata_subset,mapping = aes(x = "IL2",y = "SSC-A"),subset = "CD4+CD8-") + geom_hex(bins = 100) + ggcyto_par_set(limits = "instrument") + geom_gate() + facet_grid(PID ~ Peptide,scale = "free") + geom_gate() ``` #### IL22 ```{r il22} template = rbind( template, add_pop( tbdata_subset,alias = "IL22",pop = "IL22+",dims = "IL22",gating_method = "tailgate",gating_args = "auto_tol=TRUE,adjust=2",parent = "CD4+CD8-",collapseDataForGating = TRUE,groupBy = "PID" ) ) ggcyto(tbdata_subset,mapping = aes(x = "IL22",y = "SSC-A"),subset = "CD4+CD8-") + geom_hex(bins = 100) + ggcyto_par_set(limits = "instrument") + geom_gate() + facet_grid(PID ~ Peptide,scale = "free") + geom_gate() ``` #### IL4 ```{r il4} template = rbind( template, add_pop( tbdata_subset,alias = "IL4",pop = "IL4+",dims = "IL4",gating_method = "tailgate",gating_args = "auto_tol=TRUE",parent = "CD4+CD8-",collapseDataForGating = TRUE,groupBy = "PID" ) ) ggcyto(tbdata_subset,mapping = aes(x = "IL4",y = "SSC-A"),subset = "CD4+CD8-") + geom_hex(bins = 100) + ggcyto_par_set(limits = "instrument") + geom_gate() + facet_grid(PID ~ Peptide,scale = "free") + geom_gate() ``` #### CD154 ```{r cd154} template = rbind( template, add_pop( tbdata_subset,alias = "CD154",pop = "CD154+",dims = "CD154",gating_method = "tailgate",gating_args = "auto_tol=TRUE",parent = "CD4+CD8-",collapseDataForGating = TRUE,groupBy = "PID" ) ) ggcyto(tbdata_subset,mapping = aes(x = "CD154",y = "SSC-A"),subset = "CD4+CD8-") + geom_hex(bins = 100) + ggcyto_par_set(limits = "instrument") + geom_gate() + facet_grid(PID ~ Peptide,scale = "free") + geom_gate() ``` #### IFNg ```{r ifng} template = rbind( template, add_pop( tbdata_subset,alias = "IFNg",pop = "IFNg+",dims = "IFNg",gating_method = "tailgate",gating_args = "auto_tol=TRUE",parent = "CD4+CD8-",collapseDataForGating = TRUE,groupBy = "PID" ) ) ggcyto(tbdata_subset,mapping = aes(x = "IFNg",y = "SSC-A"),subset = "CD4+CD8-") + geom_hex(bins = 100) + ggcyto_par_set(limits = "instrument") + geom_gate() + facet_grid(PID ~ Peptide,scale = "free") + geom_gate() ``` #### IL17a ```{r il17a} template = rbind( template, add_pop( tbdata_subset,alias = "IL17a",pop = "IL17a+",dims = "IL17a",gating_method = "tailgate",gating_args = "auto_tol=TRUE",parent = "CD4+CD8-",collapseDataForGating = TRUE,groupBy = "PID" ) ) ggcyto(tbdata_subset,mapping = aes(x = "IL17a",y = "SSC-A"),subset = "CD4+CD8-") + geom_hex(bins = 100) + ggcyto_par_set(limits = "instrument") + geom_gate() + facet_grid(PID ~ Peptide,scale = "free") + geom_gate() ``` ### Gating of the complete data set. Now that we've built up our gating template we can apply it to the remainder of our data set. We write out our gating template. ```{r clean2} tmp = tempfile(fileext=".csv") write.csv(template,tmp,row.names=FALSE) gt = gatingTemplate(tmp) data(tbdata) Rm("Singlets",tbdata) #Remove any gates attached to the GatingSet ``` We do the gating in parallel. ```{r gating} library(parallel) #clean up our gating set gating(gt,tbdata,parallel_type="multicore",mc.cores=7) ``` ### Some QC We can examine the population statistics for the different cell populations. These should be relatively stable across samples. If there are some outliers, that is cause for concern and suggests we need to go back and look at those samples. Barplots of the coefficient of variation of each cell population, faceted by its parent population, and the peptide stimulation. ```{r qc} stats = getPopStats(tbdata) #extract stats stats[,prop := Count/ParentCount] #compute the cell proportions stats = merge(stats,pData(tbdata),by="name") data.table::setDT(stats) ggplot(stats[,.(CV = mad(prop) / median(prop)),.(Parent,Peptide,Population)]) + geom_bar(position = "dodge",stat = "identity") + aes(y = CV,x = Population,fill = Peptide) + theme(axis.text.x = element_text(angle = 90,hjust = 1)) + facet_wrap( ~ Parent,scales = "free_x") ``` Proportions of each cell population. ```{r qc2} ggplot(stats) + geom_boxplot() + aes(y = prop,x = Population) + geom_jitter() + theme(axis.text.x = element_text(angle = 90,hjust = 1)) + facet_wrap( ~ Parent,scales = "free") ``` There are some clear outliers in the CD8 population. We can have a closer look at those samples. ```{r qc3} nm = stats[,.(name,rZ = (prop - median(prop)) / mad(prop)),.(Population)][,.(outlier = abs(rZ) > 3),.(name,Population)][outlier == TRUE][order(name)][Population %in% "CD8+"]$name ggcyto(tbdata[as.character(nm)],mapping = aes(x = "CD8"),subset = "CD3") + geom_histogram(binwidth = 25) + geom_gate("CD8+") ``` We see the issue is one sample where the automated gate completely failed to find the CD8+ cells. For some reason they are really poorly resolved in this sample, even in the original manual gating. ### Manual intervention to adjust the gates on the outlier We can decide to set a hard threshold at 1750 for the CD8+ cells. We adjust all the gates for the CD8 containing cell populations ```{r manual_gating} for(n in nm){ mygate = getGate(tbdata[[n]],"CD8+") #extract a gates mygate@min[""] = 1750 #adjust the min value for CD8 setGate(tbdata[[n]],"CD8+",mygate) #set the gate for the cell population recompute(tbdata[[n]]) # recompute the statistics. #again for the others mygate = getGate(tbdata[[n]],"CD4-CD8+") mygate@min[""]=1750 mygate@max[""]=Inf setGate(tbdata[[n]],"CD4-CD8+",mygate) mygate = getGate(tbdata[[n]],"CD4+CD8-") mygate@max[""]=1750 setGate(tbdata[[n]],"CD4+CD8-",mygate) mygate = getGate(tbdata[[n]],"CD4-CD8-") mygate@max[""]=1750 setGate(tbdata[[n]],"CD4-CD8-",mygate) mygate = getGate(tbdata[[n]],"CD4+CD8+") mygate@min[""]=1750 setGate(tbdata[[n]],"CD4+CD8+",mygate) recompute(tbdata[[n]],"CD4-CD8+") recompute(tbdata[[n]],"CD4+CD8-") recompute(tbdata[[n]],"CD4-CD8-") recompute(tbdata[[n]],"CD4+CD8+") } plotGate(tbdata[as.character(nm)],c("CD4+CD8-","CD4-CD8+","CD4-CD8-","CD4+CD8+"), xbin=256,margin=FALSE) ``` The above at least reflects what we expect from the biology, few double positive T-cells. ### Empirical comparison of responders and non-responders We'll have a look at the proportion of cytokine positive cells in ESAT-6 and DMSO stimulated samples for each subject. ```{r empirical_diff} library(scales) #custom hyperbolic arcsine transformation. asinh_trans = function (c=800) { trans_new(name = "asinh", transform = function(x) asinh(x * c), inverse = function(x) sinh(x)/c) } stats = merge(pData(tbdata),getPopStats(tbdata),by="name") data.table::setDT(stats) stats = stats[Parent=="CD4+CD8-"][,prop:=Count/ParentCount] stats=reshape2::dcast(stats,PID+Population+Parent~Peptide) ggplot(stats) + geom_boxplot(outlier.colour = NA) + geom_jitter(aes(fill = Population),position = position_jitterdodge()) + aes(x = Population,y = pmax(`ESAT-6` - DMSO,0),col = Population) + facet_wrap( ~ Parent,scales = "free") + theme(axis.text.x = element_text(angle = 90,hjust = 1)) + scale_y_continuous("Background Corrected Proportion",trans = "asinh") + ggtitle("Marginal Background Corrected Proportions\nof Cytokine Producing CD4 Cells") + scale_x_discrete("Marginal Cytokine") ``` There are some clear marginal responses in IFNg, IL2, IL17a. We can look at combinatorial subsets of cells using the COMPASS framework, which models antigen-specific responses jointly in all cell subsets. ### Construct a COMPASS Container object from single-cell ICS data COMPASS will test whether the ESAT-6 stimulation is greater than the DMSO stimulation for each subject and each cell subset. It returns a probability of response. ```{r compass} library(COMPASS) data = getSingleCellExpression(tbdata,nodes = c("IL2","IFNg","CD154","IL17a","IL4","TNF","IL22")) #single-cell expression of seven markers counts = as.vector(getPopStats(tbdata,subpopulations = "CD4+CD8-",statistic = "count")[,.(name,Count)]) #parent population cell counts counts = as.vector(as.matrix(counts[,2,with = FALSE])) names(counts) = names(data) meta = pData(tbdata) #metadata CC = COMPASSContainer( data = data,counts = counts,meta = meta,individual_id = "PID",sample_id = "name" ) set.seed(100) compass_result = COMPASS(CC,treatment = Peptide == "ESAT-6",control = Peptide == "DMSO") ``` #### Heatmap of posterior probability of response ```{r heatmap} p = plot(compass_result,threshold=0.01) ``` #### Functionality score ```{r fs} ggplot(cbind(compass_result$data$meta,PFS = FunctionalityScore(compass_result))) + geom_boxplot(outlier.colour = NA) + geom_jitter(aes(col = known_response)) + aes(x = known_response,y = PFS) + theme_gray() ``` The differences between the `known_response` variable and the calculated polyfunctionality scores are due to 1. Different numbers of subjects entering the model. The complete study had 20 subjects, which means some cell subsets may have been filtered inappropriately with fewer subjects. 2. Differences between manual and automated gating. The automated gating scheme here is very rough and meant to give a quick overview of the data. ## That's all! Hopefully this provides you with a sufficient introduction to some of the new features and new API of openCyto. ## References