--- title: "Clustering" author: "Davide Risso" date: "`r Sys.Date()`" bibliography: bibFile.bib output: BiocStyle::html_document: toc: true vignette: > %\VignetteEncoding{UTF-8} --- ```{r options, results="hide", include=FALSE, cache=FALSE, results='hide', message=FALSE} knitr::opts_chunk$set(fig.align="center", cache=FALSE,error=FALSE, #make it stop on error fig.width=7, fig.height=7, autodep=TRUE, out.width="600px", out.height="600px", results="markup", echo=TRUE, eval=TRUE) #knitr::opts_knit$set(stop_on_error = 2L) #really make it stop #knitr::dep_auto() options(getClass.msg=FALSE) #get rid of annoying messages about cache until fixed internally in R set.seed(98883) ## for reproducibility library(bioc2016singlecell) ## for now load individual dependencies library(clusterExperiment) ## use develop for now library(SummarizedExperiment) library(cluster) library(MAST) ``` # Introduction This is the second part of the Bioc2016 workshop "Analysis of single-cell RNA-seq data with R and Bioconductor." In this part we will cover cluster analysis with the `r Githubpkg("epurdom/clusterExperiment")` package. The package is available on [Github](https://github.com/epurdom/clusterExperiment) and will be in Bioconductor devel soon. The goal of `clusterExperiment` is to encourage the user to try many different clustering algorithms in one package structure. We give tools for running many different clusterings and choices of parameters. We also provide visualization to compare many different clusterings and algorithms to find common shared clustering patterns. We implement common post-processing steps unrelated to the specific clustering algorithm (e.g. subsampling the data for stability, finding cluster-specific markers). ## An example dataset We will start from the normalized data obtained in the first part of the workshop with `r Githubpkg("YosefLab/scone")`. The normalized data can be loaded directly from the workshop package. ```{r datain, eval=TRUE} data(scone_res) data(ws_input) # Summarized Experiment se <- SummarizedExperiment(list(counts=norm_logcounts), colData=data.frame(batch=batch[colnames(norm_logcounts)], time_points=bio[colnames(norm_logcounts)])) ``` ## Motivation The most common workflow for single-cell clustering found in the literature is to start from some form of normalized gene-level summaries (often on the log-scale), and perform the following steps: 1. Dimensionality reduction (usually PCA or t-SNE or most variable genes). 2. Compute a distance matrix between samples in the reduced space (usually Euclidean distance). 3. Clustering based on a partitioning method (usually k-means or PAM). Each of these steps forces the researcher to make some choices, e.g., * How many principal components? * Which distance? Euclidean, correlation, rank-based, ... * How many clusters? These choices are very likely to impact the results. ```{r pca} pca <- prcomp(t(assay(se)), center=TRUE, scale=TRUE) plot(pca$x, pch=19, col=bigPalette[bio]) legend("topleft", levels(bio), fill=bigPalette) res1 <- pam(pca$x[,1:2], k=3) table(res1$clustering) res2 <- pam(pca$x[,1:3], k=3) table(res2$clustering) plot(pca$sdev^2/sum(pca$sdev^2), xlab="PC", ylab="Percent of explained variance") res3 <- pam(pca$x[,1:6], k=3) table(res3$clustering) ``` The main idea behind `clusterExperiment` (`r Githubpkg("epurdom/clusterExperiment")`) is to automatically perform and compare several clustering results, based on all possible combinations of parameters, and to find a consensus across the different clusterings. To repeat this simple example within the `clusterExperiment` framework, we can use the function `clusterMany`. ```{r pca_cm} cm <- clusterMany(se, dimReduce="PCA", nPCADims=c(2, 3, 6), ks = 3, clusterFunction = "pam") cm apply(clusterMatrix(cm), 2, table) ``` One of the main features of the package is the ease of visualization: For instance, we can directly compare the three results with `plotClusters`. ```{r plot_cm} defaultMar <- par("mar") plotCMar <- c(1.1,8.1,4.1,1.1) par(mar=plotCMar) plotClusters(cm) ``` We can also find a consensus across the different choices. ```{r combine_cm} cm <- combineMany(cm) plotClusters(cm) cm ``` Notice that samples are now marked as `-1`'s to indicate that these are unclustered samples. In this case, we obtain such samples because of the default option `miSize=5` which discards all the clusters with less than 5 samples. Note that, unlike each individual call to `pam`, we do not obtain `k = 3` clusters. In general, `combineMany` results in a larger number of smaller clusters, that can then be merged with the `mergeClusters` function. The basic premise of our workflow is to find small, robust clusters of samples, and then merge them into larger clusters as relevant. We find that many algorithmic methods for choosing the appropriate number of clusters err on the side of too few clusters. However, we find in practice that we tend to prefer to err on finding many clusters and then merging them based on examining the data. # RSEC: Resampling-based Sequential Ensemble Clustering The main use of the package is to apply the RSEC algorithm to single-cell RNA-seq data. The idea behind RSEC is to find a large number of small, coherent clusters by: * Subsampling of data to find robust clusters. * Perform sequential clustering to find a group of coherent samples, remove them, start over. We perform this routine over many parameters and find a single consensus. ## The `RSEC` workflow The basic intended clustering workflow is implemented in the `RSEC` wrapper function and consists of the following steps (also available as individual functions). * Apply many different clusterings using different choices of parameters using the function `clusterMany`. This results in a large collection of clusterings, where each clustering is based on different parameters. * Find a unifying clustering across these many clusterings using the `combineMany` function. * Determine whether some clusters should be merged together into larger clusters. This involves two steps: - Find a hierarchical clustering of the clusters found by `combineMany` using `makeDendrogram` - Merge together clusters of this hierarchy based on the percentage of differential expression, using `mergeClusters`. ## Subsample clustering Given an underlying clustering strategy, e.g., k-means or PAM with a particular choice of K, we repeat the following: * Subsample the data, e.g. 70% of samples. * Find clusters on the subsample. * Create a co-clustering matrix D: % of subsamples where samples were in the same cluster. We can use the function `clusterSingle` to perform this strategy over a single set of parameters. ```{r subsample} cl3 <- clusterSingle(se, subsample = TRUE, sequential = FALSE, clusterFunction = c("hierarchical01"), clusterDArgs = list(minSize=5, alpha=0.3), subsampleArgs = list("k"=3,"clusterFunction"="pam"), dimReduce = c("PCA"), ndims=10) plotCoClustering(cl3) cl7 <- clusterSingle(se, subsample = TRUE, sequential = FALSE, clusterFunction = c("hierarchical01"), clusterDArgs = list(minSize=5, alpha=0.3), subsampleArgs = list("k"=7,"clusterFunction"="pam"), dimReduce = c("PCA"), ndims=10) plotCoClustering(cl7) ``` Note that the final clustering is performed on the matrix D. This can be done with any clustering method. We use a flexible approach of hierarchical clustering and picking clusters so have at least 1-alpha similarity. Selecting a value for alpha seems more intuitive than selecting a number of desired clusters k. ## Sequential clustering Our sequential clustering works as follows. * Range over K in PAM clustering using the subsampling strategy outlined above. * The cluster that remains stable across values of k is identified and removed. * Repeat until no more stable clusters are found. This strategy draws ideas from the "tight clustering" algorithm used to find groups of genes in microarray data in [@tseng2005]. Again, one can use the `clusterSingle` function to perform sequential clustering. ```{r sequential} cl <- clusterSingle(se, subsample = TRUE, sequential = TRUE, clusterFunction = c("hierarchical01"), clusterDArgs = list(minSize=5, alpha=0.3), subsampleArgs = list("clusterFunction"="pam"), seqArgs = list(k0=5, remain.n=10, top.can=5), dimReduce = c("PCA"), ndims=10) cl ``` ## Finding a consensus The final ingredient is finding a consensus across parameter values. For instance, we can change the values of k0 in the sequential clustering. We could use the function `clusterMany` to perform many clusterings and then `combineMany` to find the consensus clusters. Instead, the package provides a wrapper to perform these steps, where many of the common parameters have reasonable defaults. ## The `RSEC` function The `RSEC` function implements the following steps: 1. Cluster analysis with `clusterMany`. 2. Find a consensus with `combineMany`. 3. Merge clusters together with `makeDendrogram` and `mergeClusters`. ```{r rsec} rs <- RSEC(se, k0s = 4:15, dimReduce = "PCA", nPCADims=c(10, 20), alphas=c(0.2, 0.3), clusterFunction="hierarchical01", betas=0.7, combineProportion=0.5, combineMinSize=3, dendroReduce = "mad", dendroNDims = 1000, mergeMethod = "adjP", mergeCutoff=0.01, seqArgs = list(remain.n=10, top.can=5)) rs ``` There are many arguments to the RSEC function. Argument | Passed to | Meaning -----------|:---------:|-------------------------------------------------------:| dimReduce | transform | Which dimensionality reduction to perform nPCADims | transform | Number of PC's to be used for the clustering clusterFunction | clusterD | Function to use in the clustering of the co-clustering matrix alphas | clusterD | 1 - similarity required between the clusters in the co-clustering matrix k0s | seqCluster | Initial k values for the sequential strategy betas | seqCluster | Stability required across parameters to determine that a cluster is stable combineProportion | combineMany | Minimum proportion of times two samples should be together to be assigned to a cluster. combineMinSize | combineMany | Clusters with size smaller than this will be ignored (resulting in a -1 label) dendroReduce | makeDendrogram | How to reduce the dimension of the data when computing the dendrogram dendroNDims | makeDendrogram | How many dimensions to use when computing the dendrogram mergeMethod | mergeClusters | Method used to merge clusters mergeCutoff | mergeClusters | Cutoff to merge clusters The `plotClusters` function is a good way to get a sense of how many clusterings we tried and to visualize the consensus across parameters. ```{r plotClusterEx1} par(mar=plotCMar) plotClusters(rs, main="Clusters from RSEC", axisLine=-1, sampleData=c("time_points", "batch")) ``` This plot shows the samples in the columns, and different clusterings on the rows. Each sample is color coded based on its clustering for that row, where the colors have been chosen to try to match up clusters across different clusterings that show large overlap. Moreover, the samples have been ordered so that each subsequent clustering (starting at the top and going down) will try to order the samples to keep the clusters together, without rearranging the clustering blocks of the previous clustering/row. We can see that some clusters are fairly stable across different choices of dimensions while others can vary dramatically. Notice that some samples are white. This indicates that they have the value -1, meaning they were not clustered. Another good visualization is the `plotCoClustering` function, which shows how many times samples in each cluster are together across parameters. ```{r plotCoClustering} plotCoClustering(rs) ``` To retrieve the actual results of each clustering, we can use the `clusterMatrix` and `primaryClusters` functions. ```{r clusterMatrix} head(clusterMatrix(rs)[,1:3]) table(primaryClusterNamed(rs)) ``` ## A few details on `mergeClusters` It is not uncommon that `combineMany` will result in too many small clusters, which in practice are too closely related to be useful. Since our final goal is to find gene markers for each clusters, we argue that we can merge clusters that show no or little differential expression (DE) between them. This functionality is implemented in the `mergeClusters` function. `mergeClusters` needs a hierarchical clustering of the clusters; it then goes progressively up that hierarchy, deciding whether two adjacent clusters can be merged. The function `makeDendrogram` makes such a hierarchy between clusters (by applying `hclust` to the medoids of the clusters). Here, we use the 1,000 most variable genes to make the cluster hierarchy. ```{r makeDendrogram} manual <- makeDendrogram(rs, whichCluster = "combineMany", dimReduce="mad", ndims=1000) plotDendrogram(manual) ``` It is useful to first run `mergeClusters` without actually creating any object so as to preview what the final clustering will be (and perhaps to help in setting the cutoff). ```{r mergeClustersPlot} mergeClusters(manual, mergeMethod="adjP", plot="adjP", cutoff=0.01) ``` ```{r mergeClusters} manual <- mergeClusters(manual, mergeMethod="adjP", plot="none", cutoff=0.01) par(mar=plotCMar) plotClusters(manual, whichClusters = c("mergeClusters", "combineMany")) plotCoClustering(manual, whichClusters=c("mergeClusters","combineMany")) ``` Notice that `mergeClusters` combines clusters based on the actual values of the features, while the `coClustering` plot shows how often the samples clustered together. # Find marker genes with `getBestFeatures` (using limma) Once we are satisfied with our clustering, the next step is usually to identify marker genes for each of the clusters. The simplest way is to use differentially expressed (DE) genes to identify such markers. First, we will use `limma` as a way to compute DE genes. When comparing multiple classes (in this case, cell types), the simplest way to identify DE genes is to look for genes DE in at least one class. This can be done using an F-test. The utility function `getBestFeatures` uses the `lmFit` and `topTable` functions from limma to find such DE genes. ```{r dendro_merge} rs <- makeDendrogram(rs, dimReduce="mad", ndims=1000) ## set good breaks for heatmap colors breaks <- c(min(norm_logcounts), seq(0, quantile(norm_logcounts[norm_logcounts > 0], .99, na.rm = TRUE), length = 50), max(norm_logcounts)) ``` ```{r getBestFeatures} genesF <- getBestFeatures(rs, contrastType="F", number=500, isCount=FALSE) head(genesF) ``` ```{r getBestFeatures_heatmap} plotHeatmap(rs, clusterSamplesData="dendrogramValue", clusterFeaturesData=unique(genesF[,"IndexInOriginal"]), main="F statistics", breaks=breaks, sampleData=c("time_points")) ``` The F statistic is not particularly useful to identify markers. The `getBestFeatures` function offers three alternative approaches. * `Pairs`: finds DE genes corresponding to all pairwise comparisons. * `OneAgainstAll`: finds DE genes comparing one cluster vs. the average of all the others. * `Dendro`: uses the cluster hierarchy (from the dendrogram) to compute only important contrasts. ```{r pairwise} genesPairs <- getBestFeatures(rs, contrastType="Pairs", number=50, isCount=FALSE) plotHeatmap(rs, clusterSamplesData="dendrogramValue", clusterFeaturesData=unique(genesPairs[,"IndexInOriginal"]), main="All pairwise comparisons", breaks=breaks, sampleData=c("time_points")) ``` ```{r one_all} genesOneAll <- getBestFeatures(rs, contrastType="OneAgainstAll", number=50, isCount=FALSE) plotHeatmap(rs, clusterSamplesData="dendrogramValue", clusterFeaturesData=unique(genesOneAll[,"IndexInOriginal"]), main="One versus All", breaks=breaks, sampleData=c("time_points")) ``` ```{r dendro} genesDendro <- getBestFeatures(rs, contrastType="Dendro", number=50, isCount=FALSE) plotHeatmap(rs, clusterSamplesData="dendrogramValue", clusterFeaturesData=unique(genesDendro[,"IndexInOriginal"]), main="Constrasts based on dendrogram", breaks=breaks, sampleData=c("time_points")) ``` # Return contrasts for use in external DE analysis ## Example: Account for zero-inflation with MAST Recently, several authors have suggested strategies to account for zero-inflation in differential expression analysis of single-cell data. One such strategy is MAST [@finak2015mast]. We can return the contrast matrix from `clusterExperiment` and use it as an input for a MAST DE analysis. ```{r get_contrasts} ## get contrasts from clusterExperiment object contrMat <- clusterContrasts(rs, contrastType="Dendro")$contrastMatrix wh_rm <- which(primaryCluster(rs)==-1) ``` ```{r mast} ## threshold data to apply MAST norm <- transform(rs) norm[norm<0] <- 0 ## create MAST object sca <- FromMatrix("SingleCellAssay", t(norm[,-wh_rm]), cData=data.frame(Clusters=primaryClusterNamed(rs)[-wh_rm])) ## model fit fit <- zlm(~0+Clusters, sca) ## hp test cont_levels <- paste("Clustersm", 1:max(primaryCluster(rs)), sep="") mast_res <- lapply(1:ncol(contrMat), function(i) { cont <- gsub("X", "Clustersm", colnames(contrMat))[i] hp <- Hypothesis(cont, cont_levels) wald <- waldTest(fit, hp) retval <- data.frame(Feature=rownames(wald), ContrastName=paste("Node", i, sep=""), Contrast=colnames(contrMat)[i], P.Value=wald[,1,3], stringsAsFactors = FALSE) retval <- retval[order(retval$P.Value),] return(retval[1:50,]) }) mast_res <- do.call(rbind, mast_res) plotHeatmap(rs, clusterSamplesData="dendrogramValue", clusterFeaturesData=unique(mast_res$Feature), main="Constrasts based on dendrogram", breaks=breaks, sampleData=c("time_points")) ``` # Session Info ```{r session} sessionInfo() ``` # References