The best advice on using the clustering functions in
clusterExperiment for large datasets is to avoid calculating any \(NxN\) distance or similarity matrix. They take a long time to calculate, and a large amount of memory to store.
The most likely reason to calculate such a matrix is because of the clustering routine used. Methods like PAM or hierarchical clustering use a distance matrix and are not good choices for large datasets.
Prior to version
2.5.5, our functions would internally calculate a distance matrix if the clustering algorithm needed it, and it would be hard for the user to realize that they selected a clustering routine that needed such a matrix. Now we have added the argument
makeMissingDiss, which, if
FALSE, will not calculate any needed distance matrices and instead return an error. We recommend setting this argument to
FALSE with large datasets as a caution. If you discover that you are hitting an error, select a different clustering algorithm that does not need a \(NxN\) distance matrix.
Note, that this may not work with PAM, because PAM takes as input a matrix \(x\) (see
?pam). But if a \(x\) matrix is given as input, the
pam function simply calculates internally the distance matrix! The option
makeMissingDiss=FALSE may not catch this, since the actual clustering function allows for using an input matrix \(x\). (This is an unfortunate for large datasets, and we may in the future change how we classify the possible input into PAM to classify it as a method that only accepts distance matrices to allow it to be caught by
Similarly, using any options regarding silhouette distance will create a \(NxN\) matrix as part of the silhouette computation in
cluster package. This includes
findBestK=TRUE argument. These options should only be considered for moderate sized datasets where the calculation (and storage) of the \(NxN\) matrix is not a problem.
Unfortunately, subsampling and consensus clustering (with
makeConsensus) operate by clustering based on the proportion of shared clusterings per pairs of sample, which has been in past versions stored by
clusterExperiment in a \(NxN\) matrix (see the main tutorial vignette for an explanation of these methods). While we are working on methods to avoid calculating this matrix, they are not yet completely operational in avoiding the \(NxN\) matrix.
We have, however, in version
2.5.5 made some infrastructure changes to allow for avoidance of the \(NxN\) matrix for subsampling and consensus clustering if the user has defined a clustering function to do this (see details below).
We have also in version
2.5.5 changed the clustering functions to allow the user to request clustering of only unique representations of the combinations of clusterings from subsampling or in
makeConsensus, significantly reducing the size of the \(NxN\) matrix used in the actual clustering step (see below).
Here we document some infrastructure changes made to allow for avoidance of the \(NxN\) matrix for subsampling and consensus clustering. These do not, as of yet, actually provide the ability to avoid the \(NxN\) calculation for the clustering, but do set up an infrastructure where the user can now provide the appropriate clustering routine to avoid it.
2.5.5the results of subsampling would be saved as a NxN matrix, corresponding to the proportion of times two samples were clustered together across the \(B\) subsamples. As of
2.5.5the results are simply saved as a \(NxB\) matrix, giving the (integer-valued) cluster assigments of each sample in each subsample. This \(NxB\) matrix will need to be clustered to get anything interesting, and whether the clustering of that matrix will require calculating a NxN matrix depends on the clustering routine set in the
makeConsensuscommand now expects clustering techniques that will work directly on the \(NxB\) matrices of clusterings, rather than directly calculating the \(NxN\) matrix. Again, this requires a clustering routine that works on a \(NxB\) matrix of clusterings, and whether the clustering of that matrix will require calculating a NxN matrix depends on the clustering routine (see below).
?ClusterFunctions), they do this by simply internally calculating the \(NxN\) matrix (and this is NOT controlled by
makeMissingDissargument as the actual clustering function that is called calculates it, not the
clusterExperimentinfrastructure – similarly to PAM above). We are working on creating a clustering routine that avoids this step; if the user has such a clustering routine, they can provide this clustering routine to the functions (see main vignette and
?ClusterFunctionfor how to integrate a user-defined function)
makeConsensus, only the \(M\) unique combinations of clusters are clustered; this can effect the results, since it ignores the number of samples represented by each of the \(M\) combinations (important for methods like kmeans that take the averages acrosss the samples). However, it can dramatically reduce the size, no longer requiring calculation or storage of all the dissimilarities between identically clustered samples. To choose this option, set
clusterArgs=list(removeDup=TRUE)in the list of arguments passed to either
subsampleArgs. This can also be done for the clustering function of subsampling, but is likely to lead to much less of a reduction in size.
ClusterExperimentobject. Instead we allow for either storage of the \(NxN\) matrix or the \(NxB\) matrix, or even just the indices of the clusterings that make up the \(NxB\) matrix. This slot was primarily used for the
plotCoClusteringcommand (basically a heatmap of the \(NxN\) matrix), which is unlikely to be of practical use for extremely large datasets. However, the
plotCoClusteringcommand will calculate that \(NxN\) matrix on the fly from the \(NxB\) matrix that is stored, so again should be avoided for large datasets.
The package is compatible with HDF5 Matrices, meaning that the package will run if the data given is a reference to a HDF5 file. However, the code may acheive such this compatibility by bringing the full matrix into memory. In particular, the default clustering routines are not compatible with the HDF5 implementation, meaning that they must bring the full dataset into memory for calculations.
The only exception to this is the method “mbkmeans” which calls on the clustering routine (from the package of the same name). This package implements a version of kmeans (“Mini-Batch kmeans”) that truly works with the structure of the HDF5 datasets to avoid bringing the full dataset into memory. “Mini-batch kmeans” refers to only using a proportion of the data (a “batch”) at each iteration of the clustering. The
mbkmeans package integrates this with HDF5 files, among other formats, meaning that mbkmeans actually is written (in C code) so as to not bring the entire dataset into memory but only the subset (or batch) needed for any particular calculation.
mbkmeans package, however, the integration in
clusterExperiment has not been tested to ensure that the full dataset is not inadvertantly brought into memory by other components of
clusterExperiment infrastructure. This is an ongoing area for improvement. (So far integration of
mbkmeans as a built-in options in
clusterExperiment has only been tested so far that it successfully runs the clustering routine.)
mbkmeans: if using
mbkmeanswith subsample=TRUE, then the ‘classify’ function (i.e. the assignment of samples that were not part of the subsample to a clustering) is not part of
mbkmeans, and may bring the entire matrix into memory (when classify is