1 Introduction

doppelgangR is a package for identifying duplicate samples within or between datasets of transcriptome profiles. It is intended for microarray and RNA-seq gene expression profiles where biological replicates are ordinarily more distinct than technical replicates, as is the case for cancer types with “noisy” genomes. It is intended for cases where per-gene summaries are available but full genotypes are not, which is typical of public databases such as the Gene Expression Omnibus.

The doppelgangR() function identifies duplicates in three different ways:

This vignette focuses on the “expression” type of doppelgänger.

2 Data types

Identification of doppelgängers is effective for both microarray and log-transformed RNA-seq data, and even for matching samples that have been profiled by microarray and RNA-seq.

3 Case Study: Batch correction in Japanese datasets

We load for datasets by Yoshihara et al. that have been curated in curatedOvarianData. These are objects of class ExpressionSet.


The doppelgangR function requires a list of ExpressionSet objects as input, which we create here:

testesets <- list(JapaneseA=GSE32062.GPL6480_eset,

Now run doppelgangR with default arguments, except for setting phenoFinder.args=NULL, which turns off checking for similar clinical data in the phenoData slot of the ExpressionSet objects:

results1 <- doppelgangR(testesets, phenoFinder.args=NULL)

This creates an object of class DoppelGang, which has print, summary, and plot methods. Summary method output not shown here due to voluminous output:


Plot creates a histogram of sample pairwise correlations within and between each study:

par(mfrow=c(2,2), las=1)
Doppelgängers identified on the basis of similar expression profiles.  The vertical red lines indicate samples that were flagged.

Figure 1: Doppelgängers identified on the basis of similar expression profiles
The vertical red lines indicate samples that were flagged.

One of these histograms can be drawn using the plot.pair argument:

plot(results1, plot.pair=c("JapaneseA", "JapaneseA"))

4 Important options

4.1 Changing sensitivity

If after inspecting the histograms, you see that some visible outliers were not caught, or non-outliers exceeded the sensitivity threshold, you can change the default sensitivity using the argument:

outlierFinder.expr.args = list(bonf.prob = 0.5, transFun = atanh, tail = "upper")

The default 0.5 is a reasonable but arbitrary trade-off between sensitivity and specificity which we have found to often select dataset pairs containing duplicates, but to often not find all the duplicate samples. Sensitivity can be increased by changing the bonf.prob argument, i.e.:

results1 <- doppelgangR(testesets, 
        outlierFinder.expr.args = list(bonf.prob = 1.0, transFun = atanh, 
                                       tail = "upper"))

4.2 Use of the ExpressionSet

The doppelgangR() function takes as its main argument a list of ExpressionSet objects. If you just have matrices, you can easily convert these to the ExpressionSet objects, for example:

mat <- matrix(1:4, ncol=2)
eset <- ExpressionSet(mat)
## [1] "ExpressionSet"
## attr(,"package")
## [1] "Biobase"

4.3 Parallelizing

The doppelgangR() function checks all pairwise combinations of datasets in a list of ExpressionSet objects, and these dataset pairs can be checked in parallel using multiple processing cores using the BPPARAM argument. This functionality is imported from the (“BiocParallel”) package. Please see “?BiocParallel::`BiocParallelParam-class`” documentation.

results2 <- doppelgangR(testesets, BPPARAM = MulticoreParam(workers = 8))

4.4 Caching

By default, the doppelgangR() function caches intermediate results to make re-running with different arguments faster. Turn caching off by setting the argument cache.dir=NULL.