library("knitr")
library("Biobase")
library("Hiiragi2013")
library("glmnet")
library("mlr")
library("ggplot2")

What you will learn in this lab

We will perform a binary classification of transcriptome (cDNA) samples from single cells based on their microarray expression data, into two groups. We learn how to use cross-validation to get a reasonable estimate of the misclassification error (without overfitting bias). Moreover, we learn how to use cross-validation to optimise tuning parameters on which the classifier algorithm depends. These two cross-validation loops have independent objectives and are performed in a nested manner: inside, the CV-based optimisation of the tuning parameters, outside, the CV-based estimation of the classifier performance.

There many different classification algorithms, based on different mathematical ideas, and some more or less suitable for different types of data and applications. They come in different R packages and often have their own ideosyncratic ways how they input data and parameters, and how they return results.

We use the mlr package, which provides uniform wrappers around many of these algorithms, so that we can swap them out and try different ones without constantly having to change our code. mlr also provides a powerful meta-language to describe and execute the (possibly nested) cross-validation schemes.

Load example data, from Hiiragi2013.

These are microarray expression profiles of single cells from mouse embryos at stages E3.25, E3.5 and E4.5. See Y. Ohnishi, W. Huber et al., Cell-to-cell expression variability followed by signal reinforcement progressively segregates early mouse lineages. Nature Cell Biology, 16:27 (2014).

data( "x", package = "Hiiragi2013" )
x
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 45101 features, 101 samples 
##   element names: exprs 
## protocolData
##   sampleNames: 1 E3.25 2 E3.25 ... 101 E4.5 (FGF4-KO) (101 total)
##   varLabels: ScanDate
##   varMetadata: labelDescription
## phenoData
##   sampleNames: 1 E3.25 2 E3.25 ... 101 E4.5 (FGF4-KO) (101 total)
##   varLabels: File.name Embryonic.day ... sampleColour (8 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 1415670_at 1415671_at ... AFFX-TrpnX-M_at (45101 total)
##   fvarLabels: symbol genename ensembl
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: mouse4302
table( x$sampleGroup )
## 
##           E3.25 E3.25 (FGF4-KO)      E3.5 (EPI)  E3.5 (FGF4-KO)       E3.5 (PE)      E4.5 (EPI) 
##              36              17              11               8              11               4 
##  E4.5 (FGF4-KO)       E4.5 (PE) 
##              10               4

x is an ExpressionSet object that was obtained from the Affymetrix raw data by RMA normalization.

Convert into a data.frame as expected by mlr functions. Here we also do independent filtering to remove likely uninformative variables. One motivation for doing this is that probes with very small variation might be the ones most affected (relatively) by batch effects.

rowV <- data.frame( v = rowVars(exprs(x)) )
ggplot( rowV, aes( x = log10(v) ) ) + geom_bar( binwidth = 0.05, fill = "skyblue" )

selectionThreshold <- 10^(-0.5)
selectedFeatures  <- ( rowV$v > selectionThreshold )
embryoSingleCells <- data.frame( t(exprs(x)[selectedFeatures, ]), check.names = TRUE )
embryoSingleCells$tg <- factor( ifelse( x$Embryonic.day == "E3.25", "E3.25", "other") )
with( embryoSingleCells, table( tg ) )
## tg
## E3.25 other 
##    53    48

Set up the machine learning problem

mlr requires us to create a series of objects that tells it what we want it to do. First, we need to define the task:

task <- makeClassifTask( id = "Hiiragi", data = embryoSingleCells, target = "tg" )

Define the learner:

lrn = makeLearner( "classif.glmnet", predict.type = "prob" )

Define the resampling strategy:

rdesc <- makeResampleDesc( method = "CV", stratify = TRUE, iters = 12 )

Do the resampling:

r <- resample(learner = lrn, task = task, resampling = rdesc )

This code runs for a while. Now we are ready to explore the results. Get the mean misclassification error:

r
## Resample Result
## Task: Hiiragi
## Learner: classif.glmnet
## mmce.aggr: 0.01
## mmce.mean: 0.01
## mmce.sd: 0.04
head( r$pred$data )
##                    id truth prob.E3.25  prob.other response iter  set
## 7 E3.25             7 E3.25  0.9856487 0.014351322    E3.25    1 test
## 21 E3.25           21 E3.25  0.9814871 0.018512933    E3.25    1 test
## 27 E3.25           27 E3.25  0.9097841 0.090215895    E3.25    1 test
## 75 E3.25 (FGF4-KO) 75 E3.25  0.9975721 0.002427891    E3.25    1 test
## 45 E3.5 (EPI)      45 other  0.1674613 0.832538724    other    1 test
## 49 E3.5 (PE)       49 other  0.2944465 0.705553509    other    1 test
with( r$pred$data, table(truth, response) )
##        response
## truth   E3.25 other
##   E3.25    53     0
##   other     1    47
ggplot( r$pred$data, aes( x = truth, y = prob.E3.25, colour = response ) ) + geom_point()

Wrapped learning: internal cross-validation to set tuning parameters

The learner glmnet depends on two parameters: the regularisation penalty \(\lambda\), and the parameter \(\alpha\) (alpha) that controls the respective weights of the \(L_1\) and \(L_2\) regularisation terms. So far, we used the defaults, but maybe we can do better.

Another parameter in the above data preprocessing pipeline was the choice of selectionThreshold, which we made by eye-balling the histogram. But we could also choose this in an optimal manner (and avoid overfitting by doing this via cross-validation).

In the function glmnet.predict, the chosen value of \(\lambda\) that is used for prediction is called s, and that’s also how it is called in the mlr wrappers.

In the code below, we state that we want to tune s between 0.001 and 0.1 in 6 steps, and that we want to use an inner loop of 10-fold cross-validation for the tuning. Note that this latter parameter setting has nothing to do with the object rdesc that we created above, and which describes the outer cross-validation.

tuningLrn <- makeTuneWrapper(lrn, 
  resampling = makeResampleDesc("CV", iters = 10,  stratify = TRUE), 
  par.set = makeParamSet(makeNumericParam("s", lower = 0.001, upper = 0.1)), 
  control = makeTuneControlGrid(resolution = 6) )
r2 <- resample(learner = tuningLrn, task = task, resampling = rdesc )

This code runs for a while.

r2
## Resample Result
## Task: Hiiragi
## Learner: classif.glmnet.tuned
## mmce.aggr: 0.01
## mmce.mean: 0.01
## mmce.sd: 0.04
with( r2$pred$data, table(truth, response) )
##        response
## truth   E3.25 other
##   E3.25    53     0
##   other     1    47
ggplot( r2$pred$data, aes( x = truth, y = prob.E3.25, colour = response ) ) + geom_point()

Exercises

Session Info

sessionInfo()
## R Under development (unstable) (2015-05-20 r68389)
## Platform: x86_64-apple-darwin14.3.0/x86_64 (64-bit)
## Running under: OS X 10.10.3 (Yosemite)
## 
## locale:
## [1] C
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] BiocStyle_1.6.0      mlr_2.3              ggplot2_1.0.1        BBmisc_1.9          
##  [5] ParamHelpers_1.5     glmnet_2.0-2         foreach_1.4.2        Matrix_1.2-0        
##  [9] Hiiragi2013_1.4.0    xtable_1.7-4         RColorBrewer_1.1-2   mouse4302.db_3.1.2  
## [13] org.Mm.eg.db_3.1.2   RSQLite_1.0.0        DBI_0.3.1            MASS_7.3-40         
## [17] KEGGREST_1.8.0       gtools_3.4.2         gplots_2.17.0        geneplotter_1.46.0  
## [21] annotate_1.46.0      XML_3.98-1.1         AnnotationDbi_1.30.1 GenomeInfoDb_1.4.0  
## [25] IRanges_2.2.1        S4Vectors_0.6.0      lattice_0.20-31      genefilter_1.50.0   
## [29] cluster_2.0.1        clue_0.3-49          boot_1.3-16          affy_1.46.0         
## [33] Biobase_2.28.0       BiocGenerics_0.14.0  knitr_1.10.5        
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.11.6           png_0.1-7             Biostrings_2.36.1     digest_0.6.8         
##  [5] plyr_1.8.2            evaluate_0.7          httr_0.6.1            BiocInstaller_1.18.2 
##  [9] zlibbioc_1.14.0       gdata_2.16.1          checkmate_1.5.3       preprocessCore_1.30.0
## [13] rmarkdown_0.6.1       labeling_0.3          proto_0.3-10          splines_3.3.0        
## [17] stringr_1.0.0         munsell_0.4.2         htmltools_0.2.6       codetools_0.2-11     
## [21] bitops_1.0-6          grid_3.3.0            gtable_0.1.2          magrittr_1.5         
## [25] formatR_1.2           scales_0.2.4          KernSmooth_2.23-14    stringi_0.4-1        
## [29] XVector_0.8.0         reshape2_1.4.1        affyio_1.36.0         parallelMap_1.2      
## [33] latticeExtra_0.6-26   iterators_1.0.7       tools_3.3.0           survival_2.38-1      
## [37] yaml_2.1.13           colorspace_1.2-6      caTools_1.17.1