Skip to content.

Bioconductor is an open source and open development software project
for the analysis and comprehension of genomic data.



ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{Machine Learning Lab} %\VignetteDepends{ALL, Biobase, genefilter, e1071, geneplotter, hgu95av2} %\VignetteKeywords{Expression Analysis, Reproducible Research} %\VignettePackage{Milan} \documentclass[11pt]{article}

\usepackage{times} \usepackage{hyperref}

\usepackage[authoryear,round]{natbib} \usepackage{comment} \usepackage{amsmath,pstricks,fullpage} \usepackage{theorem} \usepackage{float}

\textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in %\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle}

\newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}}

\theoremstyle{break} \newtheorem{Ex}{Exercise}


\begin{document} \title{Machine Learning Lab} \author{R. Gentleman, V. Carey, W. Huber} \maketitle \tableofcontents


In this lab we explore the use of trees, nnets, support vector machines (SVM) and random forests for classification in microarray experiments. The lab and some of the experiments described here were based on the exercises ``Microarray Analaysis and Classification by SVM and PAM'' by R.~Spang and F.~Markowetz from the NGFN 1 microarray courses.

The data is that reported by Chiaretti \textit{et al} and contained in the package \Rpackage{ALL}. The software for fitting the svms comes from the package \Rpackage{e1071}.


library(ALL) library(e1071) library(Biobase) library(genefilter)


The first step in our process of analysing these data is to perform the basic transformations; we reduce the data to a comparison of BCR/ABL and NEG.

<>= data("ALL") ## subset of interest: 37+42 samples sel1 <- grep("^B", as.character(ALL$BT)) sel2 <- which(as.character(ALL$mol) %in% c("BCR/ABL","NEG")) eset <- ALL[, intersect(sel1, sel2)] @

We first apply a non-specific filtering step to remove those genes which show low values of expression or which show little variation across samples. Note that the non-specific filtering step does \textbf{not} select genes with respect to their ability to classify any particular set of samples.

Note however that, if you want to use cross-validation to estimate the performance of a classifier (e.g. mis-classification rate), the variable selection step should be included in the cross-validation~\citep{Ambroise.2002}.

<>= library(genefilter) ## intensities above 200 in at least 25% of the samples ##just so we have fewer genes and it runs in finite time f1 <- pOverA(.25, log2(200)) f2 <- function(x)(IQR(x)>0.5) ff <- filterfun(f1, f2) selected <- genefilter(eset, ff) sum(selected) esetSub <- eset[selected,] @

The filtering has selected \Sexpr{sum(selected)} genes that seem worthy (according to the criteria imposed) of further investigation.

%----------------------------------------------------------- \section{An analysis using rpart} \label{sec.rpart} %----------------------------------------------------------- First, let's set up a response variable \Robject{Y} and explanatory variables (genes) \Robject{X}:

<<>>= Y <- factor(esetSub$mol.biol) #$ table(Y) X <- t(exprs(esetSub)) @

<>= library(rpart) df <- data.frame(Y=Y, X=X) tr <- rpart(Y ~ ., data=df) @

<>= plotcp(tr) @

This seems strange. The best cross-validated (``X-val'') relative error is obtained with a \textit{two node} tree.

<>= print(tr) print(summary(tr)) @

Now we see what is happening. The chosen gene creates two nearly pure nodes. And several other genes do this.

How strongly conditioned is this result on this particular dataset? To answer that question you can randomly split your data into a test set and a training set. You fit a model on the test set and assess it on the training set.

%--------------------------------------------------------------------- \section{Test and training set} %--------------------------------------------------------------------- So, let us randomly divide the data into two groups, so that we can use them for test and training purposes.

<>= a1=which(as.character(Y)=="NEG") a2 = which(as.character(Y)=="BCR/ABL") set.seed(100) sub1N = sample(a1, 21) sub1B = sample(a2, 19) sub2N = a1[!(a1 %in% sub1N)] sub2B = a2[!(a2 %in% sub1B)] testSet = esetSub[, c(sub1N, sub1B)] testY = Y[c(sub1N, sub1B)] trainSet = esetSub[, c(sub2N, sub2B)] trainY = Y[c(sub2N, sub2B)] @

\begin{Ex} \begin{itemize} \item Write a function that will create a data set divided into two groups at at given proportion (e.g. 75\% training and 25\% test set). It should take an \Robject{exprSet} object and the name of a two class variable in its \Robject{pData} slot on which to balance. \end{itemize} \end{Ex}

\begin{Ex} Now we can return to the classification tree from Section~\ref{sec.rpart}. \begin{enumerate} \item Fit a model using the genes in \Robject{trainSet}. Assess the fit of that model. \item Now use those genes to fit a model using those best performing genes in your test data set. \item Carry out a different split of your data into test and training sets and repeat. \end{enumerate} \end{Ex}

%------------------------------------------------------------ \section{Neural network} %------------------------------------------------------------ Unfortunately, we cannot use all the genes as inputs to the current nnet implementation, at least on a modestly endowed machine. We'll use a randomly reduced set of genes.

<>= library(nnet) df <- data.frame(Y=Y, X=X[, sample(ncol(X), 150)]) nn1 <- nnet(Y~., data=df, size=5, decay=.01, MaxNWts=5000) @

<>= print(nn1) print(table(predict(nn1,type="class"), Y)) @

\begin{Ex} \begin{enumerate} \item Is this a reproducible analysis? How can it be made so? \item Does the number of iterations need to be increased? \item Use the tune.nnet procedure of package e1071 to choose effective size and decay parameters that may lead to a more parsimonious model. \item Apply this procedure to your test set and training set. \end{enumerate} \end{Ex}

%------------------------------------------------------------ \section{$k$ nearest neighbors} %------------------------------------------------------------

The user interface of this procedure is a bit unusual: it asks for the matrix of `training records', a matrix of `test records', the true classes for the training records, and then parameters of the procedure. Here's a simple example.

<<>>= library(class) knn1 <- knn(t(exprs(trainSet)), t(exprs(testSet)), trainY, k=1) table(knn1, testY)


\begin{Ex} \begin{enumerate} \item Repeat the exercise with different values of $k$ and $l$. \item Use $$ on the original data. What does it do? How does its performance compare to that of the separate test and training sets. \end{enumerate} \end{Ex}

%------------------------------------------------------------ \section{Support Vector Machines} %------------------------------------------------------------

The function \Rfunction{svm} asks for the specification of a data matrix \Robject{x} and a response \Robject{y}. The labels in \Robject{y} correspond to the rows of \Robject{x}.

<>= Xm <- t(exprs(trainSet)) svm1 <- svm(Xm, trainY, type="C-classification", kernel="linear")

@ %$

We can now explore the prediction training error.


trpred <- predict(svm1, Xm) sum( trpred != trainY) table(trpred, trainY)


As anticipated there are no errors in the prediction on the training set. We can also employ cross-validation to see what the cross-validated training set error rate is.


trcv <- svm(Xm, trainY, type="C-classification", kernel="linear", cross=10)



Of course we also have a test data set and we can use the svm based on the training set to predict the class of samples in the test set.


Xmtr <- t(exprs(testSet)) tepred <- predict(svm1, Xmtr) sum(tepred != testY) table(tepred, testY)


\begin{Ex} \begin{enumerate} \item How many samples are misclassified? \item Does \Rfunction{svm} seem to work better or worse than \Rfunction{knn}? \item Compare the test and training set error rates. \item Reverse the roles of the test and training data sets. \item Repeat the cross-validation part using the whole data set. \end{enumerate}


If both the test set and the training set represent samples drawn from the same population (which they need to if we are going to use one of them to assess the performance of the other), then we should be able to consider a larger experiment. The two data sets combined simply represent a larger sample. The labels ``training set'' and ``test set'' are irrelevant and one should be able to draw (or create) very many pseudo-test and training sets.

An easy way to start exploring that particular setting is to combine the data into one large dataset.

%% \subsection{How to efficiently do the exercises}

%% The easiest way to do the exercises that are given here is to use %% \Rfunction{Sweave} and \Rfunction{Stangle}. %% \begin{enumerate} %% \item Set the working directory to the location of the directory that %% contains the file \texttt{SVM.Rnw}. %% \item Run the command, \verb+Stangle("SVM.Rnw", driver=Rtangle())+ %% \end{enumerate} %% This will create file called \texttt{SVM.R} which contains all the R %% commands. You can now open this in your favorite editor and change the %% names as appropriate.

\subsection{How good is SVM really?}

Spang and Markowitz suggest the following experiment. Suppose that we take the training data set. In our training set there are 18 samples with BCR/ABL and 21 NEG samples. In the previous exercises we selected genes that were good predictors of the different classes. Suppose instead we make up some class labels that are not associated with any biologically meaningful variables (that we know of).


newlabs <- sample(c(rep("B", 18), rep("N",21)), 39) table(newlabs, trainY) funnysvm <- svm(Xm, newlabs, type="C-classification", kernel="linear")

fpred <- predict(funnysvm, Xm) sum( fpred != newlabs) table(fpred, newlabs)

@ %$

Wow, we can predict perfectly! And as for cross-validation, well, <>=

trfcv <- svm(Xm, newlabs, type="C-classification", kernel="linear", cross=10)

summary(trfcv) @ now we see that the error rate is a bit higher than for the correct categories.

However, this does not always turn out to be the case.

So, what is going on? Basically classifiers such as svm, randomForests and neural networks are very flexible and very good at what they are doing. The ability to correctly classify (on the training set) is no evidence of anything more than the capability of the classifier and the richness of the feature space.

\section{Random Forests}

In this part of the laboratory exercise we will use the random forests \citep{RanForests1999,RanForests1999} and the \Rpackage{randomForest} package to further explore the data in the \Rpackage{golubEsets} package.




Basic use of the random forest technology is fairly straightforward. The only parameter that seems to be very important is \Robject{mtry}. This controls the number of features that are selected for each split. The default value is the square root of the number of features but often a smaller value tends to have better performance.


set.seed(123) rf1 <- randomForest(Xm, trainY, ntree=2000, mtry=55, importance=TRUE) rf1

rf2 <- randomForest(Xm, trainY, ntree=2000, mtry=35, importance=TRUE)



Notice that the predictive capabilities of random forests do not seem to be as good as those of svm. Random forests seems to have some difficulties when the sizes of the groups are not approximately equal. There is a \Robject{weight} argument that can be given to the random forest function but it appears to have little or no effect.

We can use the prediction function to assess the ability of these two forests to predict the class for the test set.


p1 <- predict(rf1, Xm, prox=TRUE) table(trainY, p1$pred) p2 <- predict(rf2, Xm, prox=TRUE) table(trainY, p2$pred) @

\subsection{Feature Selection}

One of the nice things about the random forest technology is that it provides some indication of which variables were most important in the classification process. These features can be compared to those selected by $t$-test or other means.

The current version of \Rpackage{randomForest} produces four different variable importance statistics. Breiman has recently recommended that only two of those be considered (the other two are too unstable). The ones to concentrate on are measures two and four.

In the next code chunk a small function is defined that can be used to extract the most important variables (those with the highest scores).

<>= var.imp.plot(rf1, n.var=15) @ <>= var.imp.plot(rf2, n.var=15) @

<>= impvars <- function(x, which=2, k=10) { v1<-order(x$importance[,which]) l1 <- length(v1) x$importance[v1[(l1-k+1):l1], which] } iv.rf1 <- impvars(rf1, k=25) library("hgu95av2") library(annotate) isyms <- getSYMBOL(names(iv.rf1),data="hgu95av2")


\subsection{More exercises}

Again a number of interesting exercises present themselves. \begin{Ex} \begin{enumerate} \item Reverse the role of the test set and the training set and see how the estimated prediction errors change. \item Use the whole data set to build a random forest. How well does it do? \end{enumerate} \end{Ex}

There are very many other classifiers available to you. \begin{itemize} \item Linear discriminant analysis \Rfunction{lda}, from the \Rpackage{MASS} package. \item Logistic discrimination, \Rfunction{multinom}, in \Rpackage{nnet}. \item Generalized partial least squares, \Rpackage{gpls}. \item So-called ``Prediction Analysis for Microarrays'', \Rpackage{rpam}. \end{itemize}

\begin{thebibliography}{10} \bibitem{RanForests1999} Breiman L. \newblock Random forests - random features. \newblock \textit{Preprint Department of Statistics, U.C. Berkeley, no. 567} (1999).

\bibitem{RanForests2001} Breiman L. \newblock Notes on setting up, using, and understanding random forests V3.0 (2001).

\bibitem{Ambroise.2002} Ambroise C and McLachlan GJ. \newblock Selection bias in gene extraction on the basis of microarray gene-expression data. \newblock\textit{Proc Natl Acad Sci} 99(10): 6562-6 (2002).

\end{thebibliography} \end{document}


BioC 2.5, consisting of 352 packages and designed to work with R 2.10.z, was released today.


R, the open source platform used by Bioconductor, featured in a series of articles in the New York Times.