Skip to content.

bioconductor.org

Bioconductor is an open source and open development software project
for the analysis and comprehension of genomic data.

Sections

Lab1.Rnw

% NOTE
ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % \VignetteDepends{Biobase, hgu95av2, GO, class, geneplotter} % \VignetteIndexEntry{Bioconductor R Laboratory Exercises} % \VignetteKeywords{tutorial, technical duplicates, probe data} % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \documentclass[12pt]{article}

\usepackage{amsmath,pstricks,fullpage} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \usepackage{theorem} \usepackage{float}

\parindent 0in % Left justify

\newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle}

\newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rfunction}[1]{\textit{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}}

\theoremstyle{break} \newtheorem{Ex}{Exercise}

\bibliographystyle{plainnat}

\title{Lab: Using R and Bioconductor}

\author{Robert Gentleman \\ Wolfgang Huber \\ Paul Murrell}

\begin{document}

\maketitle

\section*{Introduction}

In this lab we will cover some basic uses of R and also begin working with some of the Bioconductor data sets and tools. Topics covered include basic use of R, R graphics, working with environments as hash tables.

\section*{Some Basic R}

First load the \Rpackage{Biobase} package and then the data set \Robject{eset}.

<>= library("Biobase") data(eset) eset @

The expression set is an S4 class and \Robject{eset} is an instance of this class. You can get help (a description of the class) by using the \verb+?+ operator; \verb+class?exprSet+.

<>=

class(eset) slotNames(eset) eset$cov1 eset[1,] eset[,1]

@

You can extract the values in the slots using the \verb+@+ operator, or in many cases accessor functions are available. The names of the slots can be obtained using \Rfunction{slotNames}, as shown above. Extract the values for some of the named slots.

\begin{Ex} What happens when we subset \verb+eset+? What kind of an object do we get? What happened to the phenotypic data? What happened to the expression data? Subset \Robject{eset} by selecting all elements for which \Robject{cov1} has value \verb+1+. \end{Ex}

\subsection*{Environments}

In R an \Robject{environment} is a set of symbol-value pairs. These are very similar to lists, but there is no natural ordering of values and you cannot make use of numeric indices. Otherwise they behave the same way.

We first create an environment and then add, remove, list etc.

<>=

e1 = new.env(hash=TRUE) e1$a = rnorm(10) e1$b = runif(20) ls(e1) xx = as.list(e1) names(xx)

@

\begin{Ex} \begin{itemize} \item Create an environment and put a copy of \Robject{eset} into it. \item Fit a linear model to the data \verb+x=1:10+, \verb+y=2*x+rnorm(10, sd=0.25)+, and also place this into your environment. \item Write a function, \Rfunction{myExtract}, that takes an environment as an argument and returns a list, one element is the variable \Robject{cov2} from \Robject{eset} and the other is the vector of coefficients from the linear model. \end{itemize} \end{Ex}

\subsection*{Something Harder}

Later we will spend some time discussing machine learning (ML), but here we will just use one simple algorithm, $k$-nearest neighbors ($knn$) to make predictions. You should read the R manual page for a description of $knn$.

<>= library("class") apropos("knn") @

The $knn$ algorithm predicts the class of a given observation (the test case) according to a majority vote of the $k$ nearest neighbors in the training set. We will show how you can use this to predict the class of sample 1, given data on samples 2 through 26.

<>=

exprsEset = exprs(eset) classEset = eset$cov2 esub = eset[,-1] pred1 = knn(t(exprs(esub)), exprs(eset)[,1], esub$cov2)

classEset[1] @

\begin{Ex} \begin{itemize} \item Write a function, that takes an \Rclass{exprSet} as its input and carries out a leave-one-out set of predictions. \item Your function should return the vector of predicted values for the given covariate. \item Modify your function to allow the user to specify some of the parameters for $knn$, such as $k$. \end{itemize} \end{Ex}

\subsection*{The \texttt{apply} functions}

In R a great deal of work is done by applying some function to all elements of a list, matrix or array. There are several functions available for you to use, \Rfunction{apply}, \Rfunction{lapply}, \Rfunction{sapply} are the most commonly used. From the next release of R onwards there will also be an \Rfunction{eapply} for use with environments.

To get some understanding of the apply functions we will attempt to extract some information from the Gene Ontology information that is supplied with each data package.

This next code chunk shows how to use apply to extract all the molecular function GO terms for each Affymetrix probe set.

<>= library("GO") library("hgu95av2") affyGO = as.list(hgu95av2GO) #find the MF terms affyMF = lapply(affyGO, function(x) { onts = sapply(x, function(z) z$Ontology) if( is.null(onts) || is.na(onts) ) NA else unique(names(onts)[onts=="MF"]) })

@

\begin{Ex} \begin{itemize} \item How are the GO terms stored? What information is available for each? \item What are the evidence codes and what do they mean? \item Turn this code into a function that would allow users to obtain either the MF, BP or CC data. \item Extend this function to allow the user to include only given evidence codes. (Or if you think it better - to exclude specific codes). \end{itemize} \end{Ex} \section*{Graphics}

In this section you will work through some examples that allow you to create very general plots in R.

Given the following data produce a plot that looks like the one in Figure~\ref{fig1}. The relevant features are the tick marks on the y-axis and the vertical positioning of the data symbols. <>= x1 <- 1:10 y1 <- as.factor(rep(c("A","B"), c(5,5))) @

\begin{figure}[h] \begin{center} <>=

plot(x1, y1, ylim=c(0.5, 2.5), axes=FALSE) box() axis(1) axis(2, at=sort(as.numeric(unique(y1))), labels=levels(y1)) @ \caption{Figure for Graphics Question 1. \label{fig1}} \end{center} \end{figure}

The following data represent a value recorded regularly over time (e.g., television viewing gures). The variable \Robject{v} contains the raw values, and \verb+mean.8+ contains ``moving average'' values (for week $i$, the moving average is average of weeks $i-7, i-6, \ldots, i$).

<>= v <- rnorm(20) + 4 mean.8 <- rep(0, length(v) - 7) for (i in 1:length(mean.8)) mean.8[i] <- mean(v[i:(i+7)])

@

\begin{figure}[h] \begin{center} <>= mp <- barplot(v, density=10, xlab="Week Number", ylab="Number of Viewers", col=1) lines(mp[8:length(v)], mean.8, lty=1, lwd=2, col=2) points(mp[8:length(v)], mean.8, pch=16, col=2) mtext(as.character(1:20), side=1, at=mp, line=0) @ \caption{Figure for Graphics Question 2. \label{fig2}} \end{center} \end{figure}

Now, let us try for something that is related to bioinformatics. We first find out which genes are located in which chromosomes. <>= whCHR = unlist(mget(geneNames(eset), hgu95av2CHR)) table(whCHR) max(table(whCHR)) @ We can see how many genes from each chromosome are included in our data set. We want to plot these data, basically creating plots similar to those in \Rpackage{geneplotter}, such as \Rfunction{alongChrom}.

<>= library("geneplotter") @

\begin{Ex} \begin{itemize} \item Select a chromosome (any one) to produce your plot. \item Find out the length of this chromosome (in bases). [Hint: the necessary data is in \Rpackage{hgu95av2}.] \item Find the position for each gene, on your selected chromosome. [Hint: \Robject{hgu95av2CHRLOC}] \item Create a plot with a single horizontal line and add a tick mark for each gene (perpendicular to the horizontal line). \item Can you color the tick marks according to gene expression? \end{itemize} \end{Ex}

\end{document}

News
2009-10-26

BioC 2.5, consisting of 352 packages and designed to work with R 2.10.z, was released today.

2009-01-07

R, the open source platform used by Bioconductor, featured in a series of articles in the New York Times.