Skills for Intermediate R Users

CSAMA 2014
Author: Martin Morgan (mtmorgan@fhcrc.org)
Date: 23 June, 2014

Refresher: working with data
- Case study: ALL phenotypic data
- Case study: Weighty matters
Making progress: classes, methods, and packages
- Case study: IRanges and GRanges
A sequence analysis package tour
Avoiding deadly sins: efficient code
- Case study: From iteration to vectorization
- Case study: Choosing algorithms

Refresher: working with data

Case study: ALL phenotypic data

This case study servers as a refresher basic input and manipulation of data.

Input a file that contains ALL (acute lymphoblastic leukemia) patient information

fname <- file.choose()   ## "ALLphenoData.tsv"
stopifnot(file.exists(fname))
pdata <- read.delim(fname)

Check out the help page ?read.delim for input options, and explore basic properties of the object you've created, for instance…

class(pdata)

## [1] "data.frame"

colnames(pdata)

##  [1] "id"             "diagnosis"      "sex"            "age"           
##  [5] "BT"             "remission"      "CR"             "date.cr"       
##  [9] "t.4.11."        "t.9.22."        "cyto.normal"    "citog"         
## [13] "mol.biol"       "fusion.protein" "mdr"            "kinet"         
## [17] "ccr"            "relapse"        "transplant"     "f.u"           
## [21] "date.last.seen"

dim(pdata)

## [1] 127  21

head(pdata)

##     id diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE
## 2 1010 3/29/2000   M  19 B2        CR CR 6/27/2000   FALSE   FALSE
## 3 3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA
## 4 4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE
## 5 4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE
## 6 4008 7/30/1997   M  17 B1        CR CR 9/27/1997   FALSE   FALSE
##   cyto.normal        citog mol.biol fusion.protein mdr   kinet   ccr
## 1       FALSE      t(9;22)  BCR/ABL           p210 NEG dyploid FALSE
## 2       FALSE  simple alt.      NEG           <NA> POS dyploid FALSE
## 3          NA         <NA>  BCR/ABL           p190 NEG dyploid FALSE
## 4       FALSE      t(4;11) ALL1/AF4           <NA> NEG dyploid FALSE
## 5       FALSE      del(6q)      NEG           <NA> NEG dyploid FALSE
## 6       FALSE complex alt.      NEG           <NA> NEG hyperd. FALSE
##   relapse transplant               f.u date.last.seen
## 1   FALSE       TRUE BMT / DEATH IN CR           <NA>
## 2    TRUE      FALSE               REL      8/28/2000
## 3    TRUE      FALSE               REL     10/15/1999
## 4    TRUE      FALSE               REL      1/23/1998
## 5    TRUE      FALSE               REL      11/4/1997
## 6    TRUE      FALSE               REL     12/15/1997

summary(pdata$sex)

##    F    M NA's 
##   42   83    2

summary(pdata$cyto.normal)

##    Mode   FALSE    TRUE    NA's 
## logical      69      24      34

Remind yourselves about various ways to subset and access columns of a data.frame

pdata[1:5, 3:4]

##   sex age
## 1   M  53
## 2   M  19
## 3   F  52
## 4   M  38
## 5   M  57

pdata[1:5, ]

##     id diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE
## 2 1010 3/29/2000   M  19 B2        CR CR 6/27/2000   FALSE   FALSE
## 3 3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA
## 4 4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE
## 5 4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE
##   cyto.normal       citog mol.biol fusion.protein mdr   kinet   ccr
## 1       FALSE     t(9;22)  BCR/ABL           p210 NEG dyploid FALSE
## 2       FALSE simple alt.      NEG           <NA> POS dyploid FALSE
## 3          NA        <NA>  BCR/ABL           p190 NEG dyploid FALSE
## 4       FALSE     t(4;11) ALL1/AF4           <NA> NEG dyploid FALSE
## 5       FALSE     del(6q)      NEG           <NA> NEG dyploid FALSE
##   relapse transplant               f.u date.last.seen
## 1   FALSE       TRUE BMT / DEATH IN CR           <NA>
## 2    TRUE      FALSE               REL      8/28/2000
## 3    TRUE      FALSE               REL     10/15/1999
## 4    TRUE      FALSE               REL      1/23/1998
## 5    TRUE      FALSE               REL      11/4/1997

head(pdata[, 3:5])

##   sex age BT
## 1   M  53 B2
## 2   M  19 B2
## 3   F  52 B4
## 4   M  38 B1
## 5   M  57 B2
## 6   M  17 B1

tail(pdata[, 3:5], 3)

##     sex age BT
## 125   M  19 T2
## 126   M  30 T3
## 127   M  29 T2

head(pdata$age)

## [1] 53 19 52 38 57 17

head(pdata$sex)

## [1] M M F M M M
## Levels: F M

head(pdata[pdata$age > 21,])

##      id diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22.
## 1  1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE
## 3  3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA
## 4  4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE
## 5  4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE
## 10 8001 1/15/1997   M  40 B2        CR CR 3/26/1997   FALSE   FALSE
## 11 8011 8/21/1998   M  33 B3        CR CR 10/8/1998   FALSE   FALSE
##    cyto.normal        citog mol.biol fusion.protein mdr   kinet   ccr
## 1        FALSE      t(9;22)  BCR/ABL           p210 NEG dyploid FALSE
## 3           NA         <NA>  BCR/ABL           p190 NEG dyploid FALSE
## 4        FALSE      t(4;11) ALL1/AF4           <NA> NEG dyploid FALSE
## 5        FALSE      del(6q)      NEG           <NA> NEG dyploid FALSE
## 10       FALSE     del(p15)  BCR/ABL           p190 NEG    <NA> FALSE
## 11       FALSE del(p15/p16)  BCR/ABL      p190/p210 NEG dyploid FALSE
##    relapse transplant               f.u date.last.seen
## 1    FALSE       TRUE BMT / DEATH IN CR           <NA>
## 3     TRUE      FALSE               REL     10/15/1999
## 4     TRUE      FALSE               REL      1/23/1998
## 5     TRUE      FALSE               REL      11/4/1997
## 10    TRUE      FALSE               REL      7/11/1997
## 11   FALSE       TRUE BMT / DEATH IN CR           <NA>

It seems from below that there are 17 females over 40 in the data set, but when sub-setting pdata to contain just those individuals 19 rows are selected. Why? What can we do to correct this?

idx <- pdata$sex == "F" & pdata$age > 40
table(idx)

## idx
## FALSE  TRUE 
##   108    17

dim(pdata[idx,])

## [1] 19 21

Use the mol.biol column to subset the data to contain just individuals with 'BCR/ABL' or 'NEG', e.g.,

bcrabl <- pdata[pdata$mol.biol %in% c("BCR/ABL", "NEG"),]

The mol.biol column is a factor, and retains all levels even after subsetting. How might you drop the unused factor levels?

bcrabl$mol.biol <- factor(bcrabl$mol.biol)

The BT column is a factor describing B- and T-cell subtypes

levels(bcrabl$BT)

##  [1] "B"  "B1" "B2" "B3" "B4" "T"  "T1" "T2" "T3" "T4"

How might one collapse B1, B2, … to a single type B, and likewise for T1, T2, …, so there are only two subtypes, B and T

table(bcrabl$BT)

## 
##  B B1 B2 B3 B4  T T1 T2 T3 T4 
##  4  9 35 22  9  4  1 15  9  2

levels(bcrabl$BT) <- substring(levels(bcrabl$BT), 1, 1)
table(bcrabl$BT)

## 
##  B  T 
## 79 31

Use xtabs() (cross-tabulation) to count the number of samples with B- and T-cell types in each of the BCR/ABL and NEG groups

xtabs(~ BT + mol.biol, bcrabl)

##    mol.biol
## BT  BCR/ABL NEG
##   B      37  42
##   T       0  31

Use aggregate() to calculate the average age of males and females in the BCR/ABL and NEG treatment groups.

aggregate(age ~ mol.biol + sex, bcrabl, mean)

##   mol.biol sex   age
## 1  BCR/ABL   F 39.94
## 2      NEG   F 30.42
## 3  BCR/ABL   M 40.50
## 4      NEG   M 27.21

Use t.test() to compare the age of individuals in the BCR/ABL versus NEG groups; visualize the results using boxplot(). In both cases, use the formula interface. Consult the help page ?t.test and re-do the test assuming that variance of ages in the two groups is identical. What parts of the test output change?

t.test(age ~ mol.biol, bcrabl)

## 
##  Welch Two Sample t-test
## 
## data:  age by mol.biol
## t = 4.817, df = 68.53, p-value = 8.401e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.135 17.224
## sample estimates:
## mean in group BCR/ABL     mean in group NEG 
##                 40.25                 28.07

boxplot(age ~ mol.biol, bcrabl)

plot of chunk ALL-age

Case study: weighty matters

This case study is a second walk through basic data manipulation and visualization skills. We use data from the US Center for Disease Control's Behavioral Risk Factor Surveillance System (BRFSS) annual survey. Check out the web page for a little more information. We are using a small subset of this data, including a random sample of 10000 observations from each of 1990 and 2010.

Input the data using read.csv(), creating a variable brfss to hold it. Use file.choose() to locate the data file BRFSS-subset.csv

fname <- file.choose()   ## BRFSS-subset.csv
stopifnot(file.exists(fname))
brfss <- read.csv(fname)

Exercises: base plotting functions

Explore the data using class(), dim(), head(), summary(), etc. Use xtabs() to summarize the number of males and females in the study, in each of the two years.
Use aggregate() to summarize the average weight in each sex and year.
Create a scatterplot showing the relationship between the square root of weight and height, using the plot() function and the main argument to annotate the plot. Note the transformed Y-axis. Experiment with different plotting symbols (try the command example(points) to view different points).
```
plot(sqrt(Weight) ~ Height, brfss, main="All Years, Both Sexes")
```
Color the female and male points differently. To do this, use the col argument to plot(). Provide as a value to that argument a vector of colors, subset by brfss$Sex.
Create a subset of the data containing only observations from 2010.
```
brfss2010 <- brfss[brfss$Year == "2010", ]
```
Create the figure below (two panels in a single figure). Do this by using the par() function with the mfcol argument before calling plot(). You'll need to create two more subsets of data, perhaps when you are providing the data to the function plot.
```
opar <- par(mfcol=c(1, 2))
plot(sqrt(Weight) ~ Height, brfss2010[brfss2010$Sex == "Female", ],
    main="2010, Female")
plot(sqrt(Weight) ~ Height, brfss2010[brfss2010$Sex == "Male", ],
    main="2010, Male")
```
```
par(mfcol=c(1, 1))
```
Plotting large numbers of points means that they are often over-plotted, potentially obscuring important patterns. Experiment with arguments to plot() to address over-plotting, e.g., pch='.' or alpha=.4. Try using the smoothScatter() function (the data have to be presented as x and y, rather than as a formula). Try adding the hexbin library to your R session (using library()) and creating a hexbinplot().

Exercises: Trellis graphics and the lattice package

R has a number of additional plotting facilities, both as part of the 'base' distribution and user-contributed packages. The lattice package adopts a particular philosophy about the presentation of data, and can be a very effective visualization tool.

Use library() to load the lattice package.
```
library(lattice)
```
Create the following figure using the xyplot() function with a formula and the brfss2010 data. The formula is sqrt(Weight) ~ Height | Sex, which can be read as square root of Weight as a function of Height, conditioned on Sex'.
```
xyplot(sqrt(Weight) ~ Height | Sex, brfss2010)
```
Add a background grid and a regression line to each panel using the argument type=c('p', 'g', 'r'); change the width (lwd) and color (col.line) of the regression line. For some hints on other arguments, see the help pages ?xyplot (for overall structure of the plot) and ?panel.xyplot (for how each 'panel' of data is plotted).
Create the following figure. Use the densityplot() function with the formula ~ sqrt(Weight). The group=Sex function argument creates separate lines for each sex. Be sure to use plot.points=FALSE to avoid a rug' of points at the base of the figure. Can you add a key (legend)?
```
densityplot(~sqrt(Weight), brfss2010, group=Sex, plot.points=FALSE)
```
Create the figure below using the bwplot function. The formula requires that Year be coerced to a factor, factor(Year).
```
bwplot(sqrt(Weight) ~ factor(Year) | Sex, brfss)
```
Create the Figure below, a violin plot, using bwplot() and the panel argument set to panel.violin. from ?bwplot we learn that panel is a function that determines how each panel is drawn; the details for controling the violin panel plot are described on the help page ?panel.violin.
```
bwplot(sqrt(Weight) ~ factor(Year) | Sex, brfss, panel=panel.violin)
```

(Advanced) We can write our own panel argument to lattice functions to influence how each panel is displayed. Here we add a point at the median age and weight.

xyplot(sqrt(Weight) ~ Height|Sex, brfss2010,
    panel = function(x, y, ...) {
        panel.xyplot(x, y, ...)
        panel.points(median(x, na.rm=TRUE), median(y, na.rm=TRUE), 
            cex=2, pch=20, col="red")
    },
    type=c("p", "g", "r"), lwd=2, col.line="red", xlim=c(120, 210))

plot of chunk lattice-panel

Making progress: classes, methods, and packages

This section focuses on classes, methods, and packages, with the goal being to learn to navigate the help system and interactive discovery facilities.

Motivation

Sequence analysis is specialized

Large data needs to be processed in a memory- and time-efficient manner
Specific algorithms have been developed for the unique characteristics of sequence data

Additional considerations

Re-use of existing, tested code is easier to do and less error-prone than re-inventing the wheel.
Interoperability between packages is easier when the packages share similar data structures.

Solution: use well-defined classes to represent complex data; methods operate on the classes to perform useful functions. Classes and methods are placed together and distributed as packages so that we can all benefit from the hard work and tested code of others.

Case study: IRanges and GRanges

The IRanges package defines an important class for specifying integer ranges, e.g.,

library(IRanges)

## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
##     as.data.frame, as.vector, cbind, colnames, do.call,
##     duplicated, eval, evalq, get, intersect, is.unsorted, lapply,
##     mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rep.int, rownames, sapply, setdiff,
##     sort, table, tapply, union, unique, unlist

ir <- IRanges(start=c(10, 20, 30), width=5)
ir

## IRanges of length 3
##     start end width
## [1]    10  14     5
## [2]    20  24     5
## [3]    30  34     5

There are many interesting operations to be performed on ranges, e.g, flank() identifies adjacent ranges

flank(ir, 3)

## IRanges of length 3
##     start end width
## [1]     7   9     3
## [2]    17  19     3
## [3]    27  29     3

Consult the help page for flank, ?flank, and explore other range-based operations.

The IRanges class is part of a class hierarchy. To see this, ask R for the class of ir, and for the class definition of the IRanges class

class(ir)

## [1] "IRanges"
## attr(,"package")
## [1] "IRanges"

getClassDef(class(ir))

## Class "IRanges" [package "IRanges"]
## 
## Slots:
##                                                                       
## Name:            start           width           NAMES     elementType
## Class:         integer         integer characterORNULL       character
##                                       
## Name:  elementMetadata        metadata
## Class: DataTableORNULL            list
## 
## Extends: 
## Class "Ranges", directly
## Class "IntegerList", by class "Ranges", distance 2
## Class "RangesORmissing", by class "Ranges", distance 2
## Class "AtomicList", by class "Ranges", distance 3
## Class "List", by class "Ranges", distance 4
## Class "Vector", by class "Ranges", distance 5
## Class "Annotated", by class "Ranges", distance 6
## 
## Known Subclasses: "NormalIRanges"

Notice that IRanges extends the Ranges class. Now try entering ?"flank,<tab>, where <tab> means to press the tab key to ask for tab completion (may not be necessary in Rstudio). You can see that there are help pages for several different classes. Tab-complete to

?"flank,Ranges-method"

and verify that you're at the page that describes the method relevant to an IRanges instance.

The GenomicRanges package extends the notion of ranges to include features relevant to application of ranges in sequence analysis, particularly the ability to associate a range with a sequence name (e.g., chromosome) and a strand. Create a GRanges instance based on our IRanges instance, as follows

library(GenomicRanges)

## Loading required package: GenomeInfoDb

gr <- GRanges(c("chr1", "chr1", "chr2"), ir, strand=c("+", "-", "+"))
gr

## GRanges with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1  [10, 14]      +
##   [2]     chr1  [20, 24]      -
##   [3]     chr2  [30, 34]      +
##   ---
##   seqlengths:
##    chr1 chr2
##      NA   NA

The notion of flanking sequence has a more nuanced meaning in biology. In particular we might expect that flanking sequence on the + strand would precede the range, but on the minus strand would follow it. Verify that flank applied to a GRanges object has this behavior.

flank(gr, 3)

## GRanges with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1  [ 7,  9]      +
##   [2]     chr1  [25, 27]      -
##   [3]     chr2  [27, 29]      +
##   ---
##   seqlengths:
##    chr1 chr2
##      NA   NA

Discover what classes GRanges extends, find the help page documenting the behavior of flank when applied to a GRanges object, and verify that the help page documents the behavior we just observed.

class(gr)

## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"

getClassDef(class(gr))

## Class "GRanges" [package "GenomicRanges"]
## 
## Slots:
##                                                                       
## Name:         seqnames          ranges          strand elementMetadata
## Class:             Rle         IRanges             Rle       DataFrame
##                                       
## Name:          seqinfo        metadata
## Class:         Seqinfo            list
## 
## Extends: 
## Class "GenomicRanges", directly
## Class "Vector", by class "GenomicRanges", distance 2
## Class "GenomicRangesORmissing", by class "GenomicRanges", distance 2
## Class "GenomicRangesORGRangesList", by class "GenomicRanges", distance 2
## Class "Annotated", by class "GenomicRanges", distance 3

?"flank,GenomicRanges-method"

Notice that the available flank() methods have been augmented by the methods defined in the GenomicRanges package.

It seems like there might be a number of helpful methods available for working with genomic ranges; we can discover some of these from the command line, indicating that the methods should be on the current search() path

showMethods(class="GRanges", where=search())

Use help() to list the help pages in the GenomicRanges package, and vignettes() to view and access available vignettes; these are also available in the Rstudio 'Help' tab.

help(package="GenomicRanges")
vignette(package="GenomicRanges")
vignette(package="GenomicRanges", "GenomicRangesHOWTOs")

A sequence analysis package tour

This very open-ended topic points to some of the most prominent Bioconductor packages for sequence analysis. Use the opportunity in this lab to explore the package vignettes and help pages highlighted below; many of the material will be covered in greater detail in subsequent labs and lectures.

Basics

Bioconductor packages are listed on the biocViews page. Each package has 'biocViews' (tags from a controlled vocabulary) associated with it; these can be searched to identify appropriately tagged packages, as can the package title and author.
Each package has a 'landing page', e.g., for GenomicRanges. Visit this landing page, and note the description, authors, and installation instructions. Packages are often written up in the scientific literature, and if available the corresponding citation is present on the landing page. Also on the landing page are links to the vignettes and reference manual and, at the bottom, an indication of cross-platform availability and download statistics.
A package needs to be installed once, using the instructions on the landing page. Once installed, the package can be loaded into an R session
```
library(GenomicRanges)
```
and the help system queried interactively, as outlined above:
```
help(package="GenomicRanges")
vignette(package="GenomicRanges")
vignette(package="GenomicRanges", "GenomicRangesHOWTOs")
?GRanges
```

Domain-specific analysis – explore the landing pages, vignettes, and reference manuals of two or three of the following packages.

Important packages for analysis of differential expression include edgeR and DESeq2; both have excellent vignettes for exploration. Additional research methods embodied in Bioconductor packages can be discovered by visiting the biocViews web page, searching for the 'DifferentialExpression' view term, and narrowing the selection by searching for 'RNA seq' and similar.
Popular ChIP-seq packages include DiffBind for comparison of peaks across samples, ChIPQC for quality assessment, and ChIPpeakAnno for annotating results (e.g., discovering nearby genes). What other ChIP-seq packages are listed on the biocViews page?
Working with called variants (VCF files) is facilitated by packages such as VariantAnnotation, VariantFiltering, and ensemblVEP; packages for calling variants include, e.g., h5vc and VariantTools.
Several packages identify copy number variants from sequence data, including cn.mops; from the biocViews page, what other copy number packages are available? The CNTools package provides some useful facilities for comparison of segments across samples.
Microbiome and metagenomic analysis is facilitated by packages such as phyloseq and metagenomeSeq.
Metabolomics, chemoinformatics, image analysis, and many other high-throughput analysis domains are also represented in Bioconductor; explore these via biocViews and title searches.

Working with sequences, alignments, common web file formats, and raw data; these packages rely very heavily on the IRanges / GenomicRanges infrastructure that we will encounter later in the course.

The Biostrings package is used to represent DNA and other sequences, with many convenient sequence-related functions. Check out the functions documented on the help page ?consensusMatrix, for instance. Also check out the BSgenome package for working with whole genome sequences, e.g., ?"getSeq,BSgenome-method"
The GenomicAlignments package is used to input reads aligned to a reference genome. See for instance the ?readGAlignments help page and vigentte(package="GenomicAlignments", "summarizeOverlaps")
[rtracklayer][]'s import and export functions can read in many common file types, e.g., BED, WIG, GTF, …, in addition to querying and navigating the UCSC genome browser. Check out the ?import page for basic usage.
The ShortRead and Rsamtools packages can be used for lower-level access to FASTQ and BAM files, respectively. Explore the ShortRead vignette and Scalable Genomics labs to see approaches to effectively processing the large files.

Annotation: Bioconductor provides extensive access to 'annotation' resources (see the AnnotationData biocViews hierarchy); these are covered in greater detail in Thursday's lab, but some interesting examples to explore during this lab include:

biomaRt, PSICQUIC, KEGGREST and other packages for querying on-line resources; each of these have informative vignettes.
AnnotationDbi is a cornerstone of the Annotation Data packages provided by Bioconductor.
- org packages (e.g., org.Hs.eg.db) contain maps between different gene identifiers, e.g., ENTREZ and SYMBOL. The basic interface to these packages is described on the help page ?select
- TxDb packages (e.g., TxDb.Hsapiens.UCSC.hg19.knownGene) contain gene models (exon coordinates, exon / transcript relationships, etc) derived from common sources such as the hg19 knownGene track of the UCSC genome browser. These packages can be queried, e.g., as described on the ?exonsBy page to retrieve all exons grouped by gene or transcript.
- BSgenome packages (e.g., BSgenome.Hsapiens.UCSC.hg19) contain whole genomes of model organisms.
VariantAnnotation and ensemblVEP provide access to sequence annotation facilities, e.g., to identify coding variants; see the Introduction to VariantAnnotation vignette for a brief introduction; we'll re-visit this during the Thursday lab.
Take a quick look (we'll do more of this in Thursday's lab) at the annotation work flow on the Bioconductor web site.

Avoiding deadly sins: efficient code

The goal of this section is to learn to write correct, robust, simple and efficient R code. We do this through a couple of case studies.

Priorities

Correct: consistent with hand-worked examples
- identical(), all.equal()
Robust: supports realistic inputs, e.g., 0-length vectors, NA values, …
Simple: easy to understand next month; easy to describe what it does to a colleague; easy to spot logical errors; easy to enhance.
Fast, or at least reasonable given the speed of modern computers.

Strategies

Profile
- Look at the script to understand in general terms what it is doing.
- Step through the code to see how it is executed, and to gain an understanding of the speed of each line.
- Time evaluation of select lines or simple chunks of code with system.time() or the microbenchmark package.
- Profile the code with a tool that indicates how much time is spent in each function call or line – the built-in Rprof() function, or packages such as lineprof or aprof

Vectorize – operate on vectors, rather than explicit loops

x <- 1:10
log(x)     ## NOT for (i in seq_along) x[i] <- log(x[i])

##  [1] 0.0000 0.6931 1.0986 1.3863 1.6094 1.7918 1.9459 2.0794 2.1972 2.3026

Pre-allocate memory, then fill in the result

result <- numeric(10)
result[1] <- runif(1)
for (i in 2:length(result))
      result[i] <- runif(1) * result[i - 1]
result

##  [1] 0.72601 0.64980 0.44125 0.25742 0.09304 0.08861 0.07351 0.07070
##  [9] 0.05750 0.02301

Hoist common sub-expressions outside of repeated calculations, so that the sub-expression is only calculated once
- Simple, e.g., 'hoist' constant multiplications from a for loop
- Higher-level, e.g., use lm.fit() rather than repeatedly fitting the same design matrix.
Re-use existing, tested code
- Efficient implementations of common operations – tabulate(), rowSums() and friends, %in%, …
- Efficient domain-specific implementations, e.g., snpStats for GWAS linear models; limma for microarray linear models; edgeR, DESeq2 for negative binomial GLMs relevant to RNASeq.
Re-think how to attack the problem
- Different implementations
- Alternative algorithms
Compile your script with the byte compiler
Use parallel evaluation
Speak in tongues – 'foreign' languages like C, Fortran

Case study: from iteration to vectorization

Here's an obviously inefficient function:

f0 <- function(n, a=2) {
    ## stopifnot(is.integer(n) && (length(n) == 1) &&
    ##           !is.na(n) && (n > 0))
    result <- numeric()
    for (i in seq_len(n))
        result[[i]] <- a * log(i)
    result
}

Use system.time to investigate how this algorithm scales with n, focusing on elapsed time.

system.time(f0(10000))

##    user  system elapsed 
##   0.256   0.003   0.258

n <- 1000 * seq(1, 20, 2)
t <- sapply(n, function(i) system.time(f0(i))[[3]])
plot(t ~ n, type="b")

plot of chunk system-time

Remember the current 'correct' value, and an approximate time

n <- 10000
system.time(expected <- f0(n))

##    user  system elapsed 
##   0.252   0.003   0.254

head(expected)

## [1] 0.000 1.386 2.197 2.773 3.219 3.584

Revise the function to hoist the common multiplier, a, out of the loop. Make sure the result of the 'optimization' and the original calculation are the same. Use the microbenchmark package to compare the two versions

f1 <- function(n, a=2) {
    result <- numeric()
    for (i in seq_len(n))
        result[[i]] <- log(i)
    a * result
}
identical(expected, f1(n))

## [1] TRUE

library(microbenchmark)
microbenchmark(f0(n), f1(n), times=5)

## Unit: milliseconds
##   expr   min    lq median    uq   max neval
##  f0(n) 231.6 234.1  235.2 349.7 352.2     5
##  f1(n) 229.6 232.3  348.2 348.8 357.9     5

Adopt a 'pre-allocate and fill' strategy

f2 <- function(n, a=2) {
    result <- numeric(n)
    for (i in seq_len(n))
        result[[i]] <- log(i)
    a * result
}
identical(expected, f2(n))

## [1] TRUE

microbenchmark(f0(n), f2(n), times=5)

## Unit: milliseconds
##   expr    min     lq median     uq    max neval
##  f0(n) 235.23 236.08 236.31 351.28 363.17     5
##  f2(n)  16.68  16.81  17.15  17.91  18.37     5

Use an *apply() function to avoid having to explicitly pre-allocate, and make opportunities for vectorization more apparent.

f3 <- function(n, a=2)
    a * sapply(seq_len(n), log)

identical(expected, f3(n))

## [1] TRUE

microbenchmark(f0(n), f2(n), f3(n), times=10)

## Unit: milliseconds
##   expr    min     lq median     uq    max neval
##  f0(n) 234.50 237.66 298.54 359.67 365.47    10
##  f2(n)  16.84  17.24  17.32  18.30  21.26    10
##  f3(n)  20.37  21.04  22.21  22.55  27.46    10

Now that the code is presented in a single line, it is apparent that it could be easily vectorized. Seize the opportunity to vectorize it:

f4 <- function(n, a=2)
    a * log(seq_len(n))
identical(expected, f4(n))

## [1] TRUE

microbenchmark(f0(n), f3(n), f4(n), times=10)

## Unit: microseconds
##   expr      min       lq median       uq      max neval
##  f0(n) 233304.5 237095.1 298124 359998.1 364520.8    10
##  f3(n)  20219.3  20356.3  20679  21235.9  22307.4    10
##  f4(n)    473.4    474.3    488    488.7    527.9    10

Returning to our explicit iteration f2(), in these situations it can be helpful to compile the code to a more efficient representation. Do this using the compiler package.

library(compiler)
f2c <- cmpfun(f2)
n <- 10000
identical(f2(n), f2c(n))

## [1] TRUE

microbenchmark(f2(n), f2c(n), times=10)

## Unit: milliseconds
##    expr    min     lq median     uq    max neval
##   f2(n) 17.214 17.670 18.173 19.025 22.358    10
##  f2c(n)  7.272  7.374  7.533  7.835  9.165    10

f4() definitely seems to be the winner. How does it scale with n? (Repeat several times)

n <- 100000 * seq(1, 20, 2)             # 100x larger than f0
t <- sapply(n, function(i) system.time(f4(i))[[3]])
plot(t ~ n, type="b")

plot of chunk vectorized-scale

Any explanations for the different pattern of response?

Lessons learned:

Vectorizing offers a huge improvement over iteration
Pre-allocate-and-fill is very helpful when explicit iteration is required.
*apply() functions help avoid need for explicit pre-allocation and make opportunities for vectorization more apparent. This may come at a small performance cost, but is generally worth it
Hoisting common sub-expressions and using the compiler package can be helpful for improving performance when explicit iteration is required.

Case study: Choosing algorithms

It can be very helpful to reason about an algorithm in an abstract sense, to gain understanding about how an operation might scale. Here's an interesting problem, taken from StackOverflow: Suppose one has a very long sorted vector

vec <- c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6))

with the simple goal being to identify the number of values less than zero. The original post and many responses suggested a variation of scanning the vector for values less than zero, then summing

f0 <- function(v) sum(v < 0)

This algorithm compares each element of vec to zero, creating a logical vector as long as the original, length(v). This logical vector is then scanned by sum to count the number of elements satisfying the condition.

Questions:

How many vectors of length v need to be allocated for this algorithm?
Based on the number of comparisons that need to be performed, how would you expect this algorithm to scale with the length of v? Verify this with a simple figure.

N <- seq(1, 11, 2) * 1e6
Time <- sapply(N, function(n) {
    v <- sort(rnorm(n))
    system.time(f0(v))[[3]]
})
plot(Time ~ N, type="b")

plot of chunk algo-scan-time

Is there a better algorithm, i.e., an approach that arrives at the same answer but takes less time (and / or space)? The vector is sorted, and we can take advantage of that by doing a binary search. The algorithm is surprisingly simple: create an index to the minimum (first) element, and the maximum (last) element. Check to see if the element half way between is greater than or equal to zero. If so, move the maximum index to that point. Otherwise, make that point the new minimum. Repeat this procedure until the minimum index is no longer less than the maximum index.

f3 <- function(x, threshold=0) {
    imin <- 1L
    imax <- length(x)
    while (imax >= imin) {
        imid <- as.integer(imin + (imax - imin) / 2)
        if (x[imid] >= threshold)
            imax <- imid - 1L
        else
            imin <- imid + 1L
    }
    imax
}

Approximately half of the possible values are discarded each iteration, so we expect on average to arrive at the end after about log2(length(v)) iterations – the algorithm scales with the log of the length of v, rather than with the length of v, and no long vectors are created. These difference become increasingly important as the length of v becomes long.

Questions:

Verify with simple data that f3() and f0() result in identical() answers.
Compare how timing of f3() scales with vector length.

## identity
identical(f0((-2):2), f3((-2):2))

## [1] TRUE

identical(f0(2:4), f3(2:4))

## [1] TRUE

identical(f0(-(4:2)), f3(-(4:2)))

## [1] TRUE

identical(f0(vec), f3(vec))

## [1] TRUE

## scale
N <- 10^(1:7)

Time <- sapply(N, function(n) {
    v <- sort(rnorm(n))
    system.time(f3(v))[[3]]
})
plot(Time ~ N, type="b")

plot of chunk algo-binary-perform

Use the microbenchmark package to compare performance of f0() and f3() with the original data, vec.
R code can be compiled, and compilation helps most when doing non-vectorized operations like those in f3(). Use compiler::cmpfun() to compile f3(), and compare the result using microbenchmark.

## relative time
library(microbenchmark)
microbenchmark(f0(vec), f3(vec))

## Unit: microseconds
##     expr      min       lq  median       uq     max neval
##  f0(vec) 26688.44 27822.29 27924.6 28301.45 32728.0   100
##  f3(vec)    52.44    55.19    72.3    79.73   121.5   100

library(compiler)
f3c <- cmpfun(f3)
microbenchmark(f3(vec), f3c(vec))

## Unit: microseconds
##      expr   min    lq median    uq   max neval
##   f3(vec) 53.10 55.28  59.17 60.93 69.20   100
##  f3c(vec) 23.11 23.85  24.83 25.71 35.18   100

We could likely gain additional speed by writing the binary search algorithm in C, but we are already so happy with the performance improvement that we won't do that!

It is useful to ask what is lost by f3() compared to f0(). For instance, does the algorithm work on character vectors? What about when the vector contains NA values? How are ties at 0 treated?

findInterval() is probably a better built-in way to solve the original problem, and generalizes to additional situations. The idea is to take the query that we are interested in, 0, and find the interval specified by vec in which it occurs.

f4 <- function(v, query=0)
    findInterval(query - .Machine$double.eps, v)

identical(f0(vec), f4(vec))

## [1] TRUE

microbenchmark(f0(vec), f3(vec), f4(vec))

## Unit: microseconds
##     expr      min       lq   median      uq     max neval
##  f0(vec) 26911.29 27841.17 27956.62 28343.8 31953.2   100
##  f3(vec)    52.59    58.17    70.79    78.8   137.9   100
##  f4(vec) 21972.92 22035.44 22062.42 22158.6 23855.8   100

The fact that it is flexible and well tested means that it would often be preferred to f3(), even though it is less speedy. For instance, compare the time it takes to query 10000 different points using f3 and iteration, versus findInterval and vectorization.

threshold <- rnorm(10000)
identical(sapply(threshold, f3, x=vec), f4(vec, threshold))

## [1] TRUE

microbenchmark(sapply(x, f3), f4(vec, x))

## Unit: microseconds
##           expr     min      lq  median      uq     max neval
##  sapply(x, f3)   128.2   138.2   181.7   257.9   291.2   100
##     f4(vec, x) 22000.2 22037.3 22091.4 22308.3 23871.0   100

Some R functions that implement efficient algorithms are sort() (including radix sort), match() (hash table look-up), and tabulate(); these can be useful in your own code.

Lessons learned:

Choice of algorithm can be very important
Implementing classical algorithms (like binary search) can be a rewarding learning experience even when, at the end of the day, it may be better to use existing functions.
The built-in R functions that implement efficient algorithms can be important building-blocks for more complicated code.

Skills for Intermediate R Users

Contents

Refresher: working with data

Case study: ALL phenotypic data

Case study: weighty matters

Exercises: base plotting functions

Exercises: Trellis graphics and the lattice package

Making progress: classes, methods, and packages

Motivation

Case study: IRanges and GRanges

A sequence analysis package tour

Avoiding deadly sins: efficient code

Priorities

Strategies

Case study: from iteration to vectorization

Case study: Choosing algorithms