Skills for Intermediate R Users

CSAMA 2014
Author: Martin Morgan (mtmorgan@fhcrc.org)
Date: 23 June, 2014

Contents

Refresher: working with data

Case study: ALL phenotypic data

This case study servers as a refresher basic input and manipulation of data.

Input a file that contains ALL (acute lymphoblastic leukemia) patient information

fname <- file.choose()   ## "ALLphenoData.tsv"
stopifnot(file.exists(fname))
pdata <- read.delim(fname)

Check out the help page ?read.delim for input options, and explore basic properties of the object you've created, for instance…

class(pdata)
## [1] "data.frame"
colnames(pdata)
##  [1] "id"             "diagnosis"      "sex"            "age"           
##  [5] "BT"             "remission"      "CR"             "date.cr"       
##  [9] "t.4.11."        "t.9.22."        "cyto.normal"    "citog"         
## [13] "mol.biol"       "fusion.protein" "mdr"            "kinet"         
## [17] "ccr"            "relapse"        "transplant"     "f.u"           
## [21] "date.last.seen"
dim(pdata)
## [1] 127  21
head(pdata)
##     id diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE
## 2 1010 3/29/2000   M  19 B2        CR CR 6/27/2000   FALSE   FALSE
## 3 3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA
## 4 4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE
## 5 4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE
## 6 4008 7/30/1997   M  17 B1        CR CR 9/27/1997   FALSE   FALSE
##   cyto.normal        citog mol.biol fusion.protein mdr   kinet   ccr
## 1       FALSE      t(9;22)  BCR/ABL           p210 NEG dyploid FALSE
## 2       FALSE  simple alt.      NEG           <NA> POS dyploid FALSE
## 3          NA         <NA>  BCR/ABL           p190 NEG dyploid FALSE
## 4       FALSE      t(4;11) ALL1/AF4           <NA> NEG dyploid FALSE
## 5       FALSE      del(6q)      NEG           <NA> NEG dyploid FALSE
## 6       FALSE complex alt.      NEG           <NA> NEG hyperd. FALSE
##   relapse transplant               f.u date.last.seen
## 1   FALSE       TRUE BMT / DEATH IN CR           <NA>
## 2    TRUE      FALSE               REL      8/28/2000
## 3    TRUE      FALSE               REL     10/15/1999
## 4    TRUE      FALSE               REL      1/23/1998
## 5    TRUE      FALSE               REL      11/4/1997
## 6    TRUE      FALSE               REL     12/15/1997
summary(pdata$sex)
##    F    M NA's 
##   42   83    2
summary(pdata$cyto.normal)
##    Mode   FALSE    TRUE    NA's 
## logical      69      24      34

Remind yourselves about various ways to subset and access columns of a data.frame

pdata[1:5, 3:4]
##   sex age
## 1   M  53
## 2   M  19
## 3   F  52
## 4   M  38
## 5   M  57
pdata[1:5, ]
##     id diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE
## 2 1010 3/29/2000   M  19 B2        CR CR 6/27/2000   FALSE   FALSE
## 3 3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA
## 4 4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE
## 5 4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE
##   cyto.normal       citog mol.biol fusion.protein mdr   kinet   ccr
## 1       FALSE     t(9;22)  BCR/ABL           p210 NEG dyploid FALSE
## 2       FALSE simple alt.      NEG           <NA> POS dyploid FALSE
## 3          NA        <NA>  BCR/ABL           p190 NEG dyploid FALSE
## 4       FALSE     t(4;11) ALL1/AF4           <NA> NEG dyploid FALSE
## 5       FALSE     del(6q)      NEG           <NA> NEG dyploid FALSE
##   relapse transplant               f.u date.last.seen
## 1   FALSE       TRUE BMT / DEATH IN CR           <NA>
## 2    TRUE      FALSE               REL      8/28/2000
## 3    TRUE      FALSE               REL     10/15/1999
## 4    TRUE      FALSE               REL      1/23/1998
## 5    TRUE      FALSE               REL      11/4/1997
head(pdata[, 3:5])
##   sex age BT
## 1   M  53 B2
## 2   M  19 B2
## 3   F  52 B4
## 4   M  38 B1
## 5   M  57 B2
## 6   M  17 B1
tail(pdata[, 3:5], 3)
##     sex age BT
## 125   M  19 T2
## 126   M  30 T3
## 127   M  29 T2
head(pdata$age)
## [1] 53 19 52 38 57 17
head(pdata$sex)
## [1] M M F M M M
## Levels: F M
head(pdata[pdata$age > 21,])
##      id diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22.
## 1  1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE
## 3  3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA
## 4  4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE
## 5  4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE
## 10 8001 1/15/1997   M  40 B2        CR CR 3/26/1997   FALSE   FALSE
## 11 8011 8/21/1998   M  33 B3        CR CR 10/8/1998   FALSE   FALSE
##    cyto.normal        citog mol.biol fusion.protein mdr   kinet   ccr
## 1        FALSE      t(9;22)  BCR/ABL           p210 NEG dyploid FALSE
## 3           NA         <NA>  BCR/ABL           p190 NEG dyploid FALSE
## 4        FALSE      t(4;11) ALL1/AF4           <NA> NEG dyploid FALSE
## 5        FALSE      del(6q)      NEG           <NA> NEG dyploid FALSE
## 10       FALSE     del(p15)  BCR/ABL           p190 NEG    <NA> FALSE
## 11       FALSE del(p15/p16)  BCR/ABL      p190/p210 NEG dyploid FALSE
##    relapse transplant               f.u date.last.seen
## 1    FALSE       TRUE BMT / DEATH IN CR           <NA>
## 3     TRUE      FALSE               REL     10/15/1999
## 4     TRUE      FALSE               REL      1/23/1998
## 5     TRUE      FALSE               REL      11/4/1997
## 10    TRUE      FALSE               REL      7/11/1997
## 11   FALSE       TRUE BMT / DEATH IN CR           <NA>

It seems from below that there are 17 females over 40 in the data set, but when sub-setting pdata to contain just those individuals 19 rows are selected. Why? What can we do to correct this?

idx <- pdata$sex == "F" & pdata$age > 40
table(idx)
## idx
## FALSE  TRUE 
##   108    17
dim(pdata[idx,])
## [1] 19 21

Use the mol.biol column to subset the data to contain just individuals with 'BCR/ABL' or 'NEG', e.g.,

bcrabl <- pdata[pdata$mol.biol %in% c("BCR/ABL", "NEG"),]

The mol.biol column is a factor, and retains all levels even after subsetting. How might you drop the unused factor levels?

bcrabl$mol.biol <- factor(bcrabl$mol.biol)

The BT column is a factor describing B- and T-cell subtypes

levels(bcrabl$BT)
##  [1] "B"  "B1" "B2" "B3" "B4" "T"  "T1" "T2" "T3" "T4"

How might one collapse B1, B2, … to a single type B, and likewise for T1, T2, …, so there are only two subtypes, B and T

table(bcrabl$BT)
## 
##  B B1 B2 B3 B4  T T1 T2 T3 T4 
##  4  9 35 22  9  4  1 15  9  2
levels(bcrabl$BT) <- substring(levels(bcrabl$BT), 1, 1)
table(bcrabl$BT)
## 
##  B  T 
## 79 31

Use xtabs() (cross-tabulation) to count the number of samples with B- and T-cell types in each of the BCR/ABL and NEG groups

xtabs(~ BT + mol.biol, bcrabl)
##    mol.biol
## BT  BCR/ABL NEG
##   B      37  42
##   T       0  31

Use aggregate() to calculate the average age of males and females in the BCR/ABL and NEG treatment groups.

aggregate(age ~ mol.biol + sex, bcrabl, mean)
##   mol.biol sex   age
## 1  BCR/ABL   F 39.94
## 2      NEG   F 30.42
## 3  BCR/ABL   M 40.50
## 4      NEG   M 27.21

Use t.test() to compare the age of individuals in the BCR/ABL versus NEG groups; visualize the results using boxplot(). In both cases, use the formula interface. Consult the help page ?t.test and re-do the test assuming that variance of ages in the two groups is identical. What parts of the test output change?

t.test(age ~ mol.biol, bcrabl)
## 
##  Welch Two Sample t-test
## 
## data:  age by mol.biol
## t = 4.817, df = 68.53, p-value = 8.401e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.135 17.224
## sample estimates:
## mean in group BCR/ABL     mean in group NEG 
##                 40.25                 28.07
boxplot(age ~ mol.biol, bcrabl)

plot of chunk ALL-age

Case study: weighty matters

This case study is a second walk through basic data manipulation and visualization skills. We use data from the US Center for Disease Control's Behavioral Risk Factor Surveillance System (BRFSS) annual survey. Check out the web page for a little more information. We are using a small subset of this data, including a random sample of 10000 observations from each of 1990 and 2010.

Input the data using read.csv(), creating a variable brfss to hold it. Use file.choose() to locate the data file BRFSS-subset.csv

fname <- file.choose()   ## BRFSS-subset.csv
stopifnot(file.exists(fname))
brfss <- read.csv(fname)

Exercises: base plotting functions

  1. Explore the data using class(), dim(), head(), summary(), etc. Use xtabs() to summarize the number of males and females in the study, in each of the two years.

  2. Use aggregate() to summarize the average weight in each sex and year.

  3. Create a scatterplot showing the relationship between the square root of weight and height, using the plot() function and the main argument to annotate the plot. Note the transformed Y-axis. Experiment with different plotting symbols (try the command example(points) to view different points).

    plot(sqrt(Weight) ~ Height, brfss, main="All Years, Both Sexes")
    

    plot of chunk brfss-simple-plot

  4. Color the female and male points differently. To do this, use the col argument to plot(). Provide as a value to that argument a vector of colors, subset by brfss$Sex.

  5. Create a subset of the data containing only observations from 2010.

    brfss2010 <- brfss[brfss$Year == "2010", ]
    
  6. Create the figure below (two panels in a single figure). Do this by using the par() function with the mfcol argument before calling plot(). You'll need to create two more subsets of data, perhaps when you are providing the data to the function plot.

    opar <- par(mfcol=c(1, 2))
    plot(sqrt(Weight) ~ Height, brfss2010[brfss2010$Sex == "Female", ],
        main="2010, Female")
    plot(sqrt(Weight) ~ Height, brfss2010[brfss2010$Sex == "Male", ],
        main="2010, Male")
    

    plot of chunk brfss-pair-plot

    par(mfcol=c(1, 1))
    
  7. Plotting large numbers of points means that they are often over-plotted, potentially obscuring important patterns. Experiment with arguments to plot() to address over-plotting, e.g., pch='.' or alpha=.4. Try using the smoothScatter() function (the data have to be presented as x and y, rather than as a formula). Try adding the hexbin library to your R session (using library()) and creating a hexbinplot().

Exercises: Trellis graphics and the lattice package

R has a number of additional plotting facilities, both as part of the 'base' distribution and user-contributed packages. The lattice package adopts a particular philosophy about the presentation of data, and can be a very effective visualization tool.

  1. Use library() to load the lattice package.

    library(lattice)
    
  2. Create the following figure using the xyplot() function with a formula and the brfss2010 data. The formula is sqrt(Weight) ~ Height | Sex, which can be read as square root of Weight as a function of Height, conditioned on Sex'.

    xyplot(sqrt(Weight) ~ Height | Sex, brfss2010)
    

    plot of chunk lattice-conditioning

  3. Add a background grid and a regression line to each panel using the argument type=c('p', 'g', 'r'); change the width (lwd) and color (col.line) of the regression line. For some hints on other arguments, see the help pages ?xyplot (for overall structure of the plot) and ?panel.xyplot (for how each 'panel' of data is plotted).

  4. Create the following figure. Use the densityplot() function with the formula ~ sqrt(Weight). The group=Sex function argument creates separate lines for each sex. Be sure to use plot.points=FALSE to avoid a rug' of points at the base of the figure. Can you add a key (legend)?

    densityplot(~sqrt(Weight), brfss2010, group=Sex, plot.points=FALSE)
    

    plot of chunk lattice-density

  5. Create the figure below using the bwplot function. The formula requires that Year be coerced to a factor, factor(Year).

    bwplot(sqrt(Weight) ~ factor(Year) | Sex, brfss)
    

    plot of chunk lattice-bwplot

  6. Create the Figure below, a violin plot, using bwplot() and the panel argument set to panel.violin. from ?bwplot we learn that panel is a function that determines how each panel is drawn; the details for controling the violin panel plot are described on the help page ?panel.violin.

    bwplot(sqrt(Weight) ~ factor(Year) | Sex, brfss, panel=panel.violin)
    

    plot of chunk lattice-violin

  7. (Advanced) We can write our own panel argument to lattice functions to influence how each panel is displayed. Here we add a point at the median age and weight.

    xyplot(sqrt(Weight) ~ Height|Sex, brfss2010,
        panel = function(x, y, ...) {
            panel.xyplot(x, y, ...)
            panel.points(median(x, na.rm=TRUE), median(y, na.rm=TRUE), 
                cex=2, pch=20, col="red")
        },
        type=c("p", "g", "r"), lwd=2, col.line="red", xlim=c(120, 210))
    

    plot of chunk lattice-panel

Making progress: classes, methods, and packages

This section focuses on classes, methods, and packages, with the goal being to learn to navigate the help system and interactive discovery facilities.

Motivation

Sequence analysis is specialized

Additional considerations

Solution: use well-defined classes to represent complex data; methods operate on the classes to perform useful functions. Classes and methods are placed together and distributed as packages so that we can all benefit from the hard work and tested code of others.

Case study: IRanges and GRanges

The IRanges package defines an important class for specifying integer ranges, e.g.,

library(IRanges)
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
##     as.data.frame, as.vector, cbind, colnames, do.call,
##     duplicated, eval, evalq, get, intersect, is.unsorted, lapply,
##     mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rep.int, rownames, sapply, setdiff,
##     sort, table, tapply, union, unique, unlist
ir <- IRanges(start=c(10, 20, 30), width=5)
ir
## IRanges of length 3
##     start end width
## [1]    10  14     5
## [2]    20  24     5
## [3]    30  34     5

There are many interesting operations to be performed on ranges, e.g, flank() identifies adjacent ranges

flank(ir, 3)
## IRanges of length 3
##     start end width
## [1]     7   9     3
## [2]    17  19     3
## [3]    27  29     3

Consult the help page for flank, ?flank, and explore other range-based operations.

The IRanges class is part of a class hierarchy. To see this, ask R for the class of ir, and for the class definition of the IRanges class

class(ir)
## [1] "IRanges"
## attr(,"package")
## [1] "IRanges"
getClassDef(class(ir))
## Class "IRanges" [package "IRanges"]
## 
## Slots:
##                                                                       
## Name:            start           width           NAMES     elementType
## Class:         integer         integer characterORNULL       character
##                                       
## Name:  elementMetadata        metadata
## Class: DataTableORNULL            list
## 
## Extends: 
## Class "Ranges", directly
## Class "IntegerList", by class "Ranges", distance 2
## Class "RangesORmissing", by class "Ranges", distance 2
## Class "AtomicList", by class "Ranges", distance 3
## Class "List", by class "Ranges", distance 4
## Class "Vector", by class "Ranges", distance 5
## Class "Annotated", by class "Ranges", distance 6
## 
## Known Subclasses: "NormalIRanges"

Notice that IRanges extends the Ranges class. Now try entering ?"flank,<tab>, where <tab> means to press the tab key to ask for tab completion (may not be necessary in Rstudio). You can see that there are help pages for several different classes. Tab-complete to

?"flank,Ranges-method" 

and verify that you're at the page that describes the method relevant to an IRanges instance.

The GenomicRanges package extends the notion of ranges to include features relevant to application of ranges in sequence analysis, particularly the ability to associate a range with a sequence name (e.g., chromosome) and a strand. Create a GRanges instance based on our IRanges instance, as follows

library(GenomicRanges)
## Loading required package: GenomeInfoDb
gr <- GRanges(c("chr1", "chr1", "chr2"), ir, strand=c("+", "-", "+"))
gr
## GRanges with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1  [10, 14]      +
##   [2]     chr1  [20, 24]      -
##   [3]     chr2  [30, 34]      +
##   ---
##   seqlengths:
##    chr1 chr2
##      NA   NA

The notion of flanking sequence has a more nuanced meaning in biology. In particular we might expect that flanking sequence on the + strand would precede the range, but on the minus strand would follow it. Verify that flank applied to a GRanges object has this behavior.

flank(gr, 3)
## GRanges with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1  [ 7,  9]      +
##   [2]     chr1  [25, 27]      -
##   [3]     chr2  [27, 29]      +
##   ---
##   seqlengths:
##    chr1 chr2
##      NA   NA

Discover what classes GRanges extends, find the help page documenting the behavior of flank when applied to a GRanges object, and verify that the help page documents the behavior we just observed.

class(gr)
## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"
getClassDef(class(gr))
## Class "GRanges" [package "GenomicRanges"]
## 
## Slots:
##                                                                       
## Name:         seqnames          ranges          strand elementMetadata
## Class:             Rle         IRanges             Rle       DataFrame
##                                       
## Name:          seqinfo        metadata
## Class:         Seqinfo            list
## 
## Extends: 
## Class "GenomicRanges", directly
## Class "Vector", by class "GenomicRanges", distance 2
## Class "GenomicRangesORmissing", by class "GenomicRanges", distance 2
## Class "GenomicRangesORGRangesList", by class "GenomicRanges", distance 2
## Class "Annotated", by class "GenomicRanges", distance 3
?"flank,GenomicRanges-method"

Notice that the available flank() methods have been augmented by the methods defined in the GenomicRanges package.

It seems like there might be a number of helpful methods available for working with genomic ranges; we can discover some of these from the command line, indicating that the methods should be on the current search() path

showMethods(class="GRanges", where=search())

Use help() to list the help pages in the GenomicRanges package, and vignettes() to view and access available vignettes; these are also available in the Rstudio 'Help' tab.

help(package="GenomicRanges")
vignette(package="GenomicRanges")
vignette(package="GenomicRanges", "GenomicRangesHOWTOs")

A sequence analysis package tour

This very open-ended topic points to some of the most prominent Bioconductor packages for sequence analysis. Use the opportunity in this lab to explore the package vignettes and help pages highlighted below; many of the material will be covered in greater detail in subsequent labs and lectures.

Basics

Domain-specific analysis – explore the landing pages, vignettes, and reference manuals of two or three of the following packages.

Working with sequences, alignments, common web file formats, and raw data; these packages rely very heavily on the IRanges / GenomicRanges infrastructure that we will encounter later in the course.

Annotation: Bioconductor provides extensive access to 'annotation' resources (see the AnnotationData biocViews hierarchy); these are covered in greater detail in Thursday's lab, but some interesting examples to explore during this lab include:

Avoiding deadly sins: efficient code

The goal of this section is to learn to write correct, robust, simple and efficient R code. We do this through a couple of case studies.

Priorities

  1. Correct: consistent with hand-worked examples
  2. Robust: supports realistic inputs, e.g., 0-length vectors, NA values, …
  3. Simple: easy to understand next month; easy to describe what it does to a colleague; easy to spot logical errors; easy to enhance.
  4. Fast, or at least reasonable given the speed of modern computers.

Strategies

  1. Profile
  2. Vectorize – operate on vectors, rather than explicit loops

    x <- 1:10
    log(x)     ## NOT for (i in seq_along) x[i] <- log(x[i])
    
    ##  [1] 0.0000 0.6931 1.0986 1.3863 1.6094 1.7918 1.9459 2.0794 2.1972 2.3026
    
  3. Pre-allocate memory, then fill in the result

    result <- numeric(10)
    result[1] <- runif(1)
    for (i in 2:length(result))
          result[i] <- runif(1) * result[i - 1]
    result
    
    ##  [1] 0.72601 0.64980 0.44125 0.25742 0.09304 0.08861 0.07351 0.07070
    ##  [9] 0.05750 0.02301
    
  4. Hoist common sub-expressions outside of repeated calculations, so that the sub-expression is only calculated once

  5. Re-use existing, tested code

  6. Re-think how to attack the problem

  7. Compile your script with the byte compiler

  8. Use parallel evaluation

  9. Speak in tongues – 'foreign' languages like C, Fortran

Case study: from iteration to vectorization

Here's an obviously inefficient function:

f0 <- function(n, a=2) {
    ## stopifnot(is.integer(n) && (length(n) == 1) &&
    ##           !is.na(n) && (n > 0))
    result <- numeric()
    for (i in seq_len(n))
        result[[i]] <- a * log(i)
    result
}

Use system.time to investigate how this algorithm scales with n, focusing on elapsed time.

system.time(f0(10000))
##    user  system elapsed 
##   0.256   0.003   0.258
n <- 1000 * seq(1, 20, 2)
t <- sapply(n, function(i) system.time(f0(i))[[3]])
plot(t ~ n, type="b")

plot of chunk system-time

Remember the current 'correct' value, and an approximate time

n <- 10000
system.time(expected <- f0(n))
##    user  system elapsed 
##   0.252   0.003   0.254
head(expected)
## [1] 0.000 1.386 2.197 2.773 3.219 3.584

Revise the function to hoist the common multiplier, a, out of the loop. Make sure the result of the 'optimization' and the original calculation are the same. Use the microbenchmark package to compare the two versions

f1 <- function(n, a=2) {
    result <- numeric()
    for (i in seq_len(n))
        result[[i]] <- log(i)
    a * result
}
identical(expected, f1(n))
## [1] TRUE
library(microbenchmark)
microbenchmark(f0(n), f1(n), times=5)
## Unit: milliseconds
##   expr   min    lq median    uq   max neval
##  f0(n) 231.6 234.1  235.2 349.7 352.2     5
##  f1(n) 229.6 232.3  348.2 348.8 357.9     5

Adopt a 'pre-allocate and fill' strategy

f2 <- function(n, a=2) {
    result <- numeric(n)
    for (i in seq_len(n))
        result[[i]] <- log(i)
    a * result
}
identical(expected, f2(n))
## [1] TRUE
microbenchmark(f0(n), f2(n), times=5)
## Unit: milliseconds
##   expr    min     lq median     uq    max neval
##  f0(n) 235.23 236.08 236.31 351.28 363.17     5
##  f2(n)  16.68  16.81  17.15  17.91  18.37     5

Use an *apply() function to avoid having to explicitly pre-allocate, and make opportunities for vectorization more apparent.

f3 <- function(n, a=2)
    a * sapply(seq_len(n), log)

identical(expected, f3(n))
## [1] TRUE
microbenchmark(f0(n), f2(n), f3(n), times=10)
## Unit: milliseconds
##   expr    min     lq median     uq    max neval
##  f0(n) 234.50 237.66 298.54 359.67 365.47    10
##  f2(n)  16.84  17.24  17.32  18.30  21.26    10
##  f3(n)  20.37  21.04  22.21  22.55  27.46    10

Now that the code is presented in a single line, it is apparent that it could be easily vectorized. Seize the opportunity to vectorize it:

f4 <- function(n, a=2)
    a * log(seq_len(n))
identical(expected, f4(n))
## [1] TRUE
microbenchmark(f0(n), f3(n), f4(n), times=10)
## Unit: microseconds
##   expr      min       lq median       uq      max neval
##  f0(n) 233304.5 237095.1 298124 359998.1 364520.8    10
##  f3(n)  20219.3  20356.3  20679  21235.9  22307.4    10
##  f4(n)    473.4    474.3    488    488.7    527.9    10

Returning to our explicit iteration f2(), in these situations it can be helpful to compile the code to a more efficient representation. Do this using the compiler package.

library(compiler)
f2c <- cmpfun(f2)
n <- 10000
identical(f2(n), f2c(n))
## [1] TRUE
microbenchmark(f2(n), f2c(n), times=10)
## Unit: milliseconds
##    expr    min     lq median     uq    max neval
##   f2(n) 17.214 17.670 18.173 19.025 22.358    10
##  f2c(n)  7.272  7.374  7.533  7.835  9.165    10

f4() definitely seems to be the winner. How does it scale with n? (Repeat several times)

n <- 100000 * seq(1, 20, 2)             # 100x larger than f0
t <- sapply(n, function(i) system.time(f4(i))[[3]])
plot(t ~ n, type="b")

plot of chunk vectorized-scale

Any explanations for the different pattern of response?

Lessons learned:

  1. Vectorizing offers a huge improvement over iteration
  2. Pre-allocate-and-fill is very helpful when explicit iteration is required.
  3. *apply() functions help avoid need for explicit pre-allocation and make opportunities for vectorization more apparent. This may come at a small performance cost, but is generally worth it
  4. Hoisting common sub-expressions and using the compiler package can be helpful for improving performance when explicit iteration is required.

Case study: Choosing algorithms

It can be very helpful to reason about an algorithm in an abstract sense, to gain understanding about how an operation might scale. Here's an interesting problem, taken from StackOverflow: Suppose one has a very long sorted vector

vec <- c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6))

with the simple goal being to identify the number of values less than zero. The original post and many responses suggested a variation of scanning the vector for values less than zero, then summing

f0 <- function(v) sum(v < 0)

This algorithm compares each element of vec to zero, creating a logical vector as long as the original, length(v). This logical vector is then scanned by sum to count the number of elements satisfying the condition.

Questions:

  1. How many vectors of length v need to be allocated for this algorithm?
  2. Based on the number of comparisons that need to be performed, how would you expect this algorithm to scale with the length of v? Verify this with a simple figure.
N <- seq(1, 11, 2) * 1e6
Time <- sapply(N, function(n) {
    v <- sort(rnorm(n))
    system.time(f0(v))[[3]]
})
plot(Time ~ N, type="b")

plot of chunk algo-scan-time

Is there a better algorithm, i.e., an approach that arrives at the same answer but takes less time (and / or space)? The vector is sorted, and we can take advantage of that by doing a binary search. The algorithm is surprisingly simple: create an index to the minimum (first) element, and the maximum (last) element. Check to see if the element half way between is greater than or equal to zero. If so, move the maximum index to that point. Otherwise, make that point the new minimum. Repeat this procedure until the minimum index is no longer less than the maximum index.

f3 <- function(x, threshold=0) {
    imin <- 1L
    imax <- length(x)
    while (imax >= imin) {
        imid <- as.integer(imin + (imax - imin) / 2)
        if (x[imid] >= threshold)
            imax <- imid - 1L
        else
            imin <- imid + 1L
    }
    imax
}

Approximately half of the possible values are discarded each iteration, so we expect on average to arrive at the end after about log2(length(v)) iterations – the algorithm scales with the log of the length of v, rather than with the length of v, and no long vectors are created. These difference become increasingly important as the length of v becomes long.

Questions:

  1. Verify with simple data that f3() and f0() result in identical() answers.
  2. Compare how timing of f3() scales with vector length.
## identity
identical(f0((-2):2), f3((-2):2))
## [1] TRUE
identical(f0(2:4), f3(2:4))
## [1] TRUE
identical(f0(-(4:2)), f3(-(4:2)))
## [1] TRUE
identical(f0(vec), f3(vec))
## [1] TRUE
## scale
N <- 10^(1:7)

Time <- sapply(N, function(n) {
    v <- sort(rnorm(n))
    system.time(f3(v))[[3]]
})
plot(Time ~ N, type="b")

plot of chunk algo-binary-perform

  1. Use the microbenchmark package to compare performance of f0() and f3() with the original data, vec.
  2. R code can be compiled, and compilation helps most when doing non-vectorized operations like those in f3(). Use compiler::cmpfun() to compile f3(), and compare the result using microbenchmark.
## relative time
library(microbenchmark)
microbenchmark(f0(vec), f3(vec))
## Unit: microseconds
##     expr      min       lq  median       uq     max neval
##  f0(vec) 26688.44 27822.29 27924.6 28301.45 32728.0   100
##  f3(vec)    52.44    55.19    72.3    79.73   121.5   100
library(compiler)
f3c <- cmpfun(f3)
microbenchmark(f3(vec), f3c(vec))
## Unit: microseconds
##      expr   min    lq median    uq   max neval
##   f3(vec) 53.10 55.28  59.17 60.93 69.20   100
##  f3c(vec) 23.11 23.85  24.83 25.71 35.18   100

We could likely gain additional speed by writing the binary search algorithm in C, but we are already so happy with the performance improvement that we won't do that!

It is useful to ask what is lost by f3() compared to f0(). For instance, does the algorithm work on character vectors? What about when the vector contains NA values? How are ties at 0 treated?

findInterval() is probably a better built-in way to solve the original problem, and generalizes to additional situations. The idea is to take the query that we are interested in, 0, and find the interval specified by vec in which it occurs.

f4 <- function(v, query=0)
    findInterval(query - .Machine$double.eps, v)

identical(f0(vec), f4(vec))
## [1] TRUE
microbenchmark(f0(vec), f3(vec), f4(vec))
## Unit: microseconds
##     expr      min       lq   median      uq     max neval
##  f0(vec) 26911.29 27841.17 27956.62 28343.8 31953.2   100
##  f3(vec)    52.59    58.17    70.79    78.8   137.9   100
##  f4(vec) 21972.92 22035.44 22062.42 22158.6 23855.8   100

The fact that it is flexible and well tested means that it would often be preferred to f3(), even though it is less speedy. For instance, compare the time it takes to query 10000 different points using f3 and iteration, versus findInterval and vectorization.

threshold <- rnorm(10000)
identical(sapply(threshold, f3, x=vec), f4(vec, threshold))
## [1] TRUE
microbenchmark(sapply(x, f3), f4(vec, x))
## Unit: microseconds
##           expr     min      lq  median      uq     max neval
##  sapply(x, f3)   128.2   138.2   181.7   257.9   291.2   100
##     f4(vec, x) 22000.2 22037.3 22091.4 22308.3 23871.0   100

Some R functions that implement efficient algorithms are sort() (including radix sort), match() (hash table look-up), and tabulate(); these can be useful in your own code.

Lessons learned:

  1. Choice of algorithm can be very important
  2. Implementing classical algorithms (like binary search) can be a rewarding learning experience even when, at the end of the day, it may be better to use existing functions.
  3. The built-in R functions that implement efficient algorithms can be important building-blocks for more complicated code.