Contents

1 Common Bioconductor work flows

This very open-ended topic points to some of the most prominent Bioconductor packages for sequence analysis. Use the opportunity in this lab to explore the package vignettes and help pages highlighted below; many of the material will be covered in greater detail in subsequent labs and lectures.

Domain-specific analysis – explore the landing pages, vignettes, and reference manuals of two or three of the following packages.

Working with sequences, alignments, common web file formats, and raw data; these packages rely very heavily on the IRanges / GenomicRanges infrastructure.

Visualization

Alt Sequencing Ecosystem

2 Computation: Big Data

2.1 Efficient R code

The goal of this section is to highlight practices for writing correct, robust and efficient R code.

2.2 Priorities

  1. Correct: consistent with hand-worked examples (identical(), all.equal())
  2. Robust: supports realistic inputs, e.g., 0-length vectors, NA values, …
  3. Simple: easy to understand next month; easy to describe what it does to a colleague; easy to spot logical errors; easy to enhance.
  4. Fast, or at least reasonable given the speed of modern computers.

2.3 Strategies

  1. Profile
    • Look at the script to understand in general terms what it is doing.
    • Step through the code to see how it is executed, and to gain an understanding of the speed of each line.
    • Time evaluation of select lines or simple chunks of code with system.time() or the microbenchmark package.
    • Profile the code with a tool that indicates how much time is spent in each function call or line – the built-in Rprof() function, or packages such as lineprof or aprof
  2. Vectorize – operate on vectors, rather than explicit loops

    x <- 1:10
    log(x)     ## NOT for (i in seq_along(x)) x[i] <- log(x[i])
    ##  [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101 2.0794415 2.1972246
    ## [10] 2.3025851
  3. Pre-allocate memory, then fill in the result

    result <- numeric(10)
    result[1] <- runif(1)
    for (i in 2:length(result))
           result[i] <- runif(1) * result[i - 1]
    result
    ##  [1] 0.4951848784 0.2865352417 0.1203113475 0.0282785626 0.0117993505 0.0040204726 0.0025406804
    ##  [8] 0.0011436998 0.0010444164 0.0002276832
  4. Hoist common sub-expressions outside of repeated calculations, so that the sub-expression is only calculated once
    • Simple, e.g., ‘hoist’ constant multiplications from a for loop
    • Higher-level, e.g., use lm.fit() rather than repeatedly fitting the same design matrix.
  5. Re-use existing, tested code
    • Efficient implementations of common operations – tabulate(), rowSums() and friends, %in%, …
    • Efficient domain-specific implementations, e.g., snpStats for GWAS linear models; limma for microarray linear models; edgeR, DESeq2 for negative binomial GLMs relevant to RNASeq.
    • Reuse others’ work – DESeq2, GenomicRanges, Biostrings, …, dplyr, data.table, Rcpp

2.3.1 Case Study: Pre-allocate and vectorize

Here’s an obviously inefficient function:

f0 <- function(n, a=2) {
    ## stopifnot(is.integer(n) && (length(n) == 1) &&
    ##           !is.na(n) && (n > 0))
    result <- numeric()
    for (i in seq_len(n))
        result[[i]] <- a * log(i)
    result
}

Use system.time() to investigate how this algorithm scales with n, focusing on elapsed time.

system.time(f0(10000))
##    user  system elapsed 
##   0.284   0.004   0.285
n <- 1000 * seq(1, 20, 2)
t <- sapply(n, function(i) system.time(f0(i))[[3]])
plot(t ~ n, type="b")

Remember the current ‘correct’ value, and an approximate time

n <- 10000
system.time(expected <- f0(n))
##    user  system elapsed 
##   0.272   0.000   0.270
head(expected)
## [1] 0.000000 1.386294 2.197225 2.772589 3.218876 3.583519

Revise the function to hoist the common multiplier, a, out of the loop. Make sure the result of the ‘optimization’ and the original calculation are the same. Use the microbenchmark package to compare the two versions

f1 <- function(n, a=2) {
    result <- numeric()
    for (i in seq_len(n))
        result[[i]] <- log(i)
    a * result
}
identical(expected, f1(n))
## [1] TRUE
library(microbenchmark)
microbenchmark(f0(n), f1(n), times=5)
## Unit: milliseconds
##   expr      min       lq     mean   median       uq      max neval cld
##  f0(n) 246.0423 264.9063 261.8002 264.9214 265.3211 267.8099     5   a
##  f1(n) 219.2913 221.7509 246.0515 261.9951 263.2040 264.0162     5   a

Adopt a ‘pre-allocate and fill’ strategy

f2 <- function(n, a=2) {
    result <- numeric(n)
    for (i in seq_len(n))
        result[[i]] <- log(i)
    a * result
}
identical(expected, f2(n))
## [1] TRUE
microbenchmark(f0(n), f2(n), times=5)
## Unit: milliseconds
##   expr       min        lq      mean   median        uq      max neval cld
##  f0(n) 214.56280 236.09284 268.35744 244.4207 323.00509 323.7058     5   b
##  f2(n)  11.68582  11.75021  12.18971  11.8478  12.04433  13.6204     5  a

Use an *apply() function to avoid having to explicitly pre-allocate, and make opportunities for vectorization more apparent.

f3 <- function(n, a=2)
    a * sapply(seq_len(n), log)

identical(expected, f3(n))
## [1] TRUE
microbenchmark(f0(n), f2(n), f3(n), times=10)
## Unit: milliseconds
##   expr        min         lq       mean     median         uq        max neval cld
##  f0(n) 210.639096 212.705845 254.638491 214.667187 315.358004 321.404387    10   b
##  f2(n)  11.724926  11.803303  12.009622  11.916915  12.079235  12.733139    10  a 
##  f3(n)   6.126175   6.152834   6.335576   6.339932   6.441334   6.720913    10  a

Now that the code is presented in a single line, it is apparent that it could be easily vectorized. Seize the opportunity to vectorize it:

f4 <- function(n, a=2)
    a * log(seq_len(n))
identical(expected, f4(n))
## [1] TRUE
microbenchmark(f0(n), f3(n), f4(n), times=10)
## Unit: microseconds
##   expr        min         lq       mean     median         uq        max neval cld
##  f0(n) 207472.674 210034.700 251660.864 212280.688 312108.901 317604.029    10   b
##  f3(n)   6043.009   6085.972   6148.863   6122.600   6228.416   6292.337    10  a 
##  f4(n)    364.354    365.775    374.082    373.359    382.867    392.421    10  a

f4() definitely seems to be the winner. How does it scale with n? (Repeat several times)

n <- 10 ^ (5:8)                         # 100x larger than f0
t <- sapply(n, function(i) system.time(f4(i))[[3]])
plot(t ~ n, log="xy", type="b")

Any explanations for the different pattern of response?

Lessons learned:

  1. Vectorizing offers a huge improvement over iteration
  2. Pre-allocate-and-fill is very helpful when explicit iteration is required.
  3. *apply() functions help avoid need for explicit pre-allocation and make opportunities for vectorization more apparent. This may come at a small performance cost, but is generally worth it
  4. Hoisting common sub-expressions can be helpful for improving performance when explicit iteration is required.

2.4 Parallel evaluation

When data are too large to fit in memory, we can iterate through files in chunks or subset the data by fields or genomic positions.

Iteration

Restriction

Parallel evalutation

BiocParallel provides a standardized interface for simple parallel evaluation. The package builds provides access to the snow and multicore functionality in the parallel package as well as BatchJobs for running cluster jobs.

General ideas:

2.4.1 Exercise: Sleeping serially and in parallel

This small example motivates the use of parallel execution and demonstrates how bplapply() can be a drop in for lapply.

Use system.time() to explore how long this takes to execute as n increases from 1 to 10. Use identical() and microbenchmark to compare alternatives f0() and f1() for both correctness and performance.

fun sleeps for 1 second, then returns i.

library(BiocParallel)

fun <- function(i) {
    Sys.sleep(1)
    i
}

## serial
f0 <- function(n)
    lapply(seq_len(n), fun)

## parallel
f1 <- function(n)
    bplapply(seq_len(n), fun)