This very open-ended topic points to some of the most prominent Bioconductor packages for sequence analysis. Use the opportunity in this lab to explore the package vignettes and help pages highlighted below; many of the material will be covered in greater detail in subsequent labs and lectures.
Domain-specific analysis – explore the landing pages, vignettes, and reference manuals of two or three of the following packages.
Working with sequences, alignments, common web file formats, and raw data; these packages rely very heavily on the IRanges / GenomicRanges infrastructure.
?consensusMatrix, for instance. Also check out the BSgenome package for working with whole genome sequences, e.g., ?"getSeq,BSgenome-method"?readGAlignments help page and vigentte(package="GenomicAlignments",   "summarizeOverlaps")import and export functions can read in many common file types, e.g., BED, WIG, GTF, …, in addition to querying and navigating the UCSC genome browser. Check out the ?import page for basic usage.Visualization
The goal of this section is to highlight practices for writing correct, robust and efficient R code.
identical(), all.equal())NA values, …system.time() or the microbenchmark package.Rprof() function, or packages such as lineprof or aprofVectorize – operate on vectors, rather than explicit loops
x <- 1:10
log(x)     ## NOT for (i in seq_along(x)) x[i] <- log(x[i])##  [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101 2.0794415 2.1972246
## [10] 2.3025851Pre-allocate memory, then fill in the result
result <- numeric(10)
result[1] <- runif(1)
for (i in 2:length(result))
       result[i] <- runif(1) * result[i - 1]
result##  [1] 0.331484123 0.272304255 0.073922382 0.059968133 0.045987249 0.020074915 0.007646036 0.004178788
##  [9] 0.004107127 0.002888223for looplm.fit() rather than repeatedly fitting the same design matrix.tabulate(), rowSums() and friends, %in%, …Here’s an obviously inefficient function:
f0 <- function(n, a=2) {
    ## stopifnot(is.integer(n) && (length(n) == 1) &&
    ##           !is.na(n) && (n > 0))
    result <- numeric()
    for (i in seq_len(n))
        result[[i]] <- a * log(i)
    result
}Use system.time() to investigate how this algorithm scales with n, focusing on elapsed time.
system.time(f0(10000))##    user  system elapsed 
##   0.187   0.008   0.195n <- 1000 * seq(1, 20, 2)
t <- sapply(n, function(i) system.time(f0(i))[[3]])
plot(t ~ n, type="b")Remember the current ‘correct’ value, and an approximate time
n <- 10000
system.time(expected <- f0(n))##    user  system elapsed 
##   0.179   0.000   0.179head(expected)## [1] 0.000000 1.386294 2.197225 2.772589 3.218876 3.583519Revise the function to hoist the common multiplier, a, out of the loop. Make sure the result of the ‘optimization’ and the original calculation are the same. Use the microbenchmark package to compare the two versions
f1 <- function(n, a=2) {
    result <- numeric()
    for (i in seq_len(n))
        result[[i]] <- log(i)
    a * result
}
identical(expected, f1(n))## [1] TRUElibrary(microbenchmark)
microbenchmark(f0(n), f1(n), times=5)## Unit: milliseconds
##   expr      min       lq     mean   median       uq      max neval
##  f0(n) 156.3378 156.4045 175.3319 156.8620 203.4312 203.6239     5
##  f1(n) 155.3136 156.5851 265.3980 202.0044 202.1739 610.9129     5Adopt a ‘pre-allocate and fill’ strategy
f2 <- function(n, a=2) {
    result <- numeric(n)
    for (i in seq_len(n))
        result[[i]] <- log(i)
    a * result
}
identical(expected, f2(n))## [1] TRUEmicrobenchmark(f0(n), f2(n), times=5)## Unit: milliseconds
##   expr        min         lq       mean     median         uq        max neval
##  f0(n) 155.993159 156.635264 175.518239 157.970451 203.158903 203.833420     5
##  f2(n)   7.922534   7.930492   7.951939   7.954765   7.961272   7.990633     5Use an *apply() function to avoid having to explicitly pre-allocate, and make opportunities for vectorization more apparent.
f3 <- function(n, a=2)
    a * sapply(seq_len(n), log)
identical(expected, f3(n))## [1] TRUEmicrobenchmark(f0(n), f2(n), f3(n), times=10)## Unit: milliseconds
##   expr        min         lq       mean     median         uq        max neval
##  f0(n) 155.782877 155.912423 179.260635 179.434312 202.485429 202.642872    10
##  f2(n)   7.899218   7.910416   8.441093   7.980780   8.133669  10.315199    10
##  f3(n)   3.762718   3.806632   3.836373   3.849915   3.862659   3.880582    10Now that the code is presented in a single line, it is apparent that it could be easily vectorized. Seize the opportunity to vectorize it:
f4 <- function(n, a=2)
    a * log(seq_len(n))
identical(expected, f4(n))## [1] TRUEmicrobenchmark(f0(n), f3(n), f4(n), times=10)## Unit: microseconds
##   expr        min         lq        mean      median         uq        max neval
##  f0(n) 155763.460 155858.448 179344.1292 179768.8200 202611.040 202708.645    10
##  f3(n)   3773.596   3794.231   3907.6489   3848.1350   3865.203   4611.464    10
##  f4(n)    366.626    367.884    373.7918    373.7705    376.339    385.529    10f4() definitely seems to be the winner. How does it scale with n? (Repeat several times)
n <- 10 ^ (5:8)                         # 100x larger than f0
t <- sapply(n, function(i) system.time(f4(i))[[3]])
plot(t ~ n, log="xy", type="b")Any explanations for the different pattern of response?
Lessons learned:
*apply() functions help avoid need for explicit pre-allocation and make opportunities for vectorization more apparent. This may come at a small performance cost, but is generally worth itWhen data are too large to fit in memory, we can iterate through files in chunks or subset the data by fields or genomic positions.
Iteration
open(), read chunk(s), close().yieldSize argument to Rsamtools::BamFile()GenomicFiles::reduceByYield()Restriction
Rsamtools::ScanBamParam()Rsamtools::PileupParam()VariantAnnotation::ScanVcfParam()Parallel evalutation
BiocParallel provides a standardized interface for simple parallel evaluation. The package builds provides access to the snow and multicore functionality in the parallel package as well as BatchJobs for running cluster jobs.
General ideas:
bplapply() instead of lapply()Argument BPPARAM influences how parallel evaluation occurs
MulticoreParam(): threads on a single (non-Windows) machineSnowParam(): processes on the same or different machinesBatchJobsParam(): resource scheduler on a clusterDoparParam(): parallel execution with foreach()This small example motivates the use of parallel execution and demonstrates how bplapply() can be a drop in for lapply.
Use system.time() to explore how long this takes to execute as n increases from 1 to 10. Use identical() and microbenchmark to compare alternatives f0() and f1() for both correctness and performance.
fun sleeps for 1 second, then returns i.
library(BiocParallel)
fun <- function(i) {
    Sys.sleep(1)
    i
}
## serial
f0 <- function(n)
    lapply(seq_len(n), fun)
## parallel
f1 <- function(n)
    bplapply(seq_len(n), fun)