Scalabe computing

Efficient R code

Vectorize!
Reuse others’ work – DESeq2, GenomicRanges, Biostrings, …, dplyr, data.table, Rcpp
Useful tools: system.time(), Rprof(), microbenchmark
More detail in deadly sins of a previous course.

Iteration

Chunk-wise
open(), read chunk(s), close().
e.g., yieldSize argument to Rsamtools::BamFile()

Restriction

Limit to columns and / or rows of interest
Exploit domain-specific formats, e.g., BAM files and Rsamtools::ScanBamParam()
Use a data base

Sampling

Iterate through large data, retaining a manageable sample, e.g., ShortRead::FastqSampler()

Parallel evaluation

After writing efficient code
Typically, lapply()-like operations
Cores on a single machine (‘easy’); clusters (more tedious); clouds

File management

File classes

Type	Example use	Name	Package
.bed	Range annotations	`BedFile()`	rtracklayer
.wig	Coverage	`WigFile()`, `BigWigFile()`	rtracklayer
.gtf	Transcript models	`GTFFile()`	rtracklayer
		`makeTxDbFromGFF()`	GenomicFeatures
.2bit	Genomic Sequence	`TwoBitFile()`	rtracklayer
.fastq	Reads & qualities	`FastqFile()`	ShortRead
.bam	Aligned reads	`BamFile()`	Rsamtools
.tbx	Indexed tab-delimited	`TabixFile()`	Rsamtools
.vcf	Variant calls	`VcfFile()`	VariantAnnotation

## rtracklayer menagerie
library(rtracklayer)
names(getClass("RTLFile")@subclasses)

##  [1] "UCSCFile"         "GFFFile"          "BEDFile"         
##  [4] "WIGFile"          "BigWigFile"       "ChainFile"       
##  [7] "TwoBitFile"       "FastaFile"        "TabSeparatedFile"
## [10] "CompressedFile"   "GFF1File"         "GFF2File"        
## [13] "GFF3File"         "BEDGraphFile"     "BED15File"       
## [16] "BWFile"           "2BitFile"         "GTFFile"         
## [19] "GVFFile"          "GZFile"           "BGZFile"         
## [22] "BZ2File"          "XZFile"

Notes

Not a consistent interface
open(), close(), import() / yield() / read*()
Some: selective import via index (e.g., .bai, bam index); selection (‘columns’); restriction (‘rows’)

Managing a collection of files

*FileList() classes

reduceByYield() – iterate through a single large file
bplapply() (BiocParallel) – perform independent operations on several files, in parallel

GenomicFiles()

‘rows’ as genomic range restrictions, ‘columns’ as files
Each row x column is a map from file data to useful representation in R
reduceByRange(), reduceByFile(): collapse maps into summary representation
see the GenomicFiles vignette Figure 1

Parallel evaluation with BiocParallel

Standardized interface for simple parallel evaluation.

bplapply() instead of lapply()
Argument BPPARAM influences how parallel evaluation occurs
- MulticoreParam(): threads on a single (non-Windows) machine
- SnowParam(): processes on the same or different machines
- BatchJobsParam(): resource scheduler on a cluster

Other resources

Bioconductor Amazon AMI
- easily ‘spin up’ 10’s of instances
- Pre-configured with Bioconductor packages and StarCluster management
GoogleGenomics to interact with google compute cloud and resources

Practical

Efficient code

Define following as a function.

f0 <- function(n) {
    ## inefficient!
    ans <- numeric()
    for (i in seq_len(n))
        ans <- c(ans, exp(i))
    ans
}

Use system.time() to explore how long this takes to execute as n increases from 100 to 10000. Use identical() and microbenchmark to compare alternatives f1(), f2(), and f3() for both correctness and performance of these three different functions. What strategies are these functions using?

f1 <- function(n) {
    ans <- numeric(n)
    for (i in seq_len(n))
        ans[[i]] <- exp(i)
    ans
}

f2 <- function(n)
    sapply(seq_len(n), exp)

f3 <- function(n)
    exp(seq_len(n))

Sleeping serially and in parallel

Go to sleep for 1 second, then return i. This takes 8 seconds.

library(BiocParallel)

fun <- function(i) {
    Sys.sleep(1)
    i
}

## serial
f0 <- function(n)
    lapply(seq_len(n), fun)

## parallel
f1 <- function(n)
    bplapply(seq_len(n), fun)

Reads overlapping windows

This exercise uses the following packages:

library(GenomicAlignments)
library(GenomicFiles)
library(BiocParallel)
library(Rsamtools)
library(GenomeInfoDb)

This is a re-implementation of the basic csaw binned counts algorithm. It supposes that ChIP fragment lengths are 110 nt, and that we bin coverage in windows of width 50. We focus on chr1.

frag.len <- 110
spacing <- 50
chr <- "chr1"

Here we point to the bam files, indicating that we’ll process the files in chunks of size 1,000,000.

fls <- dir("~/UseBioconductor-data/ChIPSeq/NFYA/", ".BAM$", full=TRUE)
names(fls) <- sub(".fastq.*", "", basename(fls))
bfl <- BamFileList(fls, yieldSize=1000000)

We’ll creating the counting bins using tileGenome(), focusing the ‘standard’ chromosomes’

len <- seqlengths(keepStandardChromosomes(seqinfo(bfl)))[chr]
tiles <- tileGenome(len, tilewidth=spacing, cut.last.tile.in.chrom=TRUE)

We’ll use reduceByYield() to iterate through a single file. We read to tell this function we’ll YIELD a chunk of the file, how we’ll MAP the chunk from it’s input representation to the per-window counts, and finally how we’ll REDUCE successive chunks into a final representation.

YIELD is supposed to be a function that takes one argument, the input source, and returns a chunk of records

yield <- function(x, ...)
    readGAlignments(x)

MAP must take the output of yield and perhaps additional arguments, and return a vector of counts. We’ll resize the genomic ranges describing the alignment so that they have a width equal to the fragment length

map <- function(x, tiles, frag.len, ...) {
   gr <- keepStandardChromosomes(granges(x))
   countOverlaps(tiles, resize(gr, frag.len))
}

REDUCE takes two results from MAP (in our case, vectors of counts) and combines them into a single result. We simply add our vectors (+ is actually a function!)

reduce <- `+`

To process one file, we use reduceByYield(), passing the file we want to process, the yield, map, and reduce functions. Our ‘wrapper’ function passes any additional arguments through to reduceByYield() using ...:

count1file <- function(bf, ...)
    reduceByYield(bf, yield, map, reduce, ...)

Using yieldSize and reduceByYield() means that we do not consume too much memory processing each file, so that we can process files in parallel using BiocParallel. The simplify2array() function transforms a list-of-vectors to a matrix.

counts <- bplapply(bfl, count1file, tiles=tiles, frag.len=frag.len)
counts <- simplify2array(counts)
dim(counts)
colSums(counts)

I. Working with Large Data

Martin Morgan (mtmorgan@fredhutch.org)

2015-04-07