Scalabe computing

Efficient R code

Iteration

Restriction

Sampling

Parallel evaluation

File management

File classes

Type Example use Name Package
.bed Range annotations BedFile() rtracklayer
.wig Coverage WigFile(), BigWigFile() rtracklayer
.gtf Transcript models GTFFile() rtracklayer
makeTxDbFromGFF() GenomicFeatures
.2bit Genomic Sequence TwoBitFile() rtracklayer
.fastq Reads & qualities FastqFile() ShortRead
.bam Aligned reads BamFile() Rsamtools
.tbx Indexed tab-delimited TabixFile() Rsamtools
.vcf Variant calls VcfFile() VariantAnnotation
## rtracklayer menagerie
library(rtracklayer)
names(getClass("RTLFile")@subclasses)
##  [1] "UCSCFile"         "GFFFile"          "BEDFile"         
##  [4] "WIGFile"          "BigWigFile"       "ChainFile"       
##  [7] "TwoBitFile"       "FastaFile"        "TabSeparatedFile"
## [10] "CompressedFile"   "GFF1File"         "GFF2File"        
## [13] "GFF3File"         "BEDGraphFile"     "BED15File"       
## [16] "BWFile"           "2BitFile"         "GTFFile"         
## [19] "GVFFile"          "GZFile"           "BGZFile"         
## [22] "BZ2File"          "XZFile"

Notes

Managing a collection of files

*FileList() classes

GenomicFiles()

Parallel evaluation with BiocParallel

Standardized interface for simple parallel evaluation.

Other resources

Practical

Efficient code

Define following as a function.

f0 <- function(n) {
    ## inefficient!
    ans <- numeric()
    for (i in seq_len(n))
        ans <- c(ans, exp(i))
    ans
}

Use system.time() to explore how long this takes to execute as n increases from 100 to 10000. Use identical() and microbenchmark to compare alternatives f1(), f2(), and f3() for both correctness and performance of these three different functions. What strategies are these functions using?

f1 <- function(n) {
    ans <- numeric(n)
    for (i in seq_len(n))
        ans[[i]] <- exp(i)
    ans
}

f2 <- function(n)
    sapply(seq_len(n), exp)

f3 <- function(n)
    exp(seq_len(n))

Sleeping serially and in parallel

Go to sleep for 1 second, then return i. This takes 8 seconds.

library(BiocParallel)

fun <- function(i) {
    Sys.sleep(1)
    i
}

## serial
f0 <- function(n)
    lapply(seq_len(n), fun)

## parallel
f1 <- function(n)
    bplapply(seq_len(n), fun)

Reads overlapping windows

This exercise uses the following packages:

library(GenomicAlignments)
library(GenomicFiles)
library(BiocParallel)
library(Rsamtools)
library(GenomeInfoDb)

This is a re-implementation of the basic csaw binned counts algorithm. It supposes that ChIP fragment lengths are 110 nt, and that we bin coverage in windows of width 50. We focus on chr1.

frag.len <- 110
spacing <- 50
chr <- "chr1"

Here we point to the bam files, indicating that we’ll process the files in chunks of size 1,000,000.

fls <- dir("~/UseBioconductor-data/ChIPSeq/NFYA/", ".BAM$", full=TRUE)
names(fls) <- sub(".fastq.*", "", basename(fls))
bfl <- BamFileList(fls, yieldSize=1000000)

We’ll creating the counting bins using tileGenome(), focusing the ‘standard’ chromosomes’

len <- seqlengths(keepStandardChromosomes(seqinfo(bfl)))[chr]
tiles <- tileGenome(len, tilewidth=spacing, cut.last.tile.in.chrom=TRUE)

We’ll use reduceByYield() to iterate through a single file. We read to tell this function we’ll YIELD a chunk of the file, how we’ll MAP the chunk from it’s input representation to the per-window counts, and finally how we’ll REDUCE successive chunks into a final representation.

YIELD is supposed to be a function that takes one argument, the input source, and returns a chunk of records

yield <- function(x, ...)
    readGAlignments(x)

MAP must take the output of yield and perhaps additional arguments, and return a vector of counts. We’ll resize the genomic ranges describing the alignment so that they have a width equal to the fragment length

map <- function(x, tiles, frag.len, ...) {
   gr <- keepStandardChromosomes(granges(x))
   countOverlaps(tiles, resize(gr, frag.len))
}

REDUCE takes two results from MAP (in our case, vectors of counts) and combines them into a single result. We simply add our vectors (+ is actually a function!)

reduce <- `+`

To process one file, we use reduceByYield(), passing the file we want to process, the yield, map, and reduce functions. Our ‘wrapper’ function passes any additional arguments through to reduceByYield() using ...:

count1file <- function(bf, ...)
    reduceByYield(bf, yield, map, reduce, ...)

Using yieldSize and reduceByYield() means that we do not consume too much memory processing each file, so that we can process files in parallel using BiocParallel. The simplify2array() function transforms a list-of-vectors to a matrix.

counts <- bplapply(bfl, count1file, tiles=tiles, frag.len=frag.len)
counts <- simplify2array(counts)
dim(counts)
colSums(counts)

Resources