--- title: "20.1 -- Working with Large Data" author: "Martin Morgan " output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > % \VignetteIndexEntry{20.1 -- Working with Large Data} % \VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))) ``` Author: [Martin Morgan][]
Date: 26 July, 2019
[Martin Morgan]: mailto: Martin.Morgan@RoswellPark.org # Motivating example Attach the [TENxBrainData][] experiment data package ```{r, message = FALSE} library(TENxBrainData) ``` Load a very large `SummarizedExperiment` ```{r} tenx <- TENxBrainData() tenx assay(tenx) ``` Quickly perform basic operations ```{r} log1p(assay(tenx)) ``` Subset and summarize 'library size' for 1000 cells ```{r} tenx_subset <- tenx[,1:1000] lib_size <- colSums(assay(tenx_subset)) hist(log10(lib_size)) ``` The data is sparse, with more than 92% of cells equal to 0 ```{r} sum(assay(tenx_subset) == 0) / prod(dim(tenx_subset)) ``` # Write or use efficient _R_ code - The most important step! Avoid unnecessary copying ```{r} n <- 50000 ## makes a copy of `res1` on each iteration set.seed(123) system.time({ res1 <- NULL for (i in 1:n) res1 <- c(res1, rnorm(1)) }) ## pre-allocation set.seed(123) system.time({ res2 <- numeric(n) for (i in 1:n) res2[i] <- rnorm(1) }) identical(res1, res2) ## no need to think about allocation! set.seed(123) system.time({ res3 <- sapply(1:n, function(i) rnorm(1)) }) identical(res1, res3) ``` _Vectorize_ your own scripts ```{r} n <- 2000000 ## iteration: n calls to `rnorm(1)` set.seed(123) system.time({ res1 <- sapply(1:n, function(i) rnorm(1)) }) ## vectorize: 1 call to `rnorm(n)` set.seed(123) system.time( res2 <- rnorm(n) ) identical(res1, res2) ``` _Reuse_ other people's efficient code. - E.g., `limma::lmFit()` to fit 10,000's of linear models very quickly Examples in the lab this afternoon, and from Tuesday evening # Use chunk-wise iteration Example: nucleotide frequency of mapped reads ## An example without chunking Working through example data; not really large... ```{r, message = FALSE} library(RNAseqData.HNRNPC.bam.chr14) fname <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1] basename(fname) Rsamtools::countBam(fname) ``` Input into a `GAlignments` object, including `seq` (sequence) of each read. ```{r, message = FALSE} library(GenomicAlignments) param <- ScanBamParam(what = "seq") galn <- readGAlignments(fname, param = param) ``` Write a function to determine GC content of reads ```{r, message = FALSE} library(Biostrings) gc_content <- function(galn) { seq <- mcols(galn)[["seq"]] gc <- letterFrequency(seq, "GC", as.prob=TRUE) as.vector(gc) } ``` Calculate and display GC content ```{r} param <- ScanBamParam(what = "seq") galn <- readGAlignments(fname, param = param) res1 <- gc_content(galn) hist(res1) ``` ## The same example with chunking Open file for reading, specifying 'yield' size ```{r} bfl <- BamFile(fname, yieldSize = 100000) open(bfl) ``` Repeatedly read chunks of data and calculate GC content ```{r} res2 <- numeric() repeat { message(".") galn <- readGAlignments(bfl, param = param) if (length(galn) == 0) break ## inefficient copy of res2, but only a few iterations... res2 <- c(res2, gc_content(galn)) } ``` Clean up and compare approaches ```{r} close(bfl) identical(res1, res2) ``` # Use (classical) parallel evaluation Many down-sides - More complicated code, e.g., to distribute data - Relies on, and requires mastery of, supporting infrastructure Maximum speed-up - Proportional to number of parallel computations - In reality - Cost of data movement from 'manager' to 'worker' - Additional overhead of orchestrating parallel computations ## [BiocParallel][] ```{r, echo=FALSE} xx <- gc(); xx <- gc() ``` ```{r, message = FALSE} fun <- function(i) { Sys.sleep(1) # a time-consuming calculation i # and then the result } system.time({ res1 <- lapply(1:10, fun) }) library(BiocParallel) system.time({ res2 <- bplapply(1:10, fun) }) identical(res1, res2) ``` - 'Forked' processes (non-Windows) - No need to distribute data from main thread to workers - Independent processes - Classic clusters, e.g., _slurm_ - Coming: cloud-based solutions ## [GenomicFiles][] Parallel, chunk-wise iteration through genomic files. Set up: ```{r, message = FALSE} library(GenomicFiles) ``` Define a `yield` function that provides a chunk of data for processing ```{r} yield <- function(x) { param <- ScanBamParam(what = "seq") readGAlignments(x, param = param) } ``` Define a `map` function that transforms the input data to the desired result ```{r} map <- function(x) { seq <- mcols(x)[["seq"]] gc <- letterFrequency(seq, "GC", as.prob = TRUE) as.vector(gc) } ``` Define a `reduce` function that combines two successive results ```{r} reduce <- c ``` Perform the calculation, chunk-wise and in parallel ```{r} library(RNAseqData.HNRNPC.bam.chr14) fname <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1] bfl <- BamFile(fname, yieldSize = 100000) res <- reduceByYield(bfl, yield, map, reduce, parallel = TRUE) hist(res) ``` [BiocParallel]: https://bioconductor.org/packages/BiocParallel [GenomicFiles]: https://bioconductor.org/packages/GenomicFiles # Query specific values # Provenance ```{r} sessionInfo() ```