# Skills for Intermediate R Users

CSAMA 2014
Author: Martin Morgan (mtmorgan@fhcrc.org)
Date: 23 June, 2014

## Refresher: working with data

### Case study: ALL phenotypic data

This case study servers as a refresher basic input and manipulation of data.

Input a file that contains ALL (acute lymphoblastic leukemia) patient information

fname <- file.choose()   ## "ALLphenoData.tsv"
stopifnot(file.exists(fname))

Check out the help page ?read.delim for input options, and explore basic properties of the object you've created, for instance…

class(pdata)
## [1] "data.frame"
colnames(pdata)
##  [1] "id"             "diagnosis"      "sex"            "age"
##  [5] "BT"             "remission"      "CR"             "date.cr"
##  [9] "t.4.11."        "t.9.22."        "cyto.normal"    "citog"
## [13] "mol.biol"       "fusion.protein" "mdr"            "kinet"
## [17] "ccr"            "relapse"        "transplant"     "f.u"
## [21] "date.last.seen"
dim(pdata)
## [1] 127  21
##     id diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE
## 2 1010 3/29/2000   M  19 B2        CR CR 6/27/2000   FALSE   FALSE
## 3 3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA
## 4 4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE
## 5 4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE
## 6 4008 7/30/1997   M  17 B1        CR CR 9/27/1997   FALSE   FALSE
##   cyto.normal        citog mol.biol fusion.protein mdr   kinet   ccr
## 1       FALSE      t(9;22)  BCR/ABL           p210 NEG dyploid FALSE
## 2       FALSE  simple alt.      NEG           <NA> POS dyploid FALSE
## 3          NA         <NA>  BCR/ABL           p190 NEG dyploid FALSE
## 4       FALSE      t(4;11) ALL1/AF4           <NA> NEG dyploid FALSE
## 5       FALSE      del(6q)      NEG           <NA> NEG dyploid FALSE
## 6       FALSE complex alt.      NEG           <NA> NEG hyperd. FALSE
##   relapse transplant               f.u date.last.seen
## 1   FALSE       TRUE BMT / DEATH IN CR           <NA>
## 2    TRUE      FALSE               REL      8/28/2000
## 3    TRUE      FALSE               REL     10/15/1999
## 4    TRUE      FALSE               REL      1/23/1998
## 5    TRUE      FALSE               REL      11/4/1997
## 6    TRUE      FALSE               REL     12/15/1997
summary(pdata$sex) ## F M NA's ## 42 83 2 summary(pdata$cyto.normal)
##    Mode   FALSE    TRUE    NA's
## logical      69      24      34

Remind yourselves about various ways to subset and access columns of a data.frame

pdata[1:5, 3:4]
##   sex age
## 1   M  53
## 2   M  19
## 3   F  52
## 4   M  38
## 5   M  57
pdata[1:5, ]
##     id diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE
## 2 1010 3/29/2000   M  19 B2        CR CR 6/27/2000   FALSE   FALSE
## 3 3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA
## 4 4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE
## 5 4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE
##   cyto.normal       citog mol.biol fusion.protein mdr   kinet   ccr
## 1       FALSE     t(9;22)  BCR/ABL           p210 NEG dyploid FALSE
## 2       FALSE simple alt.      NEG           <NA> POS dyploid FALSE
## 3          NA        <NA>  BCR/ABL           p190 NEG dyploid FALSE
## 4       FALSE     t(4;11) ALL1/AF4           <NA> NEG dyploid FALSE
## 5       FALSE     del(6q)      NEG           <NA> NEG dyploid FALSE
##   relapse transplant               f.u date.last.seen
## 1   FALSE       TRUE BMT / DEATH IN CR           <NA>
## 2    TRUE      FALSE               REL      8/28/2000
## 3    TRUE      FALSE               REL     10/15/1999
## 4    TRUE      FALSE               REL      1/23/1998
## 5    TRUE      FALSE               REL      11/4/1997
##   sex age BT
## 1   M  53 B2
## 2   M  19 B2
## 3   F  52 B4
## 4   M  38 B1
## 5   M  57 B2
## 6   M  17 B1
tail(pdata[, 3:5], 3)
##     sex age BT
## 125   M  19 T2
## 126   M  30 T3
## 127   M  29 T2
head(pdata$age) ## [1] 53 19 52 38 57 17 head(pdata$sex)
## [1] M M F M M M
## Levels: F M
head(pdata[pdata$age > 21,]) ## id diagnosis sex age BT remission CR date.cr t.4.11. t.9.22. ## 1 1005 5/21/1997 M 53 B2 CR CR 8/6/1997 FALSE TRUE ## 3 3002 6/24/1998 F 52 B4 CR CR 8/17/1998 NA NA ## 4 4006 7/17/1997 M 38 B1 CR CR 9/8/1997 TRUE FALSE ## 5 4007 7/22/1997 M 57 B2 CR CR 9/17/1997 FALSE FALSE ## 10 8001 1/15/1997 M 40 B2 CR CR 3/26/1997 FALSE FALSE ## 11 8011 8/21/1998 M 33 B3 CR CR 10/8/1998 FALSE FALSE ## cyto.normal citog mol.biol fusion.protein mdr kinet ccr ## 1 FALSE t(9;22) BCR/ABL p210 NEG dyploid FALSE ## 3 NA <NA> BCR/ABL p190 NEG dyploid FALSE ## 4 FALSE t(4;11) ALL1/AF4 <NA> NEG dyploid FALSE ## 5 FALSE del(6q) NEG <NA> NEG dyploid FALSE ## 10 FALSE del(p15) BCR/ABL p190 NEG <NA> FALSE ## 11 FALSE del(p15/p16) BCR/ABL p190/p210 NEG dyploid FALSE ## relapse transplant f.u date.last.seen ## 1 FALSE TRUE BMT / DEATH IN CR <NA> ## 3 TRUE FALSE REL 10/15/1999 ## 4 TRUE FALSE REL 1/23/1998 ## 5 TRUE FALSE REL 11/4/1997 ## 10 TRUE FALSE REL 7/11/1997 ## 11 FALSE TRUE BMT / DEATH IN CR <NA> It seems from below that there are 17 females over 40 in the data set, but when sub-setting pdata to contain just those individuals 19 rows are selected. Why? What can we do to correct this? idx <- pdata$sex == "F" & pdata$age > 40 table(idx) ## idx ## FALSE TRUE ## 108 17 dim(pdata[idx,]) ## [1] 19 21 Use the mol.biol column to subset the data to contain just individuals with 'BCR/ABL' or 'NEG', e.g., bcrabl <- pdata[pdata$mol.biol %in% c("BCR/ABL", "NEG"),]

The mol.biol column is a factor, and retains all levels even after subsetting. How might you drop the unused factor levels?

bcrabl$mol.biol <- factor(bcrabl$mol.biol)

The BT column is a factor describing B- and T-cell subtypes

levels(bcrabl$BT) ## [1] "B" "B1" "B2" "B3" "B4" "T" "T1" "T2" "T3" "T4" How might one collapse B1, B2, … to a single type B, and likewise for T1, T2, …, so there are only two subtypes, B and T table(bcrabl$BT)
##
##  B B1 B2 B3 B4  T T1 T2 T3 T4
##  4  9 35 22  9  4  1 15  9  2
levels(bcrabl$BT) <- substring(levels(bcrabl$BT), 1, 1)
table(bcrabl$BT) ## ## B T ## 79 31 Use xtabs() (cross-tabulation) to count the number of samples with B- and T-cell types in each of the BCR/ABL and NEG groups xtabs(~ BT + mol.biol, bcrabl) ## mol.biol ## BT BCR/ABL NEG ## B 37 42 ## T 0 31 Use aggregate() to calculate the average age of males and females in the BCR/ABL and NEG treatment groups. aggregate(age ~ mol.biol + sex, bcrabl, mean) ## mol.biol sex age ## 1 BCR/ABL F 39.94 ## 2 NEG F 30.42 ## 3 BCR/ABL M 40.50 ## 4 NEG M 27.21 Use t.test() to compare the age of individuals in the BCR/ABL versus NEG groups; visualize the results using boxplot(). In both cases, use the formula interface. Consult the help page ?t.test and re-do the test assuming that variance of ages in the two groups is identical. What parts of the test output change? t.test(age ~ mol.biol, bcrabl) ## ## Welch Two Sample t-test ## ## data: age by mol.biol ## t = 4.817, df = 68.53, p-value = 8.401e-06 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 7.135 17.224 ## sample estimates: ## mean in group BCR/ABL mean in group NEG ## 40.25 28.07 boxplot(age ~ mol.biol, bcrabl) ### Case study: weighty matters This case study is a second walk through basic data manipulation and visualization skills. We use data from the US Center for Disease Control's Behavioral Risk Factor Surveillance System (BRFSS) annual survey. Check out the web page for a little more information. We are using a small subset of this data, including a random sample of 10000 observations from each of 1990 and 2010. Input the data using read.csv(), creating a variable brfss to hold it. Use file.choose() to locate the data file BRFSS-subset.csv fname <- file.choose() ## BRFSS-subset.csv stopifnot(file.exists(fname)) brfss <- read.csv(fname) #### Exercises: base plotting functions 1. Explore the data using class(), dim(), head(), summary(), etc. Use xtabs() to summarize the number of males and females in the study, in each of the two years. 2. Use aggregate() to summarize the average weight in each sex and year. 3. Create a scatterplot showing the relationship between the square root of weight and height, using the plot() function and the main argument to annotate the plot. Note the transformed Y-axis. Experiment with different plotting symbols (try the command example(points) to view different points). plot(sqrt(Weight) ~ Height, brfss, main="All Years, Both Sexes") 4. Color the female and male points differently. To do this, use the col argument to plot(). Provide as a value to that argument a vector of colors, subset by brfss$Sex.

5. Create a subset of the data containing only observations from 2010.

brfss2010 <- brfss[brfss$Year == "2010", ] 6. Create the figure below (two panels in a single figure). Do this by using the par() function with the mfcol argument before calling plot(). You'll need to create two more subsets of data, perhaps when you are providing the data to the function plot. opar <- par(mfcol=c(1, 2)) plot(sqrt(Weight) ~ Height, brfss2010[brfss2010$Sex == "Female", ],
main="2010, Female")
plot(sqrt(Weight) ~ Height, brfss2010[brfss2010Sex == "Male", ], main="2010, Male") par(mfcol=c(1, 1)) 7. Plotting large numbers of points means that they are often over-plotted, potentially obscuring important patterns. Experiment with arguments to plot() to address over-plotting, e.g., pch='.' or alpha=.4. Try using the smoothScatter() function (the data have to be presented as x and y, rather than as a formula). Try adding the hexbin library to your R session (using library()) and creating a hexbinplot(). #### Exercises: Trellis graphics and the lattice package R has a number of additional plotting facilities, both as part of the 'base' distribution and user-contributed packages. The lattice package adopts a particular philosophy about the presentation of data, and can be a very effective visualization tool. 1. Use library() to load the lattice package. library(lattice) 2. Create the following figure using the xyplot() function with a formula and the brfss2010 data. The formula is sqrt(Weight) ~ Height | Sex, which can be read as square root of Weight as a function of Height, conditioned on Sex'. xyplot(sqrt(Weight) ~ Height | Sex, brfss2010) 3. Add a background grid and a regression line to each panel using the argument type=c('p', 'g', 'r'); change the width (lwd) and color (col.line) of the regression line. For some hints on other arguments, see the help pages ?xyplot (for overall structure of the plot) and ?panel.xyplot (for how each 'panel' of data is plotted). 4. Create the following figure. Use the densityplot() function with the formula ~ sqrt(Weight). The group=Sex function argument creates separate lines for each sex. Be sure to use plot.points=FALSE to avoid a rug' of points at the base of the figure. Can you add a key (legend)? densityplot(~sqrt(Weight), brfss2010, group=Sex, plot.points=FALSE) 5. Create the figure below using the bwplot function. The formula requires that Year be coerced to a factor, factor(Year). bwplot(sqrt(Weight) ~ factor(Year) | Sex, brfss) 6. Create the Figure below, a violin plot, using bwplot() and the panel argument set to panel.violin. from ?bwplot we learn that panel is a function that determines how each panel is drawn; the details for controling the violin panel plot are described on the help page ?panel.violin. bwplot(sqrt(Weight) ~ factor(Year) | Sex, brfss, panel=panel.violin) 7. (Advanced) We can write our own panel argument to lattice functions to influence how each panel is displayed. Here we add a point at the median age and weight. xyplot(sqrt(Weight) ~ Height|Sex, brfss2010, panel = function(x, y, ...) { panel.xyplot(x, y, ...) panel.points(median(x, na.rm=TRUE), median(y, na.rm=TRUE), cex=2, pch=20, col="red") }, type=c("p", "g", "r"), lwd=2, col.line="red", xlim=c(120, 210)) ## Making progress: classes, methods, and packages This section focuses on classes, methods, and packages, with the goal being to learn to navigate the help system and interactive discovery facilities. ### Motivation Sequence analysis is specialized • Large data needs to be processed in a memory- and time-efficient manner • Specific algorithms have been developed for the unique characteristics of sequence data Additional considerations • Re-use of existing, tested code is easier to do and less error-prone than re-inventing the wheel. • Interoperability between packages is easier when the packages share similar data structures. Solution: use well-defined classes to represent complex data; methods operate on the classes to perform useful functions. Classes and methods are placed together and distributed as packages so that we can all benefit from the hard work and tested code of others. ### Case study: IRanges and GRanges The IRanges package defines an important class for specifying integer ranges, e.g., library(IRanges) ## Loading required package: BiocGenerics ## Loading required package: parallel ## ## Attaching package: 'BiocGenerics' ## ## The following objects are masked from 'package:parallel': ## ## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, ## clusterExport, clusterMap, parApply, parCapply, parLapply, ## parLapplyLB, parRapply, parSapply, parSapplyLB ## ## The following object is masked from 'package:stats': ## ## xtabs ## ## The following objects are masked from 'package:base': ## ## Filter, Find, Map, Position, Reduce, anyDuplicated, append, ## as.data.frame, as.vector, cbind, colnames, do.call, ## duplicated, eval, evalq, get, intersect, is.unsorted, lapply, ## mapply, match, mget, order, paste, pmax, pmax.int, pmin, ## pmin.int, rank, rbind, rep.int, rownames, sapply, setdiff, ## sort, table, tapply, union, unique, unlist ir <- IRanges(start=c(10, 20, 30), width=5) ir ## IRanges of length 3 ## start end width ## [1] 10 14 5 ## [2] 20 24 5 ## [3] 30 34 5 There are many interesting operations to be performed on ranges, e.g, flank() identifies adjacent ranges flank(ir, 3) ## IRanges of length 3 ## start end width ## [1] 7 9 3 ## [2] 17 19 3 ## [3] 27 29 3 Consult the help page for flank, ?flank, and explore other range-based operations. The IRanges class is part of a class hierarchy. To see this, ask R for the class of ir, and for the class definition of the IRanges class class(ir) ## [1] "IRanges" ## attr(,"package") ## [1] "IRanges" getClassDef(class(ir)) ## Class "IRanges" [package "IRanges"] ## ## Slots: ## ## Name: start width NAMES elementType ## Class: integer integer characterORNULL character ## ## Name: elementMetadata metadata ## Class: DataTableORNULL list ## ## Extends: ## Class "Ranges", directly ## Class "IntegerList", by class "Ranges", distance 2 ## Class "RangesORmissing", by class "Ranges", distance 2 ## Class "AtomicList", by class "Ranges", distance 3 ## Class "List", by class "Ranges", distance 4 ## Class "Vector", by class "Ranges", distance 5 ## Class "Annotated", by class "Ranges", distance 6 ## ## Known Subclasses: "NormalIRanges" Notice that IRanges extends the Ranges class. Now try entering ?"flank,<tab>, where <tab> means to press the tab key to ask for tab completion (may not be necessary in Rstudio). You can see that there are help pages for several different classes. Tab-complete to ?"flank,Ranges-method" and verify that you're at the page that describes the method relevant to an IRanges instance. The GenomicRanges package extends the notion of ranges to include features relevant to application of ranges in sequence analysis, particularly the ability to associate a range with a sequence name (e.g., chromosome) and a strand. Create a GRanges instance based on our IRanges instance, as follows library(GenomicRanges) ## Loading required package: GenomeInfoDb gr <- GRanges(c("chr1", "chr1", "chr2"), ir, strand=c("+", "-", "+")) gr ## GRanges with 3 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 [10, 14] + ## [2] chr1 [20, 24] - ## [3] chr2 [30, 34] + ## --- ## seqlengths: ## chr1 chr2 ## NA NA The notion of flanking sequence has a more nuanced meaning in biology. In particular we might expect that flanking sequence on the + strand would precede the range, but on the minus strand would follow it. Verify that flank applied to a GRanges object has this behavior. flank(gr, 3) ## GRanges with 3 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 [ 7, 9] + ## [2] chr1 [25, 27] - ## [3] chr2 [27, 29] + ## --- ## seqlengths: ## chr1 chr2 ## NA NA Discover what classes GRanges extends, find the help page documenting the behavior of flank when applied to a GRanges object, and verify that the help page documents the behavior we just observed. class(gr) ## [1] "GRanges" ## attr(,"package") ## [1] "GenomicRanges" getClassDef(class(gr)) ## Class "GRanges" [package "GenomicRanges"] ## ## Slots: ## ## Name: seqnames ranges strand elementMetadata ## Class: Rle IRanges Rle DataFrame ## ## Name: seqinfo metadata ## Class: Seqinfo list ## ## Extends: ## Class "GenomicRanges", directly ## Class "Vector", by class "GenomicRanges", distance 2 ## Class "GenomicRangesORmissing", by class "GenomicRanges", distance 2 ## Class "GenomicRangesORGRangesList", by class "GenomicRanges", distance 2 ## Class "Annotated", by class "GenomicRanges", distance 3 ?"flank,GenomicRanges-method" Notice that the available flank() methods have been augmented by the methods defined in the GenomicRanges package. It seems like there might be a number of helpful methods available for working with genomic ranges; we can discover some of these from the command line, indicating that the methods should be on the current search() path showMethods(class="GRanges", where=search()) Use help() to list the help pages in the GenomicRanges package, and vignettes() to view and access available vignettes; these are also available in the Rstudio 'Help' tab. help(package="GenomicRanges") vignette(package="GenomicRanges") vignette(package="GenomicRanges", "GenomicRangesHOWTOs") ## A sequence analysis package tour This very open-ended topic points to some of the most prominent Bioconductor packages for sequence analysis. Use the opportunity in this lab to explore the package vignettes and help pages highlighted below; many of the material will be covered in greater detail in subsequent labs and lectures. Basics • Bioconductor packages are listed on the biocViews page. Each package has 'biocViews' (tags from a controlled vocabulary) associated with it; these can be searched to identify appropriately tagged packages, as can the package title and author. • Each package has a 'landing page', e.g., for GenomicRanges. Visit this landing page, and note the description, authors, and installation instructions. Packages are often written up in the scientific literature, and if available the corresponding citation is present on the landing page. Also on the landing page are links to the vignettes and reference manual and, at the bottom, an indication of cross-platform availability and download statistics. • A package needs to be installed once, using the instructions on the landing page. Once installed, the package can be loaded into an R session library(GenomicRanges) and the help system queried interactively, as outlined above: help(package="GenomicRanges") vignette(package="GenomicRanges") vignette(package="GenomicRanges", "GenomicRangesHOWTOs") ?GRanges Domain-specific analysis – explore the landing pages, vignettes, and reference manuals of two or three of the following packages. • Important packages for analysis of differential expression include edgeR and DESeq2; both have excellent vignettes for exploration. Additional research methods embodied in Bioconductor packages can be discovered by visiting the biocViews web page, searching for the 'DifferentialExpression' view term, and narrowing the selection by searching for 'RNA seq' and similar. • Popular ChIP-seq packages include DiffBind for comparison of peaks across samples, ChIPQC for quality assessment, and ChIPpeakAnno for annotating results (e.g., discovering nearby genes). What other ChIP-seq packages are listed on the biocViews page? • Working with called variants (VCF files) is facilitated by packages such as VariantAnnotation, VariantFiltering, and ensemblVEP; packages for calling variants include, e.g., h5vc and VariantTools. • Several packages identify copy number variants from sequence data, including cn.mops; from the biocViews page, what other copy number packages are available? The CNTools package provides some useful facilities for comparison of segments across samples. • Microbiome and metagenomic analysis is facilitated by packages such as phyloseq and metagenomeSeq. • Metabolomics, chemoinformatics, image analysis, and many other high-throughput analysis domains are also represented in Bioconductor; explore these via biocViews and title searches. Working with sequences, alignments, common web file formats, and raw data; these packages rely very heavily on the IRanges / GenomicRanges infrastructure that we will encounter later in the course. • The Biostrings package is used to represent DNA and other sequences, with many convenient sequence-related functions. Check out the functions documented on the help page ?consensusMatrix, for instance. Also check out the BSgenome package for working with whole genome sequences, e.g., ?"getSeq,BSgenome-method" • The GenomicAlignments package is used to input reads aligned to a reference genome. See for instance the ?readGAlignments help page and vigentte(package="GenomicAlignments", "summarizeOverlaps") • [rtracklayer][]'s import and export functions can read in many common file types, e.g., BED, WIG, GTF, …, in addition to querying and navigating the UCSC genome browser. Check out the ?import page for basic usage. • The ShortRead and Rsamtools packages can be used for lower-level access to FASTQ and BAM files, respectively. Explore the ShortRead vignette and Scalable Genomics labs to see approaches to effectively processing the large files. Annotation: Bioconductor provides extensive access to 'annotation' resources (see the AnnotationData biocViews hierarchy); these are covered in greater detail in Thursday's lab, but some interesting examples to explore during this lab include: • biomaRt, PSICQUIC, KEGGREST and other packages for querying on-line resources; each of these have informative vignettes. • AnnotationDbi is a cornerstone of the Annotation Data packages provided by Bioconductor. • org packages (e.g., org.Hs.eg.db) contain maps between different gene identifiers, e.g., ENTREZ and SYMBOL. The basic interface to these packages is described on the help page ?select • TxDb packages (e.g., TxDb.Hsapiens.UCSC.hg19.knownGene) contain gene models (exon coordinates, exon / transcript relationships, etc) derived from common sources such as the hg19 knownGene track of the UCSC genome browser. These packages can be queried, e.g., as described on the ?exonsBy page to retrieve all exons grouped by gene or transcript. • BSgenome packages (e.g., BSgenome.Hsapiens.UCSC.hg19) contain whole genomes of model organisms. • VariantAnnotation and ensemblVEP provide access to sequence annotation facilities, e.g., to identify coding variants; see the Introduction to VariantAnnotation vignette for a brief introduction; we'll re-visit this during the Thursday lab. • Take a quick look (we'll do more of this in Thursday's lab) at the annotation work flow on the Bioconductor web site. ## Avoiding deadly sins: efficient code The goal of this section is to learn to write correct, robust, simple and efficient R code. We do this through a couple of case studies. ### Priorities 1. Correct: consistent with hand-worked examples • identical(), all.equal() 2. Robust: supports realistic inputs, e.g., 0-length vectors, NA values, … 3. Simple: easy to understand next month; easy to describe what it does to a colleague; easy to spot logical errors; easy to enhance. 4. Fast, or at least reasonable given the speed of modern computers. ### Strategies 1. Profile • Look at the script to understand in general terms what it is doing. • Step through the code to see how it is executed, and to gain an understanding of the speed of each line. • Time evaluation of select lines or simple chunks of code with system.time() or the microbenchmark package. • Profile the code with a tool that indicates how much time is spent in each function call or line – the built-in Rprof() function, or packages such as lineprof or aprof 2. Vectorize – operate on vectors, rather than explicit loops x <- 1:10 log(x) ## NOT for (i in seq_along) x[i] <- log(x[i]) ## [1] 0.0000 0.6931 1.0986 1.3863 1.6094 1.7918 1.9459 2.0794 2.1972 2.3026 3. Pre-allocate memory, then fill in the result result <- numeric(10) result[1] <- runif(1) for (i in 2:length(result)) result[i] <- runif(1) * result[i - 1] result ## [1] 0.72601 0.64980 0.44125 0.25742 0.09304 0.08861 0.07351 0.07070 ## [9] 0.05750 0.02301 4. Hoist common sub-expressions outside of repeated calculations, so that the sub-expression is only calculated once • Simple, e.g., 'hoist' constant multiplications from a for loop • Higher-level, e.g., use lm.fit() rather than repeatedly fitting the same design matrix. 5. Re-use existing, tested code • Efficient implementations of common operations – tabulate(), rowSums() and friends, %in%, … • Efficient domain-specific implementations, e.g., snpStats for GWAS linear models; limma for microarray linear models; edgeR, DESeq2 for negative binomial GLMs relevant to RNASeq. 6. Re-think how to attack the problem • Different implementations • Alternative algorithms 7. Compile your script with the byte compiler 8. Use parallel evaluation 9. Speak in tongues – 'foreign' languages like C, Fortran ### Case study: from iteration to vectorization Here's an obviously inefficient function: f0 <- function(n, a=2) { ## stopifnot(is.integer(n) && (length(n) == 1) && ## !is.na(n) && (n > 0)) result <- numeric() for (i in seq_len(n)) result[[i]] <- a * log(i) result } Use system.time to investigate how this algorithm scales with n, focusing on elapsed time. system.time(f0(10000)) ## user system elapsed ## 0.256 0.003 0.258 n <- 1000 * seq(1, 20, 2) t <- sapply(n, function(i) system.time(f0(i))[[3]]) plot(t ~ n, type="b") Remember the current 'correct' value, and an approximate time n <- 10000 system.time(expected <- f0(n)) ## user system elapsed ## 0.252 0.003 0.254 head(expected) ## [1] 0.000 1.386 2.197 2.773 3.219 3.584 Revise the function to hoist the common multiplier, a, out of the loop. Make sure the result of the 'optimization' and the original calculation are the same. Use the microbenchmark package to compare the two versions f1 <- function(n, a=2) { result <- numeric() for (i in seq_len(n)) result[[i]] <- log(i) a * result } identical(expected, f1(n)) ## [1] TRUE library(microbenchmark) microbenchmark(f0(n), f1(n), times=5) ## Unit: milliseconds ## expr min lq median uq max neval ## f0(n) 231.6 234.1 235.2 349.7 352.2 5 ## f1(n) 229.6 232.3 348.2 348.8 357.9 5 Adopt a 'pre-allocate and fill' strategy f2 <- function(n, a=2) { result <- numeric(n) for (i in seq_len(n)) result[[i]] <- log(i) a * result } identical(expected, f2(n)) ## [1] TRUE microbenchmark(f0(n), f2(n), times=5) ## Unit: milliseconds ## expr min lq median uq max neval ## f0(n) 235.23 236.08 236.31 351.28 363.17 5 ## f2(n) 16.68 16.81 17.15 17.91 18.37 5 Use an *apply() function to avoid having to explicitly pre-allocate, and make opportunities for vectorization more apparent. f3 <- function(n, a=2) a * sapply(seq_len(n), log) identical(expected, f3(n)) ## [1] TRUE microbenchmark(f0(n), f2(n), f3(n), times=10) ## Unit: milliseconds ## expr min lq median uq max neval ## f0(n) 234.50 237.66 298.54 359.67 365.47 10 ## f2(n) 16.84 17.24 17.32 18.30 21.26 10 ## f3(n) 20.37 21.04 22.21 22.55 27.46 10 Now that the code is presented in a single line, it is apparent that it could be easily vectorized. Seize the opportunity to vectorize it: f4 <- function(n, a=2) a * log(seq_len(n)) identical(expected, f4(n)) ## [1] TRUE microbenchmark(f0(n), f3(n), f4(n), times=10) ## Unit: microseconds ## expr min lq median uq max neval ## f0(n) 233304.5 237095.1 298124 359998.1 364520.8 10 ## f3(n) 20219.3 20356.3 20679 21235.9 22307.4 10 ## f4(n) 473.4 474.3 488 488.7 527.9 10 Returning to our explicit iteration f2(), in these situations it can be helpful to compile the code to a more efficient representation. Do this using the compiler package. library(compiler) f2c <- cmpfun(f2) n <- 10000 identical(f2(n), f2c(n)) ## [1] TRUE microbenchmark(f2(n), f2c(n), times=10) ## Unit: milliseconds ## expr min lq median uq max neval ## f2(n) 17.214 17.670 18.173 19.025 22.358 10 ## f2c(n) 7.272 7.374 7.533 7.835 9.165 10 f4() definitely seems to be the winner. How does it scale with n? (Repeat several times) n <- 100000 * seq(1, 20, 2) # 100x larger than f0 t <- sapply(n, function(i) system.time(f4(i))[[3]]) plot(t ~ n, type="b") Any explanations for the different pattern of response? Lessons learned: 1. Vectorizing offers a huge improvement over iteration 2. Pre-allocate-and-fill is very helpful when explicit iteration is required. 3. *apply() functions help avoid need for explicit pre-allocation and make opportunities for vectorization more apparent. This may come at a small performance cost, but is generally worth it 4. Hoisting common sub-expressions and using the compiler package can be helpful for improving performance when explicit iteration is required. ### Case study: Choosing algorithms It can be very helpful to reason about an algorithm in an abstract sense, to gain understanding about how an operation might scale. Here's an interesting problem, taken from StackOverflow: Suppose one has a very long sorted vector vec <- c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6)) with the simple goal being to identify the number of values less than zero. The original post and many responses suggested a variation of scanning the vector for values less than zero, then summing f0 <- function(v) sum(v < 0) This algorithm compares each element of vec to zero, creating a logical vector as long as the original, length(v). This logical vector is then scanned by sum to count the number of elements satisfying the condition. Questions: 1. How many vectors of length v need to be allocated for this algorithm? 2. Based on the number of comparisons that need to be performed, how would you expect this algorithm to scale with the length of v? Verify this with a simple figure. N <- seq(1, 11, 2) * 1e6 Time <- sapply(N, function(n) { v <- sort(rnorm(n)) system.time(f0(v))[[3]] }) plot(Time ~ N, type="b") Is there a better algorithm, i.e., an approach that arrives at the same answer but takes less time (and / or space)? The vector is sorted, and we can take advantage of that by doing a binary search. The algorithm is surprisingly simple: create an index to the minimum (first) element, and the maximum (last) element. Check to see if the element half way between is greater than or equal to zero. If so, move the maximum index to that point. Otherwise, make that point the new minimum. Repeat this procedure until the minimum index is no longer less than the maximum index. f3 <- function(x, threshold=0) { imin <- 1L imax <- length(x) while (imax >= imin) { imid <- as.integer(imin + (imax - imin) / 2) if (x[imid] >= threshold) imax <- imid - 1L else imin <- imid + 1L } imax } Approximately half of the possible values are discarded each iteration, so we expect on average to arrive at the end after about log2(length(v)) iterations – the algorithm scales with the log of the length of v, rather than with the length of v, and no long vectors are created. These difference become increasingly important as the length of v becomes long. Questions: 1. Verify with simple data that f3() and f0() result in identical() answers. 2. Compare how timing of f3() scales with vector length. ## identity identical(f0((-2):2), f3((-2):2)) ## [1] TRUE identical(f0(2:4), f3(2:4)) ## [1] TRUE identical(f0(-(4:2)), f3(-(4:2))) ## [1] TRUE identical(f0(vec), f3(vec)) ## [1] TRUE ## scale N <- 10^(1:7) Time <- sapply(N, function(n) { v <- sort(rnorm(n)) system.time(f3(v))[[3]] }) plot(Time ~ N, type="b") 1. Use the microbenchmark package to compare performance of f0() and f3() with the original data, vec. 2. R code can be compiled, and compilation helps most when doing non-vectorized operations like those in f3(). Use compiler::cmpfun() to compile f3(), and compare the result using microbenchmark. ## relative time library(microbenchmark) microbenchmark(f0(vec), f3(vec)) ## Unit: microseconds ## expr min lq median uq max neval ## f0(vec) 26688.44 27822.29 27924.6 28301.45 32728.0 100 ## f3(vec) 52.44 55.19 72.3 79.73 121.5 100 library(compiler) f3c <- cmpfun(f3) microbenchmark(f3(vec), f3c(vec)) ## Unit: microseconds ## expr min lq median uq max neval ## f3(vec) 53.10 55.28 59.17 60.93 69.20 100 ## f3c(vec) 23.11 23.85 24.83 25.71 35.18 100 We could likely gain additional speed by writing the binary search algorithm in C, but we are already so happy with the performance improvement that we won't do that! It is useful to ask what is lost by f3() compared to f0(). For instance, does the algorithm work on character vectors? What about when the vector contains NA values? How are ties at 0 treated? findInterval() is probably a better built-in way to solve the original problem, and generalizes to additional situations. The idea is to take the query that we are interested in, 0, and find the interval specified by vec in which it occurs. f4 <- function(v, query=0) findInterval(query - .Machinedouble.eps, v)

identical(f0(vec), f4(vec))
## [1] TRUE
microbenchmark(f0(vec), f3(vec), f4(vec))
## Unit: microseconds
##     expr      min       lq   median      uq     max neval
##  f0(vec) 26911.29 27841.17 27956.62 28343.8 31953.2   100
##  f3(vec)    52.59    58.17    70.79    78.8   137.9   100
##  f4(vec) 21972.92 22035.44 22062.42 22158.6 23855.8   100

The fact that it is flexible and well tested means that it would often be preferred to f3(), even though it is less speedy. For instance, compare the time it takes to query 10000 different points using f3 and iteration, versus findInterval and vectorization.

threshold <- rnorm(10000)
identical(sapply(threshold, f3, x=vec), f4(vec, threshold))
## [1] TRUE
microbenchmark(sapply(x, f3), f4(vec, x))
## Unit: microseconds
##           expr     min      lq  median      uq     max neval
##  sapply(x, f3)   128.2   138.2   181.7   257.9   291.2   100
##     f4(vec, x) 22000.2 22037.3 22091.4 22308.3 23871.0   100

Some R functions that implement efficient algorithms are sort() (including radix sort), match() (hash table look-up), and tabulate(); these can be useful in your own code.

Lessons learned:

1. Choice of algorithm can be very important
2. Implementing classical algorithms (like binary search) can be a rewarding learning experience even when, at the end of the day, it may be better to use existing functions.
3. The built-in R functions that implement efficient algorithms can be important building-blocks for more complicated code.