Contents

Original Authors: Martin Morgan
Presenting Authors: Martin Morgan, Lori Shepherd
Date: 22 July, 2019
Back: Monday labs

Objective: An overview of software available in Bioconductor.

Lessons learned:

1 Bioconductor

Analysis and comprehension of high-throughput genomic data

1.1 Packages, vignettes, work flows

Alt Sequencing Ecosystem

Alt Sequencing Ecosystem

  • 1750 software packages
  • Discover and navigate via biocViews
  • Package ‘landing page’, e.g., Gviz
    • Title, author / maintainer, short description, citation, installation instructions, …, download statistics
  • All user-visible functions have help pages, most with runnable examples
  • ‘Vignettes’ an important feature in Bioconductor – narrative documents illustrating how to use the package, with integrated code
  • ‘Release’ (every six months) and ‘devel’ branches

2 Packages

2.1 Finding packages

‘Domain-specific’ analysis

Exercise

  • Visit the listing of All Packages.
  • Use the ‘Search biocViews’ box in the upper left to identify packages that have been tagged for RNASeq analysis. Explore other analysis like ChIPSeq, Epigenetics, VariantAnnotation, Proteomics, SingleCell, etc. Explore the graph of software packages by expanding and contracting individual terms.
  • Return to RNASeq. Two very popular packages are DESeq2 and edgeR. Visit the ‘landing page’ of one of these packages. The landing page has a title, authors, instructions for citing use the package, etc.
  • Breifly explore the vignette and reference manual links. When would you consult the vignette? When would the reference manual be helpful?

Bioconductor provides ‘infrastructure’ for working with genomic data. We’ll explore some of these in more detail in a later part of this lab. For now…

Exercise

  • Visit the landing pages of each of the Biostrings, GenomicRanges, VariantAnnotation, and GenomicAlignments packages. Create a short summary describing what each package does, and when it might be useful.
  • Visit the landing page for the SummarizedExperiment package. This package is meant to provide a data representation that helps manage, in a coordinated fashion, ‘assays’ (e.g., gene x sample matrix of RNASeq counts) and the row (e.g., genomic coordinates, P-values) and column (e.g., sample sheets). Briefly review the user-oriented vignette (‘SummarizedExperiment for Coordinating Experimental Assays, Samples, and Regions of Interest’) to get a sense for how this package might be used.
  • Visit the landing page for the rtracklayer package. From the reference manual, when would the various import() and export() functions be useful?

Annotation packages are data-, rather than software-, centric, providing information about the relationship between different identifiers, gene models, reference genomes, etc.

Exercise

  • On the page listing All Packages, click on the AnnotationData top-level term.
  • Search, using the box on the right-hand side, for annotation packages that start with the following letters to get a sense of the packages and organisms available.

    • org.*: symbol mapping
    • TxDb.* and EnsDb.*: gene models
    • BSgenome.*: reference genomes

We’ll see in a subsequent lab that a wealth of additional annotation resources, including updated EnsDb and reference genomes, are available through AnnotationHub.

Workflow packages are meant to provide a comprehensive introduction to work flows that require several different packages. These can be quite extensive documents, providing a very rich source of information.

Exercise

  • Briefly explore the ‘Simple Single Cell’ workflow (or other workflow relevant to your domain of interest) to get a sense of the material the workflow covers.

2.2 Installing packages

Likely the packages needed for this course are already installed. Nonetheless it is useful to know how to install other packages.

Bioconductor has a particular approach to making packages available. We have a ‘devel’ branch where new packages and features are introduced, and a ‘release’ branch where users have access to stable packages. Each six months, in spring and fall, the current ‘devel’ version of packages is branched to become the next ‘release’. Packages within a release are tested with one another, so it is important to install packages from the same release. The BiocManager package tries to make it easy to do this.

The first step to package installation is to make sure that the BiocManager package has been installed using standard R procedures.

if (!require(BiocManager))
    install.packages("BiocManager", repos = "https://cran.r-project.org")

Then, install the package(s) you would like to use

BiocManager::install(c("Biostrings", "GenomicRanges"))

BiocManager knows how to install CRAN and github packages, too.

There are several common problems encountered with package installation. Often, packages have been installed using methods different from the one recommended here, and the packages are from different Bioconductor releases. This leads to problems when packages from different releases are incompatible with one another.

Exercise Verify that your packages are current and installed from the same Bioconductor release with

BiocManager::valid()

Two common problems are that some packages are too old (a newer version of the package exists) or too new (some packages have been installed using a method other than BiocManager). If there are packages that are too old or too new, it is almost always a good idea to follow the instructions from BiocManager::valid() to correct the situation.

2.3 Loading and using packages

Packages need to be installed only once for each version of R you use, but need to be loaded into each new R session that you start. Packages are loaded using

library(Biostrings)

Whan a package is loaded, it can sometimes generate messages that are informational only, if you are confident this is the case for the packages you’re loading, use suppressPackageStartupMessages() for a quieter experience:

suppressPackageStartupMessages({
    library(GenomicRanges)
    library(GenomicAlignments)
})

Exercise It is usually very helpful to explore package vignettes.

  • Visit the vignette of the DESeq2 package, and walk through a few steps to understand what the vignette provides in terms of instructions for starting with the package, functionality the package provides, mathematical and statistical details of the implementation, and how the analysis provided by the package might be extended by other packages in the Bioconductor ecosystem. One can visit vignettes through RStudio, or by running commands such as

    vignette(package = "DESeq2")
    browseVignettes("DESeq2")
  • Most vignettes are written in such a way that the R code of the vignette must be correct for the vignette to be produced. The code itself is available in the package. Find the code for the DESeq2 vignette

    dir(system.file(package="DESeq2", "doc"))
    ## [1] "DESeq2.html" "DESeq2.R"    "DESeq2.Rmd"  "index.html"
    vign <- system.file(package="DESeq2", "doc", "DESeq2.R")

    open it in RStudio (e.g., using File -> Open File… menu), step through the first few lines of R code and compare your output to the output in the vignette. Alternatively, run the entire analysis in the vignette with the command

    source(vign, echo = TRUE, max.lines = Inf)

Exercise Help pages provide more focused instructions for use of particular functions. It is often con

  • Load the Biostrings package

    library(Biostrings)
  • Look for help on the function letterFrequency() using the command

    ?letterFrequency

    note that there is tab completion after the ? and first few letters of the command.

  • The help page is quite complicated, documenting several different functions. In the ‘Description’ section, find a description of what letterFrequency() does. In the ‘Usage’ section, find the arguments that can be used with letterFrequency(), and try to understand, from the Arguments section what each argument might be or how it influences the computation. The Value section attempts to describe the return value of the letterFrequency() function.

  • Sometimes an example is worth a thousand words. Can you run the first two sections of the example at the end of the help page (for alphabetFrequency() and letterFrequency() to arrive at a better understanding of how the letterFrequency() function works?

3 Getting help

Where to get help?

What can you get help on?

How to ask a good question

Exercise Visit the support site and review the five most recent questions. Which do you think are ‘good’, from the guidelines offered above? Which have received helpful answers? Can you figure out who the person answering the question is, i.e., why do they think they have an answer?

4 A sequence analysis package tour

This very open-ended topic points to some of the most prominent Bioconductor packages for sequence analysis. Use the opportunity in this lab to explore the package vignettes and help pages highlighted below; many of the material will be covered in greater detail in subsequent labs and lectures.

Basics

library(GenomicRanges)
help(package="GenomicRanges")
vignette(package="GenomicRanges")
vignette(package="GenomicRanges", "GenomicRangesHOWTOs")
?GRanges

Domain-specific analysis – explore the landing pages, vignettes, and reference manuals of two or three of the following packages.

Working with sequences, alignments, common web file formats, and raw data; these packages rely very heavily on the IRanges / GenomicRanges infrastructure that we will encounter later in the course.

Annotation: Bioconductor provides extensive access to ‘annotation’ resources (see the AnnotationData biocViews hierarchy); these are covered in greater detail in Thursday’s lab, but some interesting examples to explore during this lab include:

A number of Bioconductor packages help with visualization and reporting, in addition to functions provided by indiidual packages.

5 End matter

5.1 Session Info

sessionInfo()
## R version 3.6.1 Patched (2019-07-16 r76845)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /Users/ma38727/bin/R-3-6-branch/lib/libRblas.dylib
## LAPACK: /Users/ma38727/bin/R-3-6-branch/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] GenomicAlignments_1.21.4    Rsamtools_2.1.3            
##  [3] SummarizedExperiment_1.15.5 DelayedArray_0.11.4        
##  [5] BiocParallel_1.19.0         matrixStats_0.54.0         
##  [7] Biobase_2.45.0              GenomicRanges_1.37.14      
##  [9] GenomeInfoDb_1.21.1         Biostrings_2.53.2          
## [11] XVector_0.25.0              IRanges_2.19.10            
## [13] S4Vectors_0.23.17           BiocGenerics_0.31.5        
## [15] BiocStyle_2.13.2           
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1             knitr_1.23             magrittr_1.5          
##  [4] zlibbioc_1.31.0        lattice_0.20-38        stringr_1.4.0         
##  [7] tools_3.6.1            grid_3.6.1             xfun_0.8              
## [10] htmltools_0.3.6        yaml_2.2.0             digest_0.6.20         
## [13] bookdown_0.12          Matrix_1.2-17          GenomeInfoDbData_1.2.1
## [16] BiocManager_1.30.4     bitops_1.0-6           codetools_0.2-16      
## [19] RCurl_1.95-4.12        evaluate_0.14          rmarkdown_1.14        
## [22] stringi_1.4.3          compiler_3.6.1

5.2 Acknowledgements

Research reported in this tutorial was supported by the National Human Genome Research Institute and the National Cancer Institute of the National Institutes of Health under award numbers U41HG004059 and U24CA180996.

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 633974)