--- title: "2. Sequences and Strings For Genome-Scale Data" author: "Valerie Obenchain (valerie.obenchain@roswellpark.org)
Lori Shepherd (lori.shepherd@roswellpark.org)
Martin Morgan (martin.morgan@roswellpark.org)
Stanford University, Stanford, CA
25 - 26 June, 2016" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > % \VignetteIndexEntry{2. Sequences and Strings For Genome-Scale Data} % \VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() options(width=100, max.print=1000) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))) ``` ```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE} suppressPackageStartupMessages({ library(GenomicRanges) library(GenomicAlignments) }) ``` The material in this course requires R version 3.3 and Bioconductor version 3.4 ```{r configure-test} stopifnot( getRversion() >= '3.3' && getRversion() < '3.4', BiocInstaller::biocVersion() == "3.4" ) ``` # _Bioconductor_ 'infrastructure' for sequence analysis ## Classes, methods, and packages This section focuses on classes, methods, and packages, with the goal being to learn to navigate the help system and interactive discovery facilities. ## Motivation Sequence analysis is specialized - Large data needs to be processed in a memory- and time-efficient manner - Specific algorithms have been developed for the unique characteristics of sequence data Additional considerations - Re-use of existing, tested code is easier to do and less error-prone than re-inventing the wheel. - Interoperability between packages is easier when the packages share similar data structures. Solution: use well-defined _classes_ to represent complex data; _methods_ operate on the classes to perform useful functions. Classes and methods are placed together and distributed as _packages_ so that we can all benefit from the hard work and tested code of others. # Core packages
                   VariantAnnotation
                           |
                           v
                    GenomicFeatures
                           |
                           v
                       BSgenome
                           |
                           v
                      rtracklayer
                           |
                           v
                    GenomicAlignments
                      |           |
                      v           v
     SummarizedExperiment   Rsamtools  ShortRead
                  |         |      |      |
                  v         v      v      v
                GenomicRanges     Biostrings
                        |          |
                        v          v
               GenomeInfoDb   (XVector)
                        |     |
                        v     v
                        IRanges
                           |
                           v 
                      (S4Vectors)
# Core classes ## Case study: _IRanges_ and _GRanges_ The [IRanges][] package defines an important class for specifying integer ranges, e.g., ```{r iranges} library(IRanges) ir <- IRanges(start=c(10, 20, 30), width=5) ir ``` There are many interesting operations to be performed on ranges, e.g, `flank()` identifies adjacent ranges ```{r iranges-flank} flank(ir, 3) ``` The `IRanges` class is part of a class hierarchy. To see this, ask R for the class of `ir`, and for the class definition of the `IRanges` class ```{r iranges-class} class(ir) getClass(class(ir)) ``` Notice that `IRanges` extends the `Ranges` class. Show Now try entering `?flank` (if not using _RStudio_, enter `?"flank,"` where `` means to press the tab key to ask for tab completion). You can see that there are help pages for `flank` operating on several different classes. Select the completion ```{r iranges-flank-method, eval=FALSE} ?"flank,Ranges-method" ``` and verify that you're at the page that describes the method relevant to an `IRanges` instance. Explore other range-based operations. The [GenomicRanges][] package extends the notion of ranges to include features relevant to application of ranges in sequence analysis, particularly the ability to associate a range with a sequence name (e.g., chromosome) and a strand. Create a `GRanges` instance based on our `IRanges` instance, as follows ```{r granges} library(GenomicRanges) gr <- GRanges(c("chr1", "chr1", "chr2"), ir, strand=c("+", "-", "+")) gr ``` The notion of flanking sequence has a more nuanced meaning in biology. In particular we might expect that flanking sequence on the `+` strand would precede the range, but on the minus strand would follow it. Verify that `flank` applied to a `GRanges` object has this behavior. ```{r granges-flank} flank(gr, 3) ``` Discover what classes `GRanges` extends, find the help page documenting the behavior of `flank` when applied to a `GRanges` object, It seems like there might be a number of helpful methods available for working with genomic ranges; we can discover some of these from the command line, indicating that the methods should be on the current `search()` path ```{r granges-methods} methods(class="GRanges") ``` Notice that the available `flank()` methods have been augmented by the methods defined in the _GenomicRanges_ package, including those that are relevant (via inheritance) to the _GRanges_ class. ```{r granges-flank-method} grep("flank", methods(class="GRanges"), value=TRUE) ``` Verify that the help page documents the behavior we just observed. ```{r granges-flank-method-help, eval=FALSE} ?"flank,GenomicRanges-method" ``` Use `help()` to list the help pages in the `GenomicRanges` package, and `vignettes()` to view and access available vignettes; these are also available in the Rstudio 'Help' tab. ```{r granges-man-and-vignettes, eval=FALSE} help(package="GenomicRanges") vignette(package="GenomicRanges") vignette(package="GenomicRanges", "GenomicRangesHOWTOs") ``` ## _GenomicRanges_ ### The `GRanges` and `GRangesList` classes Aside: 'TxDb' packages provide an R representation of gene models ```{r txdb} library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene ``` `exons()`: _GRanges_ ```{r txdb-exons} exons(txdb) ``` ![Alt Genomic Ranges](our_figures/GRanges.png) `exonsBy()`: _GRangesList_ ```{r txdb-exonsby} exonsBy(txdb, "tx") ``` ![Alt Genomic Ranges List](our_figures/GRangesList.png) _GRanges_ / _GRangesList_ are incredibly useful - Represent **annotations** -- genes, variants, regulatory elements, copy number regions, ... - Represent **data** -- aligned reads, ChIP peaks, called variants, ... ### Algebra of genomic ranges Many biologically interesting questions represent operations on ranges - Count overlaps between aligned reads and known genes -- `GenomicRanges::summarizeOverlaps()` - Genes nearest to regulatory regions -- `GenomicRanges::nearest()`, [ChIPseeker][] - Called variants relevant to clinical phenotypes -- [VariantFiltering][] _GRanges_ Algebra - Intra-range methods - Independent of other ranges in the same object - GRanges variants strand-aware - `shift()`, `narrow()`, `flank()`, `promoters()`, `resize()`, `restrict()`, `trim()` - See `?"intra-range-methods"` - Inter-range methods - Depends on other ranges in the same object - `range()`, `reduce()`, `gaps()`, `disjoin()` - `coverage()` (!) - see `?"inter-range-methods"` - Between-range methods - Functions of two (or more) range objects - `findOverlaps()`, `countOverlaps()`, ..., `%over%`, `%within%`, `%outside%`; `union()`, `intersect()`, `setdiff()`, `punion()`, `pintersect()`, `psetdiff()` ![Alt Ranges Algebra](our_figures/RangeOperations.png) ## _Biostrings_ (DNA or amino acid sequences) Classes - XString, XStringSet, e.g., DNAString (genomes), DNAStringSet (reads) Methods -- - [Cheat sheat](http://bioconductor.org/packages/release/bioc/vignettes/Biostrings/inst/doc/BiostringsQuickOverview.pdf) - Manipulation, e.g., `reverseComplement()` - Summary, e.g., `letterFrequency()` - Matching, e.g., `matchPDict()`, `matchPWM()` Related packages - [BSgenome][] - Whole-genome representations - Model and custom - [ShortRead][] - FASTQ files Example - Whole-genome sequences are distrubuted by ENSEMBL, NCBI, and others as FASTA files; model organism whole genome sequences are packaged into more user-friendly `BSgenome` packages. The following calculates GC content across chr14. ```{r BSgenome-require, message=FALSE} library(BSgenome.Hsapiens.UCSC.hg19) chr14_range = GRanges("chr14", IRanges(1, seqlengths(Hsapiens)["chr14"])) chr14_dna <- getSeq(Hsapiens, chr14_range) letterFrequency(chr14_dna, "GC", as.prob=TRUE) ``` ## _GenomicAlignments_ (Aligned reads) Classes -- GenomicRanges-like behaivor - GAlignments, GAlignmentPairs, GAlignmentsList Methods - `readGAlignments()`, `readGAlignmentsList()` - Easy to restrict input, iterate in chunks - `summarizeOverlaps()` Example - Find reads supporting the junction identified above, at position 19653707 + 66M = 19653773 of chromosome 14 ```{r bam-require} library(GenomicRanges) library(GenomicAlignments) library(Rsamtools) ## our 'region of interest' roi <- GRanges("chr14", IRanges(19653773, width=1)) ## sample data library('RNAseqData.HNRNPC.bam.chr14') bf <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[[1]], asMates=TRUE) ## alignments, junctions, overlapping our roi paln <- readGAlignmentsList(bf) j <- summarizeJunctions(paln, with.revmap=TRUE) j_overlap <- j[j %over% roi] ## supporting reads paln[j_overlap$revmap[[1]]] ``` ## _VariantAnnotation_ (Called variants) Classes -- GenomicRanges-like behavior - VCF -- 'wide' - VRanges -- 'tall' Functions and methods - I/O and filtering: `readVcf()`, `readGeno()`, `readInfo()`, `readGT()`, `writeVcf()`, `filterVcf()` - Annotation: `locateVariants()` (variants overlapping ranges), `predictCoding()`, `summarizeVariants()` - SNPs: `genotypeToSnpMatrix()`, `snpSummary()` Example - Read variants from a VCF file, and annotate with respect to a known gene model ```{r vcf, message=FALSE} ## input variants library(VariantAnnotation) fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation") vcf <- readVcf(fl, "hg19") seqlevels(vcf) <- "chr22" ## known gene model library(TxDb.Hsapiens.UCSC.hg19.knownGene) coding <- locateVariants(rowRanges(vcf), TxDb.Hsapiens.UCSC.hg19.knownGene, CodingVariants()) head(coding) ``` Related packages - [ensemblVEP][] - Forward variants to Ensembl Variant Effect Predictor - [VariantTools][], [h5vc][] - Call variants Reference - Obenchain, V, Lawrence, M, Carey, V, Gogarten, S, Shannon, P, and Morgan, M. VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants. Bioinformatics, first published online March 28, 2014 [doi:10.1093/bioinformatics/btu168](http://bioinformatics.oxfordjournals.org/content/early/2014/04/21/bioinformatics.btu168) ## _rtracklayer_ (Genome annotations) - `import()`: BED, GTF, WIG, 2bit, etc - `export()`: GRanges to BED, GTF, WIG, ... - Access UCSC genome browser ## _SummarizedExperiment_ - Integrate experimental data with sample, feature, and experiment-wide annotations - Matrix where rows are indexed by genomic ranges, columns by a DataFrame. ![Alt SummarizedExperiment](our_figures/SE_Description.png) Functions and methods - Accessors: `assay()` / `assays()`, `rowData()` / `rowRanges()`, `colData()`, `metadata()` - Range-based operations, especially `subsetByOverlaps()` # Input & representation of standard file formats ## BAM files of aligned reads -- `GenomicAlignments` Recall: overall workflow 1. Experimental design 2. Wet-lab preparation 3. High-throughput sequencing 4. Alignment - Whole genome, vs. transcriptome 5. Summary 6. Statistical analysis 7. Comprehension BAM files of aligned reads - Header @HD VN:1.0 SO:coordinate @SQ SN:chr1 LN:249250621 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 ... @SQ SN:chrY LN:59373566 @PG ID:TopHat VN:2.0.8b CL:/home/hpages/tophat-2.0.8b.Linux_x86_64/tophat --mate-inner-dist 150 --solexa-quals --max-multihits 5 --no-discordant --no-mixed --coverage-search --microexon-search --library-type fr-unstranded --num-threads 2 --output-dir tophat2_out/ERR127306 /home/hpages/bowtie2-2.1.0/indexes/hg19 fastq/ERR127306_1.fastq fastq/ERR127306_2.fastq - Alignments - ID, flag, alignment and mate ERR127306.7941162 403 chr14 19653689 3 72M = 19652348 -1413 ... ERR127306.22648137 145 chr14 19653692 1 72M = 19650044 -3720 ... - Sequence and quality ... GAATTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCC *'%%%%%#&&%''#'&%%%)&&%%$%%'%%'&*****$))$)'')'%)))&)%%%%$'%%%%&"))'')%)) ... TTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAG '**)****)*'*&*********('&)****&***(**')))())%)))&)))*')&***********)**** - Tags ... AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:72 YT:Z:UU NH:i:2 CC:Z:chr22 CP:i:16189276 HI:i:0 ... AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:72 YT:Z:UU NH:i:3 CC:Z:= CP:i:19921600 HI:i:0 - Typically, sorted (by position) and indexed ('.bai' files) [GenomicAlignments][] - Use an example BAM file (`fl` could be the path to your own BAM file) ```{r genomicalignments} ## example BAM data library(RNAseqData.HNRNPC.bam.chr14) ## one BAM file fl <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1] ## Let R know that this is a BAM file, not just a character vector library(Rsamtools) bfl <- BamFile(fl) ``` - Input the data into R ```{r readgalignments} aln <- readGAlignments(bfl) aln ``` - `readGAlignmentPairs()` / `readGAlignmentsList()` if paired-end data - Lots of things to do, including all the _GRanges_ / _GRangesList_ operations ```{r galignments-methods} methods(class=class(aln)) ``` - **Caveat emptor**: BAM files are large. Normally you will _restrict_ the input to particular genomic ranges, or _iterate_ through the BAM file. Key _Bioconductor_ functions (e.g., `GenomicAlignments::summarizeOverlaps()` do this data management step for you. See next section! ## Other formats and packages ![Alt Files and the Bioconductor packages that input them](our_figures/FilesToPackages.png) # Resources Acknowledgements The research reported in this presentation was supported by the National Cancer Institute and the National Human Genome Research Institute of the National Institutes of Health under Award numbers U24CA180996 and U41HG004059, and the National Science Foundation under Award number 1247813. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the National Science Foundation. ## `sessionInfo()` ```{r sessionInfo} sessionInfo() ``` [AnnotationDbi]: http://bioconductor.org/packages/AnnotationDbi [BSgenome]: http://bioconductor.org/packages/BSgenome [BiocParallel]: http://bioconductor.org/packages/BiocParallel [Biostrings]: http://bioconductor.org/packages/Biostrings [CNTools]: http://bioconductor.org/packages/CNTools [ChIPQC]: http://bioconductor.org/packages/ChIPQC [ChIPpeakAnno]: http://bioconductor.org/packages/ChIPpeakAnno [DESeq2]: http://bioconductor.org/packages/DESeq2 [DiffBind]: http://bioconductor.org/packages/DiffBind [GenomicAlignments]: http://bioconductor.org/packages/GenomicAlignments [GenomicRanges]: http://bioconductor.org/packages/GenomicRanges [IRanges]: http://bioconductor.org/packages/IRanges [KEGGREST]: http://bioconductor.org/packages/KEGGREST [PSICQUIC]: http://bioconductor.org/packages/PSICQUIC [rtracklayer]: http://bioconductor.org/packages/rtracklayer [Rsamtools]: http://bioconductor.org/packages/Rsamtools [ShortRead]: http://bioconductor.org/packages/ShortRead [VariantAnnotation]: http://bioconductor.org/packages/VariantAnnotation [VariantFiltering]: http://bioconductor.org/packages/VariantFiltering [VariantTools]: http://bioconductor.org/packages/VariantTools [biomaRt]: http://bioconductor.org/packages/biomaRt [cn.mops]: http://bioconductor.org/packages/cn.mops [h5vc]: http://bioconductor.org/packages/h5vc [edgeR]: http://bioconductor.org/packages/edgeR [ensemblVEP]: http://bioconductor.org/packages/ensemblVEP [limma]: http://bioconductor.org/packages/limma [metagenomeSeq]: http://bioconductor.org/packages/metagenomeSeq [phyloseq]: http://bioconductor.org/packages/phyloseq [snpStats]: http://bioconductor.org/packages/snpStats [org.Hs.eg.db]: http://bioconductor.org/packages/org.Hs.eg.db [TxDb.Hsapiens.UCSC.hg19.knownGene]: http://bioconductor.org/packages/TxDb.Hsapiens.UCSC.hg19.knownGene [BSgenome.Hsapiens.UCSC.hg19]: http://bioconductor.org/packages/BSgenome.Hsapiens.UCSC.hg19