Contents

Version: 0.1.1
Compiled: Sat Oct 24 10:12:10 2015

Objectives

Time Topic
09:15 - 10:15 Sequencing work flows and file types
10:15 Tea/Coffee break
10:30 - 12:30 Introduction to R and Bioconductor
12:30 Lunch
13:30 -14:00 Scalable computing

1 Sequencing work flows

  1. Experimental design
    • Keep it simple, e.g., ‘control’ and ‘treatment’ groups
    • Replicate within treatments!
  2. Wet-lab sequence preparation (figure from http://rnaseq.uoregon.edu/)

    • Record covariates, including processing day – likely ‘batch effects’
  3. (Illumina) Sequencing (Bentley et al., 2008, doi:10.1038/nature07517

  4. Alignment
    • Choose to match task, e.g., Rsubread, Bowtie2 good for ChIPseq, some forms of RNAseq; BWA, GMAP better for variant calling
    • Primary output: BAM files of aligned reads
  5. Reduction
    • e.g., RNASeq ‘count table’ (simple spreadsheets), DNASeq called variants (VCF files), ChIPSeq peaks (BED, WIG files)
  6. Analysis
    • Differential expression, peak identification, …
  7. Comprehension
    • Biological context

Alt Sequencing Ecosystem

2 Sequence data representations

2.1 DNA / amino acid sequences: FASTA files

Input & manipulation: Biostrings

>NM_078863_up_2000_chr2L_16764737_f chr2L:16764737-16766736
gttggtggcccaccagtgccaaaatacacaagaagaagaaacagcatctt
gacactaaaatgcaaaaattgctttgcgtcaatgactcaaaacgaaaatg
...
atgggtatcaagttgccccgtataaaaggcaagtttaccggttgcacggt
>NM_001201794_up_2000_chr2L_8382455_f chr2L:8382455-8384454
ttatttatgtaggcgcccgttcccgcagccaaagcactcagaattccggg
cgtgtagcgcaacgaccatctacaaggcaatattttgatcgcttgttagg
...

Whole genomes: 2bit and .fa formats: rtracklayer, Rsamtools; BSgenome

2.2 Reads: FASTQ files

Input & manipulation: ShortRead readFastq(), FastqStreamer(), FastqSampler()

@ERR127302.1703 HWI-EAS350_0441:1:1:1460:19184#0/1
CCTGAGTGAAGCTGATCTTGATCTACGAAGAGAGATAGATCTTGATCGTCGAGGAGATGCTGACCTTGACCT
+
HHGHHGHHHHHHHHDGG<GDGGE@GDGGD<?B8??ADAD<BE@EE8EGDGA3CB85*,77@>>CE?=896=:
@ERR127302.1704 HWI-EAS350_0441:1:1:1460:16861#0/1
GCGGTATGCTGGAAGGTGCTCGAATGGAGAGCGCCAGCGCCCCGGCGCTGAGCCGCAGCCTCAGGTCCGCCC
+
DE?DD>ED4>EEE>DE8EEEDE8B?EB<@3;BA79?,881B?@73;1?########################
    

2.3 Aligned reads: BAM files (e.g., ERR127306_chr14.bam)

Input & manipulation: ‘low-level’ Rsamtools, scanBam(), BamFile(); ‘high-level’ GenomicAlignments

2.4 Called variants: VCF files

Input and manipulation: VariantAnnotation readVcf(), readInfo(), readGeno() selectively with ScanVcfParam().

2.5 Genome annotations: BED, WIG, GTF, etc. files

Input: rtracklayer import()

3 R

Language and environment for statistical computing and graphics

Vector, class, object

Function, generic, method

Introspection

Help

Example

x <- rnorm(1000)                   # atomic vectors
y <- x + rnorm(1000, sd=.5)
df <- data.frame(x=x, y=y)         # object of class 'data.frame'
plot(y ~ x, df)                    # generic plot, method plot.formula

fit <- lm(y ~x, df)                # object of class 'lm'
methods(class=class(fit))          # introspection
##  [1] add1           alias          anova          case.names     coerce         confint       
##  [7] cooks.distance deviance       dfbeta         dfbetas        drop1          dummy.coef    
## [13] effects        extractAIC     family         formula        hatvalues      influence     
## [19] initialize     kappa          labels         logLik         model.frame    model.matrix  
## [25] nobs           plot           predict        print          proj           qr            
## [31] residuals      rstandard      rstudent       show           simulate       slotsFromS3   
## [37] summary        variable.names vcov          
## see '?methods' for accessing help and source code

4 Bioconductor

4.1 Overview

Analysis and comprehension of high-throughput genomic data

Packages, vignettes, work flows

Objects

Example

require(Biostrings)                     # Biological sequences
data(phiX174Phage)                      # sample data, see ?phiX174Phage
phiX174Phage
##   A DNAStringSet instance of length 6
##     width seq                                                                   names               
## [1]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA Genbank
## [2]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA RF70s
## [3]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA SS78
## [4]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA Bull
## [5]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA G97
## [6]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA NEB03
m <- consensusMatrix(phiX174Phage)[1:4,] # nucl. x position counts
polymorphic <- which(colSums(m != 0) > 1)
m[, polymorphic]
##   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## A    4    5    4    3    0    0    5    2    0
## C    0    0    0    0    5    1    0    0    5
## G    2    1    2    3    0    0    1    4    0
## T    0    0    0    0    1    5    0    0    1
methods(class=class(phiX174Phage))
##   [1] !                               !=                             
##   [3] [                               [[                             
##   [5] [[<-                            [<-                            
##   [7] %in%                            <                              
##   [9] <=                              ==                             
##  [11] >                               >=                             
##  [13] $                               $<-                            
##  [15] aggregate                       alphabetFrequency              
##  [17] anyNA                           append                         
##  [19] as.character                    as.complex                     
##  [21] as.data.frame                   as.env                         
##  [23] as.integer                      as.list                        
##  [25] as.logical                      as.matrix                      
##  [27] as.numeric                      as.raw                         
##  [29] as.vector                       c                              
##  [31] chartr                          coerce                         
##  [33] compact                         compare                        
##  [35] compareStrings                  complement                     
##  [37] consensusMatrix                 consensusString                
##  [39] countOverlaps                   countPattern                   
##  [41] countPDict                      dinucleotideFrequencyTest      
##  [43] do.call                         droplevels                     
##  [45] duplicated                      elementLengths                 
##  [47] elementMetadata                 elementMetadata<-              
##  [49] elementType                     endoapply                      
##  [51] eval                            expand                         
##  [53] extractAt                       extractROWS                    
##  [55] Filter                          Find                           
##  [57] findOverlaps                    hasOnlyBaseLetters             
##  [59] head                            high2low                       
##  [61] ifelse                          intersect                      
##  [63] is.na                           is.unsorted                    
##  [65] isEmpty                         isMatchingEndingAt             
##  [67] isMatchingStartingAt            lapply                         
##  [69] length                          lengths                        
##  [71] letterFrequency                 Map                            
##  [73] match                           matchPattern                   
##  [75] matchPDict                      mcols                          
##  [77] mcols<-                         mendoapply                     
##  [79] metadata                        metadata<-                     
##  [81] mstack                          names                          
##  [83] names<-                         narrow                         
##  [85] nchar                           neditEndingAt                  
##  [87] neditStartingAt                 NROW                           
##  [89] nucleotideFrequencyAt           oligonucleotideFrequency       
##  [91] order                           overlapsAny                    
##  [93] PairwiseAlignments              PairwiseAlignmentsSingleSubject
##  [95] parallelSlotNames               PDict                          
##  [97] Position                        PWM                            
##  [99] rank                            Reduce                         
## [101] relist                          relistToClass                  
## [103] rename                          rep                            
## [105] rep.int                         replaceAt                      
## [107] replaceLetterAt                 replaceROWS                    
## [109] rev                             revElements                    
## [111] reverse                         reverseComplement              
## [113] ROWNAMES                        sapply                         
## [115] seqinfo                         seqinfo<-                      
## [117] seqlevelsInUse                  seqtype                        
## [119] seqtype<-                       setdiff                        
## [121] setequal                        shiftApply                     
## [123] show                            showAsCell                     
## [125] sort                            split                          
## [127] split<-                         splitAsList                    
## [129] stack                           stringDist                     
## [131] subseq                          subseq<-                       
## [133] subset                          subsetByOverlaps               
## [135] table                           tail                           
## [137] tapply                          threebands                     
## [139] toString                        translate                      
## [141] trimLRPatterns                  twoWayAlphabetFrequency        
## [143] union                           unique                         
## [145] uniqueLetters                   unlist                         
## [147] unsplit                         unstrsplit                     
## [149] updateObject                    values                         
## [151] values<-                        vcountPattern                  
## [153] vcountPDict                     vmatchPattern                  
## [155] vwhichPDict                     which.isMatchingEndingAt       
## [157] which.isMatchingStartingAt      whichPDict                     
## [159] width                           window                         
## [161] window<-                        with                           
## [163] within                          xtfrm                          
## [165] xvcopy                         
## see '?methods' for accessing help and source code
selectMethod(reverseComplement, class(phiX174Phage))
## Method Definition:
## 
## function (x, ...) 
## xvcopy(x, lkup = getDNAComplementLookup(), reverse = TRUE)
## <environment: namespace:Biostrings>
## 
## Signatures:
##         x             
## target  "DNAStringSet"
## defined "DNAStringSet"

Alt Sequencing Ecosystem

4.2 A sequence analysis package tour

This very open-ended topic points to some of the most prominent Bioconductor packages for sequence analysis. Use the opportunity in this lab to explore the package vignettes and help pages highlighted below; many of the material will be covered in greater detail in subsequent labs and lectures.

Basics

Domain-specific analysis – explore the landing pages, vignettes, and reference manuals of two or three of the following packages.

Working with sequences, alignments, common web file formats, and raw data; these packages rely very heavily on the IRanges / GenomicRanges infrastructure that we will encounter later in the course.

Visualization

4.3 DNA or amino acid sequences: Biostrings, ShortRead, BSgenome

Classes

Methods –

Related packages

Example

  require(BSgenome.Hsapiens.UCSC.hg19)
  chr14_range = GRanges("chr14", IRanges(1, seqlengths(Hsapiens)["chr14"]))
  chr14_dna <- getSeq(Hsapiens, chr14_range)
  letterFrequency(chr14_dna, "GC", as.prob=TRUE)
##           G|C
## [1,] 0.336276

4.4 Ranges: GenomicRanges, IRanges

Ranges represent: - Data, e.g., aligned reads, ChIP peaks, SNPs, CpG islands, … - Annotations, e.g., gene models, regulatory elements, methylated regions - Ranges are defined by chromosome, start, end, and strand - Often, metadata is associated with each range, e.g., quality of alignment, strength of ChIP peak

Many common biological questions are range-based - What reads overlap genes? - What genes are ChIP peaks nearest? - …

The GenomicRanges package defines essential classes and methods

Alt

Alt

4.4.1 Range operations

Alt Ranges Algebra

Ranges - IRanges - start() / end() / width() - List-like – length(), subset, etc. - ‘metadata’, mcols() - GRanges - ‘seqnames’ (chromosome), ‘strand’ - Seqinfo, including seqlevels and seqlengths

Intra-range methods - Independent of other ranges in the same object - GRanges variants strand-aware - shift(), narrow(), flank(), promoters(), resize(), restrict(), trim() - See ?"intra-range-methods"

Inter-range methods - Depends on other ranges in the same object - range(), reduce(), gaps(), disjoin() - coverage() (!) - see ?"inter-range-methods"

Between-range methods - Functions of two (or more) range objects - findOverlaps(), countOverlaps(), …, %over%, %within%, %outside%; union(), intersect(), setdiff(), punion(), pintersect(), psetdiff()

Example

require(GenomicRanges)
gr <- GRanges("A", IRanges(c(10, 20, 22), width=5), "+")
shift(gr, 1)                            # 1-based coordinates!
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        A  [11, 15]      +
##   [2]        A  [21, 25]      +
##   [3]        A  [23, 27]      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
range(gr)                               # intra-range
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        A  [10, 26]      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
reduce(gr)                              # inter-range
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        A  [10, 14]      +
##   [2]        A  [20, 26]      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
coverage(gr)
## RleList of length 1
## $A
## integer-Rle of length 26 with 6 runs
##   Lengths: 9 5 5 2 3 2
##   Values : 0 1 0 1 2 1
setdiff(range(gr), gr)                  # 'introns'
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        A  [15, 19]      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

IRangesList, GRangesList - List: all elements of the same type - Many *List-aware methods, but a common ‘trick’: apply a vectorized function to the unlisted representaion, then re-list

    grl <- GRangesList(...)
    orig_gr <- unlist(grl)
    transformed_gr <- FUN(orig)
    transformed_grl <- relist(, grl)
    

Reference

  • Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, et al. (2013) Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol 9(8): e1003118. doi:10.1371/journal.pcbi.1003118

4.5 Aligned reads: GenomicAlignments, Rsamtools

Classes – GenomicRanges-like behaivor

Methods

Example

require(GenomicRanges)
require(GenomicAlignments)
## Loading required package: GenomicAlignments
## Loading required package: Rsamtools
require(Rsamtools)

## our 'region of interest'
roi <- GRanges("chr14", IRanges(19653773, width=1)) 
## sample data
require('RNAseqData.HNRNPC.bam.chr14')
## Loading required package: RNAseqData.HNRNPC.bam.chr14
bf <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[[1]], asMates=TRUE)
## alignments, junctions, overlapping our roi
paln <- readGAlignmentsList(bf)
j <- summarizeJunctions(paln, with.revmap=TRUE)
j_overlap <- j[j %over% roi]

## supporting reads
paln[j_overlap$revmap[[1]]]
## GAlignmentsList object of length 8:
## [[1]] 
## GAlignments object with 2 alignments and 0 metadata columns:
##       seqnames strand      cigar qwidth    start      end width njunc
##   [1]    chr14      -  66M120N6M     72 19653707 19653898   192     1
##   [2]    chr14      + 7M1270N65M     72 19652348 19653689  1342     1
## 
## [[2]] 
## GAlignments object with 2 alignments and 0 metadata columns:
##       seqnames strand     cigar qwidth    start      end width njunc
##   [1]    chr14      - 66M120N6M     72 19653707 19653898   192     1
##   [2]    chr14      +       72M     72 19653686 19653757    72     0
## 
## [[3]] 
## GAlignments object with 2 alignments and 0 metadata columns:
##       seqnames strand     cigar qwidth    start      end width njunc
##   [1]    chr14      +       72M     72 19653675 19653746    72     0
##   [2]    chr14      - 65M120N7M     72 19653708 19653899   192     1
## 
## ...
## <5 more elements>
## -------
## seqinfo: 93 sequences from an unspecified genome

4.6 Called variants: VariantAnnotation, VariantFiltering

Classes – GenomicRanges-like behavior

Functions and methods

Example

  ## input variants
  require(VariantAnnotation)
  fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
  vcf <- readVcf(fl, "hg19")
  seqlevels(vcf) <- "chr22"
  ## known gene model
  require(TxDb.Hsapiens.UCSC.hg19.knownGene)
  coding <- locateVariants(rowRanges(vcf),
      TxDb.Hsapiens.UCSC.hg19.knownGene,
      CodingVariants())
  head(coding)
## GRanges object with 6 ranges and 9 metadata columns:
##     seqnames               ranges strand | LOCATION  LOCSTART    LOCEND   QUERYID        TXID
##        <Rle>            <IRanges>  <Rle> | <factor> <integer> <integer> <integer> <character>
##   1    chr22 [50301422, 50301422]      - |   coding       939       939        24       75253
##   2    chr22 [50301476, 50301476]      - |   coding       885       885        25       75253
##   3    chr22 [50301488, 50301488]      - |   coding       873       873        26       75253
##   4    chr22 [50301494, 50301494]      - |   coding       867       867        27       75253
##   5    chr22 [50301584, 50301584]      - |   coding       777       777        28       75253
##   6    chr22 [50302962, 50302962]      - |   coding       698       698        57       75253
##             CDSID      GENEID       PRECEDEID        FOLLOWID
##     <IntegerList> <character> <CharacterList> <CharacterList>
##   1        218562       79087                                
##   2        218562       79087                                
##   3        218562       79087                                
##   4        218562       79087                                
##   5        218562       79087                                
##   6        218563       79087                                
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Related packages

Reference

4.7 Integrated data representations: SummarizedExperiment

SummarizedExperiment

4.8 Annotation: org, TxDb, AnnotationHub, biomaRt, …

4.9 Scalable computing

  1. Efficient R code
  1. Iteration
  1. Restriction
  1. Sampling
  1. Parallel evaluation

Parallel evaluation in Bioconductor

5 Resources

R / Bioconductor

Publications (General Bioconductor)

Other