22 July | CSAMA 2019

Introduction

Started 2002 as a platform for understanding analysis of microarray data

More than 1,741 packages. Domains of expertise:

  • Sequencing (RNASeq, ChIPSeq, single-cell, called variants, …)
  • Microarrays (methylation, expression, copy number, …)
  • flow cytometry
  • proteomics

Important themes

  • Reproducible research
  • Interoperability between packages & work kflows
  • Usability

Resources

High-throughput sequence work flow

Genomic Sequences

Genomic Sequences

Biostrings

  • Valid data, e.g., alphabet
  • 'Vector' interface: length(), [, …
  • Specialized operations, e.g,. reverseComplement()

Classes

  • BString / BStringSet
  • DNAString / DNAStringSet
  • RNAString / RNAStringSet
  • AAString / AAStringSet

DNA Sequence Example

library(Biostrings)

dna <- DNAStringSet(c("AAACTG", "CCCTTCAAC", "TACGAA"))
dna
##   A DNAStringSet instance of length 3
##     width seq
## [1]     6 AAACTG
## [2]     9 CCCTTCAAC
## [3]     6 TACGAA
length(dna)
## [1] 3
dna[c(1, 3, 1)]
##   A DNAStringSet instance of length 3
##     width seq
## [1]     6 AAACTG
## [2]     6 TACGAA
## [3]     6 AAACTG

DNA Sequence Example cont.

width(dna)
## [1] 6 9 6
as.character(dna[c(1,2)])
## [1] "AAACTG"    "CCCTTCAAC"
reverseComplement(dna)
##   A DNAStringSet instance of length 3
##     width seq
## [1]     6 CAGTTT
## [2]     9 GTTGAAGGG
## [3]     6 TTCGTA

Importing Genomic Sequence

Import Methods from FASTA/FASTQ

  • readBStringSet / readDNAStringSet / readRNAStringSet / readAAStringSet
filepath1 <- system.file("extdata", "someORF.fa", package="Biostrings")
x1 <- readDNAStringSet(filepath1)
x1
##   A DNAStringSet instance of length 7
##     width seq                                          names               
## [1]  5573 ACTTGTAAATATATCTTTTAT...ATCGACCTTATTGTTGATAT YAL001C TFC3 SGDI...
## [2]  5825 TTCCAAGGCCGATGAATTCGA...AAATTTTTTTCTATTCTCTT YAL002W VPS8 SGDI...
## [3]  2987 CTTCATGTCAGCCTGCACTTC...TACTCATGTAGCTGCCTCAT YAL003W EFB1 SGDI...
## [4]  3929 CACTCATATCGGGGGTCTTAC...CCCGAAACACGAAAAAGTAC YAL005C SSA1 SGDI...
## [5]  2648 AGAGAAAGAGTTTCACTTCTT...TAATTTATGTGTGAACATAG YAL007C ERP2 SGDI...
## [6]  2597 GTGTCCGGGCCTCGCAGGCGT...TTTTGGCAGAATGTACTTTT YAL008W FUN14 SGD...
## [7]  2780 CAAGATAATGTCAAAGTTAGT...AAGGAAGAAAAAAAAATCAC YAL009W SPO7 SGDI...

Importing Genomic Sequences

Utilizing BSgenome Packages

BSgenome packages contain sequence information for a given species/build. There are many such packages - you can get a listing using BiocManager::available("BSgenome")

library(BSgenome)
head(BiocManager::available("BSgenome"))
## [1] "BSgenome"                               
## [2] "BSgenome.Alyrata.JGI.v1"                
## [3] "BSgenome.Amellifera.BeeBase.assembly4"  
## [4] "BSgenome.Amellifera.UCSC.apiMel2"       
## [5] "BSgenome.Amellifera.UCSC.apiMel2.masked"
## [6] "BSgenome.Aofficinalis.NCBI.V1"

BSgenome

We can load and inspect a BSgenome package

library(BSgenome.Hsapiens.UCSC.hg19)
BSgenome.Hsapiens.UCSC.hg19
## Human genome:
## # organism: Homo sapiens (Human)
## # provider: UCSC
## # provider version: hg19
## # release date: Feb. 2009
## # release name: Genome Reference Consortium GRCh37
## # 93 sequences:
## #   chr1                  chr2                  chr3                 
## #   chr4                  chr5                  chr6                 
## #   chr7                  chr8                  chr9                 
## #   chr10                 chr11                 chr12                
## #   chr13                 chr14                 chr15                
## #   ...                   ...                   ...                  
## #   chrUn_gl000235        chrUn_gl000236        chrUn_gl000237       
## #   chrUn_gl000238        chrUn_gl000239        chrUn_gl000240       
## #   chrUn_gl000241        chrUn_gl000242        chrUn_gl000243       
## #   chrUn_gl000244        chrUn_gl000245        chrUn_gl000246       
## #   chrUn_gl000247        chrUn_gl000248        chrUn_gl000249       
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[['
## # operator to access a given sequence)

Subsetting a BSgenome

The main accessor is getSeq, and you can get data by sequence (e.g., entire chromosome or unplaced scaffold), or by passing in a GRanges object, to get just a region.

getSeq(BSgenome.Hsapiens.UCSC.hg19, "chr1")
##   249250621-letter "DNAString" instance
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
getSeq(BSgenome.Hsapiens.UCSC.hg19, GRanges("chr6:35310335-35395968"))
##   A DNAStringSet instance of length 1
##     width seq
## [1] 85634 GCGGAGCGTGTGACGCTGCGGCCGCCGCGGA...TGGGAATTAAATATTTAAGAGCTGACTGGAA

What is a GRanges object?

giphy.com

Genomic Ranges

GenomicRanges

GenomicRanges objects allow for easy selection and subsection of data based on genomic position information.

Where are GenomicRanges used?

Everywhere…

  • Data (e.g., aligned reads, called peaks, copy number)
  • Annotations (e.g., genes, exons, transcripts, TxDb)
  • Close relation to BED files (see rtracklayer::import.bed() and HelloRanges)
  • Anywhere there is Genomic positioning information



Utility

  • Also vector interface – length(), [, etc.
  • Tidyverse: plyranges

GenomicRanges

library(GenomicRanges)
gr <- GRanges(c("chr1:100-120", "chr1:115-130"))
gr
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1   100-120      *
##   [2]     chr1   115-130      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
gr <- GRanges(c("chr1:100-120", "chr1:115-130", "chr2:200-220"),
strand=c("+","+","-"), GC = seq(1, 0, length=3), id = paste0("id",1:3))
gr
## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames    ranges strand |        GC          id
##          <Rle> <IRanges>  <Rle> | <numeric> <character>
##   [1]     chr1   100-120      + |         1         id1
##   [2]     chr1   115-130      + |       0.5         id2
##   [3]     chr2   200-220      - |         0         id3
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

There are lots of utility functions for GRange objects

Help!

methods(class="GRanges")
methods(class="GRangesList")
?"GRanges"
?"GRangesList"

GRanges functions

Intra-range operations

  • e.g., shift(), narrow(), flank()

Inter-range operations

  • e.g., reduce(), coverage(), gaps(), disjoin()

Between-range operations

  • e.g., countOverlaps(), findOverlaps(), summarizeOverlaps()

GRanges Example

shift(gr, 1)
## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames    ranges strand |        GC          id
##          <Rle> <IRanges>  <Rle> | <numeric> <character>
##   [1]     chr1   101-121      + |         1         id1
##   [2]     chr1   116-131      + |       0.5         id2
##   [3]     chr2   201-221      - |         0         id3
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
reduce(gr)
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1   100-130      +
##   [2]     chr2   200-220      -
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
anno <- GRanges(c("chr1:110-150", "chr2:150-210"))
countOverlaps(anno, gr)
## [1] 2 1

Lists of Genomic Ranges

- e.g., exons-within-transcripts, alignments-within-reads

More examples

Returning to BSGenome: Get the sequences for three UTR regions? threeUTRsByTranscript() returns a GRangesList

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

threeUTR <-  threeUTRsByTranscript(txdb, use.names=TRUE)
threeUTR_seq <- extractTranscriptSeqs(Hsapiens, threeUTR)
options(showHeadLines = 3, showTailLines = 2)
threeUTR_seq
##   A DNAStringSet instance of length 60740
##         width seq                                      names               
##     [1]   770 TGCCCGTTGGAGAAAACAG...GCACACTGTTGGTTTCTG uc010nxq.1
##     [2]  1333 CTGTGAGGCCATTTCCAGG...CCTCCCACGCGGACAGAG uc009vjk.2
##     [3]  2976 CTGTGAGGCCATTTCCAGG...GAAAAATATCGCCCACGA uc001aau.3
##     ...   ... ...
## [60739]  1191 GCACCCTGACCCTATTCAG...GGAATTTGTATTTAATAA uc011mgh.2
## [60740]  1191 GCACCCTGACCCTATTCAG...GGAATTTGTATTTAATAA uc011mgi.2

More examples

How many genes are between 2858473 and 3271812 on chr2 in the hg19 genome?

gns <- genes(txdb)
gns[gns %over%  GRanges("chr2:2858473-3271812")]
## GRanges object with 1 range and 1 metadata column:
##        seqnames          ranges strand |     gene_id
##           <Rle>       <IRanges>  <Rle> | <character>
##   7260     chr2 3192741-3381653      - |        7260
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
## OR

subsetByOverlaps(gns, GRanges("chr2:2858473-3271812"))
## GRanges object with 1 range and 1 metadata column:
##        seqnames          ranges strand |     gene_id
##           <Rle>       <IRanges>  <Rle> | <character>
##   7260     chr2 3192741-3381653      - |        7260
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

Importing a BED file

We said earlier, GRanges are closely related to bed files. Lets look at the example in the rtracklayer::import.bed help page:

library(rtracklayer)
test_bed <- system.file(package = "rtracklayer", "tests", "test.bed")
test <- import(test_bed)
test
## UCSC track 'ItemRGBDemo'
## UCSCData object with 5 ranges and 5 metadata columns:
##       seqnames              ranges strand |        name     score
##          <Rle>           <IRanges>  <Rle> | <character> <numeric>
##   [1]     chr7 127471197-127472363      + |        Pos1         0
##   [2]     chr7 127472364-127473530      + |        Pos2         2
##   [3]     chr7 127473531-127474697      - |        Neg1         0
##   [4]     chr9 127474698-127475864      + |        Pos3         5
##   [5]     chr9 127475865-127477031      - |        Neg2         5
##           itemRgb               thick                  blocks
##       <character>           <IRanges>           <IRangesList>
##   [1]     #FF0000 127471197-127472363 1-300,501-700,1068-1167
##   [2]     #FF0000 127472364-127473530          1-250,668-1167
##   [3]     #FF0000 127473531-127474697                  1-1167
##   [4]     #FF0000 127474698-127475864                  1-1167
##   [5]     #0000FF 127475865-127477031                  1-1167
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

Bed file continued

In fact this class Extends the GenomicRange GRange class

is(test, "GRanges")
## [1] TRUE

So you can use GRange functions

subsetByOverlaps(test, GRanges("chr7:127471197-127472368"))
## UCSC track 'ItemRGBDemo'
## UCSCData object with 2 ranges and 5 metadata columns:
##       seqnames              ranges strand |        name     score
##          <Rle>           <IRanges>  <Rle> | <character> <numeric>
##   [1]     chr7 127471197-127472363      + |        Pos1         0
##   [2]     chr7 127472364-127473530      + |        Pos2         2
##           itemRgb               thick                  blocks
##       <character>           <IRanges>           <IRangesList>
##   [1]     #FF0000 127471197-127472363 1-300,501-700,1068-1167
##   [2]     #FF0000 127472364-127473530          1-250,668-1167
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

Side Note:

Utilizing Bioconductor recommended import/export methods, classes, etc. has other benefits as well…

BED files have 0-based half-open intervals (left end point included, right endpoint 'after' the end of the range),

whereas in other parts of the bioinformatic community and in bioc the coordinates are 1-based and closed

Using import() converts BED coordinates into bioc coordinates.

Working with BAM files

library(GenomicAlignments)
fls <- list.files(system.file("extdata", package="GenomicAlignments"),
                  recursive=TRUE, pattern="*bam$", full=TRUE)
names(fls) <- basename(fls)
bf <- BamFileList(fls, index=character(), yieldSize=1000)
genes <- GRanges(
    seqnames = rep(c("chr2L", "chr2R", "chr3L"), c(4, 5, 2)),
    ranges = IRanges(c(1000, 3000, 4000, 7000, 2000, 3000, 3600, 
                       4000, 7500, 5000, 5400), 
                     width=c(rep(500, 3), 600, 900, 500, 300, 900, 
                             300, 500, 500))) 
se <- summarizeOverlaps(genes, bf)
se
## class: RangedSummarizedExperiment 
## dim: 11 2 
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames(2): sm_treated1.bam sm_untreated1.bam
## colData names(0):
# Start differential expression analysis with the DESeq2 or edgeR

What is a RangedSummarizedExperiment?

giphy.com

SummarizedExperiments

SummarizedExperiment objects are popular objects for representing expression data and other rectangular data (feature x sample data). Incoming packages are now strongly recommended to use this class representation instead of ExpressionSet.

three components: underlying 'matrix' 'row' annotations (genomic features) 'column' annotations (sample descriptions)

ExpressionSet

giphy.com

Summarized Experiments

Components 1.

Main matrix of values / count data / expression values / etc …

counts <- as.matrix(read.csv("lecture-02-data/airway_counts.csv", row.names=1))
head(counts, 3)
##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003        679        448        873        408       1138
## ENSG00000000419        467        515        621        365        587
## ENSG00000000457        260        211        263        164        245
##                 SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003       1047        770        572
## ENSG00000000419        799        417        508
## ENSG00000000457        331        233        229

Component 2.

Sample data / Phenotypic data / Sample specific information / etc …

colData <- read.csv("lecture-02-data/airway_colData.csv", row.names=1)
colData[, 1:4]
##            SampleName    cell   dex albut
## SRR1039508 GSM1275862  N61311 untrt untrt
## SRR1039509 GSM1275863  N61311   trt untrt
## SRR1039512 GSM1275866 N052611 untrt untrt
## SRR1039513 GSM1275867 N052611   trt untrt
## SRR1039516 GSM1275870 N080611 untrt untrt
## SRR1039517 GSM1275871 N080611   trt untrt
## SRR1039520 GSM1275874 N061011 untrt untrt
## SRR1039521 GSM1275875 N061011   trt untrt

Component 3.

Genomic position information / Information about features / etc …

rowRanges <- readRDS("lecture-02-data/airway_rowRanges.rds")
options(showHeadLines = 3, showTailLines = 2)
rowRanges
## GRangesList object of length 33469:
## $ENSG00000000003
## GRanges object with 17 ranges and 2 metadata columns:
##        seqnames            ranges strand |   exon_id       exon_name
##           <Rle>         <IRanges>  <Rle> | <integer>     <character>
##    [1]        X 99883667-99884983      - |    667145 ENSE00001459322
##    [2]        X 99885756-99885863      - |    667146 ENSE00000868868
##    [3]        X 99887482-99887565      - |    667147 ENSE00000401072
##    ...      ...               ...    ... .       ...             ...
##   [16]        X 99891790-99892101      - |    667160 ENSE00001863395
##   [17]        X 99894942-99894988      - |    667161 ENSE00001828996
##   -------
##   seqinfo: 722 sequences (1 circular) from an unspecified genome
## 
## ...
## <33468 more elements>

Benefit of Containers

Could manipulate independently…

cidx <- colData$dex == "trt"
plot(
    rowMeans(1 + counts[, cidx]) ~ rowMeans(1 + counts[, !cidx]),
    log="xy"
)

  • very fragile, e.g., what if colData rows had been re-ordered?

SummarizedExperiment

Solution: coordinate in a single object – SummarizedExperiment

library(SummarizedExperiment)

se <- SummarizedExperiment(counts, rowRanges = rowRanges, colData = colData)
cidx <- se$dex == "trt"
plot(
    rowMeans(1 + assay(se)[, cidx]) ~ rowMeans(1 + assay(se)[, !cidx]),
    log="xy"
)

Benefits

  • Much more robust to 'bookkeeping' errors
  • matrix-like interface: dim(), two-dimensional [, …
  • accessors: assay(), rowData() / rowRanges(), colData(), …
dim(se)
## [1] 33469     8
colnames(se[1:2,3:4])
## [1] "SRR1039512" "SRR1039513"
names(colData)
## [1] "SampleName" "cell"       "dex"        "albut"      "Run"       
## [6] "avgLength"  "Experiment" "Sample"     "BioSample"

Popular

Help