Contents

1 Project Overview

1.1 About

Bioconductor: Analysis and comprehension of high-throughput genomic data

Packages, vignettes, work flows

Package installation and use

1.2 Key concepts

Goals

What a few lines of R has to say

x <- rnorm(1000)
y <- x + rnorm(1000)
df <- data.frame(X=x, Y=y)
plot(Y ~ X, df)
fit <- lm(Y ~ X, df)
anova(fit)
## Analysis of Variance Table
## 
## Response: Y
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## X           1 966.02  966.02  1039.9 < 2.2e-16 ***
## Residuals 998 927.12    0.93                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
abline(fit)

Classes and methods – “S3”

Bioconductor classes and methods – “S4”

1.3 High-throughput sequence analysis work flows

  1. Experimental design

  2. Wet-lab sequence preparation (figure from http://rnaseq.uoregon.edu/)

  3. (Illumina) Sequencing (Bentley et al., 2008, doi:10.1038/nature07517)

  4. Alignment
    • Choose to match task, e.g., Rsubread, Bowtie2 good for ChIPseq, some forms of RNAseq; BWA, GMAP better for variant calling
    • Primary output: BAM files of aligned reads
    • More recently: kallisto and similar programs that produce tables of reads aligned to transcripts
  5. Reduction
    • e.g., RNASeq ‘count table’ (simple spreadsheets), DNASeq called variants (VCF files), ChIPSeq peaks (BED, WIG files)
  6. Analysis
    • Differential expression, peak identification, …
  7. Comprehension
    • Biological context

1.4 Bioconductor sequencing ecosystem

Alt Sequencing Ecosystem

2 High-Throughput Sequence Data

2.1 DNA (and other) sequences

Biostrings

library(Biostrings)
data(phiX174Phage)
phiX174Phage
##   A DNAStringSet instance of length 6
##     width seq                                                                   names               
## [1]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA Genbank
## [2]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA RF70s
## [3]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA SS78
## [4]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA Bull
## [5]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA G97
## [6]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA NEB03
letterFrequency(phiX174Phage, c("A", "C", "G", "T"))
##         A    C    G    T
## [1,] 1291 1157 1254 1684
## [2,] 1292 1156 1253 1685
## [3,] 1292 1156 1253 1685
## [4,] 1292 1155 1253 1686
## [5,] 1292 1156 1253 1685
## [6,] 1292 1155 1253 1686
letterFrequency(phiX174Phage, "GC", as.prob=TRUE)
##            G|C
## [1,] 0.4476420
## [2,] 0.4472707
## [3,] 0.4472707
## [4,] 0.4470850
## [5,] 0.4472707
## [6,] 0.4470850

2.2 Genomic ranges

GenomicRanges

Operations

Alt GRanges operations

library(GenomicRanges)
gr <- GRanges("A", IRanges(c(10, 20, 22), width=5), "+")
shift(gr, 1)                            # intra-range
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        A  [11, 15]      +
##   [2]        A  [21, 25]      +
##   [3]        A  [23, 27]      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
range(gr)                               # inter-range
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        A  [10, 26]      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
reduce(gr)                              # inter-range
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        A  [10, 14]      +
##   [2]        A  [20, 26]      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
snps <- GRanges("A", IRanges(c(11, 17, 24), width=1))
findOverlaps(snps, gr)                  # between-range
## Hits object with 3 hits and 0 metadata columns:
##       queryHits subjectHits
##       <integer>   <integer>
##   [1]         1           1
##   [2]         3           2
##   [3]         3           3
##   -------
##   queryLength: 3 / subjectLength: 3
setdiff(range(gr), gr)                  # 'introns'
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        A  [15, 19]      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

2.3 Summarized experiments

SummarizedExperiment

Alt SummarizedExperiment

2.3.1 SummarizedExperiment exercise

The airway experiment data package summarizes an RNA-seq experiment investigating human smooth-muscle airway cell lines treated with dexamethasone. Load the library and data set.

library(airway)
data(airway)
airway
## class: RangedSummarizedExperiment 
## dim: 64102 8 
## metadata(1): ''
## assays(1): counts
## rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
## rowData names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample

airway is an example of the SummarizedExperiment class. Explore its assay() (the matrix of counts of reads overlapping genomic regions of interest in each sample), colData() (a description of each sample), and rowRanges() (a description of each region of interest; here each region is an ENSEMBL gene).

x <- assay(airway)
class(x)
## [1] "matrix"
dim(x)
## [1] 64102     8
head(x)
##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520
## ENSG00000000003        679        448        873        408       1138       1047        770
## ENSG00000000005          0          0          0          0          0          0          0
## ENSG00000000419        467        515        621        365        587        799        417
## ENSG00000000457        260        211        263        164        245        331        233
## ENSG00000000460         60         55         40         35         78         63         76
## ENSG00000000938          0          0          2          0          1          0          0
##                 SRR1039521
## ENSG00000000003        572
## ENSG00000000005          0
## ENSG00000000419        508
## ENSG00000000457        229
## ENSG00000000460         60
## ENSG00000000938          0
colData(airway)
## DataFrame with 8 rows and 9 columns
##            SampleName     cell      dex    albut        Run avgLength Experiment    Sample
##              <factor> <factor> <factor> <factor>   <factor> <integer>   <factor>  <factor>
## SRR1039508 GSM1275862   N61311    untrt    untrt SRR1039508       126  SRX384345 SRS508568
## SRR1039509 GSM1275863   N61311      trt    untrt SRR1039509       126  SRX384346 SRS508567
## SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126  SRX384349 SRS508571
## SRR1039513 GSM1275867  N052611      trt    untrt SRR1039513        87  SRX384350 SRS508572
## SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120  SRX384353 SRS508575
## SRR1039517 GSM1275871  N080611      trt    untrt SRR1039517       126  SRX384354 SRS508576
## SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101  SRX384357 SRS508579
## SRR1039521 GSM1275875  N061011      trt    untrt SRR1039521        98  SRX384358 SRS508580
##               BioSample
##                <factor>
## SRR1039508 SAMN02422669
## SRR1039509 SAMN02422675
## SRR1039512 SAMN02422678
## SRR1039513 SAMN02422670
## SRR1039516 SAMN02422682
## SRR1039517 SAMN02422673
## SRR1039520 SAMN02422683
## SRR1039521 SAMN02422677
rowRanges(airway)
## GRangesList object of length 64102:
## $ENSG00000000003 
## GRanges object with 17 ranges and 2 metadata columns:
##        seqnames               ranges strand |   exon_id       exon_name
##           <Rle>            <IRanges>  <Rle> | <integer>     <character>
##    [1]        X [99883667, 99884983]      - |    667145 ENSE00001459322
##    [2]        X [99885756, 99885863]      - |    667146 ENSE00000868868
##    [3]        X [99887482, 99887565]      - |    667147 ENSE00000401072
##    [4]        X [99887538, 99887565]      - |    667148 ENSE00001849132
##    [5]        X [99888402, 99888536]      - |    667149 ENSE00003554016
##    ...      ...                  ...    ... .       ...             ...
##   [13]        X [99890555, 99890743]      - |    667156 ENSE00003512331
##   [14]        X [99891188, 99891686]      - |    667158 ENSE00001886883
##   [15]        X [99891605, 99891803]      - |    667159 ENSE00001855382
##   [16]        X [99891790, 99892101]      - |    667160 ENSE00001863395
##   [17]        X [99894942, 99894988]      - |    667161 ENSE00001828996
## 
## ...
## <64101 more elements>
## -------
## seqinfo: 722 sequences (1 circular) from an unspecified genome

It’s easy to subset a SummarizedExperiment on rows, columns and assays, e.g., retaining just those samples in the trt level of the dex factor. Accessing elements of the column data is common, so there is a short-cut.

cidx <- colData(airway)$dex %in% "trt"
airway[, cidx]
## class: RangedSummarizedExperiment 
## dim: 64102 4 
## metadata(1): ''
## assays(1): counts
## rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
## rowData names(0):
## colnames(4): SRR1039509 SRR1039513 SRR1039517 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
## shortcut
airway[, airway$dex %in% "trt"]
## class: RangedSummarizedExperiment 
## dim: 64102 4 
## metadata(1): ''
## assays(1): counts
## rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
## rowData names(0):
## colnames(4): SRR1039509 SRR1039513 SRR1039517 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample

It’s also easy to perform range-based operations on SummarizedExperiment objects, e.g., querying for range of chromosome 14 and then subsetting to contain only genes on this chromosome. Range operations on rows are very common, so there are shortcuts here, too.

chr14 <- as(seqinfo(rowRanges(airway)), "GRanges")["14"]
ridx <- rowRanges(airway) %over% chr14
airway[ridx,]
## class: RangedSummarizedExperiment 
## dim: 2244 8 
## metadata(1): ''
## assays(1): counts
## rownames(2244): ENSG00000006432 ENSG00000009830 ... ENSG00000273259 ENSG00000273307
## rowData names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
## shortcut
chr14 <- as(seqinfo(airway), "GRanges")["14"]
airway[airway %over% chr14,]
## class: RangedSummarizedExperiment 
## dim: 2244 8 
## metadata(1): ''
## assays(1): counts
## rownames(2244): ENSG00000006432 ENSG00000009830 ... ENSG00000273259 ENSG00000273307
## rowData names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample

Use the assay() and rowSums() function to remove all rows from the airway object that have 0 reads overlapping all samples. Summarize the library size (column sums of assay()) and plot a histogram of the distribution of reads per feature of interest.

2.4 BED, GFF, GTF, WIG import and export

Genome annotations: BED, WIG, GTF, etc. files. E.g., GTF:

rtracklayer

2.5 FASTQ files

Sequenced reads: FASTQ files

@ERR127302.1703 HWI-EAS350_0441:1:1:1460:19184#0/1
CCTGAGTGAAGCTGATCTTGATCTACGAAGAGAGATAGATCTTGATCGTCGAGGAGATGCTGACCTTGACCT
+
HHGHHGHHHHHHHHDGG<GDGGE@GDGGD<?B8??ADAD<BE@EE8EGDGA3CB85*,77@>>CE?=896=:
@ERR127302.1704 HWI-EAS350_0441:1:1:1460:16861#0/1
GCGGTATGCTGGAAGGTGCTCGAATGGAGAGCGCCAGCGCCCCGGCGCTGAGCCGCAGCCTCAGGTCCGCCC
+
DE?DD>ED4>EEE>DE8EEEDE8B?EB<@3;BA79?,881B?@73;1?########################

ShortRead

2.6 Aligned reads

Aligned reads: BAM files

[GenomicAligments][]

Working with large files

2.6.1 GenomicAlignments exercise

The [RNAseqData.HNRNPC.bam.chr14][] package is an example of an experiment data package. It contains a subset of BAM files used in a gene knock-down experiment, as described in ?RNAseqData.HNRNPC.bam.chr14. Load the package and get the path to the BAM files.

library(RNAseqData.HNRNPC.bam.chr14)
fls = RNAseqData.HNRNPC.bam.chr14_BAMFILES
basename(fls)
## [1] "ERR127306_chr14.bam" "ERR127307_chr14.bam" "ERR127308_chr14.bam" "ERR127309_chr14.bam"
## [5] "ERR127302_chr14.bam" "ERR127303_chr14.bam" "ERR127304_chr14.bam" "ERR127305_chr14.bam"

Create BamFileList(), basically telling R that these are paths to BAM files rather than, say, text files from a spreadsheet.

library(GenomicAlignments)
bfls = BamFileList(fls)
bfl = bfls[[1]]

Input and explore the aligments. See ?readGAlignments and ?GAlignments for details on how to manipulate these objects.

ga = readGAlignments(bfl)
ga
## GAlignments object with 800484 alignments and 0 metadata columns:
##            seqnames strand       cigar    qwidth     start       end     width     njunc
##               <Rle>  <Rle> <character> <integer> <integer> <integer> <integer> <integer>
##        [1]    chr14      +         72M        72  19069583  19069654        72         0
##        [2]    chr14      +         72M        72  19363738  19363809        72         0
##        [3]    chr14      -         72M        72  19363755  19363826        72         0
##        [4]    chr14      +         72M        72  19369799  19369870        72         0
##        [5]    chr14      -         72M        72  19369828  19369899        72         0
##        ...      ...    ...         ...       ...       ...       ...       ...       ...
##   [800480]    chr14      -         72M        72 106989780 106989851        72         0
##   [800481]    chr14      +         72M        72 106994763 106994834        72         0
##   [800482]    chr14      -         72M        72 106994819 106994890        72         0
##   [800483]    chr14      +         72M        72 107003080 107003151        72         0
##   [800484]    chr14      -         72M        72 107003171 107003242        72         0
##   -------
##   seqinfo: 93 sequences from an unspecified genome
table(strand(ga))
## 
##      +      -      * 
## 400242 400242      0

Many of the reads have cigar “72M”. What does this mean? Can you create a subset of reads that do not have this cigar? Interpret some of the non-72M cigars. Any hint about what these cigars represent?

tail(sort(table(cigar(ga))))
## 
## 18M123N54M 36M123N36M  64M316N8M 38M670N34M 35M123N37M        72M 
##        225        228        261        264        272     603939
ga[cigar(ga) != "72M"]
## GAlignments object with 196545 alignments and 0 metadata columns:
##            seqnames strand       cigar    qwidth     start       end     width     njunc
##               <Rle>  <Rle> <character> <integer> <integer> <integer> <integer> <integer>
##        [1]    chr14      -     64M1I7M        72  19411677  19411747        71         0
##        [2]    chr14      + 55M2117N17M        72  19650072  19652260      2189         1
##        [3]    chr14      - 43M2117N29M        72  19650084  19652272      2189         1
##        [4]    chr14      - 40M2117N32M        72  19650087  19652275      2189         1
##        [5]    chr14      + 38M2117N34M        72  19650089  19652277      2189         1
##        ...      ...    ...         ...       ...       ...       ...       ...       ...
##   [196541]    chr14      -    51M1D21M        72 106950429 106950501        73         0
##   [196542]    chr14      +    31M1I40M        72 106960410 106960480        71         0
##   [196543]    chr14      +    52M1D20M        72 106965156 106965228        73         0
##   [196544]    chr14      -    13M1D59M        72 106965195 106965267        73         0
##   [196545]    chr14      -     6M1D66M        72 106965202 106965274        73         0
##   -------
##   seqinfo: 93 sequences from an unspecified genome

Use the function summarizeJunctions() to identify genomic regions that are spanned by reads with complicated cigars. Can you use the argument with.revmap=TRUE to extract the reads supporting a particular (e.g., first) junction?

summarizeJunctions(ga)
## GRanges object with 4635 ranges and 3 metadata columns:
##          seqnames                 ranges strand |     score plus_score minus_score
##             <Rle>              <IRanges>  <Rle> | <integer>  <integer>   <integer>
##      [1]    chr14   [19650127, 19652243]      * |         4          2           2
##      [2]    chr14   [19650127, 19653624]      * |         1          1           0
##      [3]    chr14   [19652355, 19653624]      * |         8          7           1
##      [4]    chr14   [19652355, 19653657]      * |         1          1           0
##      [5]    chr14   [19653773, 19653892]      * |         9          5           4
##      ...      ...                    ...    ... .       ...        ...         ...
##   [4631]    chr14 [106912703, 106922227]      * |         1          0           1
##   [4632]    chr14 [106938165, 106938301]      * |        10          2           8
##   [4633]    chr14 [106938645, 106944774]      * |        24          7          17
##   [4634]    chr14 [106944969, 106950170]      * |         7          6           1
##   [4635]    chr14 [106950323, 106960260]      * |         1          1           0
##   -------
##   seqinfo: 93 sequences from an unspecified genome
junctions <- summarizeJunctions(ga, with.revmap=TRUE)
ga[ junctions$revmap[[1]] ]
## GAlignments object with 4 alignments and 0 metadata columns:
##       seqnames strand       cigar    qwidth     start       end     width     njunc
##          <Rle>  <Rle> <character> <integer> <integer> <integer> <integer> <integer>
##   [1]    chr14      + 55M2117N17M        72  19650072  19652260      2189         1
##   [2]    chr14      - 43M2117N29M        72  19650084  19652272      2189         1
##   [3]    chr14      - 40M2117N32M        72  19650087  19652275      2189         1
##   [4]    chr14      + 38M2117N34M        72  19650089  19652277      2189         1
##   -------
##   seqinfo: 93 sequences from an unspecified genome

It is possible to do other actions on BAM files, e.g., calculating the ‘coverage’ (reads overlapping each base).

coverage(bfl)$chr14
## integer-Rle of length 107349540 with 493510 runs
##   Lengths: 19069582       72   294083       17       55 ...       72       19       72   346298
##   Values :        0        1        0        1        2 ...        1        0        1        0

2.7 Called variants: VCF files

VariantAnnotation