Detecting differential binding in ChIP-seq data with csaw

Package: csaw
Author: Aaron Lun (alun@wehi.edu.au)
Compilation date: 2015-07-17

Motivation

I'll assume that everyone here knows what ChIP-seq is - but if not, then you can look at this cute picture.

ChIP-seq protocol

Why look for differential binding (DB) between biological conditions?

Why look for de novo sites?

Why use windows?

Overall workflow

It's a fairly modular process:

  1. Counting reads into windows
  2. Computing normalization factors
  3. Filtering out low-abundance windows
  4. Testing for differential binding
  5. Aggregating windows into clusters
  6. Visualization, annotation and other stuff

But before we get into that, we have to set up some data. We'll be using the libraries from the Tiwari et al.'s 2012 study. This compares binding of the NF-YA transcription factor in embryonic stem cells (ESCs) and terminal neurons (TNs).

bam.files <- file.path("/home/ubuntu/data/aaronlun", 
    c("es_1.bam", "es_2.bam", "tn_1.bam", "tn_2.bam"))

These are BAM files, so we've already mapped the reads to the mm10 build of the mouse genome. For those who are interested, this was done using subread, with some help from SAMtools and Picard. The full processing pipeline can be found at the location below.

system.file("doc", "sra2bam.sh", package="csaw")
## [1] "/home/ubuntu/R-libs/csaw/doc/sra2bam.sh"

Counting reads into windows

The first step is to count reads into windows. Reads are directionally extended to the average fragment length, to represent the original fragments that were present during immunoprecipitation. Windows of constant size (10 bp) are tiled across the genome at regular intervals (50 bp).

frag.len <- 110
window.width <- 10
spacing <- 50

The number of imputed fragments overlapping each window is then counted in each library. We use the windowCounts function to do the actual counting. We filter out low-quality alignments (mapping quality score less than 50) and we ignore everything except the standard chromosome set.

param <- readParam(minq=50, restrict=paste0('chr', c(1:19, 'X', 'Y')))
data <- windowCounts(bam.files, ext=frag.len, width=window.width, spacing=spacing, param=param)
data
## class: RangedSummarizedExperiment 
## dim: 3950319 4 
## metadata(4): spacing width shift final.ext
## assays(1): counts
## rownames: NULL
## rowRanges metadata column names(0):
## colnames: NULL
## colData names(4): bam.files totals ext param

This gives us a RangedSummarizedExperiment object that we can play around with. Check out rowRanges(data), for the genomic coordinates of the windows; assay(data), for the counts for each window; and colData(data), for library-specific information.

Interesting questions

max.delay <- 500
dedup.on <- readParam(dedup=TRUE, minq=50)
x <- correlateReads(bam.files, max.delay, param=dedup.on)
plot(0:max.delay, x, type="l", ylab="CCF", xlab="Delay (bp)")

plot of chunk unnamed-chunk-6

Advanced usage

A number of useful read extraction parameters can be specified within the readParam object. These options allow us to:

The idea is to define a single readParam object for use throughout the analysis, to ensure the same reads are being extracted each time (with some exceptions, e.g., correlateReads). For example, look at the one that we previously constructed.

param
##     Extracting reads in single-end mode
##     Duplicate removal is turned off 
##     Minimum allowed mapping score is 50 
##     Reads are extracted from both strands
##     Read extraction is limited to 21 sequences
##     No regions are specified to discard reads

Computing normalization factors

There's several different normalization options in csaw, but we'll just focus on removing composition bias. This is introduced when DB regions take up a bigger slice of the sequencing pie in one library, suppressing the coverage of everything else (e.g., non-DB sites, background regions). We don't want to detect differences due to this suppression, so we normalize based on the assumption that most of the genome is non-DB background. First, we count reads into large 10 kbp bins to quantify the coverage of the background regions.

binned <- windowCounts(bam.files, bin=TRUE, width=10000, param=param)

Any systematic difference between libraries across these background bins must be artifactual and should be eliminated by scaling normalization. The normalize function is a S4 method that wraps the normalizeCounts function, which in turn wraps the calcNormFactors function from the edgeR package. The actual normalization is performed using the trimmed mean of M-values (TMM) method.

normfacs <- normalize(binned)
normfacs
## [1] 1.006 0.973 1.015 1.007

We can have a look at these factors on MA plots. The code below compares the binned counts for libraries 1 and 2, i.e., the two ES replicates. Note the big blob, which corresponds mostly to the background regions of the genome. The location of the red line represents the log-scaling factor used to normalize one library against the other.

adj.counts <- cpm(asDGEList(binned), log=TRUE)
cur.x <- adj.counts[,1]
cur.y <- adj.counts[,2]
smoothScatter(x=(cur.x+cur.y)/2+6*log2(10), y=cur.x-cur.y,
  xlab="A", ylab="M", main="1 vs 2")
all.dist <- diff(log2(normfacs[c(2, 1)]))
abline(h=all.dist, col="red")

plot of chunk unnamed-chunk-10

Interesting questions

Advanced usage

Another source of bias comes from differences in immunoprecipitation efficiency between libraries. If such differences are present, we wouldn't want to detect them because they're not biologically interesting. Thus, we can normalize by eliminating any systematic difference in abundance across high-abundance windows (i.e., bound sites). This involves filtering (see next section) and normalizing on the remainders.

keep <- aveLogCPM(asDGEList(data)) > -1
normalize(data[keep,])
## [1] 0.610 0.799 1.533 1.338

Note: we can only pick one set of normalization factors - either to remove composition bias, or to remove efficiency bias. The choice depends on the biological context of the experiment. Specifically, do you expect wholesale changes in binding between conditions?

In this case, we'll go for composition bias, as there's no reason that NF-YA binding should be constant between cell types.

Trended biases may also be observed as trends in the MA plots. These cannot be eliminated by scaling methods like TMM, but require the use of non-linear methods instead. In csaw, this can be achieved by setting type="loess" in the call to normalize. This performs loess-based normalization to remove the trend, and produces an offset matrix for use in later model fitting.

Filtering away low-abundance windows

Filtering is done using the average abundance of each window, as computed using aveLogCPM in edgeR. This can be interpreted as the log-(average count)-per-million for each window. In our statistical model, the average abundance is roughly independent of the DB status of each window, so we avoid data snooping when we select for high-abundance windows.

abundance <- aveLogCPM(asDGEList(data))

We then keep the windows that are considered to be high-abundance, according to some threshold. The RangedSummarizedExperiment object can be easily subsetted to achieve this (we'll hold onto the original, just in case).

keep <- abundance > -1
original <- data
data <- data[keep,]
nrow(data)
## [1] 13823

A very large number of low-abundance windows is problematic as they will mess up some of edgeR's statistical methods and assumptions. Filtering gets rid of most of these windows and their associated problems, while also reducing computational work and the severity of the multiple testing correction.

Interesting questions

binned.2 <- windowCounts(bam.files, bin=TRUE, width=2000, param=param)
filter.stat.bg <- filterWindows(original, background=binned.2, type="global")
background.keep <- filter.stat.bg$filter > log2(3)
summary(background.keep)
##    Mode   FALSE    TRUE    NA's 
## logical 3930588   19731       0

Advanced usage

We can mimic peak callers like MACS, where the background is estimated locally for each window. In this example, we define the local background as the 2 kbp interval around each window.

surrounds <- 2000
neighbour <- suppressWarnings(resize(rowRanges(original), surrounds, fix="center"))
wider <- regionCounts(bam.files, regions=neighbour, ext=frag.len, param=param)

We then use the filterWindows function to process these into enrichment values, for each window over its local background. We keep those windows that have a 3-fold or higher increase in abundance over its neighbours.

filter.stat.loc <- filterWindows(original, wider, type="local")
local.keep <- filter.stat.loc$filter > log2(3)
summary(local.keep)
##    Mode   FALSE    TRUE    NA's 
## logical 3940821    9498       0

Testing for differential binding

We first define the experimental design. This is fairly easy for our set-up, which contains two groups of two biological replicates. More complex designs can also be accommodated.

grouping <- factor(c('es', 'es', 'tn', 'tn'))
design <- model.matrix(~0 + grouping)
colnames(design) <- levels(grouping)

We use the negative binomial (NB) framework in the edgeR package. This accounts for low, discrete counts that exhibit overdispersion between biological replicates. Specifically, variability between replicates is modelled using the NB dispersion parameter, as estimated through the estimateDisp function.

y <- asDGEList(data, norm.factors=normfacs)
y <- estimateDisp(y, design)

We can augment this model with quasi-likelihood (QL) methods. The QL dispersions account for window-specific variability, while the NB dispersions model biological variability between replicates. We fit a generalized linear model (GLM) to the counts for each window, where the QL dispersion for that window is estimated from the GLM deviance.

fit <- glmQLFit(y, design, robust=TRUE)

Finally, we wrap things up by using the QL F-test to test for DB between our two cell types.

contrast <- makeContrasts(es - tn, levels=design)
results <- glmQLFTest(fit, contrast=contrast)

Interesting questions

norep.fit <- glmFit(y, design, dispersion=0.05)
norep.results <- glmLRT(norep.fit, contrast=contrast)
bin.adjc <- cpm(asDGEList(binned.2), log=TRUE)
plotMDS(bin.adjc, labels=grouping)

plot of chunk unnamed-chunk-22

Advanced usage

This isn't really that advanced, but we can plot the NB dispersions with the plotBCV command. In the QL framework, we focus on the trend as only the trended NB dispersions will be used later. We usually see a decreasing trend with abundance, possibly plateauing off at very large abundances.

plotBCV(y)

plot of chunk unnamed-chunk-23

Similarly, for the QL dispersions, we can use the plotQLDisp function. This shows the effect of empirical Bayes shrinkage on the estimates, when the raw values are squeezed towards the trend. We use the shrunken estimates for all downstream processing. The idea is to reduce uncertainty in the presence of limited replicates, to improve detection power for DB testing.

plotQLDisp(fit)

plot of chunk unnamed-chunk-24

If you have more complicated experiments (e.g., multiple groups, blocking factors), you might have to think more carefully about the design matrix and the contrasts. Check out the edgeR user's guide for how to deal with them.

Aggregating windows into clusters

To correct for multiple testing across thousands of windows, we can think about controlling the false discovery rate (FDR). However, controlling the FDR across all windows isn't that useful, as few people would interpret ChIP-seq results in terms of windows. Instead, we want to control the FDR across all genomic regions, i.e., the region-level FDR.

To do that, we can cluster adjacent windows into genomic regions using the mergeWindows function. Windows that are less than tol apart get thrown into the same cluster. Merging of adjacent or overlapping windows reduces redundancy and simplifies interpretation of the results. Each cluster now represents a distinct genomic region.

clustered <- mergeWindows(rowRanges(data), tol=1000)
clustered$region
## GRanges object with 4238 ranges and 0 metadata columns:
##          seqnames               ranges strand
##             <Rle>            <IRanges>  <Rle>
##      [1]     chr1   [6466701, 6466760]      *
##      [2]     chr1   [7088951, 7088960]      *
##      [3]     chr1   [7397901, 7398110]      *
##      [4]     chr1   [9541401, 9541510]      *
##      [5]     chr1   [9545251, 9545360]      *
##      ...      ...                  ...    ...
##   [4234]     chrY [  259151,   259360]      *
##   [4235]     chrY [90761101, 90761110]      *
##   [4236]     chrY [90805151, 90805160]      *
##   [4237]     chrY [90808801, 90808910]      *
##   [4238]     chrY [90812101, 90813810]      *
##   -------
##   seqinfo: 21 sequences from an unspecified genome

We can combine p-values for all windows in each cluster using Simes' method. This computes a single combined p-value for each cluster/region, representing the evidence against the global null. We also get some other statistics, like the total number of windows in each cluster as well as the number that are changing substantially in each direction.

tabcom <- combineTests(clustered$id, results$table)
head(tabcom)
##   nWindows logFC.up logFC.down PValue    FDR
## 1        2        0          0 0.7041 0.7698
## 2        1        0          1 0.0179 0.0527
## 3        5        0          5 0.0800 0.1462
## 4        3        0          3 0.0442 0.0961
## 5        3        0          3 0.0589 0.1181
## 6        2        0          2 0.0121 0.0409

Each row of the output table corresponds to a cluster/region. Controlling the FDR across regions is simple, as we can just apply the Benjamini-Hochberg correction on the combined p-values.

Interesting questions

clustered2 <- mergeWindows(rowRanges(data), tol=100, max.width=5000)
tab.best <- getBestTest(clustered$id, results$table)
head(tab.best)
##   best  logFC  logCPM     F PValue    FDR
## 1    2 -0.295 -0.9971 0.155 1.0000 1.0000
## 2    3 -1.858 -0.8812 6.091 0.0179 0.0546
## 3    6 -1.211  1.1286 5.999 0.0934 0.1716
## 4   10 -1.454 -0.0262 5.291 0.0798 0.1524
## 5   13 -1.514 -0.7097 4.550 0.1169 0.2025
## 6   16 -2.204 -0.8219 8.383 0.0121 0.0432

Advanced usage

We can use many different clustering strategies, so long as they are blind to the DB status of the windows. One strategy that is particularly useful is to cluster windows according to the promoters in which they are located. This allows for direct coordination with, say, a differential expression analysis.

gene.bodies <- genes(TxDb.Mmusculus.UCSC.mm10.knownGene)
prom <- promoters(gene.bodies, upstream=3000, downstream=1000)
head(prom)
## GRanges object with 6 ranges and 1 metadata column:
##             seqnames                 ranges strand |     gene_id
##                <Rle>              <IRanges>  <Rle> | <character>
##   100009600     chr9 [ 21074497,  21078496]      - |   100009600
##   100009609     chr7 [ 84963010,  84967009]      - |   100009609
##   100009614    chr10 [ 77708446,  77712445]      + |   100009614
##   100009664    chr11 [ 45805083,  45809082]      + |   100009664
##      100012     chr4 [144161652, 144165651]      - |      100012
##      100017     chr4 [134767005, 134771004]      - |      100017
##   -------
##   seqinfo: 66 sequences (1 circular) from mm10 genome

We then identify the windows overlapping each promoter. The set of windows for the promoter of each gene is defined as a cluster.

olap <- findOverlaps(prom, rowRanges(data))
olap
## Hits object with 12282 hits and 0 metadata columns:
##           queryHits subjectHits
##           <integer>   <integer>
##       [1]         7        8461
##       [2]         7        8462
##       [3]        18       10101
##       [4]        18       10102
##       [5]        18       10103
##       ...       ...         ...
##   [12278]     23643        7903
##   [12279]     23643        7904
##   [12280]     23646        8274
##   [12281]     23646        8275
##   [12282]     23646        8276
##   -------
##   queryLength: 23653
##   subjectLength: 13823

It's then a simple matter to combine the p-values, as we did before. This is done using a wrapper function that directly accepts a Hits object. We end up with a table, with one row for each entry of prom.

tabbroad <- combineOverlaps(olap, results$table)
head(tabbroad[!is.na(tabbroad$PValue),])
##    nWindows logFC.up logFC.down   PValue     FDR
## 7         2        0          2 0.000120 0.00261
## 18        5        0          5 0.000924 0.00744
## 22        2        0          2 0.179911 0.24410
## 25        4        0          0 0.935199 0.95201
## 28        3        0          3 0.005547 0.02282
## 36        4        0          4 0.016083 0.04409

This means we can make a direct statement regarding DB in a promoter region. Note that the NA's represent those entries of prom that have no overlapping windows, and are not shown here because they're not particularly exciting.

Other stuff

Simple gene-based annotation can be added to the regions using the detailRanges function. This identifies annotated genomic features (exons, promoters, introns) that are overlapped by each region, as well as those flanking each region (plus the gap distance between each flanking feature and the region).

anno <- detailRanges(clustered$region, orgdb=org.Mm.eg.db,
  txdb=TxDb.Mmusculus.UCSC.mm10.knownGene)
head(anno$overlap)
## [1] ""                               "Pcmtd1|0-1|+"                  
## [3] ""                               ""                              
## [5] "Rrs1|0|+,Adhfe1|0|+"            "Vcpip1|0|-,1700034P13Rik|0-1|+"

Some explanation may be in order here. Each string represents a comma-separated list, where each entry is of the form SYMBOL|EXONS|STRAND, i.e., the symbol of the relevant gene, the exons that are overlapped (I for introns, and 0 for promoters) and the strand of the gene. For flanking regions, an additional [DISTANCE] field is present that indicates the distance between the annotated feature and each region.

head(anno$left)
## [1] ""               ""               ""               ""              
## [5] ""               "Vcpip1|1|-[19]"

We can now put together a table to save to disk, for other people to read and whatnot. Of course, you can save to other formats if you wish, using packages like rtracklayer.

all.results <- data.frame(as.data.frame(clustered$region)[,1:3], tabcom, anno)
all.results <- all.results[order(all.results$PValue),]
write.table(all.results, file="saved.tsv", row.names=FALSE, quote=FALSE, sep="\t")

Simple visualization can be performed using the Gviz package. We extract reads from the BAM file using the extractReads function, and we plot coverage in reads-per-million. This is done separately for each strand in blue and red.

cur.region <- GRanges("chr18", IRanges(77806807, 77807165))
collected <- list()
for (i in 1:length(bam.files)) {
  reads <- extractReads(bam.files[i], cur.region)
  pcov <- as(coverage(reads[strand(reads)=="+"])/data$totals[i]*1e6, "GRanges")
  ncov <- as(coverage(reads[strand(reads)=="-"])/data$totals[i]*1e6, "GRanges")
  ptrack <- DataTrack(pcov, type="histogram", lwd=0, fill=rgb(0,0,1,.4), ylim=c(0,1),
      name=bam.files[i], col.axis="black", col.title="black")
  ntrack <- DataTrack(ncov, type="histogram", lwd=0, fill=rgb(1,0,0,.4), ylim=c(0,1))
  collected[[i]] <- OverlayTrack(trackList=list(ptrack,ntrack))
}
gax <- GenomeAxisTrack(col="black")
plotTracks(c(gax, collected), from=start(cur.region), to=end(cur.region))

plot of chunk unnamed-chunk-35

Interesting questions

anno2 <- detailRanges(clustered$region, orgdb=org.Mm.eg.db,
  txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, dist=3000, promoter=c(0, 2000))

Advanced usage

It may be useful to store the results as a GRanges for further manipulation. This can be done by stuffing the various statistics into the metadata fields.

all.regions <- clustered$region
elementMetadata(all.regions) <- tabcom

Final comments

You've just completed a window-based de novo analysis of differential binding. For more information, you can have a look at the csaw user's guide, or the documentation for each function.

csawUsersGuide()

For bonus points, guess the foodstuff that the proteins are drawn as in the first picture.