1 Overview

Here, we perform a window-based differential binding (DB) analysis to identify regions of differential H3K9ac enrichment between pro-B and mature B cells (Revilla-I-Domingo et al. 2012). H3K9ac is associated with active promoters and tends to exhibit relatively narrow regions of enrichment relative to other marks such as H3K27me3. We download the BAM files using the relevant function from the chipseqDBData package1 BAM files are cached upon the first call to this function, so subsequent calls do not need to re-download the files.. The experimental design contains two biological replicates for each of the two cell types.

library(chipseqDBData)
acdata <- H3K9acData()
acdata
## DataFrame with 4 rows and 3 columns
##                  Name            Description
##           <character>            <character>
## 1    h3k9ac-proB-8113    pro-B H3K9ac (8113)
## 2    h3k9ac-proB-8108    pro-B H3K9ac (8108)
## 3 h3k9ac-matureB-8059 mature B H3K9ac (8059)
## 4 h3k9ac-matureB-8086 mature B H3K9ac (8086)
##                                                      Path
##                                               <character>
## 1    /tmp/RtmpUOJHLw/file4a3aa901d95/h3k9ac-proB-8113.bam
## 2    /tmp/RtmpUOJHLw/file4a3aa901d95/h3k9ac-proB-8108.bam
## 3 /tmp/RtmpUOJHLw/file4a3aa901d95/h3k9ac-matureB-8059.bam
## 4 /tmp/RtmpUOJHLw/file4a3aa901d95/h3k9ac-matureB-8086.bam

2 Pre-processing checks

2.1 Examining mapping statistics

We use methods from the Rsamtools package to compute some mapping statistics for each BAM file. Ideally, the proportion of mapped reads should be high (70-80% or higher), while the proportion of marked reads should be low (generally below 20%).

library(Rsamtools)
diagnostics <- list()
for (bam in acdata$Path) {
    total <- countBam(bam)$records
    mapped <- countBam(bam, param=ScanBamParam(
        flag=scanBamFlag(isUnmapped=FALSE)))$records
    marked <- countBam(bam, param=ScanBamParam(
        flag=scanBamFlag(isUnmapped=FALSE, isDuplicate=TRUE)))$records
    diagnostics[[basename(bam)]] <- c(Total=total, Mapped=mapped, Marked=marked)
}
diag.stats <- data.frame(do.call(rbind, diagnostics))
diag.stats$Prop.mapped <- diag.stats$Mapped/diag.stats$Total*100
diag.stats$Prop.marked <- diag.stats$Marked/diag.stats$Mapped*100
diag.stats
##                            Total  Mapped Marked Prop.mapped Prop.marked
## h3k9ac-proB-8113.bam    10724526 8832006 434884    82.35335    4.923955
## h3k9ac-proB-8108.bam    10413135 7793913 252271    74.84694    3.236770
## h3k9ac-matureB-8059.bam 16675372 4670364 396785    28.00756    8.495805
## h3k9ac-matureB-8086.bam  6347683 4551692 141583    71.70635    3.110558

Note that all csaw functions that read from a BAM file require BAM indices with .bai suffixes. In this case, index files have already been downloaded by H3K9acData(), but users supplying their own files should take care to ensure that BAM indices are available with appropriate names.

2.2 Obtaining the ENCODE blacklist for mm10

A number of genomic regions contain high artifactual signal in ChIP-seq experiments. These often correspond to genomic features like telomeres or microsatellite repeats. For example, multiple tandem repeats in the real genome are reported as a single unit in the genome build. Alignment of all (non-specifically immunoprecipitated) reads from the former will result in artificially high coverage of the latter. Moreover, differences in repeat copy numbers between conditions can lead to detection of spurious DB.

As such, these problematic regions must be removed prior to further analysis. This is done with an annotated blacklist for the mm10 build of the mouse genome, constructed by identifying consistently problematic regions from ENCODE datasets (ENCODE Project Consortium 2012). We download this BED file and save it into a local cache with the BiocFileCache package. This allows it to be used again in later workflows without being re-downloaded.

library(BiocFileCache)
bfc <- BiocFileCache("local", ask=FALSE)
black.path <- bfcrpath(bfc, file.path("https://www.encodeproject.org",
    "files/ENCFF547MET/@@download/ENCFF547MET.bed.gz"))

Genomic intervals in the blacklist are loaded using the import() method from the rtracklayer package. All reads mapped within the blacklisted intervals will be ignored during processing in csaw by specifying the discard= parameter (see below).

library(rtracklayer)
blacklist <- import(black.path)
blacklist
## GRanges object with 164 ranges and 0 metadata columns:
##         seqnames              ranges strand
##            <Rle>           <IRanges>  <Rle>
##     [1]    chr10     3110061-3110270      *
##     [2]    chr10   22142531-22142880      *
##     [3]    chr10   22142831-22143070      *
##     [4]    chr10   58223871-58224100      *
##     [5]    chr10   58225261-58225500      *
##     ...      ...                 ...    ...
##   [160]     chr9     3038051-3038300      *
##   [161]     chr9   24541941-24542200      *
##   [162]     chr9   35305121-35305620      *
##   [163]     chr9 110281191-110281400      *
##   [164]     chr9 123872951-123873160      *
##   -------
##   seqinfo: 19 sequences from an unspecified genome; no seqlengths

Any user-defined set of regions can be used as a blacklist in this analysis.

  • For example, one could use predicted repeat regions from the UCSC genome annotation (Rosenbloom et al. 2015). This tends to remove a greater number of problematic regions (especially microsatellites) compared to the ENCODE blacklist. However, the size of the UCSC list means that genuine DB sites may also be removed. Thus, the ENCODE blacklist is preferred for most applications.
  • Alternatively, if negative control samples are available, they can be used to empirically identify problematic regions with the GreyListChIP package. These regions should be ignored as they have high coverage in the controls and are unlikely to be genuine binding sites.

2.3 Setting up the read extraction parameters

In the csaw package, the readParam object determines which reads are extracted from the BAM files. The intention is to set this up once and to re-use it in all relevant functions. For this analysis, reads are ignored if they map to blacklist regions or do not map to the standard set of mouse nuclear chromosomes2 In this case, we are not interested in the mitochondrial genome, as these should not be bound by histones anyway..

library(csaw)
standard.chr <- paste0("chr", c(1:19, "X", "Y"))
param <- readParam(minq=20, discard=blacklist, restrict=standard.chr)

Reads are also ignored if they have a mapping quality (MAPQ) score below 203 This is more stringent than usual, to account for the fact that the short reads ued here (32-36 bp) are more difficult to accurately align.. This avoids spurious results due to weak or non-unique alignments that should be assigned low MAPQ scores by the aligner. Note that the range of MAPQ scores will vary between aligners, so some inspection of the BAM files is necessary to choose an appropriate value.

3 Computing the average fragment length

Strand bimodality is often observed in ChIP-seq experiments involving narrow binding events like H3K9ac marking. This refers to the presence of distinct subpeaks on each strand and is quantified with cross-correlation plots (Kharchenko, Tolstorukov, and Park 2008). A strong peak in the cross-correlations should be observed if immunoprecipitation was successful. The delay distance at the peak corresponds to the distance between forward- and reverse-strand subpeaks. This is identified from Figure 1 and is used as the average fragment length for this analysis.

x <- correlateReads(acdata$Path, param=reform(param, dedup=TRUE))
frag.len <- maximizeCcf(x)
frag.len
## [1] 154
plot(1:length(x)-1, x, xlab="Delay (bp)", ylab="CCF", type="l")
abline(v=frag.len, col="red")
text(x=frag.len, y=min(x), paste(frag.len, "bp"), pos=4, col="red")
Cross-correlation function (CCF) against delay distance for the H3K9ac data set. The delay with the maximum correlation is shown as the red line.

Figure 1: Cross-correlation function (CCF) against delay distance for the H3K9ac data set. The delay with the maximum correlation is shown as the red line.

Only unmarked reads (i.e., not potential PCR duplicates) are used to calculate the cross-correlations. This reduces noise from variable PCR amplification and decreases the size of the “phantom” peak at the read length (Landt et al. 2012). However, general removal of marked reads is risky as it caps the signal in high-coverage regions of the genome. This can result in loss of power to detect DB, or introduction of spurious DB when the same cap is applied to libraries of different sizes. Thus, the marking status of each read will be ignored in the rest of the analysis, i.e., no duplicates will be removed in downstream steps.

4 Counting reads into windows

csaw uses a sliding window strategy to quantify protein binding intensity across the genome. Each read is directionally extended to the average fragment length (Figure 2) to represent the DNA fragment from which that read was sequenced. Any position within the inferred fragment is a potential contact site for the protein of interest. To quantify binding in a genomic window, the number of these fragments overlapping the window is counted. The window is then moved to its next position on the genome and counting is repeated4 Each read is usually counted into multiple windows, which will introduce correlations between adjacent windows but will not otherwise affect the analysis.. This is done for all samples such that a count is obtained for each window in each sample.

Directional extension of reads by the average fragment length `ext` in single-end ChIP-seq data. Each extended read represents an imputed fragment, and the number of fragments overlapping a window of a given `width` is counted.

Figure 2: Directional extension of reads by the average fragment length ext in single-end ChIP-seq data. Each extended read represents an imputed fragment, and the number of fragments overlapping a window of a given width is counted.

The windowCounts() function produces a RangedSummarizedExperiment object containing a matrix of such counts. Each row corresponds to a window; each column represents a BAM file corresponding to a single sample5 Counting can be parallelized across files using the BPPARAM= argument.; and each entry of the matrix represents the number of fragments overlapping a particular window in a particular sample.

win.data <- windowCounts(acdata$Path, param=param, width=150, ext=frag.len)
win.data
## class: RangedSummarizedExperiment 
## dim: 1671254 4 
## metadata(6): spacing width ... param final.ext
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames: NULL
## colData names(4): bam.files totals ext rlen

To analyze H3K9ac data, a window size of 150 bp is used here. This corresponds roughly to the length of the DNA in a nucleosome (Humburg et al. 2011), which is the smallest relevant unit for studying histone mark enrichment. The spacing between windows is set to the default of 50 bp, i.e., the start positions for adjacent windows are 50 bp apart. Smaller spacings can be used to improve spatial resolution, but will increase memory usage and runtime by increasing the number of windows required to cover the genome. This is unnecessary as increased resolution confers little practical benefit for this data set – counts for very closely spaced windows will be practically identical. Finally, windows with very low counts (by default, less than a sum of 10 across all samples) are removed to reduce memory usage. This represents a preliminary filter to remove uninteresting windows corresponding to likely background regions.

5 Filtering windows by abundance

As previously mentioned, low-abundance windows contain no binding sites and need to be filtered out. This improves power by removing irrelevant tests prior to the multiple testing correction; avoids problems with discreteness in downstream statistical methods; and reduces computational work for further analyses. Here, filtering is performed using the average abundance of each window (McCarthy, Chen, and Smyth 2012), which is defined as the average log-count per million for that window. This performs well as an independent filter statistic for NB-distributed count data (Lun and Smyth 2014).

The filter threshold is defined based on the assumption that most regions in the genome are not marked by H3K9ac. Reads are counted into large bins and the median coverage across those bins is used as an estimate of the background abundance6 Large bins are necessary to obtain a precise estimate of background coverage, which would otherwise be too low in individual windows.. This estimate is then compared to the average abundances of the windows, after rescaling to account for differences in the window and bin sizes. A window is only retained if its coverage is 3-fold higher than that of the background regions, i.e., the abundance of the window is greater than the background abundance estimate by log2(3) or more. This removes a large number of windows that are weakly or not marked and are likely to be irrelevant.

bins <- windowCounts(acdata$Path, bin=TRUE, width=2000, param=param)
filter.stat <- filterWindowsGlobal(win.data, bins)
min.fc <- 3
keep <- filter.stat$filter > log2(min.fc)
summary(keep)
##    Mode   FALSE    TRUE 
## logical  982167  689087

The effect of the fold-change threshold is examined visually in Figure 3. The chosen threshold is greater than the abundances of most bins in the genome – presumably, those that contain background regions. This suggests that the filter will remove most windows lying within background regions.

hist(filter.stat$back.abundances, main="", breaks=50,
    xlab="Background abundance (log2-CPM)")
threshold <- filter.stat$abundances[1] - filter.stat$filter[1] + log2(min.fc)
abline(v=threshold, col="red")
Histogram of average abundances across all 2 kbp genomic bins. The filter threshold is shown as the red line.

Figure 3: Histogram of average abundances across all 2 kbp genomic bins. The filter threshold is shown as the red line.

The filtering itself is done by simply subsetting the RangedSummarizedExperiment object.

filtered.data <- win.data[keep,]

6 Normalizing for sample-specific trended biases

Normalization is required to eliminate confounding sample-specific biases prior to any comparisons between samples. Here, a trended bias is present between samples in Figure 4. This refers to a systematic fold-difference in per-window coverage between samples that changes according to the average abundance of the window.

win.ab <- scaledAverage(filtered.data)
adjc <- calculateCPM(filtered.data, use.offsets=FALSE)
logfc <- adjc[,4] - adjc[,1]
smoothScatter(win.ab, logfc, ylim=c(-6, 6), xlim=c(0, 5),
    xlab="Average abundance", ylab="Log-fold change")

lfit <- smooth.spline(logfc~win.ab, df=5)
o <- order(win.ab)
lines(win.ab[o], fitted(lfit)[o], col="red", lty=2)