1 Introduction

With the improvement of sequencing techniques, chromatin immunoprecipitation followed by high throughput sequencing (ChIP-Seq) is getting popular to study genome-wide protein-DNA interactions. To address the lack of powerful ChIP-Seq analysis method, we presented the Model-based Analysis of ChIP-Seq (MACS), for identifying transcript factor binding sites. MACS captures the influence of genome complexity to evaluate the significance of enriched ChIP regions and MACS improves the spatial resolution of binding sites through combining the information of both sequencing tag position and orientation. MACS can be easily used for ChIP-Seq data alone, or with a control sample with the increase of specificity. Moreover, as a general peak-caller, MACS can also be applied to any “DNA enrichment assays” if the question to be asked is simply: where we can find significant reads coverage than the random background.

This package is a wrapper of the MACS toolkit based on basilisk.

2 Load the package

The package is built on basilisk. The dependent python library macs3 will be installed automatically inside its conda environment.

library(MACSr)

3 Usage

3.1 MACS3 functions

There are 13 functions imported from MACS3. Details of each function can be checked from its manual.

Functions Description
callpeak Main MACS3 Function to call peaks from alignment results.
bdgpeakcall Call peaks from bedGraph output.
bdgbroadcall Call broad peaks from bedGraph output.
bdgcmp Comparing two signal tracks in bedGraph format.
bdgopt Operate the score column of bedGraph file.
cmbreps Combine BEDGraphs of scores from replicates.
bdgdiff Differential peak detection based on paired four bedGraph files.
filterdup Remove duplicate reads, then save in BED/BEDPE format.
predictd Predict d or fragment size from alignment results.
pileup Pileup aligned reads (single-end) or fragments (paired-end)
randsample Randomly choose a number/percentage of total reads.
refinepeak Take raw reads alignment, refine peak summits.
callvar Call variants in given peak regions from the alignment BAM files.
hmmratac Dedicated peak calling based on Hidden Markov Model for ATAC-seq data.

3.2 Function callpeak

We have uploaded multipe test datasets from MACS to a data package MACSdata in the ExperimentHub. For example, Here we download a pair of single-end bed files to run the callpeak function.

eh <- ExperimentHub::ExperimentHub()
eh <- AnnotationHub::query(eh, "MACSdata")
CHIP <- eh[["EH4558"]]
#> see ?MACSdata and browseVignettes('MACSdata') for documentation
#> loading from cache
CTRL <- eh[["EH4563"]]
#> see ?MACSdata and browseVignettes('MACSdata') for documentation
#> loading from cache

Here is an example to call narrow and broad peaks on the SE bed files.

cp1 <- callpeak(CHIP, CTRL, gsize = 5.2e7, store_bdg = TRUE,
                name = "run_callpeak_narrow0", outdir = tempdir(),
                cutoff_analysis = TRUE)
#> INFO  @ 17 Apr 2024 17:46:39: [614 MB] 
#> # Command line: 
#> # ARGUMENTS LIST:
#> # name = run_callpeak_narrow0
#> # format = AUTO
#> # ChIP-seq file = ['/home/biocbuild/.cache/R/ExperimentHub/1e5a96bd5911c_4601']
#> # control file = ['/home/biocbuild/.cache/R/ExperimentHub/1e5a963d85fa01_4606']
#> # effective genome size = 5.20e+07
#> # band width = 300
#> # model fold = [5.0, 50.0]
#> # qvalue cutoff = 5.00e-02
#> # The maximum gap between significant sites is assigned as the read length/tag size.
#> # The minimum length of peaks is assigned as the predicted fragment length "d".
#> # Larger dataset will be scaled towards smaller dataset.
#> # Range for calculating regional lambda is: 1000 bps and 10000 bps
#> # Broad region calling is off
#> # Additional cutoff on fold-enrichment is: 0.10
#> # Paired-End mode is off
#>  
#> INFO  @ 17 Apr 2024 17:46:39: [614 MB] #1 read tag files... 
#> INFO  @ 17 Apr 2024 17:46:39: [614 MB] #1 read treatment tags... 
#> INFO  @ 17 Apr 2024 17:46:39: [618 MB] Detected format is: BED 
#> INFO  @ 17 Apr 2024 17:46:39: [618 MB] * Input file is gzipped. 
#> INFO  @ 17 Apr 2024 17:46:39: [621 MB] #1.2 read input tags... 
#> INFO  @ 17 Apr 2024 17:46:39: [621 MB] Detected format is: BED 
#> INFO  @ 17 Apr 2024 17:46:39: [621 MB] * Input file is gzipped. 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1 tag size is determined as 101 bps 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1 tag size = 101.0 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1  total tags in treatment: 49622 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1 user defined the maximum tags... 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1 filter out redundant tags at the same location and the same strand by allowing at most 1 tag(s) 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1  tags after filtering in treatment: 48047 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1  Redundant rate of treatment: 0.03 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1  total tags in control: 50837 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1 user defined the maximum tags... 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1 filter out redundant tags at the same location and the same strand by allowing at most 1 tag(s) 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1  tags after filtering in control: 50783 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1  Redundant rate of control: 0.00 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1 finished! 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #2 Build Peak Model... 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #2 looking for paired plus/minus strand peaks... 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #2 Total number of paired peaks: 469 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #2 Model building with cross-correlation: Done 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #2 finished! 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #2 predicted fragment length is 228 bps 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #2 alternative fragment length(s) may be 228 bps 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #2.2 Generate R script for model : /tmp/RtmproxEkI/run_callpeak_narrow0_model.r 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #3 Call peaks... 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #3 Pre-compute pvalue-qvalue table... 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #3 Cutoff vs peaks called will be analyzed! 
#> INFO  @ 17 Apr 2024 17:46:40: [640 MB] #3 Analysis of cutoff vs num of peaks or total length has been saved in b'/tmp/RtmproxEkI/run_callpeak_narrow0_cutoff_analysis.txt' 
#> INFO  @ 17 Apr 2024 17:46:40: [640 MB] #3 In the peak calling step, the following will be performed simultaneously: 
#> INFO  @ 17 Apr 2024 17:46:40: [640 MB] #3   Write bedGraph files for treatment pileup (after scaling if necessary)... run_callpeak_narrow0_treat_pileup.bdg 
#> INFO  @ 17 Apr 2024 17:46:40: [640 MB] #3   Write bedGraph files for control lambda (after scaling if necessary)... run_callpeak_narrow0_control_lambda.bdg 
#> INFO  @ 17 Apr 2024 17:46:40: [640 MB] #3   Pileup will be based on sequencing depth in treatment. 
#> INFO  @ 17 Apr 2024 17:46:40: [640 MB] #3 Call peaks for each chromosome... 
#> INFO  @ 17 Apr 2024 17:46:40: [642 MB] #4 Write output xls file... /tmp/RtmproxEkI/run_callpeak_narrow0_peaks.xls 
#> INFO  @ 17 Apr 2024 17:46:40: [642 MB] #4 Write peak in narrowPeak format file... /tmp/RtmproxEkI/run_callpeak_narrow0_peaks.narrowPeak 
#> INFO  @ 17 Apr 2024 17:46:40: [642 MB] #4 Write summits bed file... /tmp/RtmproxEkI/run_callpeak_narrow0_summits.bed 
#> INFO  @ 17 Apr 2024 17:46:40: [642 MB] Done!
cp2 <- callpeak(CHIP, CTRL, gsize = 5.2e7, store_bdg = TRUE,
                name = "run_callpeak_broad", outdir = tempdir(),
                broad = TRUE)
#> 

Here are the outputs.

cp1
#> macsList class
#> $outputs:
#>  /tmp/RtmproxEkI/run_callpeak_narrow0_control_lambda.bdg
#>  /tmp/RtmproxEkI/run_callpeak_narrow0_cutoff_analysis.txt
#>  /tmp/RtmproxEkI/run_callpeak_narrow0_model.r
#>  /tmp/RtmproxEkI/run_callpeak_narrow0_peaks.narrowPeak
#>  /tmp/RtmproxEkI/run_callpeak_narrow0_peaks.xls
#>  /tmp/RtmproxEkI/run_callpeak_narrow0_summits.bed
#>  /tmp/RtmproxEkI/run_callpeak_narrow0_treat_pileup.bdg 
#> $arguments: tfile, cfile, gsize, outdir, name, store_bdg, cutoff_analysis 
#> $log:
#>  INFO  @ 17 Apr 2024 17:46:39: [614 MB] 
#>  # Command line: 
#>  # ARGUMENTS LIST:
#>  # name = run_callpeak_narrow0
#>  # format = AUTO
#> ...
cp2
#> macsList class
#> $outputs:
#>  /tmp/RtmproxEkI/run_callpeak_broad_control_lambda.bdg
#>  /tmp/RtmproxEkI/run_callpeak_broad_model.r
#>  /tmp/RtmproxEkI/run_callpeak_broad_peaks.broadPeak
#>  /tmp/RtmproxEkI/run_callpeak_broad_peaks.gappedPeak
#>  /tmp/RtmproxEkI/run_callpeak_broad_peaks.xls
#>  /tmp/RtmproxEkI/run_callpeak_broad_treat_pileup.bdg 
#> $arguments: tfile, cfile, gsize, outdir, name, store_bdg, broad 
#> $log:
#> 

3.3 The macsList class

The macsList is designed to contain everything of an execution, including function, inputs, outputs and logs, for the purpose of reproducibility.

For example, we can the function and input arguments.

cp1$arguments
#> [[1]]
#> callpeak
#> 
#> $tfile
#> CHIP
#> 
#> $cfile
#> CTRL
#> 
#> $gsize
#> [1] 5.2e+07
#> 
#> $outdir
#> tempdir()
#> 
#> $name
#> [1] "run_callpeak_narrow0"
#> 
#> $store_bdg
#> [1] TRUE
#> 
#> $cutoff_analysis
#> [1] TRUE

The files of all the outputs are collected.

cp1$outputs
#> [1] "/tmp/RtmproxEkI/run_callpeak_narrow0_control_lambda.bdg" 
#> [2] "/tmp/RtmproxEkI/run_callpeak_narrow0_cutoff_analysis.txt"
#> [3] "/tmp/RtmproxEkI/run_callpeak_narrow0_model.r"            
#> [4] "/tmp/RtmproxEkI/run_callpeak_narrow0_peaks.narrowPeak"   
#> [5] "/tmp/RtmproxEkI/run_callpeak_narrow0_peaks.xls"          
#> [6] "/tmp/RtmproxEkI/run_callpeak_narrow0_summits.bed"        
#> [7] "/tmp/RtmproxEkI/run_callpeak_narrow0_treat_pileup.bdg"

The log is especially important for MACS to check. Detailed information was given in the log when running.

cat(paste(cp1$log, collapse="\n"))
#> INFO  @ 17 Apr 2024 17:46:39: [614 MB] 
#> # Command line: 
#> # ARGUMENTS LIST:
#> # name = run_callpeak_narrow0
#> # format = AUTO
#> # ChIP-seq file = ['/home/biocbuild/.cache/R/ExperimentHub/1e5a96bd5911c_4601']
#> # control file = ['/home/biocbuild/.cache/R/ExperimentHub/1e5a963d85fa01_4606']
#> # effective genome size = 5.20e+07
#> # band width = 300
#> # model fold = [5.0, 50.0]
#> # qvalue cutoff = 5.00e-02
#> # The maximum gap between significant sites is assigned as the read length/tag size.
#> # The minimum length of peaks is assigned as the predicted fragment length "d".
#> # Larger dataset will be scaled towards smaller dataset.
#> # Range for calculating regional lambda is: 1000 bps and 10000 bps
#> # Broad region calling is off
#> # Additional cutoff on fold-enrichment is: 0.10
#> # Paired-End mode is off
#>  
#> INFO  @ 17 Apr 2024 17:46:39: [614 MB] #1 read tag files... 
#> INFO  @ 17 Apr 2024 17:46:39: [614 MB] #1 read treatment tags... 
#> INFO  @ 17 Apr 2024 17:46:39: [618 MB] Detected format is: BED 
#> INFO  @ 17 Apr 2024 17:46:39: [618 MB] * Input file is gzipped. 
#> INFO  @ 17 Apr 2024 17:46:39: [621 MB] #1.2 read input tags... 
#> INFO  @ 17 Apr 2024 17:46:39: [621 MB] Detected format is: BED 
#> INFO  @ 17 Apr 2024 17:46:39: [621 MB] * Input file is gzipped. 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1 tag size is determined as 101 bps 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1 tag size = 101.0 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1  total tags in treatment: 49622 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1 user defined the maximum tags... 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1 filter out redundant tags at the same location and the same strand by allowing at most 1 tag(s) 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1  tags after filtering in treatment: 48047 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1  Redundant rate of treatment: 0.03 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1  total tags in control: 50837 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1 user defined the maximum tags... 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1 filter out redundant tags at the same location and the same strand by allowing at most 1 tag(s) 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1  tags after filtering in control: 50783 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1  Redundant rate of control: 0.00 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #1 finished! 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #2 Build Peak Model... 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #2 looking for paired plus/minus strand peaks... 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #2 Total number of paired peaks: 469 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #2 Model building with cross-correlation: Done 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #2 finished! 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #2 predicted fragment length is 228 bps 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #2 alternative fragment length(s) may be 228 bps 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #2.2 Generate R script for model : /tmp/RtmproxEkI/run_callpeak_narrow0_model.r 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #3 Call peaks... 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #3 Pre-compute pvalue-qvalue table... 
#> INFO  @ 17 Apr 2024 17:46:39: [622 MB] #3 Cutoff vs peaks called will be analyzed! 
#> INFO  @ 17 Apr 2024 17:46:40: [640 MB] #3 Analysis of cutoff vs num of peaks or total length has been saved in b'/tmp/RtmproxEkI/run_callpeak_narrow0_cutoff_analysis.txt' 
#> INFO  @ 17 Apr 2024 17:46:40: [640 MB] #3 In the peak calling step, the following will be performed simultaneously: 
#> INFO  @ 17 Apr 2024 17:46:40: [640 MB] #3   Write bedGraph files for treatment pileup (after scaling if necessary)... run_callpeak_narrow0_treat_pileup.bdg 
#> INFO  @ 17 Apr 2024 17:46:40: [640 MB] #3   Write bedGraph files for control lambda (after scaling if necessary)... run_callpeak_narrow0_control_lambda.bdg 
#> INFO  @ 17 Apr 2024 17:46:40: [640 MB] #3   Pileup will be based on sequencing depth in treatment. 
#> INFO  @ 17 Apr 2024 17:46:40: [640 MB] #3 Call peaks for each chromosome... 
#> INFO  @ 17 Apr 2024 17:46:40: [642 MB] #4 Write output xls file... /tmp/RtmproxEkI/run_callpeak_narrow0_peaks.xls 
#> INFO  @ 17 Apr 2024 17:46:40: [642 MB] #4 Write peak in narrowPeak format file... /tmp/RtmproxEkI/run_callpeak_narrow0_peaks.narrowPeak 
#> INFO  @ 17 Apr 2024 17:46:40: [642 MB] #4 Write summits bed file... /tmp/RtmproxEkI/run_callpeak_narrow0_summits.bed 
#> INFO  @ 17 Apr 2024 17:46:40: [642 MB] Done!

4 Resources

More details about MACS3 can be found: https://macs3-project.github.io/MACS/.

5 SessionInfo

sessionInfo()
#> R version 4.4.0 beta (2024-04-15 r86425)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] MACSdata_1.11.0  MACSr_1.11.2     BiocStyle_2.31.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] KEGGREST_1.43.0         dir.expiry_1.11.0       xfun_0.43              
#>  [4] bslib_0.7.0             Biobase_2.63.1          lattice_0.22-6         
#>  [7] vctrs_0.6.5             tools_4.4.0             generics_0.1.3         
#> [10] stats4_4.4.0            curl_5.2.1              parallel_4.4.0         
#> [13] tibble_3.2.1            fansi_1.0.6             AnnotationDbi_1.65.2   
#> [16] RSQLite_2.3.6           blob_1.2.4              pkgconfig_2.0.3        
#> [19] Matrix_1.7-0            dbplyr_2.5.0            S4Vectors_0.41.6       
#> [22] lifecycle_1.0.4         GenomeInfoDbData_1.2.12 compiler_4.4.0         
#> [25] Biostrings_2.71.5       GenomeInfoDb_1.39.14    htmltools_0.5.8.1      
#> [28] sass_0.4.9              yaml_2.3.8              pillar_1.9.0           
#> [31] crayon_1.5.2            jquerylib_0.1.4         cachem_1.0.8           
#> [34] mime_0.12               ExperimentHub_2.11.3    AnnotationHub_3.11.4   
#> [37] basilisk_1.15.5         tidyselect_1.2.1        digest_0.6.35          
#> [40] purrr_1.0.2             dplyr_1.1.4             bookdown_0.39          
#> [43] BiocVersion_3.19.1      fastmap_1.1.1           grid_4.4.0             
#> [46] cli_3.6.2               magrittr_2.0.3          utf8_1.2.4             
#> [49] withr_3.0.0             filelock_1.0.3          UCSC.utils_0.99.7      
#> [52] rappdirs_0.3.3          bit64_4.0.5             rmarkdown_2.26         
#> [55] XVector_0.43.1          httr_1.4.7              bit_4.0.5              
#> [58] reticulate_1.36.0       png_0.1-8               memoise_2.0.1          
#> [61] evaluate_0.23           knitr_1.46              IRanges_2.37.1         
#> [64] basilisk.utils_1.15.2   BiocFileCache_2.11.2    rlang_1.1.3            
#> [67] Rcpp_1.0.12             glue_1.7.0              DBI_1.2.2              
#> [70] BiocManager_1.30.22     BiocGenerics_0.49.1     jsonlite_1.8.8         
#> [73] R6_2.5.1                zlibbioc_1.49.3