Contents

1 Processing sequencing Hi-C libraries with HiCool

The HiCool R/Bioconductor package provides an end-to-end interface to process and normalize Hi-C paired-end fastq reads into .(m)cool files.

  1. The heavy lifting (fastq mapping, pairs parsing and pairs filtering) is performed by the underlying lightweight hicstuff python library (https://github.com/koszullab/hicstuff).
  2. Pairs filering is done using the approach described in Cournac et al., 2012 and implemented in hicstuff.
  3. cooler (https://github.com/open2c/cooler) library is used to parse pairs into a multi-resolution, balanced .mcool file. .(m)cool is a compact, indexed HDF5 file format specifically tailored for efficiently storing HiC-based data. The .(m)cool file format was developed by Abdennur and Mirny and published in 2019.
  4. Internally, all these external dependencies are automatically installed and managed in R by a basilisk environment.

The main processing function offered in this package is HiCool(). To process .fastq reads into .pairs & .mcool files, one needs to provide:

x <- HiCool(
    r1 = '<PATH-TO-R1.fq.gz>', 
    r2 = '<PATH-TO-R2.fq.gz>', 
    restriction = '<RE1(,RE2)>', 
    resolutions = "<resolutions of interest>", 
    genome = '<GENOME_ID>'
)

Here is a concrete example of Hi-C data processing.

library(HiCool)
hcf <- HiCool(
    r1 = HiContactsData::HiContactsData(sample = 'yeast_wt', format = 'fastq_R1'), 
    r2 = HiContactsData::HiContactsData(sample = 'yeast_wt', format = 'fastq_R2'), 
    restriction = 'DpnII,HinfI', 
    resolutions = c(4000, 8000, 16000), 
    genome = 'R64-1-1', 
    output = './HiCool/'
)
#> see ?HiContactsData and browseVignettes('HiContactsData') for documentation
#> loading from cache
#> see ?HiContactsData and browseVignettes('HiContactsData') for documentation
#> loading from cache
#> HiCool :: Recovering bowtie2 genome index from AWS iGenomes...
#> HiCool :: Initiating processing of fastq files [tmp folder: /tmp/RtmpKkEikH/VKIZ6F]...
#> HiCool :: Mapping fastq files...
#> HiCool :: Removing unwanted chromosomes...
#> HiCool :: Parsing pairs into .cool file...
#> HiCool :: Generating multi-resolution .mcool file...
#> HiCool :: Balancing .mcool file...
#> HiCool :: Tidying up everything for you...
#> HiCool :: .fastq to .mcool processing done!
#> HiCool :: Check ./HiCool/folder to find the generated files
#> HiCool :: Generating HiCool report. This might take a while.
#> HiCool :: Report generated and available @ /private/tmp/RtmpGo2NLn/Rbuild128f43e5579df/HiCool/vignettes/HiCool/216b2c062312_7833^mapped-R64-1-1^VKIZ6F.html
#> HiCool :: All processing successfully achieved. Congrats!
hcf
#> CoolFile object
#> .mcool file: ./HiCool//matrices/216b2c062312_7833^mapped-R64-1-1^VKIZ6F.mcool 
#> resolution: 4000 
#> pairs file: ./HiCool//pairs/216b2c062312_7833^mapped-R64-1-1^VKIZ6F.pairs 
#> metadata(3): log args stats
S4Vectors::metadata(hcf)
#> $log
#> [1] "./HiCool//logs/216b2c062312_7833^mapped-R64-1-1^VKIZ6F.log"
#> 
#> $args
#> $args$r1
#> [1] "/Users/biocbuild/Library/Caches/org.R-project.R/R/ExperimentHub/216b2c062312_7833"
#> 
#> $args$r2
#> [1] "/Users/biocbuild/Library/Caches/org.R-project.R/R/ExperimentHub/216b46e88952_7834"
#> 
#> $args$genome
#> [1] "/private/tmp/RtmpKkEikH/R64-1-1"
#> 
#> $args$resolutions
#> [1] "4000"
#> 
#> $args$resolutions
#> [1] "8000"
#> 
#> $args$resolutions
#> [1] "16000"
#> 
#> $args$restriction
#> [1] "DpnII,HinfI"
#> 
#> $args$iterative
#> [1] TRUE
#> 
#> $args$balancing_args
#> [1] " --min-nnz 10 --mad-max 5 "
#> 
#> $args$threads
#> [1] 1
#> 
#> $args$output
#> [1] "./HiCool/"
#> 
#> $args$exclude_chr
#> [1] "Mito|chrM|MT"
#> 
#> $args$keep_bam
#> [1] FALSE
#> 
#> $args$scratch
#> [1] "/tmp/RtmpKkEikH"
#> 
#> $args$wd
#> [1] "/private/tmp/RtmpGo2NLn/Rbuild128f43e5579df/HiCool/vignettes"
#> 
#> 
#> $stats
#> $stats$nFragments
#> [1] 1e+05
#> 
#> $stats$nPairs
#> [1] 73993
#> 
#> $stats$nDangling
#> [1] 10027
#> 
#> $stats$nSelf
#> [1] 2205
#> 
#> $stats$nDumped
#> [1] 83
#> 
#> $stats$nFiltered
#> [1] 61678
#> 
#> $stats$nDups
#> [1] 719
#> 
#> $stats$nUnique
#> [1] 60959
#> 
#> $stats$threshold_uncut
#> [1] 7
#> 
#> $stats$threshold_self
#> [1] 7

2 Optional parameters

Extra optional arguments can be passed to the hicstuff workhorse library:

3 Output files

The important files generated by HiCool are the following:

The diagnosis plots illustrate how pairs were filtered during the processing, using a strategy described in Cournac et al., BMC Genomics 2012. The event_distance chart represents the frequency of ++, +-, -+ and -- pairs in the library, as a function of the number of restriction sites between each end of the pairs, and shows the inferred filtering threshold. The event_distribution chart indicates the proportion of each type of pairs (e.g. dangling, uncut, abnormal, …) and the total number of pairs retained (3D intra + 3D inter).

Notes:

4 System dependencies

Processing Hi-C sequencing libraries into .pairs and .mcool files requires several dependencies, to (1) align reads to a reference genome, (2) manage alignment files (SAM), (3) filter pairs, (4) bin them to a specific resolution and (5)

All system dependencies are internally managed by basilisk. HiCool maintains a basilisk environment containing:

The first time HiCool() is executed, a fresh basilisk environment will be created and required dependencies automatically installed. This ensures compatibility between the different system dependencies needed to process Hi-C fastq files.

5 Session info

sessionInfo()
#> R version 4.5.0 Patched (2025-04-21 r88169)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Monterey 12.7.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
#> 
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/New_York
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] HiContactsData_1.11.0 ExperimentHub_2.99.0  AnnotationHub_3.99.0 
#>  [4] BiocFileCache_2.99.0  dbplyr_2.5.0          BiocGenerics_0.55.0  
#>  [7] generics_0.1.3        HiCool_1.9.0          HiCExperiment_1.9.0  
#> [10] BiocStyle_2.37.0     
#> 
#> loaded via a namespace (and not attached):
#>  [1] DBI_1.2.3                   httr2_1.1.2                
#>  [3] rlang_1.1.6                 magrittr_2.0.3             
#>  [5] matrixStats_1.5.0           compiler_4.5.0             
#>  [7] RSQLite_2.3.9               dir.expiry_1.17.0          
#>  [9] png_0.1-8                   vctrs_0.6.5                
#> [11] stringr_1.5.1               pkgconfig_2.0.3            
#> [13] crayon_1.5.3                fastmap_1.2.0              
#> [15] XVector_0.49.0              rmdformats_1.0.4           
#> [17] rmarkdown_2.29              sessioninfo_1.2.3          
#> [19] tzdb_0.5.0                  UCSC.utils_1.5.0           
#> [21] strawr_0.0.92               purrr_1.0.4                
#> [23] bit_4.6.0                   xfun_0.52                  
#> [25] cachem_1.1.0                GenomeInfoDb_1.45.0        
#> [27] jsonlite_2.0.0              blob_1.2.4                 
#> [29] rhdf5filters_1.21.0         DelayedArray_0.35.1        
#> [31] Rhdf5lib_1.31.0             BiocParallel_1.43.0        
#> [33] parallel_4.5.0              R6_2.6.1                   
#> [35] bslib_0.9.0                 stringi_1.8.7              
#> [37] reticulate_1.42.0           GenomicRanges_1.61.0       
#> [39] jquerylib_0.1.4             Rcpp_1.0.14                
#> [41] bookdown_0.43               SummarizedExperiment_1.39.0
#> [43] knitr_1.50                  IRanges_2.43.0             
#> [45] Matrix_1.7-3                tidyselect_1.2.1           
#> [47] abind_1.4-8                 yaml_2.3.10                
#> [49] codetools_0.2-20            curl_6.2.2                 
#> [51] lattice_0.22-7              tibble_3.2.1               
#> [53] withr_3.0.2                 InteractionSet_1.37.0      
#> [55] Biobase_2.69.0              basilisk.utils_1.21.0      
#> [57] KEGGREST_1.49.0             evaluate_1.0.3             
#> [59] Biostrings_2.77.0           pillar_1.10.2              
#> [61] BiocManager_1.30.25         filelock_1.0.3             
#> [63] MatrixGenerics_1.21.0       stats4_4.5.0               
#> [65] plotly_4.10.4               vroom_1.6.5                
#> [67] BiocVersion_3.22.0          S4Vectors_0.47.0           
#> [69] ggplot2_3.5.2               munsell_0.5.1              
#> [71] scales_1.3.0                glue_1.8.0                 
#> [73] lazyeval_0.2.2              tools_4.5.0                
#> [75] BiocIO_1.19.0               data.table_1.17.0          
#> [77] rhdf5_2.53.0                grid_4.5.0                 
#> [79] tidyr_1.3.1                 crosstalk_1.2.1            
#> [81] AnnotationDbi_1.71.0        colorspace_2.1-1           
#> [83] GenomeInfoDbData_1.2.14     basilisk_1.21.0            
#> [85] cli_3.6.5                   rappdirs_0.3.3             
#> [87] S4Arrays_1.9.0              viridisLite_0.4.2          
#> [89] dplyr_1.1.4                 gtable_0.3.6               
#> [91] sass_0.4.10                 digest_0.6.37              
#> [93] SparseArray_1.9.0           htmlwidgets_1.6.4          
#> [95] memoise_2.0.1               htmltools_0.5.8.1          
#> [97] lifecycle_1.0.4             httr_1.4.7                 
#> [99] bit64_4.6.0-1