Contents

1 Introduction

Hello stranger! If you are here, that means you’ve successfully completed the UMI-4C protocol and got some sequencing results! The objective of this vignette is to guide you through a simple analysis of your brand-new UMI-4C contact data. Let’s dive in!

library(UMI4Cats)

1.1 Overview of the package

Overview of the different functions included in the UMI4Cats package to analyze UMI-4C data.

Figure 1: Overview of the different functions included in the UMI4Cats package to analyze UMI-4C data

1.2 About the experimental design

One of the strengths of the UMI-4C assay (Schwartzman et al. 2016) is that of reducing the PCR duplication bias, allowing a more accurate quantification of chromatin interactions. For this reason, UMI-4C is mostly used when trying to compare changes in chromatin interactions between two conditions, cell types or developmental stages.

Taking into account this main application, UMI4Cats has been developed to facilitate the differential analysis between conditions at a given viewpoint of interest. When analyzing your data with this package, you should take into account the following points:

  • Each analysis (and UMI4C object) should correspond to the same viewpoint. If you are analyzing different viewpoints in the same or different loci, you need to analyze them separately.

  • The UMI4Cats package is mostly oriented to the performance of differential analysis. If you don’t have replicates yet or want to focus your analysis on a specific set of regions (like enhancers) we recommend you to use Fisher’s Exact Test (fisherUMI4C()). If, on the other hand, you have several replicates, you might benefit from using a DESeq2 differential test specific for UMI-4C data (differentialNbinomWaldTestUMI4C()) or you can continue to use Fisher’s exact test.

  • When performing the differential analysis, UMI4Cats is only able to deal with a condition with 2 different levels. If you have more than two conditions, you should produce different UMI4C objects and perform pairwise comparisons.

1.3 About the example datasets

The datasets used in this vignette (obtained from Ramos-Rodríguez et al. (2019)) are available for download if you want to reproduce the contents of this vignette through the downloadUMI4CexampleData().

Briefly, the datasets correspond to human pancreatic islets exposed (cyt) or not (ctrl) to pro-inflammatory cytokines for 48 hours. In this example we are using the UMI-4C data generated from two different biological replicates (HI24 and HI32) using the promoter of the CIITA gene as viewpoint.

Sample datasets can be downloaded using the downloadUMI4CexampleData() function. When used without arguments, will download the full sample fastq files containing 200K reads. However, in order to reduce computing time, the Processing UMI-4C FASTQ files section in this vignette uses a reduced sample file containing 100 reads, which can be downloaded using downloadUMI4CexampleData(reduced = TRUE). The following sections addressing analysis and visualization of such data use the processed files from the 200K fastq files, which are also included inside the package and can be accessed using the system.file() function.

2 Quick start

In this section we summarize a complete analysis using the examples provided in this package:

## 0) Download example data -------------------------------
path <- downloadUMI4CexampleData()

## 1) Generate Digested genome ----------------------------
# The selected RE in this case is DpnII (|GATC), so the cut_pos is 0, and the res_enz "GATC".
hg19_dpnii <- digestGenome(
    cut_pos = 0,
    res_enz = "GATC",
    name_RE = "DpnII",
    ref_gen = BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19,
    out_path = file.path(tempdir(), "digested_genome/")
)

## 2) Process UMI-4C fastq files --------------------------
raw_dir <- file.path(path, "CIITA", "fastq")

contactsUMI4C(
    fastq_dir = raw_dir,
    wk_dir = file.path(path, "CIITA"),
    bait_seq = "GGACAAGCTCCCTGCAACTCA",
    bait_pad = "GGACTTGCA",
    res_enz = "GATC",
    cut_pos = 0,
    digested_genome = hg19_dpnii,
    bowtie_index = file.path(path, "ref_genome", "ucsc.hg19.chr16"),
    ref_gen = BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19,
    threads = 5
)

## 3) Get filtering and alignment stats -------------------
statsUMI4C(wk_dir = file.path(path, "CIITA"))

## 4) Analyze the results ---------------------------------
# Load sample processed file paths
files <- list.files(file.path(path, "CIITA", "count"),
                    pattern = "*_counts.tsv.gz",
                    full.names = TRUE
)

# Create colData including all relevant information
colData <- data.frame(
    sampleID = gsub("_counts.tsv.gz", "", basename(files)),
    file = files,
    stringsAsFactors = FALSE
)

library(tidyr)
colData <- colData %>%
    separate(sampleID,
        into = c("condition", "replicate", "viewpoint"),
        remove = FALSE
    )

# Make UMI-4C object including grouping by condition
umi <- makeUMI4C(
    colData = colData,
    viewpoint_name = "CIITA",
    grouping = "condition",
    bait_expansion = 2e6
)

# Plot replicates
plotUMI4C(umi, grouping=NULL)

## 5) Differential testing ----------------------
# Fisher test
umi_fisher <- fisherUMI4C(umi, filter_low = 30, grouping="condition")
plotUMI4C(umi_fisher, ylim = c(0, 10), grouping="condition")

# DESeq2 Wald Test
umi_wald <- differentialNbinomWaldTestUMI4C(umi4c=umi,
                                            design=~condition,
                                            alpha = 20)
plotUMI4C(umi_wald, ylim = c(0, 10), grouping="condition")

3 Preparing necessary files

3.1 Demultiplexing FastQ files containing multiple baits

One of the many advantages of using the UMI-4C protocol is that it allows multiplexing of different baits using the same sample.

To facilitate the analysis, UMI4Cats provides a function for demultiplexing the paired-end FastQ files returned by the sequencing facility: demultiplexFastq.

This function requires the following inputs:

  • Name of the R1 file as input – it will automatically detect the R2.
  • Barcode sequences.
  • Path and name for the output files.

The barcode sequences and names to be used for each output file need to be provided as a data.frame with column names sample and barcode.

## Input files
path <- downloadUMI4CexampleData(reduced=TRUE)
fastq <- file.path(path, "CIITA", "fastq", "ctrl_hi24_CIITA_R1.fastq.gz")

## Barcode info
barcodes <- data.frame(
    sample = c("CIITA"),
    barcode = c("GGACAAGCTCCCTGCAACTCA")
)

## Output path
out_path <- tempdir()

## Demultiplex baits inside FastQ file
demultiplexFastq(
    fastq = fastq,
    barcodes = barcodes,
    out_path = out_path
)

3.2 Reference genome digestion

For the processing of the UMI-4C FastQ files it is necessary to construct a digested genome using the same restriction enzyme that was used in the UMI-4C experiments.

The UMI4Cats package includes the digestGenome() function to make this process easier. The function uses a BSgenome object1 More information on BSgenome package and objects can be found here as the reference genome, which is digested in silico at a given restriction enzyme cutting sequence (res_enz). Besides the restriction sequence, it is also necessary to provide, as a zero-based numeric integer, the position at which the restriction enzyme cuts (cut_pos).

In the following table you can see three examples of the different cutting sequences for DpnII, Csp6I and HindIII.

Restriction enzyme Restriction seq res_enz cut_pos
DpnII :GATC GATC 0
Csp6I G:TAC GTAC 1
HindIII A:AGCTT AAGCTT 1

For this example, we are using the hg19 BSGenome object and we are going to digest it using the DpnII enzyme.

library(BSgenome.Hsapiens.UCSC.hg19)
refgen <- BSgenome.Hsapiens.UCSC.hg19

hg19_dpnii <- digestGenome(
    res_enz = "GATC",
    cut_pos = 0,
    name_RE = "dpnII",
    ref_gen = refgen,
    sel_chr = "chr16", # Select bait's chr (chr16) to make example faster
    out_path = file.path(tempdir(), "digested_genome/")
)

hg19_dpnii
#> [1] "/tmp/RtmpVRGrYC/digested_genome//BSgenome.Hsapiens.UCSC.hg19_dpnII"

The digested genome will be saved in the out_path directory as RData objects, one for each chromosome. The path of the digested genome files is outputed by the function, so you can save it as a variable (in this case hg19_dpnii) and use it for downstream analyses.

4 Processing UMI-4C FASTQ files

Before doing any analysis, we need to convert paired-end reads stored in the FastQ files to UMI counts in the fragments returned from the in silico digestion of the genome, which represent contact frequencies of the viewpoint with that specific fragment. This processing is implemented in the function contactsUMI4C(), which should be ran in samples generated with the same experimental design (same viewpoint and restriction enzyme).

The function will consider all FastQ files in the same folder fastq_dir to be part of the same experiment (viewpoint + restriction enzyme). However, if you want to specify a subset of samples to perform the analysis you can do so by using the file_pattern argument. This way, only the files matching the specified pattern will be used as input.

The R1 and R2 files for the each sample need to contain _R1 or _R2 and one of the following FastQ suffixes: .fastq, .fq, .fq.gz or .fastq.gz.

For each experiment, the user needs to define 3 different sequences: