This vignette is organized hierarchically in terms of level of detail:
In the Quickstart section, we show a basic analysis with chromswitch using a small dataset included in the package.
In the next section, we give a brief overview of the method for detecting chromatin state switches, referencing specific functions in chromswitch which implement the different steps of the method.
In the Walkthrough, we demonstrate a basic analysis chromswitch with a discussion of data import, input, the most important parameters available to the user, and interpretation of chromswitch output. In this section we use the wrapper functions which provide one-line commands to call chromatin state switches based on a single mark or type of input data.
In last section, Step-by-step, we demonstrate how the steps of the method can be run individually to gain finer control over the analysis and show the intermediate results from chromswitch available to the user.
We will use the package rtracklayer to import data from BED files:
We’ll start with a toy dataset containing MACS2 narrow peak calls for H3K4me3 ChIP-seq in 3 brain tissues and 3 other adult tissues from the Roadmap Epigenomics Project, restricted to a short region on chromosome 19. In the code below, we import the input to chromswitch and run chromswitch on our dataset. This involves constructing a feature matrix, and we will describe two ways of doing so. We can then call chromatin state switches by thresholding on the value of the Consensus score, which scores the similarity between the cluster assignments and the biological condition labels.
Chromswitch essentially requires 3 inputs:
GRangesobject storing one or more regions of interest
GRangesobjects, each of which stores peaks or features for one sample, with elements named according to the sample IDs as specified in the metadata
The latter two inputs define the dataset on which the query is performed.
Each of these inputs can be imported from TSV or BED files. Learn more about
GRanges objects by checking out GenomicRanges.
Here we use the
import() function from rtracklayer to import
query regions stored in a BED file.
# Path to BED file in chromswitch installation query_path <- system.file("extdata/query.bed", package = "chromswitch") # Read in BED file, creating a GRanges object query <- rtracklayer::import(con = query_path, format = "BED") query
## GRanges object with 2 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ##  chr19 54924105-54929104 * ##  chr19 54874319-54877536 * ## ------- ## seqinfo: 1 sequence from an unspecified genome; no seqlengths
# Path to TSV in chromswitch meta_path <- system.file("extdata/metadata.tsv", package = "chromswitch") # Read in the table from a 2-column TSV file metadata <- read.delim(meta_path, sep = "\t", header = TRUE) metadata
Here we assume that peaks are in the ENCODE narrowPeak format.
Note: If the metadata file has an additional column containing the path for each sample, then that column can be passed to this function,
readNarrowPeak(paths = metadata$path, metadata = metadata).
# Paths to the BED files containing peak calls for each sample peak_paths <- system.file("extdata", paste0(metadata$Sample, ".H3K4me3.bed"), package = "chromswitch") # Import BED files containing MACS2 narrow peak calls using rtracklayer peaks <- readNarrowPeak(paths = peak_paths, # Paths to files, metadata = metadata) # Metadata dataframe
Run chromswitch using the summary strategy:
callSummary(query = query, # Input 1: Query regions metadata = metadata, # Input 2: Metadata dataframe peaks = peaks, # Input 3: Peaks mark = "H3K4me3") # Arbitrary string describing the data type
## Warning: `select_()` was deprecated in dplyr 0.7.0. ## Please use `select()` instead. ## This warning is displayed once every 8 hours. ## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated. ## Warning: `funs()` was deprecated in dplyr 0.8.0. ## Please use a list of either functions or lambdas: ## ## # Simple named list: ## list(mean = mean, median = median) ## ## # Auto named with `tibble::lst()`: ## tibble::lst(mean, median) ## ## # Using lambdas ## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE)) ## This warning is displayed once every 8 hours. ## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
Chromswitch outputs a measure of cluster quality (Average_Silhouette), the score predicting a chromatin state switch (Consensus), and the cluster assignment for each sample.
Looking at the Consensus score in each case, which represents the similarity between the cluster assignments and the biological groups of the samples, here we see a good agreement in the first region, indicating a switch, and poor agreement in the second, indicating the absence of a switch. This score takes on values between -1 and 1, where 1 represents a perfect agreement between cluster assignments and the biological conditions of the sample
Run chromswitch using the binary strategy:
callBinary(query = query, # Input 1: Query regions metadata = metadata, # Input 2: Metadata dataframe peaks = peaks) # Input 3: Peaks
The output has the same format for both strategies. In this case, both strategies predict a switch for the first region, and a non-switch for the second region.
Both of these wrapper functions have additional parameters which allow for greater sensitivity and finer control in the analysis. These are explored in the rest of the vignette.
Our method for detecting chromatin state switches involves three steps. These are illustrated in the figure below, which is a schematic for the analysis performed on one query region, based on the reference metadata and peaks or features.
As input, chromswitch requires epigenetic features represented by their genomic coordinates and optionally, some associated statistics. Possible examples include ChIP-seq or DNase-seq peaks, or previously-learned chromatin state segmentations such as from ChromHMM. Here we’ll refer to peaks for simplicity, but the analysis is the same for other types of epigenetic features given as intervals.
In the pre-processing phase, the user can set thresholds on any statistics
associated with peaks and filter out peaks below these thresholds
filterPeaks()). These statistics can then be normalized genome-wide for each
normalizePeaks()), which is strongly recommended. More detailed discussion of the normalization process
can be found in the documentation of that function.
Both these steps are optional. We then retrieve all the peaks in each sample
which overlap the query region (
Next, chromswitch transforms the peaks in the query region into a
sample-by-feature matrix using one of two strategies. In the summary strategy
summarizePeaks()), we compute a set of summary statistics from the peaks
in each sample in the query region. These can include the mean, median, and max
of the statistics associated with the peaks in the input, as well as the
fraction of the region overlapped by peak and the number of peaks. Genome-wide normalization of the data is therefore extremely
important if choosing this strategy.
In the binary strategy (
construct a binary matrix where each feature corresponds to a unique peak in the
region, and the matrix holds the binary presence or absence calls of each
unique peak in each sample. We obtain the unique peaks by collapsing the union
of all peaks in the region observed in all samples using a parameter
which specifies how much reciprocal overlap is required between two peaks to
call them the same. Since regions corresponding to the same biological event
can occasionally result in separate peaks during the process of interpretation
of raw signal, peak calling, etc., we also introduce an option to combine
peaks which are separated by less than
gap base pairs (
Finally, chromswitch clusters samples hierarchically
and then scores the similarity between the inferred cluster assignments and the
known biological condition labels of the samples (