Contents

1 Introduction

methylscaper is an R package for visualizing data that jointly profile endogenous methylation and chromatin accessibility (MAPit, NOMe-seq, scNMT-seq, nanoNOMe, etc.). The package offers pre-processing for single-molecule data and accepts input from Bismark (or similar alignment programs) for single-cell data. A common interface for visualizing both data types is done by generating ordered representational methylation-state matrices. The package provides a Shiny app to allow for interactive and optimal ordering of the individual DNA molecules to discover methylation patterns and nucleosome positioning.

Note: If you use methylscaper in your research, please cite our manuscript.

If, after reading this vignette you have questions, please submit your question on GitHub: Question or Report Issue. This will notify the package maintainers and benefit other users.

2 Getting Started

2.1 Installation

For local use of methylscaper, it can be installed into R from Bioconductor (using R version >= 4.4.0):

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("methylscaper")

2.2 Load the package

After successful installation, load the package into the working space.

library(methylscaper)

To access the Shiny app, simply run:

methylscaper()

3 Visualizing single-cell data

For visualizing single-cell data from methods such as scNMT-seq, methylscaper begins with pre-aligned data. For each cell, there should be two files, one for the GCH sites and another for the HCG sites. The minimal number of columns needed for methylscaper is three: chromosome, position, and methylation status. This type of file is generated via the “Bismark_methylation_extractor” script in the Bismark software tool. The extractor function outputs files in four or six column output files (see bedGraph option described here: https://felixkrueger.github.io/Bismark/options/methylation_extraction/). Methylscaper will accept these and convert to the three column format internally.

Due to the large file size, methylscaper further processes the data for the visualization analysis to the chromosome level. In the Shiny app, first select all files associated with the endogenous methylation and then select all files associated with accessibility. The files should be named in such a way that the file pairs can be inferred (e.g “Expr1_Sample1_met” pairs with “Expr1_Sample1_acc”). Finally, indicate the desired chromosome to filter to the chromosome level.

3.0.1 Example data for single-cell data

Below we walk through an example using data from Clark et al., 2018, obtained from GSE109262. For the sake of this example, we assume that the GSE109262_RAW.tar directory is downloaded locally to ~/Downloads/.

3.0.2 Preprocessing in the Shiny app

In the screenshot below, the data from GSE10926 data on chromosome 19 is ready for processing. When selecting “Browse…”, be sure to select all relevant files for each methylation type.

Screenshot: Preprocessing tab for single-cell data

3.1 Preprocessing using methylscaper functions

The preprocessing can also be done in the R console directly, which allows for additional start and end specifications. For the purpose of creating a small example to include in the package, we additionally restricted the data between base pairs 8,947,041 to 8,987,041, which is centered around the Eef1g gene. In practice, we advise users to filter to just the chromosome level to keep the region relatively large. The Visualization tab allows for a more refined search along the chromosome and is described in a section below.

When using methylscaper within R, rather than specifying all the files individually, simply point to a folder which contains two subfolders with the accessibility and endogenous methylation files. These subfolders must be named “acc” and “met”, respectively.

filepath <- "~/Downloads/GSE109262_RAW/"
singlecell_subset <- subsetSC(filepath, chromosome = 19, startPos = 8937041, endPos = 8997041)
# To save for later, save as an rds file and change the folder location as desired:
saveRDS(singlecell_subset, "~/Downloads/singlecell_subset.rds")

For a reproducible example, we have provided three cells for download, and below we run an example where we read the data directly from the URLs into R and use the subsetSC function. If you choose to download these files, then the directions above should be followed by moving the files into subfolders named “acc” and “met”.

gse_subset_path <- list(
    c(
        "https://rbacher.rc.ufl.edu/methylscaper/data/GSE109262_SUBSET/GSM2936197_ESC_A08_CpG-met_processed.tsv.gz",
        "https://rbacher.rc.ufl.edu/methylscaper/data/GSE109262_SUBSET/GSM2936196_ESC_A07_CpG-met_processed.tsv.gz",
        "https://rbacher.rc.ufl.edu/methylscaper/data/GSE109262_SUBSET/GSM2936192_ESC_A03_CpG-met_processed.tsv.gz"
    ),
    c(
        "https://rbacher.rc.ufl.edu/methylscaper/data/GSE109262_SUBSET/GSM2936197_ESC_A08_GpC-acc_processed.tsv.gz",
        "https://rbacher.rc.ufl.edu/methylscaper/data/GSE109262_SUBSET/GSM2936196_ESC_A07_GpC-acc_processed.tsv.gz",
        "https://rbacher.rc.ufl.edu/methylscaper/data/GSE109262_SUBSET/GSM2936192_ESC_A03_GpC-acc_processed.tsv.gz"
    ),
    c("GSM2936197_ESC_A08_CpG-met_processed", "GSM2936196_ESC_A07_CpG-met_processed", "GSM2936192_ESC_A03_CpG-met_processed"),
    c("GSM2936197_ESC_A08_GpC-acc_processed", "GSM2936196_ESC_A07_GpC-acc_processed", "GSM2936192_ESC_A03_GpC-acc_processed")
)
# This formatting is a list of 4 objects: the met file urls, the acc file urls, the met file names, and the acc file names.
options(timeout = 1000)
singlecell_subset <- subsetSC(gse_subset_path, chromosome = 19, startPos = 8937041, endPos = 8997041)

# To save for later, save as an rds file and change the folder location as desired:
# saveRDS(singlecell_subset, "~/Downloads/singlecell_subset.rds")

To fully demonstrate the example using the three cells subset, we skip some explanations of the functions and show the resulting plot. For this particular region only one of the three cells has coverage and thus only one row is shown in the plot (if a cell has no data in the entire region then it is not shown in the plot rather than being plot as missing data). All functions are further explained in detail in the following sections.

data("mouse_bm")
gene.select <- subset(mouse_bm, mgi_symbol == "Eef1g")

startPos <- 8966841
endPos <- 8967541
prepSC.out <- prepSC(singlecell_subset, startPos = startPos, endPos = endPos)

orderObj <- initialOrder(prepSC.out)
plotSequence(orderObj, Title = "Eef1g gene", plotFast = TRUE, drawKey = FALSE)