RLSeq is a package for analyzing R-loop mapping data sets, and it is a core component of the RLSuite toolchain. It serves two primary purposes: (1) to facilitate the evaluation of data quality, and (2) to enable R-loop data analysis in the context of genomic annotations and the public data sets in RLBase. The main analysis steps can be conveniently run using the
RLSeq() function. Then, an HTML report can be generated using the
report() function. Individual steps of this pipeline are also accessible through separate functions which provide custom analysis capabilities.
This vignette will showcase the primary functionality of RLSeq with data from a publicly-available R-loop data mapping study in Ewing sarcoma cell lines, GSE68845. We have selected two DNA-RNA Immunoprecipitation sequencing (DRIP-seq) samples for demonstration purposes: (1) SRX1025890, a positive R-loop mapping sample (“POS”; condition: S9.6 -RNaseH1), and (2) SRX1025892, a negative control (“NEG”; condition S9.6 +RNaseH1). We will begin by showing a quick-start analysis on SRX1025890, and then we will proceed to discuss, in detail, the specific steps of this analysis with both samples.
Here, we demonstrate a simple analysis workflow which utilizes a publicly-available data set stored in RLBase (a database of R-loop consensus regions and R-loop-mapping experiments, also part of RLSuite). The commands below download these data, run
RLSeq(), and generate the HTML report.
# Peaks and coverage can be found in RLBase rlbase <- "https://rlbase-data.s3.amazonaws.com" pks <- file.path(rlbase, "peaks", "SRX1025890_hg38.broadPeak") cvg <- file.path(rlbase, "coverage", "SRX1025890_hg38.bw") # Initialize data in the RLRanges object. # Metadata is optional, but improves the interpretability of results rlr <- RLRanges( peaks = pks, coverage = cvg, genome = "hg38", mode = "DRIP", label = "POS", sampleName = "TC32 DRIP-Seq" ) # The RLSeq command performs all analyses rlr <- RLSeq(rlr) # Generate an html report report(rlr, reportPath = "rlseq_report_example.html")
The report generated by this code is found here.
RLSeq should be installed alongside RLHub to facilitate access to the data required for annotation and analysis. When downloading RLSeq from bioconductor, RLHub is already included.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("RLSeq")
Both packages can also be installed from github.
library(remotes) install_github("Bishop-Laboratory/RLHub") install_github("Bishop-Laboratory/RLSeq")
RLSeq is compatible with R-loop data generated from a variety of pipelines and tools. However, it is strongly recommended that you use RLPipes, a snakemake-based CLI pipeline tool built specifically for upstream processing of R-loop datasets.
# conda install -c conda-forge mamba mamba create -n rlpipes -c bioconda -c conda-forge rlpipes conda activate rlpipes
A typical config file (
CSV) should be written as such:
And then the pipeline can be run.
RLPipes build -m DRIP rseq_out/ tests/test_data/samples.csv RLPipes run rseq_out/
The resulting directory will contain
bam/, and other processed data sets which are used in downstream analysis.
Note: If you choose to use a different pipeline, use macs2/macs3 for peak calling to ensure compatibility with RLBase.
Here, we describe each step of the analysis pipeline which is run as part of the
For this example, we will be using DRIP-Seq data from a 2018 Nature paper on R-loops in Ewing sarcoma (Gorthi et al. 2018). The sample has been IP’d for R-loops (S9.6 -RNaseH1; label: “POS”). The data was processed using RLPipes and uploaded to RLBase. Peaks are converted to
GRanges objects using a helper function from
regioneR. URLs and file paths for peak files can also be supplied directly to without this step.
rlbase <- "https://rlbase-data.s3.amazonaws.com" # Get peaks and coverage s96Pks <- regioneR::toGRanges(file.path(rlbase, "peaks", "SRX1025890_hg38.broadPeak")) s96Cvg <- file.path(rlbase, "coverage", "SRX1025890_hg38.bw")
For demonstration purposes, only 10000 ranges are analyzed here.
# For expediency, peaks we filter and down-sampled to the top 10000 by padj (V9) # This is not necessary as part of the typical workflow, however s96Pks <- s96Pks[s96Pks$V9 > 2,] s96Pks <- s96Pks[sample(names(s96Pks), 10000)]
RLRanges objects were constructed. These are the primary objects used in all
RLRanges are an extension of
GRanges which provide additional metadata and validation functions.
## Build RLRanges ## # S9.6 -RNaseH1 rlr <- RLRanges( peaks = s96Pks, coverage = s96Cvg, genome = "hg38", mode = "DRIP", label = "POS", sampleName = "TC32 DRIP-Seq" )
Sample quality is assessed by analyzing the association of peaks with R-loop-forming sequences (RLFS). RLFS are genomic sequences that favor the formation of R-loops (Jenjaroenpun et al. 2015). While R-loops can form outside RLFS, there is a strong relationship between them, which provides an unbiased test of whether a set of peaks actually represents successful R-loop mapping.
RLSeq first implements a permutation test to evaluate the enrichment of peaks within RLFS and build a Z-score distribution around RLFS sites.
# Analyze RLFS for positive sample rlr <- analyzeRLFS(rlr, quiet = TRUE)
The resulting objects now contain the permutation test results. These results can be easily visualized with the
The quality classifier is an ensemble model based on an online-learning scheme. It predicts “POS” for samples which are predicted to show robust R-loop mapping and “NEG” for samples which are not. The latest version can be accessed via RLHub. For greater detail, please see the
RLHub::modes reference. To apply the model and predict sample quality, use the
# Predict rlr <- predictCondition(rlr)
The results from testing our example samples:
# Access results s96_pred <- rlresult(rlr, "predictRes") cat("Prediction: ", s96_pred$prediction)
## Prediction: NEG
The feature enrichment test assesses the enrichment of genomic features within a supplied R-loop dataset. The function queries the RLHub annotation database to retrieve genomic features, and then it performs fisher’s exact test and the relative distance test to assess feature enrichment (Favorov et al. 2012). For a description of all features, see RLHub::annotations.
# Perform test rlr <- featureEnrich( object = rlr, quiet = TRUE )
# View Top Results annoResS96 <- rlresult(rlr, "featureEnrichment") annoResS96 %>% relocate(contains("fisher"), .after = type) %>% arrange(desc(stat_fisher_rl))
## # A tibble: 50 × 13 ## db type stat_fisher_rl stat_fisher_shuf pval_fisher_rl pval_fisher_shuf ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Repeat… SINE Inf 1.20 3.30e-167 0.00819 ## 2 Transc… Intr… Inf 1.00 2.23e-308 0.875 ## 3 Transc… Exon 16.9 1.06 2.23e-308 0.0346 ## 4 Transc… TSS 15.7 1.03 2.23e-308 0.473 ## 5 Transc… TTS 13.3 0.999 2.23e-308 1 ## 6 Encode… enhP 12.1 0.948 2.23e-308 0.203 ## 7 knownG… prot… 11.3 0.987 2.23e-308 0.600 ## 8 PolyA poly… 9.49 0.941 2.23e-308 0.455 ## 9 snoRNA… CDBox 9.48 1.80 5.51e- 8 0.306 ## 10 Transc… five… 9.42 1.01 2.23e-308 0.865 ## # … with 40 more rows, and 7 more variables: num_tested_peaks <int>, ## # num_total_peaks <int>, num_tested_anno_ranges <int>, ## # num_total_anno_ranges <int>, avg_reldist_rl <dbl>, avg_reldist_shuf <dbl>, ## # pval_reldist <dbl>
From the results, we see that there is high enrichment within genic features, such as exons and introns.
RLSeq provides a helper function,
plotEnrichment, to facilitate the visualization of enrichment results.
pltlst <- plotEnrichment(rlr)
This returns a list of plots named according to the corresponding annotation database. For example, Encode cis-regulatory elements (CREs):