1 Introduction

monaLisa is a collection of functions for working with biological sequences and motifs that represent the binding preferences of transcription factors or nucleic acid binding proteins.

For example, monaLisa can be used to conveniently find motif hits in sequences (see section 7), or to identify motifs that are likely associated with observed experimental data. Such analyses are supposed to provide potential answers to the question “Which transcription factors are the drivers of my observed changes in expression/methylation/accessibility?”.

Several other approaches have been described that also address this problem, among them REDUCE (Roven and Bussemaker 2003), AME (McLeay and Bailey 2010) and ISMARA (Balwierz et al. 2014). In monaLisa, we aim to provide a flexible implementation that integrates well with other Bioconductor resources, makes use of the sequence composition correction developed for Homer (Heinz et al. 2010) or stability selection (Meinshausen and Bühlmann 2010) and provides several alternative ways to study the relationship between experimental measurements and sequence motifs.

You can use known motifs from collections of transcription factor binding specificities such as JASPAR2020, also available from Bioconductor. Genomic regions could be for example promoters, enhancers or accessible regions for which experimental data is available.

Two independent approaches are implemented to identify interesting motifs:

  • In the binned motif enrichment analysis (calcBinnedMotifEnrR, see section 4), genomic regions are grouped into bins according to a numerical value assigned to each region, such as the change in expression, accessibility or methylation. Motif enrichments are then calculated for each bin, normalizing for differences in sequence composition in a very similar way as originally done by Homer (Heinz et al. 2010). As a special case, the approach can also be used to do a simple two set comparison (foreground against background sequences, see section 5.1) or to determine motif enrichments in a single set of sequences compared to a suitably matched genomic background set (see section 5.2). The binned motif enrichment approach was first introduced in Ginno et al. (2018) and subsequently applied in e.g. Barisic et al. (2019). To see more details on how calcBinnedMotifEnrR resembles Homer, check the function help page. We recommend using this function to do the binned motif enrichment analysis, since it corrects for sequence composition differences similarly to Homer, but is implemented more efficiently. calcBinnedMotifEnrHomer implements the same analysis using Homer and therefore requires a local installation of Homer, and calcBinnedKmerEnr(see section 6) implements the analysis for k-mers instead of motifs, to study sequence enrichments without the requirement of known motifs.

  • Randomized Lasso stability selection (randLassoStabSel, see the stability selection vignette in monaLisa) uses a robust regression approach (stability selection, Meinshausen and Bühlmann (2010)) to predict what transcription factors can explain experimental measurements, for example changes in chromatin accessibility between two conditions. Also this approach allows to correct for sequence composition. In addition, similar motifs have to “compete” with each other to be selected.

For both approaches, functions that allow visualization of obtained results are provided.

If you prefer to jump right in, you can continue with section 3 that shows a quick hypothetical example of how to run a binned motif enrichment analysis. If you prefer to actually compute enrichments on real data, you can find below a detailed example for a binned motif enrichment analysis (section 4). The special cases of analyzing just two sets of sequences (binary motif enrichment analysis) or a single set of sequences (comparing it to a suitable background sampled from the genome) are illustrated in section 5.

2 Installation

monaLisa can be installed from Bioconductor via the BiocManager package:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("monaLisa")

3 Quick example: Identify enriched motifs in bins

The quick example below, which we do not run, illustrates how a binned motif enrichment analysis can be performed in monaLisa. We assume that you already have a set of peaks. The sequences of the peak regions are stored in a Biostrings::DNAStringSet object (peak_seqs), and additionally each peak is associated with a numeric value (e.g., the change of methylation between two conditions, stored in the peak_change vector), that will be used to bin the regions before finding motifs enriched in each bin.

# load package
library(monaLisa)

# bin regions
# - peak_change is a numerical vector
# - peak_change needs to be created by the user to run this code
peak_bins <- bin(x = peak_change, binmode = "equalN", nElement = 400)

# calculate motif enrichments
# - peak_seqs is a DNAStringSet, pwms is a PWMatrixList
# - peak_seqs and pwms need to be created by the user to run this code
se <- calcBinnedMotifEnrR(seqs = peak_seqs,
                          bins = peak_bins,
                          pwmL = pwms)

The returned se is a SummarizedExperiment with assays negLog10P, negLog10Padj, pearsonResid, expForegroundWgt, log2enr, sumForegroundWgtWithHits and sumBackgroundWgtWithHits, each containing a matrix with motifs (rows) by bins (columns). The values are:

  • negLog10P: the raw P value (\(-\log_{10} p\)) of a given motif enrichment in a given bin. Each P value results from an enrichment calculation comparing occurrences of each motif in the bin to its occurrences in background sequences, defined by the background argument (by default: sequences in all other bins).
  • negLog10Padj: Same as negLog10P but adjusted for multiple testing using the method provided in the p.adjust.method argument, by default: Benjamini and Hochberg, 1995 (p.adjust(..., method="fdr")).
  • pearsonResid: Standardized Pearson residuals, a measure of motif enrichment akin to a z-score for the number of regions in the bin containing the motif. The standardized Pearson residuals are given by \(resid = (o - \mu)/\sigma\), where \(\mu\) is the expected count and \(\sigma\) the standard deviation of the expression in the numerator, under the null hypothesis that the probability of containing a motif is independent of whether the sequence is in the foreground or the background (see e.g. Agresti (2007), section 4.5).
  • expForegroundWgtWithHits: The expected number of regions in the bin containing a given motif.
  • log2enr: Motif enrichments, calculated as: \(log2enr = log2((o + c)/(e + c))\), where \(o\) and \(e\) are the observed and expected numbers of regions in the bin containing a given motif, respectively, and \(c\) is a pseudocount defined by the pseudocount.log2enr argument.
  • sumForegroundWgtWithHits and sumBackgroundWgtWithHits are the sum of foreground and background sequences that have at least one occurrence of the motif, respectively. The background sequences are weighted in order to adjust for differences in sequence composition between foreground and background.

In addition, rowData(se) and colData(se) give information about the used motifs and bins, respectively. In metadata(se) you can find information about parameter values.

4 Binned motif enrichment analysis with multiple sets of sequences (more than two): Finding TFs enriched in differentially methylated regions

This section illustrates the use of monaLisa to analyze regions or sequences with associated numerical values (here: changes of DNA methylation), grouped into several bins according to these values. The special cases of just two sets of sequences (binary motif enrichment analysis) or a single set of sequences (comparing it to a suitable background sampled from the genome) are illustrated in section 5.

This example is based on experimental data from an in vitro differentiation system, in which mouse embryonic stem (ES) cells are differentiated into neuronal progenitors (NP). In an earlier study (Stadler et al. 2011), we have analyzed the genome-wide CpG methylation patterns in these cell types and identified so called low methylated regions (LMRs), that have reduced methylation levels and correspond to regions bound by transcription factors.

We also developed a tool that systematically identifies such regions from genome-wide methylation data (Burger et al. 2013). Interestingly, a change in methylation of LMRs is indicative of altered transcription factor binding. We will therefore use these regions to identify transcription factor motifs that are enriched or depleted in LMRs that change their methylation between ES and NP cell states.

4.1 Load packages

We start by loading the needed packages:

library(GenomicRanges)
library(SummarizedExperiment)
library(JASPAR2020)
library(TFBSTools)
library(BSgenome.Mmusculus.UCSC.mm10)
library(monaLisa)
library(ComplexHeatmap)
library(circlize)

4.2 Genomic regions or sequences of interest

monaLisa provides a file with genomic coordinates (mouse mm10 assembly) of LMRs, with the respective changes of methylation. We load this GRanges object into R.

lmrfile <- system.file("extdata", "LMRsESNPmerged.gr.rds", 
                       package = "monaLisa")
lmr <- readRDS(lmrfile)
lmr
#> GRanges object with 45414 ranges and 1 metadata column:
#>           seqnames          ranges strand |    deltaMeth
#>              <Rle>       <IRanges>  <Rle> |    <numeric>
#>       [1]     chr1 3549153-3550201      * |    0.3190299
#>       [2]     chr1 3680914-3682164      * |    0.0657352
#>       [3]     chr1 3913315-3914523      * |    0.4803313
#>       [4]     chr1 3953500-3954157      * |    0.4504727
#>       [5]     chr1 4150457-4151567      * |    0.5014768
#>       ...      ...             ...    ... .          ...
#>   [45410]     chrY 4196254-4196510      * | -0.020020382
#>   [45411]     chrY 4193654-4194152      * | -0.102559935
#>   [45412]     chrY 4190208-4192766      * | -0.031668206
#>   [45413]     chrY 4188072-4188924      * |  0.130623049
#>   [45414]     chrY 4181867-4182624      * |  0.000494588
#>   -------
#>   seqinfo: 21 sequences from an unspecified genome

Alternatively, the user may also start the analysis with genomic regions contained in a bed file, or directly with sequences in a FASTA file. The following example code illustrates how to do this, but should not be run if you are following the examples in this vignette.

# starting from a bed file
#   import as `GRanges` using `rtracklayer::import`
#   remark: if the bed file also contains scores (5th column), these will be
#           also be imported and available in the "score" metadata column,
#           in this example in `lmr$score`
lmr <- rtracklayer::import(con = "file.bed", format = "bed")

# starting from sequences in a FASTA file
#   import as `DNAStringSet` using `Biostrings::readDNAStringSet`
#   remark: contrary to the coordinates in a `GRanges` object like `lmr` above,
#           the sequences in `lmrseqs` can be directly used as input to
#           monaLisa::calcBinnedMotifEnrR (no need to extract sequences from
#           the genome, just skip that step below)
lmrseqs <- Biostrings::readDNAStringSet(filepath = "myfile.fa", format = "fasta")

We can see there are 45414 LMRs, most of which gain methylation between ES and NP stages:

hist(lmr$deltaMeth, 100, col = "gray", main = "",
     xlab = "Change of methylation (NP - ES)", ylab = "Number of LMRs")

In order to keep the computation time reasonable, we’ll select 10,000 of the LMRs randomly:

set.seed(1)
lmrsel <- lmr[ sample(x = length(lmr), size = 10000, replace = FALSE) ]

4.3 Bin genomic regions

Now let’s bin our LMRs by how much they change methylation, using the bin function from monaLisa. We are not interested in small changes of methylation, say less than 0.3, so we’ll use the minAbsX argument to create a no-change bin in [-0.3, 0.3). The remaining LMRs are put into bins of 800 each:

bins <- bin(x = lmrsel$deltaMeth, binmode = "equalN", nElement = 800, 
            minAbsX = 0.3)
table(bins)
#> bins
#> [-0.935,-0.242]  (-0.242,0.327]   (0.327,0.388]   (0.388,0.443]   (0.443,0.491] 
#>             800            4400             800             800             800 
#>   (0.491,0.536]   (0.536,0.585]   (0.585,0.862] 
#>             800             800             800

Generally speaking, we recommend a minimum of ~100 sequences per bin as fewer sequences may lead to small motif counts and thus either small or unstable enrichments.

We can see which bin has been set to be the zero bin using getZeroBin, or set it to a different bin using setZeroBin:

# find the index of the level representing the zero bin 
levels(bins)
#> [1] "[-0.935,-0.242]" "(-0.242,0.327]"  "(0.327,0.388]"   "(0.388,0.443]"  
#> [5] "(0.443,0.491]"   "(0.491,0.536]"   "(0.536,0.585]"   "(0.585,0.862]"
getZeroBin(bins)
#> [1] 2

Because of the asymmetry of methylation changes, there is only a single bin with LMRs that lost methylation and many that gained:

plotBinDensity(lmrsel$deltaMeth, bins, legend = "topleft")