periodicDNA

Jacques Serizay

2022-11-01

Introduction to periodicDNA

Short DNA sequence motifs provide key information for interpreting the instructions in DNA, for example by providing binding sites for proteins or altering the structure of the double-helix. A less studied but important feature of DNA sequence motifs is their periodicity. A famous example is the 10-bp periodicity of many k-mers in nucleosome positioning (reviewed in Travers et al. 2010 or in Struhl and Segal 2013).

periodicDNA provides a framework to quantify the periodicity of k-mers of interest in DNA sequences. For a chosen k-mer, periodicDNA can identify which periods are statistically enriched in a set of sequences, by using a randomized shuffling approach to compute an empirical p-value. It can also generate continuous linear tracks of k-mer periodicity strength over genomic loci.

Internal steps of periodicDNA

To estimate the periodicity strength of a given k-mer in one or several sequences, periodicDNA performs the following steps:

  1. The k-mer occurrences are mapped and their pairwise distances are calculated.
  2. The distribution of all the resulting pairwise distances (also called “distogram”) is generated.
  3. The distogram is transformed into a frequency histogram and smoothed using a moving window of 3 to mask the universal three-base genomic periodicity. To normalize the frequency for distance decay, the local average (obtained by averaging the frequency with a moving window of 10) is then subtracted from the smoothed frequency.
  4. Finally, the power spectral density (PSD) is estimated by applying a Fast Fourier Transform (Figure 1F) over the normalized frequency histogram. The magnitude of the PSD values indicates the contribution of a given period to the overall periodicity of the k-mer of interest.

Quantifying k-mer periodicity over a set of sequences

Basic usage

The main goal of periodicDNA is to quantify periodicity of a given k-mer in a set of sequences. For instance, one can assess the periodicity of TT dinucleotides in sequences around TSSs of ubiquitous promoters using getPeriodicity().

In the following example, getPeriodicity() is directly ran using a GRanges object, specifying from which genome this GRanges comes from.

library(ggplot2)
library(magrittr)
library(periodicDNA)
#
data(ce11_TSSs)
periodicity_result <- getPeriodicity(
    ce11_TSSs[['Ubiq.']][1:500],
    genome = 'BSgenome.Celegans.UCSC.ce11',
    motif = 'TT', 
    BPPARAM = setUpBPPARAM(1)
)
#> - Mapping k-mers.
#> - 523903 pairwise distances measured.
#> - Calculating pairwise distance distribution.
#> - Normalizing distogram vector.
#> - Applying Fast Fourier Transform to the normalized distogram.

The main output of getPeriodicity() is a table of power spectral density (PSD) values associated with discrete frequencies, computed using a Fast Fourier Transform. For a given frequency, a high PSD score indicates a high periodicity of the k-mer of interest.

In the previous example, TT dinucleotides in sequences around TSSs of ubiquitous promoters are highly periodic, with a periodicity of 10 bp.

head(periodicity_result$PSD)
#>    freq    period          PSD
#> 1 0.005 200.00000 6.256976e-08
#> 2 0.010 100.00000 2.204282e-08
#> 3 0.015  66.66667 2.215522e-09
#> 4 0.020  50.00000 1.108237e-08
#> 5 0.025  40.00000 4.649689e-09
#> 6 0.030  33.33333 2.661198e-08
subset(periodicity_result$periodicityMetrics, Period == 10)
#>    Freq Period          PSD
#> 20  0.1     10 3.633071e-06

Graphical output of getPeriodicity() can be obtained using the plotPeriodicityResults() function:

plotPeriodicityResults(periodicity_result)

The first plot shows the raw distribution of distances between pairs of ‘TT’ in the sequences of the provided GRanges. The second plot shows the decay-normalised distribution. Finally, the third plot shows the PSD scores of the ‘TT’ k-mer, measured from the normalised distribution.

Repeated shuffling of input sequences

periodicDNA provides an approach to compare the periodicity of a given k-mer in a set of sequences compared to background. For a given k-mer at a period T in a set of input sequences, the fold-change over background of its PSD is estimated by iteratively shuffling the input sequences and estimating the resulting PSD values.
Eventually, the log2 fold-change (l2FC) between the observed PSD and the median of the PSD values measured after shuffling is computed as follows:

l2FC = log2(observed PSD / median(shuffled PSDs)).

periodicity_result <- getPeriodicity(
    ce11_TSSs[['Ubiq.']][1:500],
    genome = 'BSgenome.Celegans.UCSC.ce11',
    motif = 'TT', 
    n_shuffling = 5
)
#> - Calculating observed PSD
#> - Mapping k-mers.
#> - 523903 pairwise distances measured.
#> - Calculating pairwise distance distribution.
#> - Normalizing distogram vector.
#> - Applying Fast Fourier Transform to the normalized distogram.
#> - Shuffling 1/5
#> - Shuffling 2/5
#> - Shuffling 3/5
#> - Shuffling 4/5
#> - Shuffling 5/5
#> Only 5 shufflings. Cannot compute accurate empirical p-values. To compute empirical p-values, set up n_shuffling to at least 100. Only l2FC values are returned
head(periodicity_result$periodicityMetrics)
#>    Freq    Period PSD_observed       l2FC pval fdr
#> 1 0.005 200.00000     6.26e-08 -1.2851533   NA  NA
#> 2 0.010 100.00000     2.20e-08  0.3423978   NA  NA
#> 3 0.015  66.66667     2.22e-09 -0.2691410   NA  NA
#> 4 0.020  50.00000     1.11e-08  2.3073608   NA  NA
#> 5 0.025  40.00000     4.65e-09 -1.3508550   NA  NA
#> 6 0.030  33.33333     2.66e-08  1.2743709   NA  NA
subset(periodicity_result$periodicityMetrics, Period == 10)
#>    Freq Period PSD_observed     l2FC pval fdr
#> 20  0.1     10     3.63e-06 9.586643   NA  NA
plotPeriodicityResults(periodicity_result)
#> Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
#> "none")` instead.

If n_shuffling >= 100, an associated empirical p-value is calculated as well (North et al 2002). This metric indicates, for each individual period T, whether the observed PSD is significantly greater than the PSD values measured after shuffling the input sequences. Note that empirical p-values are only an estimation of the real p-value. Notably, small p-values are systematically under-estimated (North et al 2002).

Note

getPeriodicity() can also be ran directly on a set of sequences of interest as follows:

data(ce11_proms_seqs)
periodicity_result <- getPeriodicity(
    ce11_proms_seqs,
    motif = 'TT', 
    BPPARAM = setUpBPPARAM(1)
)
#> - Mapping k-mers.
#> - 117630 pairwise distances measured.
#> - Calculating pairwise distance distribution.
#> - Normalizing distogram vector.
#> - Applying Fast Fourier Transform to the normalized distogram.
subset(periodicity_result$periodicityMetrics, Period == 10)
#>    Freq Period         PSD
#> 20  0.1     10 1.16806e-06

Track of periodicity over a set of Genomic Ranges

The other aim of periodicDNA is to generate continuous linear tracks of k-mer periodicity strength over genomic loci of interest. getPeriodicityTrack() can be used to generate suck genomic tracks. In the following example,

WW_10bp_track <- getPeriodicityTrack(
    genome = 'BSgenome.Celegans.UCSC.ce11',
    granges = ce11_proms, 
    motif = 'WW',
    period = 10,
    BPPARAM = setUpBPPARAM(1),
    bw_file = 'WW-10-bp-periodicity_over-proms.bw'
)

When plotted over sets of ubiquitous, germline or somatic TSSs, the resulting track clearly shows increase of WW 10-bp periodicity above the ubiquitous and germline TSSs, whereas somatic TSSs do not show such increase.

data(ce11_TSSs)
plotAggregateCoverage(
    WW_10bp_track, 
    ce11_TSSs, 
    xlab = 'Distance from TSS',
    ylab = '10-bp periodicity strength (forward proms.)'
)

Session info

sessionInfo()
#> R Under development (unstable) (2022-10-25 r83175)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] BSgenome.Celegans.UCSC.ce11_1.4.2 periodicDNA_1.9.0                
#>  [3] BiocParallel_1.33.0               BSgenome_1.67.0                  
#>  [5] rtracklayer_1.59.0                Biostrings_2.67.0                
#>  [7] XVector_0.39.0                    magrittr_2.0.3                   
#>  [9] ggplot2_3.3.6                     GenomicRanges_1.51.0             
#> [11] GenomeInfoDb_1.35.0               IRanges_2.33.0                   
#> [13] S4Vectors_0.37.0                  BiocGenerics_0.45.0              
#> 
#> loaded via a namespace (and not attached):
#>  [1] SummarizedExperiment_1.29.0 gtable_0.3.1               
#>  [3] rjson_0.2.21                xfun_0.34                  
#>  [5] bslib_0.4.0                 lattice_0.20-45            
#>  [7] Biobase_2.59.0              vctrs_0.5.0                
#>  [9] tools_4.3.0                 bitops_1.0-7               
#> [11] generics_0.1.3              parallel_4.3.0             
#> [13] tibble_3.1.8                fansi_1.0.3                
#> [15] highr_0.9                   pkgconfig_2.0.3            
#> [17] Matrix_1.5-1                assertthat_0.2.1           
#> [19] lifecycle_1.0.3             GenomeInfoDbData_1.2.9     
#> [21] farver_2.1.1                compiler_4.3.0             
#> [23] stringr_1.4.1               Rsamtools_2.15.0           
#> [25] munsell_0.5.0               codetools_0.2-18           
#> [27] htmltools_0.5.3             sass_0.4.2                 
#> [29] RCurl_1.98-1.9              yaml_2.3.6                 
#> [31] pillar_1.8.1                crayon_1.5.2               
#> [33] jquerylib_0.1.4             cachem_1.0.6               
#> [35] DelayedArray_0.25.0         tidyselect_1.2.0           
#> [37] digest_0.6.30               stringi_1.7.8              
#> [39] dplyr_1.0.10                restfulr_0.0.15            
#> [41] labeling_0.4.2              cowplot_1.1.1              
#> [43] fastmap_1.1.0               grid_4.3.0                 
#> [45] colorspace_2.0-3            cli_3.4.1                  
#> [47] XML_3.99-0.12               utf8_1.2.2                 
#> [49] withr_2.5.0                 scales_1.2.1               
#> [51] rmarkdown_2.17              matrixStats_0.62.0         
#> [53] zoo_1.8-11                  evaluate_0.17              
#> [55] knitr_1.40                  BiocIO_1.9.0               
#> [57] rlang_1.0.6                 glue_1.6.2                 
#> [59] DBI_1.1.3                   jsonlite_1.8.3             
#> [61] R6_2.5.1                    MatrixGenerics_1.11.0      
#> [63] GenomicAlignments_1.35.0    zlibbioc_1.45.0