# Contents

ncRNAtools

Lara Selles Vidal, Rafael Ayala, Guy-Bart Stan, Rodrigo Ledesma-Amaro

July 24, 2020

# 1 Abstract

Non-coding RNA (ncRNA) comprise a variety of RNA molecules that are transcribed from DNA but are not translated into proteins.
They include a large diversity of RNA which perform key cellular functions, including ribosomic RNA, transference RNA, ribozymes, small nuclear and small nucleolar RNA and multiple RNAs involved in the regulation of gene expression and chromatin structure, such as cis-regulatory elements, micro RNA, interferenceRNA and long non-coding RNA. A key aspect in determining the function and properties of ncRNAs is their secondary structure. This package provides a set of tools for identifying ncRNAs of interest, determining and plotting their secondary structure and importing and exporting the most common file-types for ncRNA data.

# 2 Introduction

The key role of ncRNAs in cellular processes has been well established over the last two decades, with at least 70% of the human genome reported to be transcribed into RNAs of different sizes. Given the fact that only around 28% of the human genome corresponds to protein-encoding regions (including introns and other untranslated regions), it is currently thought that the rest of the genome encodes a high number of functional RNA molecules that are not translated to proteins (Fu 2014).

One of the main characteristic features of ncRNAs is their secondary structure, defined by the intramolecular pairing of bases through hydrogen bonds. The identification of RNA secondary structure is often the basis for determining essential bases for ncRNA function, as well as for engineering novel or modified ncRNAs, including ribozymes. Furthermore, several methodologies for the prediction of RNA threedimensional structure require secondary structure information as a key input (Wang & Xu 2011; Thiel et al. 2017).

ncRNAtools aims to facilitate the inclusion of ncRNA identification and analysis within existing genomics and transcriptomics workflows, as well as the application of statistical modeling for the prediction of ncRNA properties and structure by providing a set of tools for importing, exporting, plotting and analyzing ncRNAs and their features.

# 3 Installation

To install ncRNAtools, start R (>=4.0) and run:

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")

# The following line initializes usage of Bioc devel, and therefore should be
# skipped in order to install the stable release version
BiocManager::install(version = "devel")

BiocManager::install("ncRNAtools")

# 4 Software features

A first set of ncRNAtools functionalities allows to search the RNAcentral database of annotated ncRNA (Consortium 2019). The RNAcentral database can be searched in two different ways: by keywords, and by genomic coordinates. Information about a specific entry can also be retrieved.

ncRNAtools also enables the prediction of secondary structure of RNA by simply providing a sequence of interest. Secondary structure predictions are performed through the rtools web server for RNA Bioinformatics (Hamada et al. 2016). Currently supported methodologies for secondary structure prediction include centroidFold (Hamada et al. 2008), centroidHomfold (Hamada et al. 2011) and IPknot (Sato et al. 2011). Only IPknot can detect pseudoknots. Alternative secondary structures can be predicted with the RintW method (Hagio et al. 2018). Utilities to determine paired bases from a secondary structure string and viceversa are also available.

Additionally, ncRNAtools provides utilities to calculate a matrix of base pair probabilities and visualize it as a heatmap, with the possibility to make a composite plot including also a defined secondary structure.

Finally, ncRNAtools supports reading and writing files in the CT and Dot-Bracket formats, the two most commonly employed formats for storing the secondary structure of RNA.

# 5 Examples

## 5.1 Searching the RNAcentral database

The RNAcentral database can be searched by keywords with the “rnaCentralTextSearch” function. The function returns a character vector where each element is a string representing the RNAcentral accession for an entry that matched the search. RNAcentral accessions are always of either the “URSXXXXXXXXXX” form, where X is any hexadecimal numeral, or “URSXXXXXXXXXX_taxid”, where taxid is the NCBI taxonomy identifier for the species or strain corresponding to the entry. Searches can either include only a keyword or phrase, or be given in the field_name:“field value” syntax, with several fields separated by logical operators as described in https://rnacentral.org/help/text-search. It should be noted that logical operators must be capitalized.

For example, the keyword “HOTAIR” can be passed as the query to the search function to identify entries corresponding to the HOTAIR ncRNA:

rnaCentralTextSearch("HOTAIR")
##  [1] "URS000075C808_9606" "URS0000757747_9606" "URS00008B6DC3_9606"
##  [4] "URS0000513030_9606" "URS00001A335C_9606" "URS00008B8F97_9606"
##  [7] "URS000075EF05_9606" "URS0000301B08_9606" "URS000011D1F0_9606"
## [10] "URS000019A694_9606" "URS0000D5DAB3_9606" "URS0000759B00_9606"
## [13] "URS00008BE2FB_9606" "URS00008BE50E_9606" "URS0001BE711E_9606"

With a more refined search, it is possible to identify RNAs corresponding to the FMN riboswitch present in Bacillus subtilis strains only. It should be noted that the quotes enclosing the value for specific fields must be escaped:

rnaCentralTextSearch("FMN AND species:\"Bacillus subtilis\"")
## [1] "URS00007C2D83_224308" "URS000037084E_224308" "URS00007C2D83_1423"
## [4] "URS000037084E_1423"   "URS0001BC2DEF_1423"   "URS0001BC379C_1423"
## [7] "URS0001BC09A1_1423"

Information about the resulting entries can be retrieved with the “rnaCentralRetrieveEntry” function:

rnaCentralRetrieveEntry("URS000037084E_1423")
## $rnaCentralID ## [1] "URS000037084E_1423" ## ##$sequence
## [1] "UGUAUCCUUCGGGGCAGGGUGGAAAUCCCGACCGGCGGUAGUAAAGCACAUUUGCUUUAGAGCCCGUGACCCGUGUGCAUAAGCACGCGGUGGAUUCAGUUUAAGCUGAAGCCGACAGUGAAAGUCUGGAUGGGAGAAGGAUGAU"
##
## $sequenceLength ## [1] 145 ## ##$description
## [1] "Bacillus subtilis FMN"
##
## $species ## [1] "Bacillus subtilis" ## ##$ncbiTaxID
## [1] 1423
##
## $RNATypes ## [1] "misc_RNA" "" "other" It is also possible to perform a search of the RNAcentral database by specifying genomic coordinates with the “rnaCentralGenomicCoordinatesSearch” function. A GRanges object must be provided, with sequence names of the chr? or simply ? format (where ? denotes any number, the X or Y characters or another letter for organisms where chromosomes are named with letters), or, alternatively, “MT” to denote mitochondrial DNA. The species name must also be provided as a string with the scientific name. A list of organisms for which search by genomic coordinates is supported can be found at https://rnacentral.org/help/genomic-mapping. Several ranges can be provided in the same GRanges object, but they must refer to the same organism. In the following example, we aim to retrieve the known ncRNA present in Yarrowia lipolytica (oleaginous yeast) between positions 200000 and 300000 of chromosome C, and positions 500000 and 550000 of chromosome D: ## Generate a GRanges object with the genomic ranges to be used to search the ## RNAcentral database genomicCoordinates <- GenomicRanges::GRanges(seqnames = S4Vectors::Rle(c("chrC", "chrD")), ranges = IRanges::IRanges(c(2e+05, 5e+05), c(3e+05, 550000))) ## Use the generated GRanges object to search the RNAcentral database RNAcentralHits <- rnaCentralGenomicCoordinatesSearch(genomicCoordinates, "Yarrowia lipolytica") ## Check the number of hits in each provided genomic range length(RNAcentralHits[[1]]) # 22 known ncRNA between positions 200000 and 300000 of chromosome C ## [1] 22 length(RNAcentralHits[[2]]) # No known ncRNA between positions 500000 and 550000 of chromosome D ## [1] 0 ## 5.2 Predicting the secondary structure of an RNA The “predictSecondaryStructure” function enables the prediction of the secondary structure of RNA through three different methods: centroidFold (Hamada et al. 2008), centroidHomfold (Hamada et al. 2011) and IPknot (Sato et al. 2011). IPknot has the advantage of being able to detect pseudoknots. The secondary structure predictions by centroidFold and centroidHomfold are output in the Dot-Bracket format, where dots (“.”) represent unpaired bases and pairs of open and closed round brackets (“(”-“)”) represent paired bases. Such format cannot unambiguously represent nested structures such as pseudoknots. Therefore, the output of IPknot is provided, if required, in the extended Dot-Bracket format, where additional pairs of symbols are used to represent such nested structures. In the following example, all three methods are used to predict the structure of a fragment of a tRNA: tRNAfragment <- "UGCGAGAGGCACAGGGUUCGAUUCCCUGCAUCUCCA" centroidFoldPrediction <- predictSecondaryStructure(tRNAfragment, "centroidFold") centroidFoldPrediction$secondaryStructure
## [1] ".....((((..(((((.......)))))..)))).."
centroidHomFoldPrediction <- predictSecondaryStructure(tRNAfragment, "centroidHomFold")
centroidFoldPrediction$secondaryStructure ## [1] ".....((((..(((((.......)))))..)))).." IPknotPrediction <- predictSecondaryStructure(tRNAfragment, "IPknot") IPknotPrediction$secondaryStructure
## [1] "...((((....(((((.......)))))..)))).."

It is also possible to predict not only a single canonical secondary structure, but also a set of possible alternative secondary structures. This is achieved with the RintW method (Hagio et al. 2018), available through the “predictAlternativeSecondaryStructures” function.

tRNAfragment2 <- "AAAGGGGUUUCCC"

RintWPrediction <- predictAlternativeSecondaryStructures(tRNAfragment2)

length(RintWPrediction)  # A total of 2 alternative secondary structures were identified by RintW
## [1] 2
RintWPrediction[[1]]$secondaryStructure ## [1] "...(((....)))" RintWPrediction[[2]]$secondaryStructure
## [1] "....(((...)))"

## 5.3 Calculating and plotting of base pair probability matrices

Both centroidFold and centroidHomfold return not only a prediction of the secondary structure of an RNA, but also a list of probabilities for potential base pairs for each nucleotide. Such a matrix can be useful in assessing the potential presence of alternative secondary structures.

A matrix of base pair probabilities can be generated with the “generatePairsProbabilityMatrix”, which takes as input the probabilities in the format returned by centroidFold and centroidHomFold. The base pair probabilities matrix can then be plotted with the “plotPairsProbabilityMatrix” function:

basePairProbabilityMatrix <- generatePairsProbabilityMatrix(centroidFoldPrediction$basePairProbabilities) plotPairsProbabilityMatrix(basePairProbabilityMatrix) A composite plot, where the top right triangle represents the base pair probabilities while the bottom left triangle represents base pairings in a specific secondary structure, can be generated with the “plotCompositePairsMatrix” function. Usually, base pairs in a secondary structure correspond to pairs with a high probability. The function takes as its input a matrix of base pair probabilities, and a dataframe with bases paired in a given secondary structure. Such a dataframe can be generated from a secondary structure string with the “findPairedBases” function. In the following example, such a composite plot is generated the secondary structure predicted by centroidFold for the tRNA fragment: pairedBases <- findPairedBases(sequence = tRNAfragment, secondaryStructureString = IPknotPrediction$secondaryStructure)

plotCompositePairsMatrix(basePairProbabilityMatrix, pairedBases)

## 5.4 Reading and writing files and other utilities

ncRNAtools supports reading and writing files in the CT and Dot-Bracket format. For a description of each format, see http://projects.binf.ku.dk/pgardner/bralibase/RNAformats.html and https://rna.urmc.rochester.edu/Text/File_Formats.html.

CT files are read with the “readCT” function. Some CT files contain a line for each nucleotide of the RNA sequence, while others contain lines only for paired nucleotides. In the second case, the full sequence must be provided when calling “readCT”.

The list returned by “readCT” includes an element named “pairsTable”, consisting of a dataframe indicating paired nucleotides. A corresponding secondary structure string in the Dot-Bracket format can be generated with the “pairsToSecondaryStructure” function.

CT files are written with the “writeCT” function. A complete CT file is always output by “writeCT”.

Files in the Dot-Bracket format are read and written with the “readDotBracket” and “writeDotBracket” functions respectively. If the read Dot-Bracket file contains an energy value for the represented structure, it will be included in the output of “readDotBracket”.

## Read an example CT file corresponding to E. coli tmRNA

exampleCTFile <- system.file("extdata", "exampleCT.ct", package = "ncRNAtools")

tmRNASequence <- "GGGGCUGAUUCUGGAUUCGACGGGAUUUGCGAAACCCAAGGUGCAUGCCGAGGGGCGGUUGGCCUCGUAAAAAGCCGCAAAAAAUAGUCGCAAACGACGAAAACUACGCUUUAGCAGCUUAAUAACCUGCUUAGAGCCCUCUCUCCCUAGCCUCCGCUCUUAGGACGGGGAUCAAGAGAGGUCAAACCCAAAAGAGAUCGCGUGGAAGCCCUGCCUGGGGUUGAAGCGUUAAAACUUAAUCAGGCUAGUUUGUUAGUGGCGUGUCCGUCCGCAGCUGGCAAGCGAAUGUAAAGACUGACUAAGCAUGUAGUACCGAGGAUGUAGGAAUUUCGGACGCGGGUUCAACUCCCGCCAGCUCCACCA"

## Write a complete CT file for E. coli tmRNA

tempDir <- tempdir()
testCTFile <- paste(tempDir, "testCTfile.ct", sep = "")

tmRNASecondaryStructureString <- pairsToSecondaryStructure(pairedBases = tmRNASecondaryStructure$pairsTable, sequence = tmRNASequence) writeCT(testCTFile, sequence = tmRNASequence, secondaryStructure = tmRNASecondaryStructureString, sequenceName = "tmRNA") ## Read an example Dot-Bracket file exampleDotBracketFile <- system.file("extdata", "exampleDotBracket.dot", package = "ncRNAtools") exampleDotBracket <- readDotBracket(exampleDotBracketFile) exampleDotBracket$freeEnergy  # The structure has a free energy of -41.2 kcal/mol
## [1] -41.2
## Write a Dot-Bracket file

tempDir2 <- tempdir()
testDotBracketFile <- paste(tempDir2, "testDotBracketFile.dot", sep = "")

writeDotBracket(testDotBracketFile, sequence = exampleDotBracket$sequence, secondaryStructure = exampleDotBracket$secondaryStructure,
sequenceName = "Test sequence")

ncRNAtools includes an utility to flatten representations of secondary structure in the extended Dot-Bracket format to basic Dot-Bracket format (i.e., comprising only the “.”, “(” and “)” characters). Although information is potentially lost upon such transformation, this is required for some RNA Bioinformatics software, which can only accept secondary structures in the Dot-Bracket format as input. This can be achieved with the “flattenDotBracket” function.

extendedDotBracketString <- "...((((..[[[.))))]]]..."

plainDotBracketString <- flattenDotBracket(extendedDotBracketString)

# 6 Session info

sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_GB              LC_COLLATE=C
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods
## [8] base
##
## other attached packages:
## [1] ggplot2_3.3.5        GenomicRanges_1.46.0 GenomeInfoDb_1.30.0
## [4] IRanges_2.28.0       S4Vectors_0.32.0     BiocGenerics_0.40.0
## [7] ncRNAtools_1.4.0     BiocStyle_2.22.0
##
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.1       xfun_0.27              bslib_0.3.1
##  [4] purrr_0.3.4            colorspace_2.0-2       vctrs_0.3.8
##  [7] generics_0.1.1         htmltools_0.5.2        yaml_2.2.1
## [10] utf8_1.2.2             rlang_0.4.12           jquerylib_0.1.4
## [13] pillar_1.6.4           withr_2.4.2            glue_1.4.2
## [16] DBI_1.1.1              GenomeInfoDbData_1.2.7 lifecycle_1.0.1
## [19] stringr_1.4.0          zlibbioc_1.40.0        munsell_0.5.0
## [22] gtable_0.3.0           evaluate_0.14          knitr_1.36
## [25] fastmap_1.1.0          curl_4.3.2             fansi_0.5.0
## [28] highr_0.9              Rcpp_1.0.7             scales_1.1.1
## [31] BiocManager_1.30.16    formatR_1.11           magick_2.7.3
## [34] jsonlite_1.7.2         XVector_0.34.0         farver_2.1.0
## [37] digest_0.6.28          stringi_1.7.5          bookdown_0.24
## [40] dplyr_1.0.7            grid_4.1.1             tools_4.1.1
## [43] bitops_1.0-7           magrittr_2.0.1         sass_0.4.0
## [46] RCurl_1.98-1.5         tibble_3.1.5           crayon_1.4.1
## [49] pkgconfig_2.0.3        ellipsis_0.3.2         xml2_1.3.2
## [52] assertthat_0.2.1       rmarkdown_2.11         httr_1.4.2
## [55] R6_2.5.1               compiler_4.1.1

# References

Consortium, T.R. (2019). RNAcentral: A hub of information for non-coding rna sequences. Nucleic Acids Research, 47, D221—D229. Retrieved from https://doi.org/10.1093/nar/gky1034

Fu, X.-D. (2014). Non-coding rna: A new frontier in regulatory biology. National Science Review, 1, 190–204. Retrieved from https://doi.org/10.1093/nsr/nwu008

Hagio, T., Sakuraba, S., Iwakiri, J., Mori, R. & Asai, K. (2018). Capturing alternative secondary structures of rna by decomposition of base-pairing probabilities. BMC Bioinformatics, 19, 38. Retrieved from https://doi.org/10.1186/s12859-018-2018-4

Hamada, M., Kiryu, H., Sato, K., Mituyama, T. & Asai, K. (2008). Prediction of rna secondary structure using generalized centroid estimators. Bioinformatics, 25, 465–473. Retrieved from https://doi.org/10.1093/bioinformatics/btn601

Hamada, M., Ono, Y., Kiryu, H., Sato, K., Kato, Y., Fukunaga, T., Mori, R. & Asai, K. (2016). Rtools: A web server for various secondary structural analyses on single rna sequences. Nucleic Acids Research, 44, W302–W307. Retrieved from https://doi.org/10.1093/nar/gkw337

Hamada, M., Yamada, K., Sato, K., Frith, M.C. & Asai, K. (2011). CentroidHomfold-last: Accurate prediction of rna secondary structure using automatically collected homologous sequences. Nucleic Acids Research, 39, W100–W106. Retrieved from https://doi.org/10.1093/nar/gkr290

Sato, K., Kato, Y., Hamada, M., Akutsu, T. & Asai, K. (2011). IPknot: Fast and accurate prediction of rna secondary structures with pseudoknots using integer programming. Bioinformatics, 27, i85–i93. Retrieved from https://doi.org/10.1093/bioinformatics/btr215

Thiel, B.C., Flamm, C. & Hofacker, I.L. (2017). RNA structure prediction: From 2D to 3D. Emerging Topics in Life Sciences, 1, 275–285. Retrieved from https://doi.org/10.1042/ETLS20160027

Wang, Z. & Xu, J. (2011). A conditional random fields method for rna sequence–structure relationship modeling and conformation sampling. Bioinformatics, 27, i102–i110. Retrieved from https://doi.org/10.1093/bioinformatics/btr232