The ORFhunteR package is R and C++ library for an automatic determination and annotation of open reading frames (ORFs) in a large set of RNA molecules. It efficiently implements the machine learning model based on vectorization of nucleotide sequences and the random forest classification algorithm. The ORFhunteR package consists of a set of functions written in the R language in conjunction with C++. The efficiency of the package was confirmed by the examples of the analysis of RNA molecules from the NCBI RefSeq and Ensembl databases. The package can be used in basic and applied biomedical research related to the study of the transcriptome of normal as well as pathological (for example, cancerous) human cells.
ORFhunteR 1.13.0
This document describes the usage of the functions integrated in the package and is meant to be a reference document for the end user.
The ORFhunteR package is considered stable and will undergo few changes from now on. Please, report potential bugs and incompatibilities to bioinformatics.rfct.bio.bsu@gmail.com.
The usage of the package functions for an automatic determination and annotation of open reading frames (ORFs) is shown for an example set of 50 mRNA molecules loaded from the Ensembl. The path to the data set trans_sequences.fasta
is available as
trans <- system.file("extdata", "Set.trans_sequences.fasta",
package = "ORFhunteR")
Load the package with the following command:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("ORFhunteR")
library('ORFhunteR')
#> Loading required package: Biostrings
#> Loading required package: BiocGenerics
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, rank, rbind, rownames, sapply, setdiff, table, tapply,
#> union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: XVector
#> Loading required package: GenomeInfoDb
#>
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#>
#> strsplit
#> Loading required package: rtracklayer
#> Loading required package: GenomicRanges
#> Loading required package: Peptides
The most important data sets are available with package as an external data. Furthermore, a number of additional data used in this vignette can be downloaded from our SST Center server.
You can load the data file of the mRNA molecules with the following command:
seq.set <- loadTrExper(tr = trans)
where the function argument tr
is a character string giving the name of file with transcripts of interest. Allowed file formats are fasta
or fa
. Alternative formats are gtf
or gff
. With gtf
or gff
formats, additional argument genome should be specified. This argument gives the name of pre-installed BSgenome data package with full genome sequences. Default value is BSgenome.Hsapiens.UCSC.hg38
.
Irrespective used format of input file, the function loadTrExper
returns a list of loaded transcript sequences. Each element of this list is character string of nucleotides sequence.
All possible in-frame ORF candidates in a sequence of interest can be determined by function findORFs
:
x <- "AAAATGGCTGCGTAATGCAAAATGGCTGCGAATGCAAAATGGCTGCGAATGCCGGCACGTTGCTACGT"
orf <- findORFs(x, codStart="ATG")
where the function argument x
is a character string giving the nucleotide sequence of interest. In current implementation, the function findORFs
scans the sequence for all sub-sequences that start with ATG start codon and end in-frame with a stop codon.
The function findORFs returns a matrix with start and stop positions in RNA molecule, length and sequence of identified variants of ORFs. In fact, the predominant amount of identified ORF candidates are short pseudo-ORFs (see below), as demonstrated in Figure 1.