CircSeqAlignTk is designed for end-to-end RNA-Seq data analysis of circular genome sequences, from alignment to visualization. It mainly targets viroids which are composed of 246-401 nt circular RNAs. In addition, CircSeqAlignTk implements a tidy interface to generate synthetic sequencing data that mimic real RNA-Seq data, allowing developers to evaluate the performance of alignment tools, new alignment algorithms, and new workflows. This vignette provides an overview of CircSeqAlignTk and a detailed explanation of the manner in which it is used.
To install the CircSeqAlignTk package, start R (≥ 4.2) and run the following steps:
Note that to install the latest version of the CircSeqAlignTk package, the latest version of R is required.
CircSeqAlignTk is designed for end-to-end RNA-Seq data analysis
of circular genome sequences, from alignment to visualization.
The whole processes will generate many files
including genome sequence indexes, and intermediate and final alignment results.
Thus, it is recommended to specify a working directory to save these files.
Here, for convenience in package development and validation,
we use a temporary folder
which is automatically arranged by the
as the working directory.
However, instead of using a temporary folder, users can specify a folder on the desktop or elsewhere, depending on the analysis project. For example:
Viroids are composed of 246–401 nt, single-stranded circular non-coding RNAs (Hull 2014; Flores et al. 2015; Gago-Zachert 2016). Sequencing small RNAs from viroid-infected plants could offer insights regarding the mechanisms of infection and eventually help prevent these infections in plants. The common workflow for analyzing such data involves the following steps: (i) limit read-length between 21 and 24 nt, as small RNAs derived from viroids are known to be in this range, (ii) align these reads to viroid genome sequences, and (iii) visualize the coverage of alignment to identify the pathogenic region of the viroid. This section demonstrates the workflow using a sample RNA-Seq dataset. It includes workflow from the FASTQ format file to the visualization of the analyzed results, for analyzing small RNA-seq data sequenced from viroid-infected plants.
The FASTQ format file used in this section
is attached in the CircSeqAlignTk package
and can be obtained using the
This FASTQ format file contains 29,178 sequence reads of small RNAs
that were sequenced from a tomato plant
infected with the potato spindle tuber viroid (PSTVd) isolate Cen-1 (FR851463).
The genome sequence of PSTVd isolate Cen-1 in FASTA format can be downloaded
using the accession number FR851463.
It is also attached in the CircSeqAlignTk package,
and can be obtained using the
To ensure alignment quality, trimming adapter sequences from the sequence reads is required, because most sequence reads in this FASTQ format file contain adapters with sequence “AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC”. Here, we use AdapterRemoval (Schubert, Lindgreen, and Orlando (2016)) implemented in the Rbowtie2 (Wei et al. 2018) package to trim the adapters before aligning the sequence reads. Note that the length of small RNAs derived from viroids is known to be in the range of 21–24 nt. Therefore, we set an argument to remove sequence reads with lengths outside this range after adapter removal.
library(R.utils) library(Rbowtie2) adapter <- 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC' # decompressed the gzip file for trimming to avoid errors from `remove_adapters` gunzip(fq, destname=file.path(ws, 'srna.fq'), overwrite = TRUE, remove = FALSE) trimmed_fq <- file.path(ws, 'srna_trimmed.fq') params <- '--maxns 1 --trimqualities --minquality 30 --minlength 21 --maxlength 24' remove_adapters(file1 = file.path(ws, 'srna.fq'), adapter1 = adapter, adapter2 = NULL, output1 = trimmed_fq, params, basename = file.path(ws, 'AdapterRemoval.log'), overwrite = TRUE)
After obtaining the cleaned FASTQ format file (i.e.,
we build index files and perform alignment
implemented in the CircSeqAlignTk package.
To precisely align the reads to the circular genome sequence of the viroid,
the alignment is performed in
The index files are stored in a directory specified by the
The intermediate files (e.g., FASTQ format files used as inputs)
and alignment results (e.g., BAM format files) are stored in
the directory specified by the
output argument of the
BAM format files with the suffixes of
the final results obtained after alignment.
Refer to the sections
4.2 and 4.3
for a detailed description of each of the files generated by each function.
The alignment coverage can be summarized with the
This function loads the alignment results
calculates alignment coverage from these BAM format files,
and combines them into two data frames according to the aligned strands.
## L21 L22 L23 L24 ## [1,] 13 12 1 1 ## [2,] 13 12 1 1 ## [3,] 13 12 1 1 ## [4,] 13 13 1 1 ## [5,] 13 13 1 1 ## [6,] 13 13 1 1
## L21 L22 L23 L24 ## [1,] 7 5 0 1 ## [2,] 7 5 0 1 ## [3,] 7 5 0 1 ## [4,] 7 5 0 1 ## [5,] 7 5 0 1 ## [6,] 7 5 0 1
The alignment coverage can be then visualized using the
The scale of the upper and lower directions indicate alignment coverage of
the forward and reversed strands, respectively.
Circular genome sequences are generally represented as linear sequences in the FASTA format during analysis. Consequently, sequence reads obtained from organelles or organisms with circular genome sequences can be aligned anywhere, including at both ends of the sequence represented in the FASTA format. Using existing alignment tools such as Bowtie2 (Langmead and Salzberg 2012) and HISAT2 (Kim et al. 2019) to align such sequence reads onto circular sequences may fail, because these tools are designed to align sequence reads to linear genome sequences and their implementation does not assume that a single read can be aligned to both ends of a linear sequence. To solve this problem, that is, allowing reads to be aligned to both ends, the CircSeqAlignTk package implements a two-stage alignment process (Figure 2), using these existing alignment tools (Bowtie2 and HISAT2).