Contents

1 Installation

To install the CircSeqAlignTk package, start R (≥ 4.2) and run the following steps:

if (!requireNamespace('BiocManager', quietly = TRUE))
    install.packages('BiocManager')
BiocManager::install('CircSeqAlignTk')

Note that to install the latest version of the CircSeqAlignTk package, the latest version of R is required.

2 Preparation of working directory

CircSeqAlignTk is designed for end-to-end RNA-Seq data analysis of circular genome sequences, from alignment to visualization. The whole processes will generate many files including genome sequence indexes, and intermediate and final alignment results. Thus, it is recommended to specify a working directory to save these files. Here, for convenience in package development and validation, we use a temporary folder which is automatically arranged by the tempdir function as the working directory.

ws <- tempdir()

However, instead of using a temporary folder, users can specify a folder on the desktop or elsewhere, depending on the analysis project. For example:

ws <- '~/desktop/viroid_project'

3 Quick start

Viroids are composed of 246–401 nt, single-stranded circular non-coding RNAs (Hull 2014; Flores et al. 2015; Gago-Zachert 2016). Sequencing small RNAs from viroid-infected plants could offer insights regarding the mechanisms of infection and eventually help prevent these infections in plants. The common workflow for analyzing such data involves the following steps: (i) limit read-length between 21 and 24 nt, as small RNAs derived from viroids are known to be in this range, (ii) align these reads to viroid genome sequences, and (iii) visualize the coverage of alignment to identify the pathogenic region of the viroid. This section demonstrates the workflow using a sample RNA-Seq dataset. It includes workflow from the FASTQ format file to the visualization of the analyzed results, for analyzing small RNA-seq data sequenced from viroid-infected plants.

The FASTQ format file used in this section is attached in the CircSeqAlignTk package and can be obtained using the system.file function. This FASTQ format file contains 29,178 sequence reads of small RNAs that were sequenced from a tomato plant infected with the potato spindle tuber viroid (PSTVd) isolate Cen-1 (FR851463).

fq <- system.file(package = 'CircSeqAlignTk', 'extdata', 'srna.fq.gz')

The genome sequence of PSTVd isolate Cen-1 in FASTA format can be downloaded from GenBank or ENA using the accession number FR851463. It is also attached in the CircSeqAlignTk package, and can be obtained using the system.file function.

genome_seq <- system.file(package = 'CircSeqAlignTk', 'extdata', 'FR851463.fa')

To ensure alignment quality, trimming adapter sequences from the sequence reads is required, because most sequence reads in this FASTQ format file contain adapters with sequence “AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC”. Here, we use AdapterRemoval (Schubert, Lindgreen, and Orlando (2016)) implemented in the Rbowtie2 (Wei et al. 2018) package to trim the adapters before aligning the sequence reads. Note that the length of small RNAs derived from viroids is known to be in the range of 21–24 nt. Therefore, we set an argument to remove sequence reads with lengths outside this range after adapter removal.

library(R.utils)
library(Rbowtie2)
adapter <- 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'

# decompressed the gzip file for trimming to avoid errors from `remove_adapters`
gunzip(fq, destname=file.path(ws, 'srna.fq'), overwrite = TRUE, remove = FALSE)

trimmed_fq <- file.path(ws, 'srna_trimmed.fq')
params <- '--maxns 1 --trimqualities --minquality 30 --minlength 21 --maxlength 24'
remove_adapters(file1 = file.path(ws, 'srna.fq'),
                adapter1 = adapter,
                adapter2 = NULL,
                output1 = trimmed_fq,
                params,
                basename = file.path(ws, 'AdapterRemoval.log'),
                overwrite = TRUE)

After obtaining the cleaned FASTQ format file (i.e., srna_trimmed.fq.gz), we build index files and perform alignment using the build_index and align_reads functions implemented in the CircSeqAlignTk package. To precisely align the reads to the circular genome sequence of the viroid, the alignment is performed in two stages.

ref_index <- build_index(input = genome_seq, 
                         output = file.path(ws, 'index'))
aln <- align_reads(input = trimmed_fq, 
                   index = ref_index,
                   output = file.path(ws, 'align_results'))

The index files are stored in a directory specified by the output argument of the build_index function. The intermediate files (e.g., FASTQ format files used as inputs) and alignment results (e.g., BAM format files) are stored in the directory specified by the output argument of the align_reads function. BAM format files with the suffixes of .clean.t1.bam and .clean.t2.bam are the final results obtained after alignment. Refer to the sections 4.2 and 4.3 for a detailed description of each of the files generated by each function.

The alignment coverage can be summarized with the calc_coverage function. This function loads the alignment results (i.e., *.clean.t1.bam and *.clean.t2.bam), calculates alignment coverage from these BAM format files, and combines them into two data frames according to the aligned strands.

alncov <- calc_coverage(aln)
head(get_slot_contents(alncov, 'forward'))  # alignment coverage in forward strand 
##      L21 L22 L23 L24
## [1,]  13  12   1   1
## [2,]  13  12   1   1
## [3,]  13  12   1   1
## [4,]  13  13   1   1
## [5,]  13  13   1   1
## [6,]  13  13   1   1
head(get_slot_contents(alncov, 'reversed')) # alignment coverage in reversed strand 
##      L21 L22 L23 L24
## [1,]   7   5   0   1
## [2,]   7   5   0   1
## [3,]   7   5   0   1
## [4,]   7   5   0   1
## [5,]   7   5   0   1
## [6,]   7   5   0   1

The alignment coverage can be then visualized using the plot function (Figure 1). The scale of the upper and lower directions indicate alignment coverage of the forward and reversed strands, respectively.

plot(alncov)
Alignment coverage. The alignment coverage of the case study.

Figure 1: Alignment coverage
The alignment coverage of the case study.

4 Implementation

4.1 Two-stage alignment process

Circular genome sequences are generally represented as linear sequences in the FASTA format during analysis. Consequently, sequence reads obtained from organelles or organisms with circular genome sequences can be aligned anywhere, including at both ends of the sequence represented in the FASTA format. Using existing alignment tools such as Bowtie2 (Langmead and Salzberg 2012) and HISAT2 (Kim et al. 2019) to align such sequence reads onto circular sequences may fail, because these tools are designed to align sequence reads to linear genome sequences and their implementation does not assume that a single read can be aligned to both ends of a linear sequence. To solve this problem, that is, allowing reads to be aligned to both ends, the CircSeqAlignTk package implements a two-stage alignment process (Figure 2), using these existing alignment tools (Bowtie2 and HISAT2).