CircSeqAlignTk is designed for end-to-end RNA-Seq data analysis of circular genome sequences, from alignment to visualization. It mainly targets viroids which are composed of 246-401 nt circular RNAs. In addition, CircSeqAlignTk implements a tidy interface to generate synthetic sequencing data that mimic real RNA-Seq data, allowing developers to evaluate the performance of alignment tools, new alignment algorithms, and new workflows. This vignette provides an overview of CircSeqAlignTk and a detailed explanation of the manner in which it is used.
CircSeqAlignTk 1.8.0
To install the CircSeqAlignTk package (Sun and Cao 2024), start R (≥ 4.2) and run the following steps:
if (!requireNamespace('BiocManager', quietly = TRUE))
install.packages('BiocManager')
BiocManager::install('CircSeqAlignTk')
Note that to install the latest version of the CircSeqAlignTk package, the latest version of R is required.
CircSeqAlignTk is designed for end-to-end RNA-Seq data analysis
of circular genome sequences, from alignment to visualization.
The whole processes will generate many files
including genome sequence indexes, and intermediate and final alignment results.
Thus, it is recommended to specify a working directory to save these files.
Here, for convenience in package development and validation,
we use a temporary folder
which is automatically arranged by the tempdir
function
as the working directory.
However, instead of using a temporary folder, users can specify a folder on the desktop or elsewhere, depending on the analysis project. For example:
Viroids are composed of 246-401 nt, single-stranded circular non-coding RNAs (Hull 2014; Flores et al. 2015; Gago-Zachert 2016). Sequencing small RNAs from viroid-infected plants could offer insights regarding the mechanisms of infection and eventually help prevent these infections in plants. The common workflow for analyzing such data involves the following steps: (i) limit read-length between 21 and 24 nt, as small RNAs derived from viroids are known to be in this range, (ii) align these reads to viroid genome sequences, and (iii) visualize the coverage of alignment to identify the pathogenic region of the viroid. This section demonstrates the workflow using a sample RNA-Seq dataset. It includes workflow from the FASTQ format file to the visualization of the analyzed results, for analyzing small RNA-seq data sequenced from viroid-infected plants.
The FASTQ format file used in this section
is attached in the CircSeqAlignTk package
and can be obtained using the system.file
function.
This FASTQ format file contains 29,178 sequence reads of small RNAs
that were sequenced from a tomato plant
infected with the potato spindle tuber viroid (PSTVd) isolate Cen-1 (FR851463).
The genome sequence of PSTVd isolate Cen-1 in FASTA format can be downloaded
from GenBank
or ENA
using the accession number FR851463.
It is also attached in the CircSeqAlignTk package,
and can be obtained using the system.file
function.
To ensure alignment quality, trimming adapter sequences from the sequence reads is required, because most sequence reads in this FASTQ format file contain adapters with sequence “AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC”. Here, we use AdapterRemoval (Schubert, Lindgreen, and Orlando (2016)) implemented in the Rbowtie2 package (Wei et al. 2018) to trim the adapters before aligning the sequence reads. Note that the length of small RNAs derived from viroids is known to be in the range of 21–24 nt. Therefore, we set an argument to remove sequence reads with lengths outside this range after adapter removal.
library(R.utils)
library(Rbowtie2)
adapter <- 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
# decompressed the gzip file for trimming to avoid errors from `remove_adapters`
gunzip(fq, destname = file.path(ws, 'srna.fq'), overwrite = TRUE, remove = FALSE)
trimmed_fq <- file.path(ws, 'srna_trimmed.fq')
params <- '--maxns 1 --trimqualities --minquality 30 --minlength 21 --maxlength 24'
remove_adapters(file1 = file.path(ws, 'srna.fq'),
adapter1 = adapter,
adapter2 = NULL,
output1 = trimmed_fq,
params,
basename = file.path(ws, 'AdapterRemoval.log'),
overwrite = TRUE)
After obtaining the cleaned FASTQ format file (i.e., srna_trimmed.fq.gz
),
we build index files and perform alignment
using the build_index
and align_reads
functions
implemented in the CircSeqAlignTk package.
To precisely align the reads to the circular genome sequence of the viroid,
the alignment is performed in
two stages.
ref_index <- build_index(input = genome_seq,
output = file.path(ws, 'index'))
aln <- align_reads(input = trimmed_fq,
index = ref_index,
output = file.path(ws, 'align_results'))
The index files are stored in a directory specified by the output
argument
of the build_index
function.
The intermediate files (e.g., FASTQ format files used as inputs)
and alignment results (e.g., BAM format files) are stored in
the directory specified by the output
argument of the align_reads
function.
BAM format files with the suffixes of .clean.t1.bam
and .clean.t2.bam
are
the final results obtained after alignment.
Refer to the sections
4.2 and 4.3
for a detailed description of each of the files generated by each function.
The alignment coverage can be summarized with the calc_coverage
function.
This function loads the alignment results
(i.e., *.clean.t1.bam
and *.clean.t2.bam
),
calculates alignment coverage from these BAM format files,
and combines them into two data frames according to the aligned strands.
alncov <- calc_coverage(aln)
head(get_slot_contents(alncov, 'forward')) # alignment coverage in forward strand
## L21 L22 L23 L24
## [1,] 13 12 1 1
## [2,] 13 12 1 1
## [3,] 13 12 1 1
## [4,] 13 13 1 1
## [5,] 13 13 1 1
## [6,] 13 13 1 1
## L21 L22 L23 L24
## [1,] 7 5 0 1
## [2,] 7 5 0 1
## [3,] 7 5 0 1
## [4,] 7 5 0 1
## [5,] 7 5 0 1
## [6,] 7 5 0 1
The alignment coverage can be then visualized using the plot
function
(Figure 1).
The scale of the upper and lower directions indicate alignment coverage of
the forward and reverse strands, respectively.
Circular genome sequences are generally represented as linear sequences in the FASTA format during analysis. Consequently, sequence reads obtained from organelles or organisms with circular genome sequences can be aligned anywhere, including at both ends of the sequence represented in the FASTA format. Using existing alignment tools such as Bowtie2 (Langmead and Salzberg 2012) and HISAT2 (Kim et al. 2019) to align such sequence reads onto circular sequences may fail, because these tools are designed to align sequence reads to linear genome sequences and their implementation does not assume that a single read can be aligned to both ends of a linear sequence. To solve this problem, that is, allowing reads to be aligned to both ends, the CircSeqAlignTk package implements a two-stage alignment process (Figure 2), using these existing alignment tools (Bowtie2 and HISAT2).
To prepare for the two-stage alignment process, two types of reference
sequences are generated from the same circular genome sequence.
The type 1 reference sequence is a linear sequence generated
by cutting a circular sequence at an arbitrary location.
The type 2 reference is generated
by restoring the type 1 reference sequence into a circular sequence and
cutting the circle at the opposite position to type 1 reference sequence.
The type 1 reference sequence is the input genome sequence itself,
while the type 2 reference sequence is newly created
(by the build_index
function).
Once the two reference sequences are generated, the sequence reads are aligned to the two types of reference sequences in two stages: (i) aligning all sequence reads onto the type 1 reference sequences, and (ii) collecting the unaligned sequence reads and aligning them to the type 2 reference. Alignment can be performed with Bowtie2 or HISAT2 depending on the options specified by the user.
The build_index
function is designed
to generate type 1 and type 2 reference sequences for alignment.
This function has two required arguments,
input
and output
which are used for
specifying a file path to a genome sequence in FASTA format and
a directory path to save the generated type 1 and type 2 reference sequences,
respectively.
The type 1 and type 2 reference sequences are saved in files
refseq.t1.fa
and refseq.t2.fa
in FASTA format, respectively.
Following the generation of reference sequences,
The build_index
function then creates index files
for each reference sequence for alignment.
The index files are saved with the prefix refseq.t1.*
and refseq.t2.*
.
They correspond to the type 1 and 2 reference sequences
(i.e., refseq.t1.fa
and refseq.t2.fa
), respectively.
The extension of index files depends on the alignment tool.
Two alignment tools
(Bowtie2 and
HISAT2) can be specified for
creating index files through the aligner
argument.
If Bowtie2
is specified, then the extension is .bt2
or .bt2l
;
if HISAT2 is specified,
then the extension is .ht2
or .ht2l
.
By default, Bowtie2
is used.
The build_index
function first attempts to call the specified alignment tool
directly installed on the operation system; however, if the tool is not
installed, the function will then attempt to call the bowtie2_build
or
hisat2_build
functions implemented in Rbowtie2
or Rhisat2 packages for indexing.
For example, to generate reference sequences and index files for alignment
against the viroid PSTVd isolate Cen-1 (FR851463) using
Bowtie2/Rbowtie2,
we set the argument input
to the FASTA format file containing the sequence
of FR851463 and execute the build_index
function.
The generated index files will be saved into the folder
specified by the argument output
.
genome_seq <- system.file(package = 'CircSeqAlignTk', 'extdata', 'FR851463.fa')
ref_index <- build_index(input = genome_seq, output = file.path(ws, 'index'))
The function returns a CircSeqAlignTkRefIndex
class object that contains the
file path to type 1 and 2 reference sequences and corresponding index files.
The data structure of CircSeqAlignTkRefIndex
can be verified using the
str
function.
## Formal class 'CircSeqAlignTkRefIndex' [package "CircSeqAlignTk"] with 6 slots
## ..@ name : chr "FR851463"
## ..@ seq : chr "CGGAACTAAACTCGTGGTTCCTGTGGTTCACACCTGACCTCCTGAGCAGGAAAGAAAAAAGAATTGCGGCTCGGAGGAGCGCTTCAGGGATCCCCGGGGAAACCTGGAGCG"| __truncated__
## ..@ length : int 361
## ..@ fasta : chr [1:2] "/home/biocbuild/bbs-3.20-bioc/tmpdir/Rtmp414P90/index/refseq.t1.fa" "/home/biocbuild/bbs-3.20-bioc/tmpdir/Rtmp414P90/index/refseq.t2.fa"
## ..@ index : chr [1:2] "/home/biocbuild/bbs-3.20-bioc/tmpdir/Rtmp414P90/index/refseq.t1" "/home/biocbuild/bbs-3.20-bioc/tmpdir/Rtmp414P90/index/refseq.t2"
## ..@ cut_loc: num 180
The file path to type 1 and type 2 reference sequences, refseq.type1.fa
and
refseq.type2.fa
, can be checked through the @fasta
slot
using the get_slot_contents
function.
## [1] "/home/biocbuild/bbs-3.20-bioc/tmpdir/Rtmp414P90/index/refseq.t1.fa"
## [2] "/home/biocbuild/bbs-3.20-bioc/tmpdir/Rtmp414P90/index/refseq.t2.fa"
The file path (prefix) to the index files, refseq.t1.*.bt2
and
refseq.t2.*.bt2
, can be checked through @index
slot.
## [1] "/home/biocbuild/bbs-3.20-bioc/tmpdir/Rtmp414P90/index/refseq.t1"
## [2] "/home/biocbuild/bbs-3.20-bioc/tmpdir/Rtmp414P90/index/refseq.t2"
Note that, users can simply use the slot
function or @
operator
to access these slot contents instead of using the get_slot_contents
function.
For example,
As mentioned previously,
the type 2 reference is generated
by restoring the type 1 reference sequence to a circular sequence
and cutting the circular sequence at the opposite position of type 1.
The cutting position based on the type 1 reference
sequence coordinate can be checked from the @cut_loc
slot.
## [1] 180
By default,
Bowtie2/Rbowtie2
is used for indexing.
This can be changed to
HISAT2/Rhisat2 using
the aligner
argument.
The align_reads
function is used to align sequence reads onto a circular
genome sequence.
This function requires three arguments: input
, index
, and output
,
which are used to specify a file path to RNA-seq reads in FASTQ format,
a CircSeqAlignTkRefIndex
class object generated by the build_index
function,
and a directory path to save the intermediate and final results, respectively.
This function aligns sequence reads within
the two-stage alignment process described above.
Thus,
it (i) aligns reads to the type 1 reference sequence (i.e., refseq.t1.fa
)
and (ii) collects the unaligned reads and aligns
them with the type 2 reference sequence (i.e., refseq.t2.fa
).
Two alignment tools
(Bowtie2
and HISAT2)
can be specified for sequence read alignment.
By default, Bowtie2
is used, and it can be changed with the alinger
argument.
Similar to the build_index
function,
the align_reads
function first attempts to call the specified alignment tool
directly installed on the operation system;
however, if the tool is not installed,
the function then attempts to call
the bowtie2_build
or hisat2_build
function implemented in
Rbowtie2 or Rhisat2 packages for alignment.
The following example is aligning RNA-Seq reads in FASTQ format (fq
)
on the reference index (ref_index
) of PSTVd isolate Cen-1 (FR851463) which was generated
at the section 4.2.
The alignment results will be stored into the folder specified
by the argument output
.
fq <- system.file(package = 'CircSeqAlignTk', 'extdata', 'srna.fq.gz')
# trimming the adapter sequences if needed before alignment, omitted here.
aln <- align_reads(input = fq,
index = ref_index,
output = file.path(ws, 'align_results'))
This function returns a CircSeqAlignTkAlign
class object containing the
path to the intermediate files and final alignment results.
## Formal class 'CircSeqAlignTkAlign' [package "CircSeqAlignTk"] with 6 slots
## ..@ input_fastq: chr "/home/biocbuild/bbs-3.20-bioc/tmpdir/RtmplUApSM/Rinst3b1b3945f34265/CircSeqAlignTk/extdata/srna.fq.gz"
## ..@ fastq : chr [1:2] "/home/biocbuild/bbs-3.20-bioc/tmpdir/RtmplUApSM/Rinst3b1b3945f34265/CircSeqAlignTk/extdata/srna.fq.gz" "/home/biocbuild/bbs-3.20-bioc/tmpdir/Rtmp414P90/align_results/srna.t2.fq.gz"
## ..@ bam : chr [1:2] "/home/biocbuild/bbs-3.20-bioc/tmpdir/Rtmp414P90/align_results/srna.t1.bam" "/home/biocbuild/bbs-3.20-bioc/tmpdir/Rtmp414P90/align_results/srna.t2.bam"
## ..@ clean_bam : chr [1:2] "/home/biocbuild/bbs-3.20-bioc/tmpdir/Rtmp414P90/align_results/srna.clean.t1.bam" "/home/biocbuild/bbs-3.20-bioc/tmpdir/Rtmp414P90/align_results/srna.clean.t2.bam"
## ..@ stats :'data.frame': 4 obs. of 5 variables:
## .. ..$ n_reads : num [1:4] 29178 29012 166 30
## .. ..$ aligned_fwd : num [1:4] 89 22 89 21
## .. ..$ aligned_rev : num [1:4] 77 9 77 9
## .. ..$ unaligned : num [1:4] 29012 28981 0 0
## .. ..$ unsorted_reads: num [1:4] 0 0 0 0
## ..@ reference :Formal class 'CircSeqAlignTkRefIndex' [package "CircSeqAlignTk"] with 6 slots
## .. .. ..@ name : chr "FR851463"
## .. .. ..@ seq : chr "CGGAACTAAACTCGTGGTTCCTGTGGTTCACACCTGACCTCCTGAGCAGGAAAGAAAAAAGAATTGCGGCTCGGAGGAGCGCTTCAGGGATCCCCGGGGAAACCTGGAGCG"| __truncated__
## .. .. ..@ length : int 361
## .. .. ..@ fasta : chr [1:2] "/home/biocbuild/bbs-3.20-bioc/tmpdir/Rtmp414P90/index/refseq.t1.fa" "/home/biocbuild/bbs-3.20-bioc/tmpdir/Rtmp414P90/index/refseq.t2.fa"
## .. .. ..@ index : chr [1:2] "/home/biocbuild/bbs-3.20-bioc/tmpdir/Rtmp414P90/index/refseq.t1" "/home/biocbuild/bbs-3.20-bioc/tmpdir/Rtmp414P90/index/refseq.t2"
## .. .. ..@ cut_loc: num 180
The alignment results are saved as BAM format files
in the specified directory with the suffixes of *.t1.bam
and *.t2.bam
.
The original alignment results may contain mismatches.
Hence, this function performs filtering
to remove alignment with the mismatches over the specified value
from the BAM format file.
Filtering results for *.t1.bam
and *.t2.bam
are saved as *.clean.t1.bam
and *.clean.t2.bam
, respectively.
The path to the original and filtered BAM format files
can be checked using @bam
and @clean_bam
slots, respectively.
## [1] "/home/biocbuild/bbs-3.20-bioc/tmpdir/Rtmp414P90/align_results/srna.t1.bam"
## [2] "/home/biocbuild/bbs-3.20-bioc/tmpdir/Rtmp414P90/align_results/srna.t2.bam"
## [1] "/home/biocbuild/bbs-3.20-bioc/tmpdir/Rtmp414P90/align_results/srna.clean.t1.bam"
## [2] "/home/biocbuild/bbs-3.20-bioc/tmpdir/Rtmp414P90/align_results/srna.clean.t2.bam"
The alignment statistics (for example, number of input sequence reads,
number of aligned reads) can be checked using the @stats
slot.
## n_reads aligned_fwd aligned_rev unaligned unsorted_reads
## srna.t1.bam 29178 89 77 29012 0
## srna.t2.bam 29012 22 9 28981 0
## srna.clean.t1.bam 166 89 77 0 0
## srna.clean.t2.bam 30 21 9 0 0
By default, the align_read
function allows a single mismatch in the alignment
of each read (i.e., n_mismatch = 1
).
To forbid a mismatch or allow more mismatches,
assign 0
or a large number to the n_mismatch
argument.
aln <- align_reads(input = fq,
index = ref_index,
output = file.path(ws, 'align_results'),
n_mismatch = 0)
The number of threads for alignment
can be specified using the n_threads
argument.
Setting a large number of threads
(but not exceeding the computer limits)
can accelerate the speed of alignment.
aln <- align_reads(input = fq,
index = ref_index,
output = file.path(ws, 'align_results'),
n_threads = 4)
Additional arguments to be directly passed on to the alignment tool can be
specified with the add_args
argument.
For example, to increase the alignment sensitivity,
we set the maximum number of mismatches to 1
and the length of seed substrings for alignment to 20
during the process of the Bowtie2
multiseed alignment.
See the
Bowtie2 website
to find additional parameters of Bowtie2.
aln <- align_reads(input = fq,
index = ref_index,
output = file.path(ws, 'align_results'),
add_args = '-L 20 -N 1')
To use
HISAT2/Rhisat2,
assign hisat2
to the aligner
argument.
Summarization and visualization of the alignment results can be performed with
the calc_coverage
and plot
functions, respectively.
The calc_coverage
function calculates alignment coverage
from the two BAM files, *.clean.t1.bam
and *.clean.t2.bam
,
generated by the align_reads
function.
This function returns a CircSeqAlignTkCoverage
class object.
Alignment coverage of the reads aligned in the forward and reverse strands
are stored in the @forward
and @reverse
slots, respectively,
as a data frame.
## L21 L22 L23 L24
## [1,] 12 8 1 0
## [2,] 12 8 1 0
## [3,] 12 8 1 0
## [4,] 12 9 1 0
## [5,] 13 9 1 0
## [6,] 13 9 1 0
## L21 L22 L23 L24
## [1,] 5 4 0 0
## [2,] 5 4 0 0
## [3,] 5 4 0 0
## [4,] 5 4 0 0
## [5,] 5 4 0 0
## [6,] 5 4 0 0
Coverage can be visualized with an area chart using the plot
function.
In the chart, the upper and lower directions of the y-axis
represent the alignment coverage of reads with forward and reverse strands,
respectively.