QuasR 1.45.2
The QuasR package (short for Quantify and annotate short reads
in R) integrates the functionality of several R packages (such as IRanges (Lawrence et al. 2013)
and Rsamtools) and external software (e.g. bowtie
, through the
Rbowtie package, and HISAT2
, through the Rhisat2 package).
The package aims to cover the whole analysis workflow of typical high throughput
sequencing experiments, starting from the raw sequence reads, over pre-processing and
alignment, up to quantification. A single R script can contain all steps of a complete
analysis, making it simple to document, reproduce or share the workflow containing all
relevant details.
The current QuasR release supports the analysis of single read and paired-end ChIP-seq (chromatin immuno-precipitation combined with sequencing), RNA-seq (gene expression profiling by sequencing of RNA) and Bis-seq (measurement of DNA methylation by sequencing of bisulfite-converted genomic DNA) experiments. It has been successfully used with data from Illumina, 454 Life Technologies and SOLiD sequencers, the latter by using bam files created externally of QuasR.
If you use QuasR (Gaidatzis et al. 2015) in your work, you can cite it as follows:
citation("QuasR")
## Please use the QuasR reference below to cite the software itself. If
## you were using qAlign with Rbowtie as aligner, it can be cited as
## Langmead et al. (2009) (unspliced alignments) or Au et al. (2010)
## (spliced alignments). If you were using qAlign with Rhisat2 as aligner,
## it can be cited as Kim et al. (2015).
##
## Gaidatzis D, Lerch A, Hahne F, Stadler MB. QuasR: Quantification and
## annotation of short reads in R. Bioinformatics 31(7):1130-1132
## (2015).
##
## Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and
## memory-efficient alignment of short DNA sequences to the human
## genome. Genome Biology 10(3):R25 (2009).
##
## Au KF, Jiang H, Lin L, Xing Y, Wong WH. Detection of splice junctions
## from paired-end RNA-seq data by SpliceMap. Nucleic Acids Research,
## 38(14):4570-8 (2010).
##
## Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with
## low memory requirements. Nat Methods, 12(4):357-60 (2015).
##
## This free open-source software implements academic research by the
## authors and co-workers. If you use it, please support the project by
## citing the appropriate journal articles.
##
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.
QuasR is a package for the R computing environment and it is assumed that you have already installed R. See the R project at (http://www.r-project.org). To install the latest version of QuasR, you will need to be using the latest version of R. QuasR is part of the Bioconductor project at (http://www.bioconductor.org). To get QuasR together with its dependencies you can use
if (!require("BiocManager"))
install.packages("BiocManager")
BiocManager::install("QuasR")
Bioconductor works on a 6-monthly official release cycle. As with other Bioconductor
packages, there are always two versions of QuasR. Most users will
use the current official release version, which will be installed by BiocManager::install
if you are using the current release version of R. There is also a development version
of QuasR that includes new features due for the next official release.
The development version will be installed if you are using the development version of
Bioconductor (see version = "devel"
in BiocManager). The official
release version always has an even second number (for example 1.20.1), whereas the
developmental version has an odd second number (for example 1.21.4).
In order to run the code examples in this vignette, the QuasR package and a few additional packages need to be loaded:
suppressPackageStartupMessages({
library(QuasR)
library(BSgenome)
library(Rsamtools)
library(rtracklayer)
library(GenomicFeatures)
library(txdbmaker)
library(Gviz)
})
Most questions about QuasR will hopefully be answered by the documentation or references. If you’ve run into a question which isn’t addressed by the documentation, or you’ve found a conflict between the documentation and software, then there is an active support community which can offer help.
The authors of the package (maintainer: Michael Stadler michael.stadler@fmi.ch) always appreciate receiving reports of bugs in the package functions or in the documentation. The same goes for well-considered suggestions for improvements.
Any other questions or problems concerning QuasR should be posted to the Bioconductor support site (https://support.bioconductor.org). Users posting to the support site for the first time should read the helpful posting guide at (https://support.bioconductor.org/info/faq/). Note that each function in QuasR has it’s own help page, as described in the section 3.1. Posting etiquette requires that you read the relevant help page carefully before posting a problem to the site.
If you already use R and know about its interface, just skip this section and continue with section 3.2.
The structure of this vignette and in particular this section is based on the excellent user guide of the limma package, which we would like to hereby acknowledge. R is a program for statistical computing. It is a command-driven language meaning that you have to type commands into it rather than pointing and clicking using a mouse. In this guide it will be assumed that you have successfully downloaded and installed R from (http://www.r-project.org) as well as QuasR (see section 2.2). A good way to get started is to type
help.start()
at the R prompt or, if you’re using R for Windows, to follow the drop-down menu items Help \(\succ\) Html help. Following the links Packages \(\succ\) QuasR from the html help page will lead you to the contents page of help topics for functions in QuasR.
Before you can use any QuasR commands you have to load the package by typing
library(QuasR)
at the R prompt. You can get help on any function in any loaded package by typing
?
and the function name at the R prompt, for example
?preprocessReads
or equivalently
help("preprocessReads")
for detailed help on the preprocessReads
function. The function help
page is especially useful to learn about its arguments and its return value.
Working with R usually means creating and transforming objects. Objects might include data sets, variables, functions, anything at all. For example
x <- 2
will create a variable x
and will assign it the value 2. At any stage of your
R session you can type
ls()
to get a list of all the objects currently existing in your session. You can
display an object by typing its name on the prompt. The following displays the
object x
:
x
We hope that you can use QuasR without having to spend a lot of time learning about the R language itself but a little knowledge in this direction will be very helpful, especially when you want to do something not explicitly provided for in QuasR or in the other Bioconductor packages. For more details about the R language see An Introduction to R which is available from the online help, or one of the many great online resources, like the documentation at r-project.org, the growing list of free books at bioconductor.org, or the books from rstudio.com (many of which are also available for free). For more background on using R for statistical analysis see (Dalgaard 2002).
This is a quick overview of what an analysis could look like for users preferring
to jump right into an analysis. The example uses data that is provided with the
QuasR package, which is first copied to the current working directory,
into a subfolder called "extdata"
:
file.copy(system.file(package = "QuasR", "extdata"), ".", recursive = TRUE)
## [1] TRUE
The sequence files to be analyzed are listed in sampleFile
(see section
5.1 for details).
The sequence reads will be aligned using bowtie
(Langmead et al. 2009) (from the Rbowtie
package (Hahne, Lerch, and Stadler 2012)) to a small reference genome (consisting of three short segments
from the hg19 human genome assembly, available in full for example in the
BSgenome.Hsapiens.UCSC.hg19 package). Information on selecting
an appropriate reference genome is summarized in section 5.3.
Make sure that you have sufficient disk space, both in your R temporary
directory (tempdir()
) as well as to store the resulting alignments
(see section 7.2).
sampleFile <- "extdata/samples_chip_single.txt"
genomeFile <- "extdata/hg19sub.fa"
proj <- qAlign(sampleFile, genomeFile)
## Creating .fai file for: /tmp/RtmpIlvpBW/Rbuild2330404e7ceba2/QuasR/vignettes/extdata/hg19sub.fa
## alignment files missing - need to:
## create alignment index for the genome
## create 2 genomic alignment(s)
## Creating an Rbowtie index for /tmp/RtmpIlvpBW/Rbuild2330404e7ceba2/QuasR/vignettes/extdata/hg19sub.fa
## Finished creating index
## Testing the compute nodes...OK
## Loading QuasR on the compute nodes...preparing to run on 1 nodes...done
## Available cores:
## nebbiolo2: 1
## Performing genomic alignments for 2 samples. See progress in the log file:
## /tmp/RtmpIlvpBW/Rbuild2330404e7ceba2/QuasR/vignettes/QuasR_log_2345ca1ea08b51.txt
## Genomic alignments have been created successfully
proj
## Project: qProject
## Options : maxHits : 1
## paired : no
## splicedAlignment: FALSE
## bisulfite : no
## snpFile : none
## geneAnnotation : none
## Aligner : Rbowtie v1.45.0 (parameters: -m 1 --best --strata)
## Genome : /tmp/RtmpIlvpBW/Rbuild2330404e7ceba2/QuasR/vig.../hg19sub.fa (file)
##
## Reads : 2 files, 2 samples (fastq format):
## 1. chip_1_1.fq.bz2 Sample1 (phred33)
## 2. chip_2_1.fq.bz2 Sample2 (phred33)
##
## Genome alignments: directory: same as reads
## 1. chip_1_1_2345ca234088fb.bam
## 2. chip_2_1_2345ca7fd8dfff.bam
##
## Aux. alignments: none
The proj
object keeps track of all the information of a sequencing experiment,
for example where sequence and alignment files are stored, and what aligner and
reference genome was used to generate the alignments.
Now that the alignments have been generated, further analyses can be performed.
A quality control report is saved to the "extdata/qc_report.pdf"
file using the
qQCReport
function.
qQCReport(proj, "extdata/qc_report.pdf")
## collecting quality control data
## creating QC plots
The number of alignments per promoter region is quantified using qCount
. Genomic
coordinates for promoter regions are imported from a gtf file (annotFile
) into
the GRanges
-object with the name promReg
:
library(rtracklayer)
library(GenomicFeatures)
annotFile <- "extdata/hg19sub_annotation.gtf"
txStart <- import.gff(annotFile, format = "gtf", feature.type = "start_codon")
promReg <- promoters(txStart, upstream = 500, downstream = 500)
names(promReg) <- mcols(promReg)$transcript_name
promCounts <- qCount(proj, query = promReg)
## counting alignments...done
promCounts
## width Sample1 Sample2
## TNFRSF18-003 1000 20 4
## TNFRSF18-002 1000 20 4
## TNFRSF18-001 1000 20 4
## TNFRSF4-001 1000 5 2
## SDF4-007 1000 8 2
## SDF4-001 1000 8 2
## SDF4-002 1000 8 2
## SDF4-201 1000 8 2
## B3GALT6-001 1000 25 274
## RPS7-001 1000 121 731
## RPS7-008 1000 121 731
## RPS7-009 1000 121 731
## RPS7-005 1000 121 731
## C3orf10-201 1000 176 496
## C3orf10-001 1000 176 496
## AC034193.1-201 1000 5 2
## VHL-001 1000 61 336
## VHL-002 1000 61 336
## VHL-201 1000 61 336
The following scheme shows the major components of QuasR and their relationships:
QuasR works with data (sequences and alignments, reference genome, etc.) that are stored as files on your storage (the gray cylinder on the lower left of Figure above, see section 4.1 for details on storage locations). QuasR does not need a database management system, or these files to be named and organized according to a specific scheme.
In order to keep track of directory paths during an analysis, QuasR
makes use of a qProject
object that is returned by the qAlign
function, which
at the minimum requires two inputs: the name of a samples text file (see section
5.1 for details), and the reference genome for the alignments
(see section 5.3).
The qProject
object is the main argument passed to subsequent functions such as
qQCReport
and qCount
. The qProject
object contains all necessary information
on the current project and eliminates the need to repeatedly enter the same information.
All functions that work on qProject
objects can be recognized by their names starting
with the letter q.
Read quantification (apart from quantification of methylation which has its own
function qMeth
) is done using the qCount
function: It counts the alignments in
regions of interest (e.g. promoters, genes, exons, etc.) and produces a count table
(regions in rows, samples in columns) for further visualization and analysis. The
count table can also be used as input to a statistical analysis using packages such
as edgeR (Robinson, McCarthy, and Smyth 2010), DESeq (Anders and Huber 2010), DESeq2 (Love, Huber, and Anders 2014),
TCC (Sun et al. 2013), DEXSeq (Anders, Reyes, and Huber 2012) or baySeq (Hardcastle and Kelly 2010).
In summary, a typical QuasR analysis consists of the following steps (some of them are optional):
preprocessReads
(optional): Remove adapters from start or end of reads, filter
out reads of low quality, short length or low complexity (section 5.4).qAlign
: Create qProject
object and specify project parameters. Also download
BSgenome package, create aligner indices and align reads if not already existing
(section 7.2).qQCReport
(optional): Create quality control report with plots on sequence qualities
and alignment statistics (section 7.4).qExportWig
(optional): Export genomic alignments as wiggle tracks for genome
browser visualization (section 7.6).qCount
: Quantify alignments in regions of interest (section 7.7).Recurrent example tasks that may be part of any typical analysis are described in section 5. Example workflows for specific experiment types (ChIP-seq, RNA-seq and Bis-seq) are described in section 6.
Apart from qExportWig
and qQCReport
, which generate wig files and pdf reports,
qAlign
is the only function in QuasR that stores files on the disk
(see section 7.2 for details). All files generated by qAlign
are listed here
by type, together with their default location and how locations can be changed.
tempdir()
): Temporary files include reference genomes
in fasta
format, decompressed input sequence files, and temporary alignments in
text format, and can require a large amount of disk space. By default, these files
will be written to the temporary directory of the R process (as reported by
tempdir()
). If using clObj
for parallel processing, this may be the tempdir()
from the cluster node(s). An alternative location can be set using the TMPDIR
environment variable (see ?tempdir
).bam
format) (default: same directory as the input sequence files):
Alignments against reference genome and auxiliary targets are stored in bam
format
in the same directory that also contains the input sequence file (listed in sampleFile
).
Please note that if the input sequence file corresponds to a symbolic link, QuasR
will follow the link and use the directory of the original file instead. An alternative
directory can be specified with the alignmentsDir
argument from qAlign
, which
will store all bam
files in that directory even if the input sequence files are located
in different directories.genome
and snpFile
arguments):
Many alignment tools including bowtie
require an index of the reference sequence
to perform alignments. If necessary, qAlign
will build this index automatically
and store it in a default location that depends on the genome
argument:
BSgenome
: If genome
is the name of a BSgenome package
(such as "BSgenome.Hsapiens.UCSC.hg19"
), the index will be stored as a
new R package in the default library path (as reported by .libPaths()[1]
,
see ?install.packages
for details), or alternatively in the library
specified by the lib.loc
argument of qAlign
. The name of this index
package will be the name of the original BSgenome package
with a suffix for the index type, for example "BSgenome.Hsapiens.UCSC.hg19.Rbowtie"
.fasta
: If genome
refers to a reference genome file in fasta
format,
the index will be stored in a subdirectory at the same location. Similarly,
the indices for files listed in auxiliaryFile
are store at the location
of these files. For example, the Rbowtie
index for the genome at
"./genome/mm9.fa"
is stored in "./genome/mm9.fa.Rbowtie"
.snpFile
(e.g. "./mySNPs.tab"
)
are injected into the genome
(e.g. "BSgenome.Mmusculus.UCSC.mm9"
)
to create two variant genomes to be indexed. These indices are saved at the
location of the snpFile
in a directory named after snpFile
, genome
and the index type (e.g. "./mySNPs.tab.BSgenome.Mmusculus.UCSC.mm9.A.fa.Rbowtie"
).The sample file is a tab-delimited text file with two or three columns. The first row contains the column names: For a single read experiment, these are ‘FileName’ and ‘SampleName’; for a paired-end experiment, these are ‘FileName1’, ‘FileName2’ and ‘SampleName’. If the first row does not contain the correctly spelled column names, QuasR will not accept the samples file. Subsequent rows contain the input sequence files.
Here are examples of such sample files for a single read experiment:
FileName SampleName chip_1_1.fq.bz2 Sample1 chip_2_1.fq.bz2 Sample2
and for a paired-end experiment:
FileName1 FileName2 SampleName rna_1_1.fq.bz2 rna_1_2.fq.bz2 Sample1 rna_2_1.fq.bz2 rna_2_2.fq.bz2 Sample2
These example files are also contained in the QuasR package and may be used as templates. The path of the files can be determined using:
sampleFile1 <- system.file(package="QuasR", "extdata",
"samples_chip_single.txt")
sampleFile2 <- system.file(package="QuasR", "extdata",
"samples_rna_paired.txt")
The columns FileName for single-read, or FileName1 and FileName2 for paired-end
experiments contain paths and names to files containing the sequence data. The paths
can be absolute or relative to the location of the sample file. This allows combining
files from different directories in a single analysis. For each input sequence file,
qAlign
will create one alignment file and by default store it in the same directory
as the sequence file. Already existing alignment files with identical parameters will
not be re-created, so that it is easy to reuse the same sequence files in multiple
projects without unnecessarily copying sequence files or recreating alignments.
The SampleName column contains sample names for each sequence file. The same name
can be used on several lines to indicate multiple sequence files that belong to the
same sample (qCount
can use this information to automatically combine counts for
one sample from multiple files).
Three file formats are supported for input files (but cannot be mixed within a single sample file):
bam
files. This makes it possible to use alignment tools that are not available
within QuasR, but making use of this option comes with a risk and
should only be used by experienced users. For example, it cannot be guaranteed
any more that certain assumptions made by qCount
are fulfilled by the external
aligner (see below). When using external bam
files, we recommend to use files which contain only one alignment per read. This
may also include multi-hit reads, for which one of the alignments is randomly
selected. This allows QuasR to count the total number of reads by
counting the total number of alignments. Furthermore, if the bam
files also
contain the unmapped reads, QuasR will be able to calculate the
fraction of mapped reads. For bisulfite samples we require ungapped alignments
stored in unpaired or paired ff orientation (even if the input reads are fr).
For allele-specific bam
files, QuasR requires an additional tag
for each alignment called XV
of type A
(printable character) with the possible
values R
(Reference), U
(Unknown) and A
(Alternative).fasta and fastq files can be compressed with gzip, bzip2 or xz (file extensions ‘.gz’, ‘.bz2’ or ‘xz’, respectively) and will be automatically decompressed when necessary.
bam
files after performing alignmentsOnce alignments have been created, most analyses will only require the bam
files
and will not access the original raw sequence files anymore. However, re-creating
a qProject
object by a later identical call to qAlign
will still need access to
the raw sequences to verify consistency between raw data and alignments. It may be
desirable to remove this dependency, for example to archive or move away the raw
sequence files and to reclaim used disk space.
This can be achieved using the following procedure involving two sequential calls
to qAlign
. First, qAlign
is called with the orignial sample file (sampleFile1
)
that lists the raw sequence files, and subsequently with a second sample file
(sampleFile2
) that lists the bam
files generated in the first call. Such a
second sample file can be easily generated given the qProject
object (proj1
)
returned by the first call:
sampleFile1 <- "samples_fastq.txt"
sampleFile2 <- "samples_bam.txt"
proj1 <- qAlign(sampleFile1, genomeFile)
write.table(alignments(proj1)$genome, sampleFile2, sep="\t", row.names=FALSE)
proj2 <- qAlign(sampleFile2, genomeFile)
The analysis can now be exclusively based on the bam
files using sampleFile2
and proj2
.
The sample file implicitly defines the type of samples contained in the project:
single read or paired-end read, sequences with or without qualities.
This type will have a profound impact on the downstream analysis. For example,
it controls whether alignments will be performed in single or paired-end mode,
either with or without base qualities. That will also determine availability of
certain options for quality control and quantification in qQCReport
and qCount
.
For consistency, it is therefore required that all samples within a project have
the same type; it is not possible to mix both single and paired-end read samples,
or fasta and fastq files in a single project (sample file). If necessary,
it may be possible to analyse different types of files in separate QuasR
projects and combine the derived results at the end.
By default QuasR aligns reads only to the reference genome. However,
it may be interesting to align non-matching reads to further targets, for example
to identify contamination from vectors or a different species, or in order to
quantify spike-in material not contained in the reference genome. In QuasR,
such supplementary reference files are called auxiliary references and can be
specified to qAlign
using the auxiliaryFile
argument (see section 7.2
for details). The format of the auxiliary file is similar to the one of the sample
file described in section 5.1: It contains two columns with
column names ‘FileName’ and ‘AuxName’ in the first row. Additional rows contain
names and files of one or several auxiliary references in fasta
format.
An example auxiliary file looks like this:
FileName AuxName NC_001422.1.fa phiX174
and is available from your QuasR installation at
auxFile <- system.file(package = "QuasR", "extdata", "auxiliaries.txt")
Sequence reads are primarily aligned against the reference genome (see section 5.3.1 on how to choose a suitable reference assembly). If necessary, QuasR will create an aligner index for the genome. The reference genome can be provided in one of two different formats:
available.genomes()
## [1] "BSgenome.Alyrata.JGI.v1"
## [2] "BSgenome.Amellifera.BeeBase.assembly4"
## [3] "BSgenome.Amellifera.NCBI.AmelHAv3.1"
## [4] "BSgenome.Amellifera.UCSC.apiMel2"
## [5] "BSgenome.Amellifera.UCSC.apiMel2.masked"
## [6] "BSgenome.Aofficinalis.NCBI.V1"
## [7] "BSgenome.Athaliana.TAIR.04232008"
## [8] "BSgenome.Athaliana.TAIR.TAIR9"
## [9] "BSgenome.Btaurus.UCSC.bosTau3"
## [10] "BSgenome.Btaurus.UCSC.bosTau3.masked"
## [11] "BSgenome.Btaurus.UCSC.bosTau4"
## [12] "BSgenome.Btaurus.UCSC.bosTau4.masked"
## [13] "BSgenome.Btaurus.UCSC.bosTau6"
## [14] "BSgenome.Btaurus.UCSC.bosTau6.masked"
## [15] "BSgenome.Btaurus.UCSC.bosTau8"
## [16] "BSgenome.Btaurus.UCSC.bosTau9"
## [17] "BSgenome.Btaurus.UCSC.bosTau9.masked"
## [18] "BSgenome.Carietinum.NCBI.v1"
## [19] "BSgenome.Celegans.UCSC.ce10"
## [20] "BSgenome.Celegans.UCSC.ce11"
## [21] "BSgenome.Celegans.UCSC.ce2"
## [22] "BSgenome.Celegans.UCSC.ce6"
## [23] "BSgenome.Cfamiliaris.UCSC.canFam2"
## [24] "BSgenome.Cfamiliaris.UCSC.canFam2.masked"
## [25] "BSgenome.Cfamiliaris.UCSC.canFam3"
## [26] "BSgenome.Cfamiliaris.UCSC.canFam3.masked"
## [27] "BSgenome.Cjacchus.UCSC.calJac3"
## [28] "BSgenome.Cjacchus.UCSC.calJac4"
## [29] "BSgenome.CneoformansVarGrubiiKN99.NCBI.ASM221672v1"
## [30] "BSgenome.Creinhardtii.JGI.v5.6"
## [31] "BSgenome.Dmelanogaster.UCSC.dm2"
## [32] "BSgenome.Dmelanogaster.UCSC.dm2.masked"
## [33] "BSgenome.Dmelanogaster.UCSC.dm3"
## [34] "BSgenome.Dmelanogaster.UCSC.dm3.masked"
## [35] "BSgenome.Dmelanogaster.UCSC.dm6"
## [36] "BSgenome.Drerio.UCSC.danRer10"
## [37] "BSgenome.Drerio.UCSC.danRer11"
## [38] "BSgenome.Drerio.UCSC.danRer5"
## [39] "BSgenome.Drerio.UCSC.danRer5.masked"
## [40] "BSgenome.Drerio.UCSC.danRer6"
## [41] "BSgenome.Drerio.UCSC.danRer6.masked"
## [42] "BSgenome.Drerio.UCSC.danRer7"
## [43] "BSgenome.Drerio.UCSC.danRer7.masked"
## [44] "BSgenome.Dvirilis.Ensembl.dvircaf1"
## [45] "BSgenome.Ecoli.NCBI.20080805"
## [46] "BSgenome.Gaculeatus.UCSC.gasAcu1"
## [47] "BSgenome.Gaculeatus.UCSC.gasAcu1.masked"
## [48] "BSgenome.Ggallus.UCSC.galGal3"
## [49] "BSgenome.Ggallus.UCSC.galGal3.masked"
## [50] "BSgenome.Ggallus.UCSC.galGal4"
## [51] "BSgenome.Ggallus.UCSC.galGal4.masked"
## [52] "BSgenome.Ggallus.UCSC.galGal5"
## [53] "BSgenome.Ggallus.UCSC.galGal6"
## [54] "BSgenome.Gmax.NCBI.Gmv40"
## [55] "BSgenome.Hsapiens.1000genomes.hs37d5"
## [56] "BSgenome.Hsapiens.NCBI.GRCh38"
## [57] "BSgenome.Hsapiens.NCBI.T2T.CHM13v2.0"
## [58] "BSgenome.Hsapiens.UCSC.hg17"
## [59] "BSgenome.Hsapiens.UCSC.hg17.masked"
## [60] "BSgenome.Hsapiens.UCSC.hg18"
## [61] "BSgenome.Hsapiens.UCSC.hg18.masked"
## [62] "BSgenome.Hsapiens.UCSC.hg19"
## [63] "BSgenome.Hsapiens.UCSC.hg19.masked"
## [64] "BSgenome.Hsapiens.UCSC.hg38"
## [65] "BSgenome.Hsapiens.UCSC.hg38.dbSNP151.major"
## [66] "BSgenome.Hsapiens.UCSC.hg38.dbSNP151.minor"
## [67] "BSgenome.Hsapiens.UCSC.hg38.masked"
## [68] "BSgenome.Hsapiens.UCSC.hs1"
## [69] "BSgenome.Mdomestica.UCSC.monDom5"
## [70] "BSgenome.Mfascicularis.NCBI.5.0"
## [71] "BSgenome.Mfascicularis.NCBI.6.0"
## [72] "BSgenome.Mfuro.UCSC.musFur1"
## [73] "BSgenome.Mmulatta.UCSC.rheMac10"
## [74] "BSgenome.Mmulatta.UCSC.rheMac2"
## [75] "BSgenome.Mmulatta.UCSC.rheMac2.masked"
## [76] "BSgenome.Mmulatta.UCSC.rheMac3"
## [77] "BSgenome.Mmulatta.UCSC.rheMac3.masked"
## [78] "BSgenome.Mmulatta.UCSC.rheMac8"
## [79] "BSgenome.Mmusculus.UCSC.mm10"
## [80] "BSgenome.Mmusculus.UCSC.mm10.masked"
## [81] "BSgenome.Mmusculus.UCSC.mm39"
## [82] "BSgenome.Mmusculus.UCSC.mm8"
## [83] "BSgenome.Mmusculus.UCSC.mm8.masked"
## [84] "BSgenome.Mmusculus.UCSC.mm9"
## [85] "BSgenome.Mmusculus.UCSC.mm9.masked"
## [86] "BSgenome.Osativa.MSU.MSU7"
## [87] "BSgenome.Ppaniscus.UCSC.panPan1"
## [88] "BSgenome.Ppaniscus.UCSC.panPan2"
## [89] "BSgenome.Ptroglodytes.UCSC.panTro2"
## [90] "BSgenome.Ptroglodytes.UCSC.panTro2.masked"
## [91] "BSgenome.Ptroglodytes.UCSC.panTro3"
## [92] "BSgenome.Ptroglodytes.UCSC.panTro3.masked"
## [93] "BSgenome.Ptroglodytes.UCSC.panTro5"
## [94] "BSgenome.Ptroglodytes.UCSC.panTro6"
## [95] "BSgenome.Rnorvegicus.UCSC.rn4"
## [96] "BSgenome.Rnorvegicus.UCSC.rn4.masked"
## [97] "BSgenome.Rnorvegicus.UCSC.rn5"
## [98] "BSgenome.Rnorvegicus.UCSC.rn5.masked"
## [99] "BSgenome.Rnorvegicus.UCSC.rn6"
## [100] "BSgenome.Rnorvegicus.UCSC.rn7"
## [101] "BSgenome.Scerevisiae.UCSC.sacCer1"
## [102] "BSgenome.Scerevisiae.UCSC.sacCer2"
## [103] "BSgenome.Scerevisiae.UCSC.sacCer3"
## [104] "BSgenome.Sscrofa.UCSC.susScr11"
## [105] "BSgenome.Sscrofa.UCSC.susScr3"
## [106] "BSgenome.Sscrofa.UCSC.susScr3.masked"
## [107] "BSgenome.Tgondii.ToxoDB.7.0"
## [108] "BSgenome.Tguttata.UCSC.taeGut1"
## [109] "BSgenome.Tguttata.UCSC.taeGut1.masked"
## [110] "BSgenome.Tguttata.UCSC.taeGut2"
## [111] "BSgenome.Vvinifera.URGI.IGGP12Xv0"
## [112] "BSgenome.Vvinifera.URGI.IGGP12Xv2"
## [113] "BSgenome.Vvinifera.URGI.IGGP8X"
genomeName <- "BSgenome.Hsapiens.UCSC.hg19"
In this example, the BSgenome package "BSgenome.Hsapiens.UCSC.hg19"
refers to
an unmasked genome; alignment index and alignments will be performed on the full
unmasked genome sequence (recommended). If using a masked genome (e.g. "BSgenome.Hsapiens.UCSC.hg19.masked"
),
masked regions will be replaced with "N"
bases, and this hard-masked version of
the genome will be used for creating the alignment index and further alignments.
Please also see section 5.3.1 for potential issues with redundant
sequences contained in the reference genome, e.g. in BSgenome.Hsapiens.UCSC.hg19
or BSgenome.Hsapiens.UCSC.hg38.
fasta
format:genomeFile <- system.file(package="QuasR", "extdata", "hg19sub.fa")
For some organisms, several versions of the genome assembly exist. These differ for example in whether or not they include alternative variants for sequences that are variable within the species. This may lead to redundant sequences in the assembly, and thus reads mapping to such sequences being wrongly classified as “multi-mapping”, and comparing data aligned to different assembly version may give rise to incorrect results. A nice summary of this issue is provided in this blog post from Heng Li.
The BSgenome packages BSgenome.Hsapiens.UCSC.hg19 (versions newer than 1.4.0 from Bioconductor 3.10) and BSgenome.Hsapiens.UCSC.hg38 (all versions) do contain such redundant sequences and are therefore not ideal references for alignment of human data. Specific “analysis set” or “primary assembly” versions of the assembly should be used instead (see the before-mentioned blog post for details).
When using a BSgenome reference, QuasR will check
in the qAlign
function whether the chromosome names and lengths contained
in the header of any pre-existing bam files are identical to the ones provided by
the genome and warn if this is not the case.
The preprocessReads
function can be used to prepare the input sequence files
prior to alignment. The function takes one or several sequence files (or pairs
of files for a paired-end experiment) in fasta
or fastq
format as input and
produces the same number of output files with the processed reads.
In the following example, we truncate the reads by removing the three bases from
the 3’-end (the right side), remove the adapter sequence AAAAAAAAAA
from the
5’-end (the left side) and filter out reads that, after truncation and adapter
removal, are shorter than 14 bases or contain more than 2 N
bases:
td <- tempdir()
infiles <- system.file(package = "QuasR", "extdata",
c("rna_1_1.fq.bz2","rna_2_1.fq.bz2"))
outfiles <- file.path(td, basename(infiles))
res <- preprocessReads(filename = infiles,
outputFilename = outfiles,
truncateEndBases = 3,
Lpattern = "AAAAAAAAAA",
minLength = 14,
nBases = 2)
## filtering /tmp/RtmpIlvpBW/Rinst2330407d029a5/QuasR/extdata/rna_1_1.fq.bz2
## filtering /tmp/RtmpIlvpBW/Rinst2330407d029a5/QuasR/extdata/rna_2_1.fq.bz2
res
## rna_1_1.fq.bz2 rna_2_1.fq.bz2
## totalSequences 3002 3000
## matchTo5pAdapter 466 463
## matchTo3pAdapter 0 0
## tooShort 107 91
## tooManyN 0 0
## lowComplexity 0 0
## totalPassed 2895 2909
unlink(outfiles)
preprocessReads
returns a matrix with a summary of the pre-processing. The matrix
contains one column per (pair of) input sequence files, and contains the total
number of reads (totalSequences
), the number of reads that matched to the five
prime or three prime adapters (matchTo5pAdapter
and matchTo3pAdapter
), the
number of reads that were too short (tooShort
), contained too many non-base
characters (tooManyN
) or were of low sequence complexity (lowComplexity
,
deactivated by default). Finally, the number of reads that passed the filtering
steps is reported in the last row (totalPassed
).
In the example below we process paired-end reads, removing all pairs with one or
several N
bases. Even if only one sequence in a pair fulfills the filtering
criteria, both reads in the pair are removed, thereby preserving the matching
order of the sequences in the two files:
td <- tempdir()
infiles1 <- system.file(package = "QuasR", "extdata", "rna_1_1.fq.bz2")
infiles2 <- system.file(package = "QuasR", "extdata", "rna_1_2.fq.bz2")
outfiles1 <- file.path(td, basename(infiles1))
outfiles2 <- file.path(td, basename(infiles2))
res <- preprocessReads(filename = infiles1,
filenameMate = infiles2,
outputFilename = outfiles1,
outputFilenameMate = outfiles2,
nBases = 0)
## filtering /tmp/RtmpIlvpBW/Rinst2330407d029a5/QuasR/extdata/rna_1_1.fq.bz2 and
## /tmp/RtmpIlvpBW/Rinst2330407d029a5/QuasR/extdata/rna_1_2.fq.bz2
res
## rna_1_1.fq.bz2:rna_1_2.fq.bz2
## totalSequences 3002
## matchTo5pAdapter NA
## matchTo3pAdapter NA
## tooShort 0
## tooManyN 3
## lowComplexity 0
## totalPassed 2999
More details on the preprocessReads
function can be found in the function
documentation (see ?preprocessReads
) or in the section 7.1.
Here we show an exemplary single-end ChIP-seq workflow using a small number of
reads from a histone 3 lysine 4 trimethyl (H3K4me3) ChIP-seq experiment. This
histone modification is known to locate to genomic regions with a high density
of CpG dinucleotides (so called CpG islands); about 60% of mammalian genes have
such a CpG island close to their transcript start site. All necessary files are
included in the QuasR package, and we start the example workflow
by copying those files into the current working directly, into a subfolder called "extdata"
:
file.copy(system.file(package = "QuasR", "extdata"), ".", recursive = TRUE)
## [1] TRUE
qAlign
functionWe assume that the sequence reads have already been pre-processed as described
in section 5.4. Also, a sample file (section 5.1)
that lists all sequence files to be analyzed has been prepared. A fasta
file
with the reference genome sequence(s) is also available (section 5.3),
as well as an auxiliary file for alignment of reads that failed to match the reference
genome (section 5.2).
By default, newly generated bam
files will be stored at the location of the input
sequence files, which should be writable and have sufficient capacity (an alternative
location can be specified using the alignmentsDir
argument). Make also sure that
you have sufficient temporary disk space for intermediate files in tempdir()
(see section 7.2). We start by aligning the reads using qAlign
:
sampleFile <- "extdata/samples_chip_single.txt"
auxFile <- "extdata/auxiliaries.txt"
genomeFile <- "extdata/hg19sub.fa"
proj1 <- qAlign(sampleFile, genome = genomeFile, auxiliaryFile = auxFile)
## alignment files missing - need to:
## create 2 auxiliary alignment(s)
## Creating an Rbowtie index for /tmp/RtmpIlvpBW/Rbuild2330404e7ceba2/QuasR/vignettes/extdata/NC_001422.1.fa
## Finished creating index
## Testing the compute nodes...OK
## Loading QuasR on the compute nodes...preparing to run on 1 nodes...done
## Available cores:
## nodeNames
## nebbiolo2
## 1
## Performing auxiliary alignments for 2 samples. See progress in the log file:
## /tmp/RtmpIlvpBW/Rbuild2330404e7ceba2/QuasR/vignettes/QuasR_log_2345ca767b7b98.txt
## Auxiliary alignments have been created successfully
proj1
## Project: qProject
## Options : maxHits : 1
## paired : no
## splicedAlignment: FALSE
## bisulfite : no
## snpFile : none
## geneAnnotation : none
## Aligner : Rbowtie v1.45.0 (parameters: -m 1 --best --strata)
## Genome : /tmp/RtmpIlvpBW/Rbuild2330404e7ceba2/QuasR/vig.../hg19sub.fa (file)
##
## Reads : 2 files, 2 samples (fastq format):
## 1. chip_1_1.fq.bz2 Sample1 (phred33)
## 2. chip_2_1.fq.bz2 Sample2 (phred33)
##
## Genome alignments: directory: same as reads
## 1. chip_1_1_2345ca234088fb.bam
## 2. chip_2_1_2345ca7fd8dfff.bam
##
## Aux. alignments: 1 file, directory: same as reads
## a. /tmp/RtmpIlvpBW/Rbuild2330404e7ceba2/QuasR/vign.../NC_001422.1.fa phiX174
## 1. chip_1_1_2345ca6003128b.bam
## 2. chip_2_1_2345ca3ce24306.bam
qAlign
will build alignment indices if they do not yet exist (by default, if the
genome and auxiliary sequences are given in the form of fasta
files, they will
be stored in the same folder). The qProject
object (proj1
) returned by qAlign
now contains all information about the ChIP-seq experiment: the (optional) project name,
the project options, aligner package, reference genome, and at the bottom the sequence
and alignment files. For each input sequence file, there will be one bam
file with
alignments against the reference genome, and one for each auxiliary target sequence
with alignments of reads without genome hits. Our auxFile
contains a single auxiliary
target sequence, so we expect two bam
files per input sequence file:
list.files("extdata", pattern = ".bam$")
## [1] "chip_1_1_2345ca234088fb.bam" "chip_1_1_2345ca6003128b.bam"
## [3] "chip_2_1_2345ca3ce24306.bam" "chip_2_1_2345ca7fd8dfff.bam"
## [5] "phiX_paired_withSecondary.bam"
The bam
file names consist of the base name of the sequence file with an added
random string. The random suffix makes sure that newly generated alignment files
do not overwrite existing ones, for example of the same reads aligned against an
alternative reference genome. Each alignment file is accompanied by two additional
files with suffixes .bai
and .txt
:
list.files("extdata", pattern = "^chip_1_1_")[1:3]
## [1] "chip_1_1_2345ca234088fb.bam" "chip_1_1_2345ca234088fb.bam.bai"
## [3] "chip_1_1_2345ca234088fb.bam.txt"
The .bai
file is the bam
index used for fast access by genomic coordinate.
The .txt
file contains all the parameters used to generate the corresponding bam
file. Before new alignments are generated, qAlign
will look for available .txt
files in default locations (the directory containing the input sequence file, or
the value of alignmentsDir
), and read their contents to determine if a compatible
bam
file already exists. A compatible bam
file is one with the same reads and
genome, aligned using the same aligner and identical alignment parameters. If a
compatible bam
file is not found, or the .txt
file is missing, qAlign
will
generate a new bam
file. It is therefore recommended not to delete the .txt
file - without it, the corresponding bam
file will become unusable for QuasR.
QuasR can produce a quality control report in the form of a series
of diagnostic plots with details on sequences and alignments (see QuasR scheme figure above).
The plots are generated by calling the qQCReport
function with the qProject
object as argument. qQCReport
uses ShortRead (Morgan et al. 2009) internally
to obtain some of the quality metrics, and some of the plots are inspired by the
FastQC quality control tool by Simon Andrews (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/).
The plots will be stored into a multipage PDF document defined by the pdfFilename
argument, or else shown as individual plot windows on the current graphics device.
In order to keep the running time reasonably short, some quality metrics are obtained
from a random sub-sample of the sequences or alignments.
## collecting quality control data
## creating QC plots
qQCReport(proj1, pdfFilename = "extdata/qc_report.pdf")
## collecting quality control data
## creating QC plots
Currently available plots are described in section 7.4 and following.
The alignmentStats
gets the number of (un-)mapped reads for each sequence file
in a qProject
object, by reading the bam
file indices, and returns them as a
data.frame
. The function also works for arguments of type character
with one
or several bam
file names (for details see section 7.5).
alignmentStats(proj1)
## seqlength mapped unmapped
## Sample1:genome 95000 2339 258
## Sample2:genome 95000 3609 505
## Sample1:phiX174 5386 251 7
## Sample2:phiX174 5386 493 12
For visualization in a genome browser, alignment coverage along the genome can be
exported to a (compressed) wig file using the qExportWig
function. The created
fixedStep wig file (see (http://genome.ucsc.edu/goldenPath/help/wiggle.html) for
details on the wig format) will contain one track per sample in the qProject
object. The resolution is defined using the binsize
argument, and if scaling
is set to TRUE
, read counts per bin are scaled by the total number of aligned
reads in each sample to improve comparability:
qExportWig(proj1, binsize = 100L, scaling = TRUE, collapseBySample = TRUE)
## collecting mapping statistics for scaling...done
## start creating wig files...
## Sample1.wig.gz (Sample1)
## Sample2.wig.gz (Sample2)
## done
qCount
Alignments are quantified using qCount
, for example using a GRanges
object as
a query. In our H3K4me3 ChIP-seq example, we expect the reads to occur around the
transcript start site of genes. We can therefore construct suitable query regions
using genomic intervals around the start sites of known genes. In the code below,
this is achieved with help from the txdbmaker package to first
create a TxDb
object from a .gtf
file with gene annotation. With the promoters
function from the GenomicFeatures package, we can then create the
GRanges
object with regions to be quantified. Finally, because most genes
consist of multiple overlapping transcripts, we select the first transcript for
each gene:
library(txdbmaker)
annotFile <- "extdata/hg19sub_annotation.gtf"
chrLen <- scanFaIndex(genomeFile)
chrominfo <- data.frame(chrom = as.character(seqnames(chrLen)),
length = width(chrLen),
is_circular = rep(FALSE, length(chrLen)))
txdb <- makeTxDbFromGFF(file = annotFile, format = "gtf",
chrominfo = chrominfo,
dataSource = "Ensembl",
organism = "Homo sapiens")
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
promReg <- promoters(txdb, upstream = 1000, downstream = 500,
columns = c("gene_id","tx_id"))
gnId <- vapply(mcols(promReg)$gene_id,
FUN = paste, FUN.VALUE = "",
collapse = ",")
promRegSel <- promReg[ match(unique(gnId), gnId) ]
names(promRegSel) <- unique(gnId)
promRegSel
## GRanges object with 12 ranges and 2 metadata columns:
## seqnames ranges strand | gene_id tx_id
## <Rle> <IRanges> <Rle> | <CharacterList> <integer>
## ENSG00000176022 chr1 31629-33128 + | ENSG00000176022 1
## ENSG00000186891 chr1 6452-7951 - | ENSG00000186891 2
## ENSG00000186827 chr1 14013-15512 - | ENSG00000186827 6
## ENSG00000078808 chr1 31882-33381 - | ENSG00000078808 9
## ENSG00000171863 chr2 1795-3294 + | ENSG00000171863 17
## ... ... ... ... . ... ...
## ENSG00000254999 chr3 1276-2775 + | ENSG00000254999 28
## ENSG00000238642 chr3 19069-20568 + | ENSG00000238642 30
## ENSG00000134086 chr3 26692-28191 + | ENSG00000134086 31
## ENSG00000238345 chr3 26834-28333 + | ENSG00000238345 32
## ENSG00000134075 chr3 13102-14601 - | ENSG00000134075 36
## -------
## seqinfo: 3 sequences from an unspecified genome
Using promRegSel
object as query, we can now count the alignment per sample in
each of the promoter windows.
cnt <- qCount(proj1, promRegSel)
## counting alignments...done
cnt
## width Sample1 Sample2
## ENSG00000176022 1500 157 701
## ENSG00000186891 1500 22 5
## ENSG00000186827 1500 10 3
## ENSG00000078808 1500 73 558
## ENSG00000171863 1500 94 339
## ENSG00000252531 1500 59 9
## ENSG00000247886 1500 172 971
## ENSG00000254999 1500 137 389
## ENSG00000238642 1500 8 3
## ENSG00000134086 1500 9 18
## ENSG00000238345 1500 13 25
## ENSG00000134075 1500 7 3
The counts returned by qCount
are the raw number of alignments per sample and
region, without any normalization for the query region length, or the total number
of aligned reads in a sample. As expected, we can find H3K4me3 signal at promoters
of a subset of the genes with CpG island promoters, which we can visualize with help
of the Gviz package:
gr1 <- import("Sample1.wig.gz")
## Warning in asMethod(object): NAs introduced by coercion
gr2 <- import("Sample2.wig.gz")
## Warning in asMethod(object): NAs introduced by coercion
library(Gviz)
axisTrack <- GenomeAxisTrack()
dTrack1 <- DataTrack(range = gr1, name = "Sample 1", type = "h")
dTrack2 <- DataTrack(range = gr2, name = "Sample 2", type = "h")
txTrack <- GeneRegionTrack(txdb, name = "Transcripts", showId = TRUE)
plotTracks(list(axisTrack, dTrack1, dTrack2, txTrack),
chromosome = "chr3", extend.left = 1000)