The gDNAx
package provides functionality to diagnose the presence of genomic DNA (gDNA) contamination in RNA-seq data sets, and filter out reads of potential gDNA origin.
gDNAx 1.4.0
RNA sequencing (RNA-seq) libraries may contain genomic DNA (gDNA) contamination due to an absent or inefficient gDNA digestion step (with DNase) during RNA extraction or library preparation. In fact, some protocols do not include a DNase treatment step, or they include it as optional.
While gDNA contamination is not a major issue in libraries built with poly(A) selected RNA molecules, it can remarkably affect gene expression quantification from libraries of total RNA. When present, gDNA contamination can lead to a misleading attribution of expression to unannotated regions of the genome. For this reason, it is important to check the levels of gDNA contamination during quality control before performing further analyses, specially when total RNA has been sequenced.
Here we illustrate the use of the gDNAx package for producing different diagnostics and how do they reveal different gDNA contamination levels. We use a subset of the data in (Li et al. 2022), which consists of 9 paired-end samples of total RNA-seq with increasing levels of gDNA contamination: 0% (no contamination), 1% and 10%, with 3 replicates each. The data is available through the Bioconductor experiment data package gDNAinRNAseqData, which allows one to download 9 BAM files, containing about 100,000 alignments, sampled uniformly at random from the complete BAM files.
library(gDNAinRNAseqData)
# Retrieve BAM files
bamfiles <- LiYu22subsetBAMfiles()
bamfiles
[1] "/home/biocbuild/bbs-3.20-bioc/tmpdir/RtmpxfsbKo/s32gDNA0.bam"
[2] "/home/biocbuild/bbs-3.20-bioc/tmpdir/RtmpxfsbKo/s33gDNA0.bam"
[3] "/home/biocbuild/bbs-3.20-bioc/tmpdir/RtmpxfsbKo/s34gDNA0.bam"
[4] "/home/biocbuild/bbs-3.20-bioc/tmpdir/RtmpxfsbKo/s26gDNA1.bam"
[5] "/home/biocbuild/bbs-3.20-bioc/tmpdir/RtmpxfsbKo/s27gDNA1.bam"
[6] "/home/biocbuild/bbs-3.20-bioc/tmpdir/RtmpxfsbKo/s28gDNA1.bam"
[7] "/home/biocbuild/bbs-3.20-bioc/tmpdir/RtmpxfsbKo/s23gDNA10.bam"
[8] "/home/biocbuild/bbs-3.20-bioc/tmpdir/RtmpxfsbKo/s24gDNA10.bam"
[9] "/home/biocbuild/bbs-3.20-bioc/tmpdir/RtmpxfsbKo/s25gDNA10.bam"
# Retrieve information on the gDNA concentrations of each BAM file
pdat <- LiYu22phenoData(bamfiles)
pdat
gDNA
s32gDNA0 0
s33gDNA0 0
s34gDNA0 0
s26gDNA1 1
s27gDNA1 1
s28gDNA1 1
s23gDNA10 10
s24gDNA10 10
s25gDNA10 10
Diagnosing the presence of gDNA contamination requires using an annotation
of genes and transcripts. The gDNAx
package expects that we provide such an annotation using a so-called TxDb
package, either as a TxDb
object, created once such a package is loaded into
the R session, or by specifying the name of the package. The Bioconductor
website
provides a number of TxDb
packages, but if the we do not find the one we are
looking for, we can build a TxDb
object using the function makeTxDbFromGFF()
on a given GFF or
GTF file, or any of the
other makeTxDbFrom*()
functions, available in the
GenomicFeatures package.
Here we load the TxDb
package corresponding to the GENCODE annotation provided
by the UCSC Genome Browser.
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: hg38
# Organism: Homo sapiens
# Taxonomy ID: 9606
# UCSC Table: knownGene
# UCSC Track: GENCODE V46
# Resource URL: https://genome.ucsc.edu/
# Type of Gene ID: Entrez Gene ID
# Full dataset: yes
# miRBase build ID: NA
# Nb of transcripts: 278220
# Db created by: txdbmaker package from Bioconductor
# Creation time: 2024-09-24 16:26:51 +0000 (Tue, 24 Sep 2024)
# txdbmaker version at creation time: 1.1.1
# RSQLite version at creation time: 2.3.7
# DBSCHEMAVERSION: 1.2
We can calculate diagnostics for gDNA contamination using the function
gDNAdx()
as follows.
library(gDNAx)
gdnax <- gDNAdx(bamfiles, txdb)
class(gdnax)
[1] "gDNAx"
attr(,"package")
[1] "gDNAx"
gdnax
gDNAx object
# BAM files (9): s32gDNA0.bam, ..., s25gDNA10.bam
# Library layout: paired-end, 9 (2x50nt)
# Library protocol: unstranded (9 out of 9)
# Sequences: only standard chromosomes
# Annotation pkg: TxDb.Hsapiens.UCSC.hg38.knownGene
# Alignments employed: first 100000
The previous call will show progress through its calculations unless we set
the argument verbose=FALSE
, and return an object of class gDNAx
once it has
finished. We have let the gDNAdx()
function figure out the library layout
and protocol, but if we already knew those parameters from the data, we could
set them through the arguments singleEnd
and strandMode
and speed up
calculations. Another way to speed up calculations, which may be advantageous
specially when analysing a high number of BAM files, is to use the BPPARAM
argument to set a number of parallel threads of execution; see the help page
of gDNAdx()
for full details on how to specify non-default values to all
these parameters.
Calling the plot()
function with the resulting object gDNAx
object as the
first argument will plot several diagnostics. Here below, we also use a
parameter called group
to automatically color samples, in this case, by the
gDNA contamination levels included in the experimental design of the data; see
(Li et al. 2022) for full details on it.
par(mar=c(4, 5, 2, 1))
plot(gdnax, group=pdat$gDNA, pch=19)