1 What is genomic DNA contamination in a RNA-seq experiment

RNA sequencing (RNA-seq) libraries may contain genomic DNA (gDNA) contamination due to an absent or inefficient gDNA digestion step (with DNase) during RNA extraction or library preparation. In fact, some protocols do not include a DNase treatment step, or they include it as optional.

While gDNA contamination is not a major issue in libraries built with poly(A) selected RNA molecules, it can remarkably affect gene expression quantification from libraries of total RNA. When present, gDNA contamination can lead to a misleading attribution of expression to unannotated regions of the genome. For this reason, it is important to check the levels of gDNA contamination during quality control before performing further analyses, specially when total RNA has been sequenced.

2 Diagnose gDNA contamination

Here we illustrate the use of the gDNAx package for producing different diagnostics and how do they reveal different gDNA contamination levels. We use a subset of the data in (Li et al. 2022), which consists of 9 paired-end samples of total RNA-seq with increasing levels of gDNA contamination: 0% (no contamination), 1% and 10%, with 3 replicates each. The data is available through the Bioconductor experiment data package gDNAinRNAseqData, which allows one to download 9 BAM files, containing about 100,000 alignments, sampled uniformly at random from the complete BAM files.

library(gDNAinRNAseqData)

# Retrieve BAM files
bamfiles <- LiYu22subsetBAMfiles()
bamfiles
[1] "/tmp/Rtmpjq5k3T/s32gDNA0.bam"  "/tmp/Rtmpjq5k3T/s33gDNA0.bam" 
[3] "/tmp/Rtmpjq5k3T/s34gDNA0.bam"  "/tmp/Rtmpjq5k3T/s26gDNA1.bam" 
[5] "/tmp/Rtmpjq5k3T/s27gDNA1.bam"  "/tmp/Rtmpjq5k3T/s28gDNA1.bam" 
[7] "/tmp/Rtmpjq5k3T/s23gDNA10.bam" "/tmp/Rtmpjq5k3T/s24gDNA10.bam"
[9] "/tmp/Rtmpjq5k3T/s25gDNA10.bam"

# Retrieve information on the gDNA concentrations of each BAM file
pdat <- LiYu22phenoData(bamfiles)
pdat
          gDNA
s32gDNA0     0
s33gDNA0     0
s34gDNA0     0
s26gDNA1     1
s27gDNA1     1
s28gDNA1     1
s23gDNA10   10
s24gDNA10   10
s25gDNA10   10

Diagnosing the presence of gDNA contamination requires using an annotation of genes and transcripts. The gDNAx package expects that we provide such an annotation using a so-called TxDb package, either as a TxDb object, created once such a package is loaded into the R session, or by specifying the name of the package. The Bioconductor website provides a number of TxDb packages, but if the we do not find the one we are looking for, we can build a TxDb object using the function makeTxDbFromGFF() on a given GFF or GTF file, or any of the other makeTxDbFrom*() functions, available in the GenomicFeatures package. Here we load the TxDb package corresponding to the GENCODE annotation provided by the UCSC Genome Browser.

library(TxDb.Hsapiens.UCSC.hg38.knownGene)

txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: hg38
# Organism: Homo sapiens
# Taxonomy ID: 9606
# UCSC Table: knownGene
# UCSC Track: GENCODE V47
# Resource URL: https://genome.ucsc.edu/
# Type of Gene ID: Entrez Gene ID
# Full dataset: yes
# miRBase build ID: NA
# Nb of transcripts: 412034
# Db created by: txdbmaker package from Bioconductor
# Creation time: 2025-03-02 02:45:03 +0000 (Sun, 02 Mar 2025)
# txdbmaker version at creation time: 1.3.1
# RSQLite version at creation time: 2.3.9
# DBSCHEMAVERSION: 1.2

We can calculate diagnostics for gDNA contamination using the function gDNAdx() as follows.

library(gDNAx)

gdnax <- gDNAdx(bamfiles, txdb)
class(gdnax)
[1] "gDNAx"
attr(,"package")
[1] "gDNAx"
gdnax
gDNAx object
# BAM files (9): s32gDNA0.bam, ..., s25gDNA10.bam
# Library layout: paired-end, 9 (2x50nt)
# Library protocol: unstranded (9 out of 9)
# Sequences: only assembled molecules
# Annotation pkg: TxDb.Hsapiens.UCSC.hg38.knownGene
# Alignments employed: first 100000

The previous call will show progress through its calculations unless we set the argument verbose=FALSE, and return an object of class gDNAx once it has finished. We have let the gDNAdx() function figure out the library layout and protocol, but if we already knew those parameters from the data, we could set them through the arguments singleEnd and strandMode and speed up calculations. Another way to speed up calculations, which may be advantageous specially when analysing a high number of BAM files, is to use the BPPARAM argument to set a number of parallel threads of execution; see the help page of gDNAdx() for full details on how to specify non-default values to all these parameters.

Calling the plot() function with the resulting object gDNAx object as the first argument will plot several diagnostics. Here below, we also use a parameter called group to automatically color samples, in this case, by the gDNA contamination levels included in the experimental design of the data; see (Li et al. 2022) for full details on it.

par(mar=c(4, 5, 2, 1))
plot(gdnax, group=pdat$gDNA, pch=19)

Figure 1: Diagnostics
Default diagnostics obtained with the function plot() on a gDNAx object.

The previous figure contains three diagnostic plots, each one showing the following values as a function of the percentage of read alignments fully contained in intergenic regions (IGC):

Percentage of Splice-compatible junction (SCJ) alignments. These are alignments compatible with a transcript in the given annotation, for which the aligned read, or at least one of the two aligned reads in the case of a paired-end layout, spans one or more exon-exon junctions over two or more exons of that transcript.
Percentage of splice compatible exonic (SCE) alignments. These are alignments compatible with a transcript in the given annotation, but which differently to SCJ alignments, do not include an exon-exon junction in the alignment.
Percentage of intronic (INT) alignments. These are alignments fully contained in intronic regions.

These data appear to come from an unstranded library, but if they would be stranded, a fourth diagnostic plot would appear showing an estimated value of the strandedness of each sample as function of the percentage of intergenic alignments. In stranded RNA-seq data, we should expect strandedness values close to 1, which imply that most reads align to the same strand than the annotated transcripts. Lower strandedness values can be indicative of gDNA contamination because reads sequenced from DNA are expected to align in equal proportions to both strands.

Because IGC alignments mainly originate from gDNA contamination, we may expect a negative correlation between the percentage of SCJ or SCE alignments and the percentage of IGC alignments. On the other hand, INT alignments may originate either from primary unprocessed transcripts in the nucleus, or from gDNA contamination as well. Therefore, we may also expect some positive correlation between the percentages of INT and IGC alignments, as it happens in this data.

Using the function getDx() on the gDNAx object, we obtain all the values used in the diagnostics.

dx <- getDx(gdnax)
dx
                IGC      INT      SCJ      SCE      JNC  IGCFLM  SCJFLM  SCEFLM
s32gDNA0  0.7656031 11.80614 14.59663 39.42605 19.56452 173.851 157.238 157.140
s33gDNA0  0.8235200 11.85327 14.62375 39.76167 19.55383 171.714 160.298 162.138
s34gDNA0  0.8075357 12.30564 14.81652 39.63345 19.70788 172.938 153.528 160.444
s26gDNA1  1.0919921 12.32276 14.19890 38.21170 18.73132 172.197 165.081 161.547
s27gDNA1  1.0527268 12.51609 13.99513 37.66892 18.31765 193.587 162.330 162.715
s28gDNA1  1.1572841 12.57327 13.53148 37.16782 17.96053 186.751 164.255 163.823
s23gDNA10 2.8553537 13.53137 10.81237 30.68874 14.41110 201.513 162.810 158.908
s24gDNA10 3.0301500 14.09222 10.44190 29.71466 13.91748 177.065 162.112 162.237
s25gDNA10 2.7249990 13.89427 10.79410 30.40159 14.20691 173.403 157.404 160.488
           INTFLM STRAND
s32gDNA0  156.651     NA
s33gDNA0  159.331     NA
s34gDNA0  152.425     NA
s26gDNA1  160.337     NA
s27gDNA1  158.452     NA
s28gDNA1  160.210     NA
s23gDNA10 160.191     NA
s24gDNA10 157.094     NA
s25gDNA10 159.135     NA

The column JNC contains the percentage of alignments that include one or more junctions, irrespective of whether those alignments are compatible with an spliced transcript in the given annotation. The columns with the suffix FLM contain an estimation of the fragment length mean in the alignments originating in the corresponding region, and the column STRAND stores the strandedness values, which in this case are NA because this dataset is not strand-specific.

We can directly plot the estimated fragments length distributions with the function plotFrgLength().

plotFrgLength(gdnax)

Figure 2: Fragments length distributions
Density and location of the estimated fragments length distribution, by the origin of the alignments.

Another way to represent some of diagnostic measurements is to examine the origin of the alignments per sample in percentages. Fluctuations of these proportions across samples can help quantifying the amount of gDNA contamination per sample.

plotAlnOrigins(gdnax, group=pdat$gDNA)

Figure 3: Alignment origins

In the previous plot, OTH indicates the percentage of read alignments that do not fall fully within any of the other categories (SCJ, SCE, INT and IGC), these may typically correspond to read alignments overlapping two of these previous categories. If we are interested in knowing exactly which annotations of intergenic and intronic regions have been used to compute these diagnostics, we can retrieve them using the functions getIgc() and getInt() on the output gDNAx object, respectively.

igcann <- getIgc(gdnax)
igcann
GRanges object with 769973 ranges and 0 metadata columns:
           seqnames      ranges strand
              <Rle>   <IRanges>  <Rle>
       [1]     chr1     1-10000      *
       [2]     chr1 31132-31292      *
       [3]     chr1 31755-32840      *
       [4]     chr1 34109-34450      *
       [5]     chr1 37596-37732      *
       ...      ...         ...    ...
  [769969]     chrM   2746-3229      *
  [769970]     chrM   3308-4328      *
  [769971]     chrM   4401-7447      *
  [769972]     chrM  7515-16204      *
  [769973]     chrM 16250-16569      *
  -------
  seqinfo: 25 sequences (1 circular) from hg38 genome
intann <- getInt(gdnax)
intann
GRanges object with 1633196 ranges and 0 metadata columns:
            seqnames            ranges strand
               <Rle>         <IRanges>  <Rle>
        [1]     chr1       12722-12974      *
        [2]     chr1       13053-13220      *
        [3]     chr1       18062-18267      *
        [4]     chr1       18380-18496      *
        [5]     chr1       18555-18906      *
        ...      ...               ...    ...
  [1633192]     chrY 57207675-57208210      *
  [1633193]     chrY 57208313-57208518      *
  [1633194]     chrY 57213358-57213525      *
  [1633195]     chrY 57213603-57213855      *
  [1633196]     chrY 57213983-57214349      *
  -------
  seqinfo: 25 sequences (1 circular) from hg38 genome

2.1 Strandedness estimation

Since we have let the gDNAdx() function to estimate strandedness, we can examine those estimated values using the getter function strandedness() on the gDNAx object.

strandedness(gdnax)
          strandMode1 strandMode2      ambig Nalignments
s32gDNA0    0.4760574   0.4741786 0.04976403       66534
s33gDNA0    0.4761212   0.4735255 0.05035334       66649
s34gDNA0    0.4796994   0.4707103 0.04959038       66525
s26gDNA1    0.4805201   0.4714825 0.04799738       64066
s27gDNA1    0.4819913   0.4687183 0.04929042       63136
s28gDNA1    0.4759239   0.4745481 0.04952808       62510
s23gDNA10   0.4777124   0.4724775 0.04981009       50813
s24gDNA10   0.4733836   0.4775504 0.04906604       49199
s25gDNA10   0.4754988   0.4742469 0.05025435       50324

Using the function classifyStrandMode() we can obtain a classification of the most likely strand mode for each BAM file, given some default cutoff values.

classifyStrandMode(strandedness(gdnax))
 s32gDNA0  s33gDNA0  s34gDNA0  s26gDNA1  s27gDNA1  s28gDNA1 s23gDNA10 s24gDNA10 
       NA        NA        NA        NA        NA        NA        NA        NA 
s25gDNA10 
       NA

Li et al. (2022) report in their publication that “sequencing libraries were generated using a TruSeq Stranded Total RNA Library Prep Kit”. However, we can see that the proportion of alignments overlapping transcripts in the column strandMode1 is very similar to the one in the column strandMode2, which is compatible with an unstranded library and the reason why we obtain NA values in the output of classifyStrandMode(). We reach the same conclusion if we use the RSeQC tool infer_experiment.py (Wang, Wang, and Li 2012) and by visual inspection of the alignment data in the Integrative Genomics Viewer (IGV) (Robinson et al. 2011).

Following the recommendations made by Signal and Kahlke (2022), gDNAx attempts to use at least 200,000 alignments overlapping exonic regions to estimate strandedness. In the subset of data used in this vignette, the number of alignments used for that estimation is close to 60,000, which is the total number of exonic alignments present in the BAM files.

If we are only interested in the estimation of strandedness values, we can can also directly call strandedness() with a character string vector of BAM filenames and a TxDb annotation object; see the help page of strandedness().

3 Remove gDNA contamination

We can attempt removing read alignments from putative gDNA origin using the function gDNAtx(), which should be called with the gDNAx object returned by gDNAdx() and a path in the filesystem where to stored the filtered BAM files. By default, these filtered BAM files include splice-compatible read alignments (SCJ and SCE) that are found in a genomic window enriched for stranded alignments. For further fine tuning of this filtering strategy please use the function filterBAMtx().

## fbf <- filterBAMtxFlag(isSpliceCompatibleJunction=TRUE,
##                        isSpliceCompatibleExonic=TRUE)
## fstats <- filterBAMtx(gdnax, path=tmpdir, txflag=fbf)
## fstats
tmpdir <- tempdir()
fstats <- gDNAtx(gdnax, path=tmpdir)
fstats

           NALN NIGC NINT  NSCJ  NSCE  NSTW NNCH
s32gDNA0  99660   NA   NA 15134 39905 46336  340
s33gDNA0  99694   NA   NA 15170 40189 46823  306
s34gDNA0  99686   NA   NA 15352 40068 46529  314
s26gDNA1  99726   NA   NA 14743 38586 45061  274
s27gDNA1  99456   NA   NA 14466 38011 45083  544
s28gDNA1  99457   NA   NA 14023 37470 43892  543
s23gDNA10 99007   NA   NA 11115 30690 36424  993
s24gDNA10 99005   NA   NA 10746 29716 35032  995
s25gDNA10 99156   NA   NA 11103 30394 36039  844

The first column NALN corresponds to the total number of read alignments processed. Columns NIGC to NSCE contain the number of selected alignments from each corresponding origin, where NA indicates that that type of alignment was not selected for filtering. The column NSTW corresponds to selected alignments occurring in stranded windows, and therefore this number will be always equal or smaller than the number of the previous columns. The column NNCH corresponds to discarded read alignments ocurring in non-standard chromosomes.

4 Session information

sessionInfo()
R version 4.5.0 RC (2025-04-04 r88126)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.2 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] gDNAx_1.6.1                             
 [2] TxDb.Hsapiens.UCSC.hg38.knownGene_3.21.0
 [3] GenomicFeatures_1.60.0                  
 [4] AnnotationDbi_1.70.0                    
 [5] Biobase_2.68.0                          
 [6] Rsamtools_2.24.0                        
 [7] Biostrings_2.76.0                       
 [8] XVector_0.48.0                          
 [9] GenomicRanges_1.60.0                    
[10] GenomeInfoDb_1.44.0                     
[11] IRanges_2.42.0                          
[12] S4Vectors_0.46.0                        
[13] BiocGenerics_0.54.0                     
[14] generics_0.1.4                          
[15] gDNAinRNAseqData_1.8.0                  
[16] knitr_1.50                              
[17] BiocStyle_2.36.0                        

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1            dplyr_1.1.4                
 [3] blob_1.2.4                  filelock_1.0.3             
 [5] bitops_1.0-9                fastmap_1.2.0              
 [7] RCurl_1.98-1.17             BiocFileCache_2.16.0       
 [9] VariantAnnotation_1.54.1    GenomicAlignments_1.44.0   
[11] XML_3.99-0.18               digest_0.6.37              
[13] mime_0.13                   lifecycle_1.0.4            
[15] KEGGREST_1.48.0             RSQLite_2.4.1              
[17] magrittr_2.0.3              compiler_4.5.0             
[19] rlang_1.1.6                 sass_0.4.10                
[21] tools_4.5.0                 plotrix_3.8-4              
[23] yaml_2.3.10                 rtracklayer_1.68.0         
[25] S4Arrays_1.8.1              bit_4.6.0                  
[27] curl_6.3.0                  DelayedArray_0.34.1        
[29] RColorBrewer_1.1-3          abind_1.4-8                
[31] BiocParallel_1.42.1         withr_3.0.2                
[33] purrr_1.0.4                 grid_4.5.0                 
[35] ExperimentHub_2.16.0        tinytex_0.57               
[37] SummarizedExperiment_1.38.1 cli_3.6.5                  
[39] rmarkdown_2.29              crayon_1.5.3               
[41] httr_1.4.7                  rjson_0.2.23               
[43] DBI_1.2.3                   cachem_1.1.0               
[45] parallel_4.5.0              BiocManager_1.30.26        
[47] restfulr_0.0.15             matrixStats_1.5.0          
[49] vctrs_0.6.5                 Matrix_1.7-3               
[51] jsonlite_2.0.0              bookdown_0.43              
[53] bit64_4.6.0-1               magick_2.8.7               
[55] GenomicFiles_1.44.1         jquerylib_0.1.4            
[57] glue_1.8.0                  codetools_0.2-20           
[59] BiocVersion_3.21.1          BiocIO_1.18.0              
[61] UCSC.utils_1.4.0            tibble_3.3.0               
[63] pillar_1.10.2               rappdirs_0.3.3             
[65] htmltools_0.5.8.1           BSgenome_1.76.0            
[67] GenomeInfoDbData_1.2.14     R6_2.6.1                   
[69] dbplyr_2.5.0                evaluate_1.0.3             
[71] lattice_0.22-7              AnnotationHub_3.16.0       
[73] png_0.1-8                   memoise_2.0.1              
[75] bslib_0.9.0                 Rcpp_1.0.14                
[77] SparseArray_1.8.0           xfun_0.52                  
[79] MatrixGenerics_1.20.0       pkgconfig_2.0.3

References

Li, Xiangnan, Peipei Zhang, Haijian Wang, and Ying Yu. 2022. “Genes Expressed at Low Levels Raise False Discovery Rates in Rna Samples Contaminated with Genomic Dna.” BMC Genomics 23 (1): 554.

Robinson, James T, Helga Thorvaldsdóttir, Wendy Winckler, Mitchell Guttman, Eric S Lander, Gad Getz, and Jill P Mesirov. 2011. “Integrative Genomics Viewer.” Nature Biotechnology 29 (1): 24–26.

Signal, Brandon, and Tim Kahlke. 2022. “How_are_we_stranded_here: Quick Determination of Rna-Seq Strandedness.” BMC Bioinformatics 23 (1): 1–9.

Wang, Liguo, Shengqin Wang, and Wei Li. 2012. “RSeQC: Quality Control of Rna-Seq Experiments.” Bioinformatics 28 (16): 2184–5.

The gDNAx package

11 June 2025

Abstract

Package