Contents

1 Introduction

rprimer provides tools for designing degenerate oligos (primers and probes) for sequence variable targets, such as RNA viruses. A multiple DNA sequence alignment is used as input data, and the outputs are presented in data frame format and as dashboard-like plots. The aim of the package is to provide a visual and efficient approach to degenerate oligo design, where sequence conservation analysis forms the basis for the design process.

In this vignette, I describe and demonstrate the use of the package by designing broadly reactive primers and probes for reverse transcription (RT) polymerase chain reaction (PCR)-based amplification and detection of hepatitis E virus (HEV), a sequence variable RNA virus and a common foodborne pathogen.

2 Installation

To install rprimer, start R (version 4.2.0 or higher), and enter the following code:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("rprimer")

For best functionality, it is also recommended to install the Biostrings package [1]:

BiocManager::install("Biostrings")

3 Overview

The oligo and assay design workflow consists of five functions:

4 Shiny application

The package can be run through a Shiny application (a graphical user interface). To start the application, type runRprimerApp() from within R upon installing and attaching the package.

5 Workflow

library(rprimer)
library(Biostrings)

5.1 Collection of target sequences and multiple alignment

The first step is to identify and align target sequences of interest. It is important that the alignment does not contain any poorly aligned sequences and accurately represents the generic variation of the target population.

The file “example_alignment.txt” is provided with the package and contains an alignment of 50 HEV sequences. The file path can be retrieved by:

system.file("extdata", "example_alignment.txt", package = "rprimer")

5.2 Data import

The alignment must be imported to R using the readDNAMultipleAlignment() function from Biostrings [1]. The input file can contain from one to several thousand sequences.

Below, I show how to import the file “example_alignment.txt”:

filepath <- system.file("extdata", "example_alignment.txt", package = "rprimer")

myAlignment <- readDNAMultipleAlignment(filepath, format = "fasta")

For some applications, there is often an interest in amplifying a specific region of the genome. In such cases, the alignment can be masked so that only the desired oligo binding regions are available for the subsequent design process. This type of masking can be achieved by using colmask() from Biostrings [1], as illustrated in the example below:

## Mask everything but position 3000 to 4000 and 5000 to 6000
myMaskedAlignment <- myAlignment

colmask(myMaskedAlignment, invert = TRUE) <- c(3000:4000, 5000:6000)

It is also possible to mask specific sequences by using rowmask(). Gap-rich positions can be hidden with gapmask() [1].

5.3 Design procedure

5.3.1 Step 1: consensusProfile

Once the alignment is imported, a consensus profile can be generated. The consensus profile contains all the information needed for the subsequent design process:

  • Proportion of the letters A, C, G, T, and other
  • Proportion of gaps (-)
  • Majority consensus base
  • Identity
  • IUPAC consensus character
  • Coverage

The majority consensus base is the most frequently occurring letter among the letters A, C, G, T and -.

Identity represents the proportion of sequences, among all sequences with a DNA base (A, C, G, T) or gap (-) that has the majority consensus base.

The IUPAC consensus character includes all DNA bases that occur at a relative frequency higher than a user-specified ambiguity threshold. The threshold can range from 0 to 0.2: a value of 0 (default) will catch all the variation, but with the potential downside of generating oligos with high degeneracy. A higher value will capture most of the variation, but with the potential downside of missing less common sequence variants.

Coverage represents the proportion of sequences in the target alignment (among all sequences with a DNA base) that are covered by the IUPAC consensus character.

consensusProfile() has two arguments:

  • x: a MultipleDNAAlignment object (see Data import)
  • ambiguityThreshold: position wise “detection level” for ambiguous bases. All DNA bases that occur with a proportion higher than the specified value will be included in the IUPAC consensus character. Defaults to 0, can range from 0 to 0.2

Type the following code to generate a consensus profile from the example alignment below:

myConsensusProfile <- consensusProfile(myAlignment, ambiguityThreshold = 0.05)

Results (row 100-105):

position a c g t other gaps majority identity iupac coverage
100 0.00 1.00 0.00 0.00 0.00 0 C 1.00 C 1.00
101 1.00 0.00 0.00 0.00 0.00 0 A 1.00 A 1.00
102 0.16 0.00 0.84 0.00 0.00 0 G 0.84 R 1.00
103 0.00 0.00 1.00 0.00 0.00 0 G 1.00 G 1.00
104 0.00 0.98 0.00 0.00 0.02 0 C 1.00 C 1.00
105 0.20 0.00 0.02 0.78 0.00 0 T 0.78 W 0.98

The results can be visualized by using plotData():

plotData(myConsensusProfile)
#> Warning in ggplot2::geom_segment(color = "#93A8AC", ggplot2::aes(x = min(position), : All aesthetics have length 1, but the data has 7597 rows.
#> ℹ Please consider using `annotate()` or provide this layer with data containing
#>   a single row.