coRdon
provides tools for analysis of codone usage (CU) in various unannotated or KEGG/COG annotated DNA sequences. Funcionalities include: calculation of different CU bias statistics and CU-based predictors of gene expression, gene set enrichment analysis for annotated sequences, and several methods for visualization of CU and enrichment analysis results.
coRdon 1.25.0
Not all synonymous codons are used equally often in prokaryotic genomes - this selective preference is termed codon usage (CU) bias, and is an independent determinant of gene expression regulation at the translational level. Those synonymous codons corresponding to the most abundant tRNA species are considered optimal for translation. Codon usage bias can therfore be used to predict the relative expression levels of genes, by comparing CU bias of a gene to the CU bias of a set of genes known to be highly expressed.
This approach can be efficiently used to predict highly expressed genes in a single genome, but is especially useful at the higher level of an entire metagenome. It has been shown that, as well as being present within the genome, CU bias is shared among the microbial spieces in the same environment. By analysing CU bias of a metagenome, one can identify genes with high predicted expression across the entire microbial community, and determine which are the enriched functions within the community, i.e. its ‘functional fingerprint’.
A typical workflow for analysing CU bias includes:
Install coRdon from Bioconductor:
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("coRdon")
The developmental version can be installed directly from from GitHub:
devtools::install_github("BioinfoHR/coRdon")
Sequences from human gut microbiome samples of healthy individuals and liver cirrhosis patients (Quin et al. 2014) were processed, assembled and used to predict ORFs, which were then annotated with a KO (KEGG orthology) function (Fabijanic and Vlahovicek 2016).
For each sequence, in each sample, we can calculate codon usage (CU)
bias.
In order to do this, we need to count occurrences of each codon
in each sequence. This can be done by storing sequences in each sample
as a codonTable
object.
We can use readSet()
to read (a directory containing) .fasta files and
store sequences as a DNAStringSet
object, which could then be
converted to codonTable
using the constructor method codonTable()
.
For the purpose of this vignette we will create two codonTable
objects,
HD59 and LD94, containing codon counts for metagenomic gut sample from
healthy individual, and from liver chirosis patient, respectively.
dnaLD94 <- readSet(
file="https://raw.githubusercontent.com/BioinfoHR/coRdon-examples/master/LD94.fasta"
)
LD94 <- codonTable(dnaLD94)
dnaHD59 <- readSet(
file="https://raw.githubusercontent.com/BioinfoHR/coRdon-examples/master/HD59.fasta"
)
HD59 <- codonTable(dnaHD59)
codonTable
object stores codon counts for each sequence, as well as
some additional metadata, namely sequence ID infered from
a .fasta file or DNAStringSet object, sequence length in codons and
KO / COG annotation.
Note that not all of these need be present in every codonTable
object
- only the codon counts, and inferred lengths are mandatory.
They can be accessed using codonCounts()
and getlen()
(not to be confused with length()
which returns number of sequences
in a codonTable
object):
cc <- codonCounts(HD59)
head(cc)
AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC
[1,] 5 4 10 3 1 2 4 0 0 2 1 0 0 5 2 1 2 1
[2,] 3 5 4 1 2 3 5 1 0 1 0 0 0 8 4 4 0 4
[3,] 4 3 5 5 1 5 8 0 0 1 1 0 0 13 8 4 2 4
[4,] 1 9 12 3 1 14 7 0 0 5 0 0 0 15 6 6 2 2
[5,] 5 2 3 2 0 0 1 0 0 0 0 0 0 3 1 4 0 1
[6,] 5 1 2 3 3 5 1 3 1 1 0 4 5 7 10 4 3 4
CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT
[1,] 9 2 0 2 7 0 1 3 7 0 0 12 15 2 11 4 5 5
[2,] 3 1 0 1 4 1 0 4 1 0 0 5 11 2 3 9 9 2
[3,] 16 7 1 2 12 2 1 7 0 9 0 4 14 4 7 13 8 5
[4,] 15 3 0 1 4 1 0 9 1 8 0 5 15 1 10 12 8 7
[5,] 1 1 0 1 0 0 0 1 1 0 0 1 4 0 1 3 4 1
[6,] 6 0 0 7 1 3 0 5 0 3 0 2 6 0 16 4 1 3
GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC
[1,] 2 8 6 0 1 2 2 2 2 4 2 2 0 1 0 5 0 4
[2,] 2 7 13 1 2 12 4 2 0 9 3 2 0 1 0 5 1 4
[3,] 4 10 9 2 1 17 1 3 1 8 12 2 0 6 0 3 0 10
[4,] 3 5 8 0 2 9 0 1 0 3 12 2 0 6 0 1 0 5
[5,] 1 8 2 1 0 4 0 1 0 0 2 0 0 0 0 0 0 2
[6,] 8 11 1 4 7 8 0 8 7 6 3 2 0 3 0 4 2 2
TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT
[1,] 1 3 0 6 1 1 0 3 1 5
[2,] 6 0 0 4 0 0 1 2 2 2
[3,] 5 0 0 1 1 0 0 7 8 0
[4,] 3 0 0 0 0 0 0 10 3 3
[5,] 4 0 0 0 1 0 0 0 1 1
[6,] 2 0 0 1 2 1 3 3 3 2
l <- getlen(HD59)
head(l)
[1] 192 186 287 259 69 212
length(HD59)
[1] 19158
We can get IDs, KO or COG annotations in a similar way, if these are present:
ko <- getKO(HD59)
head(ko)
[1] "K01509" "K03431" "K08884" "K02313" "K01810" "K00174"
Now we can calculate CU bias for every sequence.
There are many statistics that measure codon usage, and several widely
used ones are implemented in coRdon. For example, Measure Independent
of Length and Composition (MILC) can be calculated for sequences in
the codonTable
like so:
milc <- MILC(HD59)
head(milc)
self
[1,] 0.7231937
[2,] 0.7083612
[3,] 0.7014576
[4,] 0.7119265
[5,] 0.5658703
[6,] 0.7242135
By default, MILC value for every sequence in a set is calculated with
respect to the average CU bias of the entire sample (“self”). This can
be changed by setting self = FALSE
and providing a subset of reference
genes, to which CU will be compared. This is commonly a subset of genes
known to be highly expressed, e.g. ribosomal genes. If the sequences in
the set are annotated, we can choose to calculate CU bias with respect
to codon usage of ribosomal genes (i.e. codon usage of the subset
of sequences in the sample which are annotated as ribosomal genes)
simply by setting ribosomal = TRUE
.
milc <- MILC(HD59, ribosomal = TRUE)
head(milc)
self ribosomal
[1,] 0.7231937 1.0337208
[2,] 0.7083612 0.9530175
[3,] 0.7014576 0.8365788
[4,] 0.7119265 0.7975796
[5,] 0.5658703 0.8343067
[6,] 0.7242135 0.7887504
Alternatively, instead of using ribosomal genes, we could use
a subsets
argument to provide a named list of character vectors
of annotations for different genes to be used as reference set(s).
However, if no annotation for sequences is available, one can still
calculate CU bias with respect to some chosen subset(s) of genes,
by setting subsets
to be a named list of logical vectors of sequences
to be included in this reference subset(s), or a named list with
an entirely new reference codon counts table(s).
Also of note, self
, ribosomal
and subsets
arguments
can be combined, and MILC calculated for all sequences, with respect
to each of thus specified reference sets of genes.
Another important consideration is that MILC, like most of the other
CU statistics, is still somewhat dependent on the length of sequence
for which it is calculated, and in order to obtain meaningful values,
it is recommended that sequences should be at least 80 codons long.
To that end, either soft or hard filtering step can be included
in calculation of CU statistics. Hard filtering will remove all of the
sequences from codonTable
that are shorter than some specified
threshold (length in codons, default is 80), whereas soft filtering
will produce a message if there are such sequences in the set,
but will not remove them.
milc <- MILC(HD59, filtering = "soft")
Warning in MILC(HD59, filtering = "soft"): Some sequences have below-threshold
length!
We can visually inspect distribution of sequence lengths, extracted
from codonTable
object using getlen()
:
lengths <- as.data.frame(getlen(HD59))
colnames(lengths) <- "length"
ggplot(lengths, aes(length)) +
geom_density() +
geom_vline(xintercept = 80, colour = "red") +
theme_light()
Now we calculate MILC values with respect to overall sample CU
and ribosomal genes CU, removing beforehand sequences shorter than 80 codons
(use len.threshold
argument to specify differnt value of threshold):
milc <- MILC(HD59, ribosomal = TRUE, filtering = "hard")
Other CU statistics can be calculated in the same way as MILC()
,
using one of the functions: B()
, MCB()
, ENCprime()
, ENC()
or
SCUO()
. Note however, that when calculating ENC and SCUO,
one doesn’t need to provide a subset of reference genes, because these
statistics measure distance in codon usage to uniform usage of
synonymous codons.
Some additional arguments that can be set when calculating CU statistics
are id_or_name2
, for choosing a genetic code variant (see help for
getGeneticCode()
function in Biostrings package), alt.init
for
alternative initiation codons, stop.rm
for removal of STOP codons.
Now we can visualize CU bias for every gene on the B plot.
Here, every gene is represented by a single point on the plot, its
coordinates determined by distance of genes’ CU bias to overall CU bias
(“self”, on y axis) and to CU bias of reference genes (“ribosomal”,
on x axis).
x
and y
are names of columns in a matrix given as data
argument,
they can alternatevly be numeric vectors (of the same length),
containing values of the CU statistic we wish to plot one against
the other.
library(ggplot2)
xlab <- "MILC distance from sample centroid"
ylab <- "MILC distance from ribosomal genes"
milc_HD59 <- MILC(HD59, ribosomal = TRUE)
Bplot(x = "ribosomal", y = "self", data = milc_HD59) +
labs(x = xlab, y = ylab)