if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("roastgsa")
library( roastgsa )
The R package roastgsa
contains several functions to perform
gene set analysis, for both competitive and self-contained hypothesis
testing.
It follows the work by Gordon Smyth’s group on rotation based methods
for gene set analysis [1], code available in R through functions
roast
and romer
from the limma
package [2].
They propose to reconsider the sample randomization approach based on permutations of GSEA [3] to the more general procedure scheme using rotations of the residual space of the data, which can be used even when the number of degrees of freedom left is very small.
In our understanding, the test statistics provided in romer
, which
are all functions of the moderated t-statistics ranks, are too limited. In
this R package we propose to complete the romer functionality by providing
other statistics used in the GSA context [4]. We consider the KS based test
statistics in the GSEA and GSVA [5] as well as computationally more efficient
procedures such as restandardized statistics based on summary statistics
measures. The methodology is presented and compared using both simulated
and benchmarking data in the following publication:
roastgsa: a comparison of rotation-based scores for gene set enrichment analysis (2023).
We encourage anybody wanting to use the R package to first read the associated paper.
In this R package, we have also provided several tools to summarize
and visualize the roastgsa results. This vignette will show a work-flow
for gene set analysis with roastgsa
using microarray data (available in
the GSEABenchmarkeR
R package [6]).
We consider the fourth dataset available in the GSEABenchmarkeR
R
package, which consists of a microarray study with 30 samples, 15 paired samples
corresponding to two different groups (that take values 0 and 1 respectively).
library(GSEABenchmarkeR)
geo2keggR <- loadEData("geo2kegg", nr.datasets=4)
geo2kegg <- maPreproc(list(geo2keggR[[4]]))
y <- assays(geo2kegg[[1]])$expr
covar <- colData(geo2kegg[[1]])
covar$GROUP <- as.factor(covar$GROUP)
covar$BLOCK <- as.factor(covar$BLOCK)
The objective is to understand which KEGG processes might be affected
by such two genotyping conditions. These are made available through the
EnrichmentBrowser
R package. Here we will consider gene sets of size
between 11 and 499.
#library(EnrichmentBrowser)
#kegg.hs <- getGenesets(org="hsa", db="kegg")
data(kegg.hs)
kegg.gs <- kegg.hs[which(sapply(kegg.hs,length)>10&sapply(kegg.hs,length)<500)]
There are some important elements that should be taken into
consideration before undertaking the roastgsa
. First, the expression
data should be approximately normally distributed, at least in their
univariate form. For RNA-seq data or any other form of counts data, prior
normalization, e.g., rlog
or vst
(DESeq2
), zscoreDGE
(edgeR
) or voom
(limma
), should be used. Besides, filtering for expression intensity
and variability could be considered, especially when the gene
variation across individuals is limited by the experimental coverage.
For microarray intensities, as it happens in the example of this vignette, we can perform a quantile normalization to balance out the libraries.
library(preprocessCore)
ynorm <- normalize.quantiles(y)
rownames(ynorm) <- rownames(y)
colnames(ynorm) <- colnames(y)
par(mfrow=c(1,2))
boxplot(y, las = 2)
boxplot(ynorm, las = 2)
y <- ynorm
Indicating the model to be fitted is essential for the roastgsa
implementation. This follows a similar strategy to the limma
specifications. With the attributes form
, covar
and contrast
it can be
set the linear model and the specific estimated coefficient to be used
in roastgsa
for hypothesis testing. The matrix design is found
automatically taking the form
and covar
parameters (see
below). The contrast
can be in a vector object (length equal to the
number of columns in the matrix design), an integer with the column of
the term of interest in the matrix design, as well as the name of such column.
form = as.formula("~ BLOCK + GROUP")
design <- model.matrix(form, covar)
contrast = "GROUP1"
Three parameters that define the roast hypothesis testing have to be
properly specified: -a- the number of rotations to draw the null
hypothesis (nrot
); -b- the statistic to be used (set.statistic
);
-c- the type of hypothesis (self.contained
argument), competitive
(set to FALSE) or self-contained (set to TRUE).
Regarding the test statistics, the maxmean
(directional) and the
absmean
(nondirectional) are recommended to maximise the power, with
mean.rank
being a good non-parametric alternative for a more robust
statistic if it is known that a few highly differentially expressed
genes might influence heavily the statistic calculation. For a more
“democratic” statistic, one that counterbalance both ends of
the distribution (negative and positive), we encourage using the
mean
statistic.
Below, there is the standard roastgsa
instruction (under competitive
testing) for maxmean
and mean.rank
statistics.
fit.maxmean.comp <- roastgsa(y, form = form, covar = covar,
contrast = contrast, index = kegg.gs, nrot = 500,
mccores = 1, set.statistic = "maxmean",
self.contained = FALSE, executation.info = FALSE)
f1 <- fit.maxmean.comp$res
rownames(f1) <- substr(rownames(f1),1,8)
head(f1)
## total_genes measured_genes est nes pval adj.pval
## hsa05230 70 69 0.5425709 3.879147 0.001996008 0.05112851
## hsa04115 73 73 0.8053681 3.454336 0.001996008 0.05112851
## hsa04215 32 31 0.6703507 3.378919 0.001996008 0.05112851
## hsa04520 93 92 0.6180546 3.301359 0.001996008 0.05112851
## hsa04530 169 164 0.4898137 3.255032 0.001996008 0.05112851
## hsa05412 77 77 0.4803524 3.141053 0.001996008 0.05112851
fit.meanrank <- roastgsa(y, form = form, covar = covar,
contrast = 2, index = kegg.gs, nrot = 500,
mccores = 1, set.statistic = "mean.rank",
self.contained = FALSE, executation.info = FALSE)
f2 <- fit.meanrank$res
rownames(f2) <- substr(rownames(f2),1,8)
head(f2)
## total_genes measured_genes est pval.diff adj.pval.diff
## hsa04710 34 33 2526.621 0.00998004 0.996008
## hsa04012 85 84 2213.738 0.02195609 0.996008
## hsa00220 22 21 4323.690 0.04191617 0.996008
## hsa00230 128 124 -1623.702 0.04191617 0.996008
## hsa04725 113 111 -1226.869 0.06187625 0.996008
## hsa04964 23 23 4231.543 0.06187625 0.996008
## pval.mixed adj.pval.mixed
## hsa04710 0.46706587 0.7004377
## hsa04012 0.04990020 0.5574657
## hsa00220 0.05988024 0.5574657
## hsa00230 0.31536926 0.6214081
## hsa04725 0.48103792 0.7054838
## hsa04964 0.02794411 0.5574657
Several functions to summarize or visualize the results can be applied
to objects of class roastgsa
, which are found as output of the
roastgsa
function.
The plot
function of a roastgsa
object produces a general image
of the differential expression results within any tested gene set. If
type = "stats"
, it shows the ordered moderated t-statistics in
various formats, area for up- and down- expressed genes, barcode plot
for these ordered values and density. With the argument whplot
it can be
selected the gene set of interest (either an integer with the ordered
position in the roastgsa output or the name of the gene set).
plot(fit.maxmean.comp, type ="stats", whplot = 2, gsainfo = TRUE,
cex.sub = 0.5, lwd = 2)
If the roastgsa
statistic is mean.rank
, the moderated-t statistic
centred ranks are printed instead.
plot(fit.meanrank, type ="stats", whplot = 1, gsainfo = TRUE,
cex.sub = 0.4, lwd = 2)
Even though the type = "GSEA"
option is directly interpretable for
ksmean and ksmax statistics, we find it useful for seeing the behavior
in both Kolgomorov-Smirnov type scores and simple summary statistics:
plot(fit.maxmean.comp, type = "GSEA", whplot = 2, gsainfo = TRUE,
maintitle = "", cex.sub = 0.5, statistic = "mean")
plot(fit.meanrank, type = "GSEA", whplot = 1, gsainfo = TRUE,
maintitle = "", cex.sub = 0.5, statistic = "mean")
Another graphic that is proposed in this package to visualize the genes activity
within a gene set can be obtained through the function heatmaprgsa_hm
.
The main intention is to illustrate the variation across samples for the
gene set of interest. We highly recommend the generation of this graphic,
or any other similar plot that shows sample variation for the tested gene sets.
Not only for showing which genes are activated but also as quality control to
detect samples that can be highly influential in the GSA analysis.
hm <- heatmaprgsa_hm(fit.maxmean.comp, y, intvar = "GROUP", whplot = 3,
toplot = TRUE, pathwaylevel = FALSE, mycol = c("orange","green",
"white"), sample2zero = FALSE)