Contents

1 Installation

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("roastgsa")

2 roastgsa in RNA-seq data

This vignette explains broadly the main functions for applying roastgsa in RNA-seq data. A more exhaustive example to explore the roastgsa functionality is presented in the “roastgsa vignette (main)”. All the analyses explained in the main vignette can be reproduced for RNA-seq data, after undertaking the steps covered here in the section “Data normalization and filtering”.

library( roastgsa )

3 Data: RNA-seq experiment from GSEABenchmarkeR

We consider the first dataset available in the tcga compendium from the GSEABenchmarkeR package [1], which consists of a RNA-seq study with 19 tumor Bladder Urothelial Carcinoma samples and 19 adjacent healthy tissues.

#library(GSEABenchmarkeR)
#tcga <- loadEData("tcga", nr.datasets=1,cache = TRUE)
#ysel <- assays(tcga[[1]])$expr
#fd <-  rowData(tcga[[1]])
#pd <- colData(tcga[[1]])
data(fd.tcga)
data(pd.tcga)
data(expr.tcga)

fd <- fd.tcga
ysel <- expr.tcga
pd <- pd.tcga
N  <- ncol(ysel)

head(pd)
## DataFrame with 6 rows and 4 columns
##                                              sample     type     GROUP
##                                         <character> <factor> <numeric>
## TCGA-K4-A3WV-01A-11R-A22U-07 TCGA-K4-A3WV-01A-11R..     BLCA         1
## TCGA-BT-A20W-01A-21R-A14Y-07 TCGA-BT-A20W-01A-21R..     BLCA         1
## TCGA-K4-A5RI-01A-11R-A28M-07 TCGA-K4-A5RI-01A-11R..     BLCA         1
## TCGA-BT-A20N-01A-11R-A14Y-07 TCGA-BT-A20N-01A-11R..     BLCA         1
## TCGA-BL-A13J-01A-11R-A277-07 TCGA-BL-A13J-01A-11R..     BLCA         1
## TCGA-BT-A20U-01A-11R-A14Y-07 TCGA-BT-A20U-01A-11R..     BLCA         1
##                                     BLOCK
##                               <character>
## TCGA-K4-A3WV-01A-11R-A22U-07 TCGA-K4-A3WV
## TCGA-BT-A20W-01A-21R-A14Y-07 TCGA-BT-A20W
## TCGA-K4-A5RI-01A-11R-A28M-07 TCGA-K4-A5RI
## TCGA-BT-A20N-01A-11R-A14Y-07 TCGA-BT-A20N
## TCGA-BL-A13J-01A-11R-A277-07 TCGA-BL-A13J
## TCGA-BT-A20U-01A-11R-A14Y-07 TCGA-BT-A20U
cnames <- c("BLOCK","GROUP")
covar <- data.frame(pd[,cnames,drop=FALSE])
covar$GROUP <- as.factor(covar$GROUP)
colnames(covar) <- cnames
print(table(covar$GROUP))
## 
##  0  1 
## 19 19

4 Data normalization and filtering

To apply roastgsa, the expression data should be approximately normally distributed, at least in their univariate form. Depending on the user’s preferred method for differential expression analysis, counts transformation methods such as rlog or vst (DESeq2) [2], zscoreDGE (edgeR) [3] or voom (limma) [4], can be applied. In the paper we explored the type I and type II errors when applying the rlog or vst transformation followed by roastgsa, showing both good control of type I errors and decent true discovery rates. In the example presented here we transform the expression data with vst function from DESeq2 R package

library(DESeq2)
dds1 <- DESeqDataSetFromMatrix(countData=ysel,colData=pd,
    design= ~ BLOCK + GROUP)
dds1 <- estimateSizeFactors(dds1)
ynorm <- assays(vst(dds1))[[1]]
colnames(ynorm) <- rownames(covar) <- paste0("s",1:ncol(ynorm))

Another key step before using the roastgsa methods for enrichment analysis is to filter out low expressed genes, where coverage might be a limitation for detecting true differentially expressed genes. For the TCGA data considered here, the default filter employed by the authors when loading the data was to exclude genes with cpm < 2 in more than half of the samples. A short discussion about the relationship between gene coverage and statistical power for the roastgsa approach is available in our article presenting the roastgsa package.

threshLR <- 10
dim(ysel)
## [1] 3621   38
min(apply(ysel,1,mean))
## [1] 88.26316

5 Gene sets

We consider a classic repository of general biological functions for battery gene set analysis such as broad hallmarks [5]. The gene sets for human are saved within the roastgsa package and can be loaded by

data(hallmarks.hs)
head(names(hallmarks.hs))
## [1] "HALLMARK_TNFA_SIGNALING_VIA_NFKB"    "HALLMARK_HYPOXIA"                   
## [3] "HALLMARK_CHOLESTEROL_HOMEOSTASIS"    "HALLMARK_MITOTIC_SPINDLE"           
## [5] "HALLMARK_WNT_BETA_CATENIN_SIGNALING" "HALLMARK_TGF_BETA_SIGNALING"

In this case, hallmarks.hs contains gene symbols whereas the row names for ynorm are entrez identifiers. We can set the row names to symbols, which in this case presents a one-to-one relationship

rownames(ynorm) <-fd[rownames(ynorm),1]

Other gene set databases that could be applied to these data for battery testing are presented in the roastgsa vignette (gene set collections).

6 Enrichment analysis

The comparison of interest can be specified by a numeric vector with length matching the number of columns in the design.

form <- as.formula(paste0("~ ", paste0(cnames, collapse = "+")))
design <- model.matrix(form , data = covar)
terms <- colnames(design)
contrast <- rep(0, length(terms))
contrast[length(colnames(design))] <- 1

Below, there is the standard roastgsa instruction (under competitive testing) for maxmean and mean statistics.

fit.maxmean <- roastgsa(ynorm, form = form, covar = covar,
    contrast = contrast, index = hallmarks.hs, nrot = 500,
    mccores = 1, set.statistic = "maxmean",
    self.contained = FALSE, executation.info = FALSE)
f1 <- fit.maxmean$res
rownames(f1) <- gsub("HALLMARK_","",rownames(f1))
head(f1)
##                           total_genes measured_genes        est       nes
## G2M_CHECKPOINT                    200            188  1.1752646  3.865173
## E2F_TARGETS                       200            194  1.4694371  3.830430
## MYOGENESIS                        200            155 -0.8304752 -2.974351
## MYC_TARGETS_V2                     58             58  1.0304122  2.704939
## MYC_TARGETS_V1                    200            192  0.7727287  2.627226
## UNFOLDED_PROTEIN_RESPONSE         113            104  0.3749170  2.595250
##                                  pval  adj.pval
## G2M_CHECKPOINT            0.001996008 0.0499002
## E2F_TARGETS               0.005988024 0.0998004
## MYOGENESIS                0.001996008 0.0499002
## MYC_TARGETS_V2            0.025948104 0.1696607
## MYC_TARGETS_V1            0.013972056 0.1397206
## UNFOLDED_PROTEIN_RESPONSE 0.029940120 0.1696607
fit.mean <- roastgsa(ynorm, form = form, covar = covar,
    contrast = contrast, index = hallmarks.hs, nrot = 500,
    mccores = 1, set.statistic = "mean",
    self.contained = FALSE, executation.info = FALSE)
f2 <- fit.mean$res
rownames(f2) <- gsub("HALLMARK_","",rownames(f2))
head(f2)
##                           total_genes measured_genes        est       nes
## E2F_TARGETS                       200            194  1.1896796  2.994555
## G2M_CHECKPOINT                    200            188  0.9287256  2.848072
## UNFOLDED_PROTEIN_RESPONSE         113            104  0.4270303  2.588451
## MYOGENESIS                        200            155 -0.7076989 -2.362653
## DNA_REPAIR                        150            139  0.4878909  2.288121
## UV_RESPONSE_DN                    144            134 -0.5941011 -2.260208
##                                  pval  adj.pval
## E2F_TARGETS               0.001996008 0.0332668
## G2M_CHECKPOINT            0.001996008 0.0332668
## UNFOLDED_PROTEIN_RESPONSE 0.009980040 0.0998004
## MYOGENESIS                0.001996008 0.0332668
## DNA_REPAIR                0.013972056 0.0998004
## UV_RESPONSE_DN            0.009980040 0.0998004

6.1 Visualization of the results

Several graphics can be obtained to complement the table results in f1 and f2. Here we only show the heatmaps that summarize the expression patterns obtained for all tested hallmarks. Full description and usage of all graphical options available in the roastgsa package are considered in the roastgsa vignette for arrays data and the roastgsa manual

hm1 <- heatmaprgsa_hm(fit.maxmean, ynorm, intvar = "GROUP", whplot = 1:50,
        toplot = TRUE, pathwaylevel = TRUE, mycol = c("orange","green",
        "white"), sample2zero = FALSE)

hm2 <- heatmaprgsa_hm(fit.mean, ynorm, intvar = "GROUP", whplot = 1:50,
        toplot = TRUE, pathwaylevel = TRUE, mycol = c("orange","green",
        "white"), sample2zero = FALSE)