Contents

1 Where does it all come from?

Test whether known biological functions or processes are over-represented (= enriched) in an experimentally-derived gene list, e.g. a list of differentially expressed (DE) genes. See Goeman and Buehlmann, 2007 for a critical review.

Example: Transcriptomic study, in which 12,671 genes have been tested for differential expression between two sample conditions and 529 genes were found DE.

Among the DE genes, 28 are annotated to a specific functional gene set, which contains in total 170 genes. This setup corresponds to a 2x2 contingency table,

deTable <-
     matrix(c(28, 142, 501, 12000),
            nrow = 2,
            dimnames = list(c("DE", "Not.DE"),
                            c("In.gene.set", "Not.in.gene.set")))
deTable
##        In.gene.set Not.in.gene.set
## DE              28             501
## Not.DE         142           12000

where the overlap of 28 genes can be assessed based on the hypergeometric distribution. This corresponds to a one-sided version of Fisher’s exact test, yielding here a significant enrichment.

fisher.test(deTable, alternative = "greater")
## 
##  Fisher's Exact Test for Count Data
## 
## data:  deTable
## p-value = 4.088e-10
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  3.226736      Inf
## sample estimates:
## odds ratio 
##   4.721744

This basic principle is at the foundation of major public and commercial enrichment tools such as DAVID and Pathway Studio.

Although gene set enrichment methods have been primarily developed and applied on transcriptomic data, they have recently been modified, extended and applied also in other fields of genomic and biomedical research. This includes novel approaches for functional enrichment analysis of proteomic and metabolomic data as well as genomic regions and disease phenotypes, Lavallee and Yates, 2016, Chagoyen et al., 2016, McLean et al., 2010, Ried et al., 2012.

2 Gene expression-based enrichment analysis

The first part of the workshop is largely based on the EnrichmentBrowser package, which implements an analysis pipeline for high-throughput gene expression data as measured with microarrays and RNA-seq. In a workflow-like manner, the package brings together a selection of established Bioconductor packages for gene expression data analysis. It integrates a wide range of gene set enrichment analysis methods and facilitates combination and exploration of results across methods.

library(EnrichmentBrowser)

Further information can be found in the vignette and publication.

2.1 A primer on terminology, existing methods & statistical theory

Gene sets, pathways & regulatory networks

Gene sets are simple lists of usually functionally related genes without further specification of relationships between genes.

Pathways can be interpreted as specific gene sets, typically representing a group of genes that work together in a biological process. Pathways are commonly divided in metabolic and signaling pathways. Metabolic pathways such as glycolysis represent biochemical substrate conversions by specific enzymes. Signaling pathways such as the MAPK signaling pathway describe signal transduction cascades from receptor proteins to transcription factors, resulting in activation or inhibition of specific target genes.

Gene regulatory networks describe the interplay and effects of regulatory factors (such as transcription factors and microRNAs) on the expression of their target genes.

Resources

GO and KEGG annotations are most frequently used for the enrichment analysis of functional gene sets. Despite an increasing number of gene set and pathway databases, they are typically the first choice due to their long-standing curation and availability for a wide range of species.

GO: The Gene Ontology (GO) consists of three major sub-ontologies that classify gene products according to molecular function (MF), biological process (BP) and cellular component (CC). Each ontology consists of GO terms that define MFs, BPs or CCs to which specific genes are annotated. The terms are organized in a directed acyclic graph, where edges between the terms represent relationships of different types. They relate the terms according to a parent-child scheme, i.e. parent terms denote more general entities, whereas child terms represent more specific entities.

KEGG: The Kyoto Encyclopedia of Genes and Genomes (KEGG) is a collection of manually drawn pathway maps representing molecular interaction and reaction networks. These pathways cover a wide range of biochemical processes that can be divided in 7 broad categories: metabolism, genetic and environmental information processing, cellular processes, organismal systems, human diseases, and drug development. Metabolism and drug development pathways differ from pathways of the other 5 categories by illustrating reactions between chemical compounds. Pathways of the other 5 categories illustrate molecular interactions between genes and gene products.

Gene set analysis vs. gene set enrichment analysis

The two predominantly used enrichment methods are:

However, the term gene set enrichment analysis nowadays subsumes a general strategy implemented by a wide range of methods Huang et al., 2009. Those methods have in common the same goal, although approach and statistical model can vary substantially Goeman and Buehlmann, 2007, Khatri et al., 2012.

To better distinguish from the specific method, some authors use the term gene set analysis to denote the general strategy. However, there is also a specific method from Efron and Tibshirani, 2007 of this name.

Underlying null: competitive vs. self-contained

Goeman and Buehlmann, 2007 classified existing enrichment methods into competitive and self-contained based on the underlying null hypothesis.

Although the authors argue that a self-contained null is closer to the actual question of interest, the vast majority of enrichment methods is competitive.

Goeman and Buehlmann further raise several critical issues concerning the 2x2 ORA:

With regard to these statistical concerns, GSEA is considered superior:

However, the simplicity and general applicability of ORA is unmet by subsequent methods improving on these issues. For instance, GSEA requires the expression data as input, which is not available for gene lists derived from other experiment types. On the other hand, the involved sample permutation procedure has been proven inaccurate and time-consuming Efron and Tibshirani, 2007, Phipson and Smyth, 2010, Larson and Owen, 2015.

Generations: ora, fcs & topology-based

Khatri et al., 2012 have taken a slightly different approach by classifying methods along the timeline of development into three generations:

  1. Generation: ORA methods based on the 2x2 contingency table test,
  2. Generation: functional class scoring (FCS) methods such as GSEA, which compute gene set (= functional class) scores by summarizing per-gene DE statistics,
  3. Generation: topology-based methods, explicitly taking into account interactions between genes as defined in signaling pathways and gene regulatory networks (Geistlinger et al., 2011 for an example).

Although topology-based (also: network-based) methods appear to be most realistic, their straightforward application can be impaired by features that are not-detectable on the transcriptional level (such as protein-protein interactions) and insufficient network knowledge Geistlinger et al., 2013, Bayerlova et al., 2015.

Given the individual benefits and limitations of existing methods, cautious interpretation of results is required to derive valid conclusions. Whereas no single method is best suited for all application scenarios, applying multiple methods can be beneficial. This has been shown to filter out spurious hits of individual methods, thereby reducing the outcome to gene sets accumulating evidence from different methods Geistlinger et al., 2016, Alhamdoosh et al., 2017.

2.2 Data types

Although RNA-seq (read count data) has become the de facto standard for transcriptomic profiling, it is important to know that many methods for differential expression and gene set enrichment analysis have been originally developed for microarray data (intensity measurements).

However, differences in data distribution assumptions (microarray: quasi-normal, RNA-seq: negative binomial) made adaptations in differential expression analysis and, to some extent, also in gene set enrichment analysis necessary.

Thus, we consider two example datasets - a microarray and a RNA-seq dataset, and discuss similarities and differences of the respective analysis steps.

For microarray data, we consider expression measurements of patients with acute lymphoblastic leukemia Chiaretti et al., 2004. A frequent chromosomal defect found among these patients is a translocation, in which parts of chromosome 9 and 22 swap places. This results in the oncogenic fusion gene BCR/ABL created by positioning the ABL1 gene on chromosome 9 to a part of the BCR gene on chromosome 22.

We load the ALL dataset

library(ALL)
data(ALL)

and select B-cell ALL patients with and without the BCR/ABL fusion, as described previously Gentleman et al., 2005.

ind.bs <- grep("^B", ALL$BT)
ind.mut <- which(ALL$mol.biol %in% c("BCR/ABL", "NEG"))
sset <- intersect(ind.bs, ind.mut)
all.eset <- ALL[, sset]

We can now access the expression values, which are intensity measurements on a log-scale for 12,625 probes (rows) across 79 patients (columns).

dim(all.eset)
## Features  Samples 
##    12625       79
exprs(all.eset)[1:4,1:4]
##              01005    01010    03002    04007
## 1000_at   7.597323 7.479445 7.567593 7.905312
## 1001_at   5.046194 4.932537 4.799294 4.844565
## 1002_f_at 3.900466 4.208155 3.886169 3.416923
## 1003_s_at 5.903856 6.169024 5.860459 5.687997

As we often have more than one probe per gene, we compute gene expression values as the average of the corresponding probe values.

all.eset <- probe.2.gene.eset(all.eset) 
head(featureNames(all.eset))
## [1] "5595" "7075" "1557" "643"  "1843" "4319"

For RNA-seq data, we consider transcriptome profiles of four primary human airway smooth muscle cell lines in two conditions: control and treatment with dexamethasone Himes et al., 2014.

We load the airway dataset

library(airway)
data(airway)

and create an Biobase::ExpressionSet.

air.eset <- as(airway, "ExpressionSet")
annotation(air.eset) <- "hsa"

Note: we are switching here from SummarizedExperiment to ExpressionSet as functionality in the EnrichmentBrowser is explicitly designed for gene-level expression data, for which ExpressionSet can be consistently used.

For further analysis, we only keep genes that are annotated to an ENSEMBL gene ID.

air.eset <- air.eset[grep("^ENSG", rownames(air.eset)), ]
dim(air.eset)
## Features  Samples 
##    63677        8
exprs(air.eset)[1:4,1:4]
##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513
## ENSG00000000003        679        448        873        408
## ENSG00000000005          0          0          0          0
## ENSG00000000419        467        515        621        365
## ENSG00000000457        260        211        263        164

2.3 Differential expression analysis

Normalization of high-throughput expression data is essential to make results within and between experiments comparable. Microarray (intensity measurements) and RNA-seq (read counts) data typically show distinct features that need to be normalized for. As this is beyond the scope of this workshop, we refer to limma for microarray normalization and EDASeq for RNA-seq normalization. See also EnrichmentBrowser::normalize, which wraps commonly used functionality for normalization.

The EnrichmentBrowser incorporates established functionality from the limma package for differential expression analysis. This involves the voom transformation when applied to RNA-seq data. Alternatively, differential expression analysis for RNA-seq data can also be carried out based on the negative binomial distribution with edgeR and DESeq2.

This can be performed using the function EnrichmentBrowser::de.ana and assumes some standardized variable names:

For more information on experimental design, see the limma user’s guide, chapter 9.

For the ALL dataset, the GROUP variable indicates whether the BCR-ABL gene fusion is present (1) or not (0).

pData(all.eset)$GROUP <- ifelse(all.eset$mol.biol == "BCR/ABL", 1, 0)
table(pData(all.eset)$GROUP)
## 
##  0  1 
## 42 37

For the airway dataset, it indicates whether the cell lines have been treated with dexamethasone (1) or not (0).

pData(air.eset)$GROUP <- ifelse(colData(airway)$dex == "trt", 1, 0)
table(pData(air.eset)$GROUP)
## 
## 0 1 
## 4 4

Paired samples, or in general sample batches/blocks, can be defined via a BLOCK column in the pData slot. For the airway dataset, the sample blocks correspond to the four different cell lines.

pData(air.eset)$BLOCK <- colData(airway)$cell
table(pData(air.eset)$BLOCK)
## 
## N052611 N061011 N080611  N61311 
##       2       2       2       2

For microarray data, the EnrichmentBrowser::de.ana function carries out differential expression analysis based on functionality from the limma package. Resulting log2 fold changes and t-test derived p-values for each gene are appended to the fData slot.

all.eset <- de.ana(all.eset)
head(fData(all.eset), n=4)
##               FC  ADJ.PVAL limma.STAT
## 5595  0.04296986 0.8996597  0.7346897
## 7075  0.03208350 0.9494365  0.4546963
## 1557 -0.04394014 0.8184362 -1.0658081
## 643  -0.02775435 0.9296705 -0.5674048

Nominal p-values are already corrected for multiple testing (ADJ.PVAL) using the method from Benjamini and Hochberg implemented in stats::p.adjust.

For RNA-seq data, the de.ana function can be used to carry out differential expression analysis between the two groups either based on functionality from limma (that includes the voom transformation), or alternatively, the frequently used edgeR or DESeq2 package. Here, we use the analysis based on edgeR.

air.eset <- de.ana(air.eset, de.method="edgeR")
## Excluding 50740 genes not satisfying min.cpm threshold
head(fData(air.eset), n=4)
##                          FC     ADJ.PVAL  edgeR.STAT
## ENSG00000000003 -0.40494099 9.700883e-05 18.75561361
## ENSG00000000419  0.18305320 1.356563e-01  3.48605567
## ENSG00000000457  0.01455321 9.423545e-01  0.01477962
## ENSG00000000460 -0.13467114 6.537338e-01  0.43278224

2.4 Gene sets

We are now interested in whether pre-defined sets of genes that are known to work together, e.g. as defined in the Gene Ontology or the KEGG pathway annotation, are coordinately differentially expressed. The function get.kegg.genesets downloads all KEGG pathways for a chosen organism (here: Homo sapiens) as gene sets using NCBI Entrez Gene IDs.

kegg.gs <- get.kegg.genesets("hsa")

Analogously, the function get.go.genesets retrieves GO terms of a selected ontology (here: biological process, BP) as defined in the GO.db annotation package.

go.gs <- get.go.genesets(org="hsa", onto="BP", mode="GO.db")

User-defined gene sets can be parsed from GMT file format. Here, we are using this functionality for reading a list of already downloaded KEGG gene sets for Homo sapiens containing NCBI Entrez Gene IDs.

data.dir <- system.file("extdata", package="EnrichmentBrowser")
gmt.file <- file.path(data.dir, "hsa_kegg_gs.gmt")
hsa.gs <- parse.genesets.from.GMT(gmt.file)
length(hsa.gs)
## [1] 39
hsa.gs[1:2]
## $hsa05416_Viral_myocarditis
##  [1] "100509457" "101060835" "1525"      "1604"      "1605"      "1756"      "1981"      "1982"     
##  [9] "25"        "2534"      "27"        "3105"      "3106"      "3107"      "3108"      "3109"     
## [17] "3111"      "3112"      "3113"      "3115"      "3117"      "3118"      "3119"      "3122"     
## [25] "3123"      "3125"      "3126"      "3127"      "3133"      "3134"      "3135"      "3383"     
## [33] "3683"      "3689"      "3908"      "4624"      "4625"      "54205"     "5551"      "5879"     
## [41] "5880"      "5881"      "595"       "60"        "637"       "6442"      "6443"      "6444"     
## [49] "6445"      "71"        "836"       "841"       "842"       "857"       "8672"      "940"      
## [57] "941"       "942"       "958"       "959"      
## 
## $`hsa04622_RIG-I-like_receptor_signaling_pathway`
##  [1] "10010"  "1147"   "1432"   "1540"   "1654"   "23586"  "26007"  "29110"  "338376" "340061"
## [11] "3439"   "3440"   "3441"   "3442"   "3443"   "3444"   "3445"   "3446"   "3447"   "3448"  
## [21] "3449"   "3451"   "3452"   "3456"   "3467"   "3551"   "3576"   "3592"   "3593"   "3627"  
## [31] "3661"   "3665"   "4214"   "4790"   "4792"   "4793"   "5300"   "54941"  "55593"  "5599"  
## [41] "5600"   "5601"   "5602"   "5603"   "56832"  "57506"  "5970"   "6300"   "64135"  "64343" 
## [51] "6885"   "7124"   "7186"   "7187"   "7189"   "7706"   "79132"  "79671"  "80143"  "841"   
## [61] "843"    "8517"   "8717"   "8737"   "8772"   "9140"   "9474"   "9636"   "9641"   "9755"

See also the MSigDb for additional gene set collections.

2.5 GO/KEGG overrepresentation analysis

A variety of gene set analysis methods have been proposed Khatri et al., 2012. The most basic, yet frequently used, method is the over-representation analysis (ORA) with gene sets defined according to GO or KEGG. As outlined in the first section, ORA tests the overlap between DE genes (typically DE p-value < 0.05) and genes in a gene set based on the hypergeometric distribution. Here, we choose a significance level \(\alpha = 0.2\) for demonstration.

ora.all <- sbea(method="ora", eset=all.eset, gs=hsa.gs, perm=0, alpha=0.2)
gs.ranking(ora.all)
## DataFrame with 7 rows and 4 columns
##                                                          GENE.SET  NR.GENES NR.SIG.GENES   P.VALUE
##                                                       <character> <numeric>    <numeric> <numeric>
## 1                hsa05202_Transcriptional_misregulation_in_cancer       153           17    0.0351
## 2 hsa05412_Arrhythmogenic_right_ventricular_cardiomyopathy_(ARVC)        63            8    0.0717
## 3                                                hsa05144_Malaria        45            6    0.0932
## 4                   hsa04670_Leukocyte_transendothelial_migration        94           10    0.1220
## 5                 hsa05100_Bacterial_invasion_of_epithelial_cells        64            7    0.1620
## 6                  hsa04622_RIG-I-like_receptor_signaling_pathway        54            6    0.1780
## 7                  hsa05130_Pathogenic_Escherichia_coli_infection        43            5    0.1840

Such a ranked list is the standard output of most existing enrichment tools. Using the ea.browse function creates a HTML summary from which each gene set can be inspected in more detail.

ea.browse(ora.all)

The resulting summary page includes for each significant gene set

As ORA works on the list of DE genes and not the actual expression values, it can be straightforward applied to RNA-seq data. However, as the gene sets here contain NCBI Entrez gene IDs and the airway dataset contains ENSEMBL gene ids, we first map the airway dataset to Entrez IDs.

air.eset <- map.ids(air.eset, org="hsa", from="ENSEMBL", to="ENTREZID")
## 'select()' returned 1:many mapping between keys and columns
## Excluded 1144 genes without a corresponding to.ID
## Encountered 22 from.IDs with >1 corresponding to.ID (a single to.ID was chosen for each of them)
ora.air <- sbea(method="ora", eset=air.eset, gs=hsa.gs, perm=0)
gs.ranking(ora.air)
## DataFrame with 8 rows and 4 columns
##                                                          GENE.SET  NR.GENES NR.SIG.GENES   P.VALUE
##                                                       <character> <numeric>    <numeric> <numeric>
## 1                                    hsa05206_MicroRNAs_in_cancer       121           71  0.000423
## 2                      hsa05410_Hypertrophic_cardiomyopathy_(HCM)        53           35  0.000765
## 3 hsa05412_Arrhythmogenic_right_ventricular_cardiomyopathy_(ARVC)        48           31  0.002670
## 4                                               hsa05218_Melanoma        50           32  0.002830
## 5                                                 hsa05214_Glioma        47           30  0.004070
## 6                                 hsa04068_FoxO_signaling_pathway       106           58  0.012800
## 7                   hsa04670_Leukocyte_transendothelial_migration        62           35  0.028600
## 8                                                hsa05144_Malaria        21           14  0.029300

Note #1: Young et al., 2010, have reported biased results for ORA on RNA-seq data due to over-detection of differential expression for long and highly expressed transcripts. The goseq package and limma::goana implement possibilities to adjust ORA for gene length and abundance bias.

Note #2: Independent of the expression data type under investigation, overlap between gene sets can result in redundant findings. This is well-documented for GO (parent-child structure, Rhee et al., 2008) and KEGG (pathway overlap/crosstalk, Donato et al., 2013). The topGO package (explicitly designed for GO) and mgsa (applicable to arbitrary gene set definitions) implement modifications of ORA to account for such redundancies.

2.6 Functional class scoring & permutation testing

A major limitation of ORA is that it restricts analysis to DE genes, excluding genes not satisfying the chosen significance threshold (typically the vast majority).

This is resolved by gene set enrichment analysis (GSEA), which scores the tendency of gene set members to appear rather at the top or bottom of the ranked list of all measured genes Subramanian et al., 2005. The statistical significance of the enrichment score (ES) of a gene set is assessed via sample permutation, i.e. (1) sample labels (= group assignment) are shuffled, (2) per-gene DE statistics are recomputed, and (3) the enrichment score is recomputed. Repeating this procedure many times allows to determine the empirical distribution of the enrichment score and to compare the observed enrichment score against it. Here, we carry out GSEA with 1000 permutations.

gsea.all <- sbea(method="gsea", eset=all.eset, gs=hsa.gs, perm=1000)  
## Permutations: 1 -- 100
## Permutations: 101 -- 200
## Permutations: 201 -- 300
## Permutations: 301 -- 400
## Permutations: 401 -- 500
## Permutations: 501 -- 600
## Permutations: 601 -- 700
## Permutations: 701 -- 800
## Permutations: 801 -- 900
## Permutations: 901 -- 1000
## Processing ...
gs.ranking(gsea.all)
## DataFrame with 19 rows and 4 columns
##                                                            GENE.SET        ES       NES   P.VALUE
##                                                         <character> <numeric> <numeric> <numeric>
## 1                  hsa05202_Transcriptional_misregulation_in_cancer    -0.530     -1.72   0.00000
## 2                                        hsa04520_Adherens_junction    -0.488     -1.73   0.00000
## 3   hsa05412_Arrhythmogenic_right_ventricular_cardiomyopathy_(ARVC)    -0.511     -1.87   0.00000
## 4                                      hsa05206_MicroRNAs_in_cancer    -0.427     -1.56   0.00201
## 5                                        hsa05416_Viral_myocarditis    -0.630     -1.88   0.00201
## ...                                                             ...       ...       ...       ...
## 15                                     hsa00790_Folate_biosynthesis    -0.734     -1.66    0.0204
## 16                                           hsa05134_Legionellosis    -0.599     -1.56    0.0271
## 17                   hsa05130_Pathogenic_Escherichia_coli_infection    -0.486     -1.51    0.0295
## 18                                               hsa04210_Apoptosis    -0.424     -1.43    0.0428
## 19                                             hsa05131_Shigellosis    -0.479     -1.52    0.0477

As GSEA’s permutation procedure involves re-computation of per-gene DE statistics, adaptations are necessary for RNA-seq. The EnrichmentBrowser implements an accordingly adapted version of GSEA, which allows incorporation of limma/voom, edgeR, or DESeq2 for repeated DE re-computation within GSEA. However, this is computationally intensive (for limma/voom the least, for DESeq2 the most). Note the relatively long running times for only 100 permutations having used edgeR for DE analysis.

gsea.air <- sbea(method="gsea", eset=air.eset, gs=hsa.gs, perm=100)  

While it might be in some cases necessary to apply permutation-based GSEA for RNA-seq data, there are also alternatives avoiding permutation. Among them is ROtAtion gene Set Testing (ROAST), which uses rotation instead of permutation Wu et al., 2010.

roast.air <- sbea(method="roast", eset=air.eset, gs=hsa.gs)
gs.ranking(roast.air)  
## DataFrame with 28 rows and 4 columns
##                                            GENE.SET  NR.GENES       DIR   P.VALUE
##                                         <character> <numeric> <numeric> <numeric>
## 1   hsa05100_Bacterial_invasion_of_epithelial_cells        60         1     0.001
## 2                        hsa05416_Viral_myocarditis        33         1     0.001
## 3               hsa03420_Nucleotide_excision_repair        42        -1     0.001
## 4                          hsa03030_DNA_replication        33        -1     0.001
## 5                        hsa04520_Adherens_junction        63         1     0.002
## ...                                             ...       ...       ...       ...
## 24     hsa04621_NOD-like_receptor_signaling_pathway        40        -1     0.032
## 25          hsa04514_Cell_adhesion_molecules_(CAMs)        60        -1     0.036
## 26              hsa04350_TGF-beta_signaling_pathway        63         1     0.037
## 27                 hsa00561_Glycerolipid_metabolism        39         1     0.040
## 28                  hsa04068_FoxO_signaling_pathway       106         1     0.049

A selection of additional methods is also available:

sbea.methods()
##  [1] "ora"        "safe"       "gsea"       "samgs"      "ebm"        "mgsa"       "gsa"       
##  [8] "padog"      "globaltest" "roast"      "camera"     "gsva"

2.7 Network-based enrichment analysis

Having found gene sets that show enrichment for differential expression, we are now interested whether these findings can be supported by known regulatory interactions.

For example, we want to know whether transcription factors and their target genes are expressed in accordance to the connecting regulations (activation/inhibition). Such information is usually given in a gene regulatory network derived from specific experiments or compiled from the literature (Geistlinger et al., 2013 for an example).

There are well-studied processes and organisms for which comprehensive and well-annotated regulatory networks are available, e.g. the RegulonDB for E. coli and Yeastract for S. cerevisiae.

However, there are also cases where such a network is missing or at least incomplete. A basic workaround is to compile a network from regulations in the KEGG database.

We can download all KEGG pathways of a specified organism (here: Homo sapiens) via

pwys <- download.kegg.pathways("hsa")

For demonstration purposes, we use a selection of already downloaded human KEGG pathways.

pwys <- file.path(data.dir, "hsa_kegg_pwys.zip")
hsa.grn <- compile.grn.from.kegg(pwys)
head(hsa.grn)
##      FROM   TO     TYPE
## [1,] "3569" "3570" "+" 
## [2,] "3458" "3459" "+" 
## [3,] "3458" "3460" "+" 
## [4,] "1950" "1956" "+" 
## [5,] "1950" "2064" "+" 
## [6,] "1950" "3480" "+"

Signaling pathway impact analysis (SPIA) is a network-based enrichment analysis method, which is explicitly designed for KEGG signaling pathways Tarca et al., 2009. The method evaluates whether expression changes are propagated across the pathway topology in combination with ORA.

spia.all <- nbea(method="spia", eset=all.eset, gs=hsa.gs, grn=hsa.grn, alpha=0.2)
gs.ranking(spia.all)

More generally applicable is gene graph enrichment analysis (GGEA), which evaluates consistency of interactions in a given gene regulatory network with the observed expression data Geistlinger et al., 2011.

ggea.all <- nbea(method="ggea", eset=all.eset, gs=hsa.gs, grn=hsa.grn)
gs.ranking(ggea.all)
## DataFrame with 4 rows and 5 columns
##                           GENE.SET   NR.RELS RAW.SCORE NORM.SCORE   P.VALUE
##                        <character> <numeric> <numeric>  <numeric> <numeric>
## 1       hsa05416_Viral_myocarditis         7      3.30      0.471   0.00699
## 2 hsa04390_Hippo_signaling_pathway        56     20.60      0.368   0.01800
## 3       hsa04520_Adherens_junction         9      3.88      0.431   0.02300
## 4               hsa04210_Apoptosis        19      7.45      0.392   0.02600

A selection of additional network-based methods is also available:

nbea.methods()
##  [1] "ggea"        "spia"        "nea"         "pathnet"     "netgsa"      "degraph"    
##  [7] "topologygsa" "ganpa"       "cepa"        "lego"

Note #1: As network-based enrichment methods typically do not involve sample permutation but rather network permutation, thus avoiding DE re-computation, they can likewise be applied to RNA-seq data.

Note #2: Given the various enrichment methods with individual benefits and limitations, combining multiple methods can be beneficial, e.g. combined application of a set-based and a network-based method. This has been shown to filter out spurious hits of individual methods and to reduce the outcome to gene sets accumulating evidence from different methods Geistlinger et al., 2016, Alhamdoosh et al., 2017.

The function comb.ea.results implements the straightforward combination of results, thereby facilitating seamless comparison of results across methods. For demonstration, we use the ORA and GSEA results for the ALL dataset from the previous section:

res.list <- list(ora.all, gsea.all)
comb.res <- comb.ea.results(res.list)
gs.ranking(comb.res)
## DataFrame with 20 rows and 6 columns
##                                                            GENE.SET  ORA.RANK GSEA.RANK MEAN.RANK
##                                                         <character> <numeric> <numeric> <numeric>
## 1                  hsa05202_Transcriptional_misregulation_in_cancer       2.6       7.7       5.1
## 2   hsa05412_Arrhythmogenic_right_ventricular_cardiomyopathy_(ARVC)       5.1       7.7       6.4
## 3                     hsa04670_Leukocyte_transendothelial_migration      10.3      17.9      14.1
## 4                                        hsa04520_Adherens_junction      20.5       7.7      14.1
## 5                                      hsa05206_MicroRNAs_in_cancer      23.1      12.8      17.9
## ...                                                             ...       ...       ...       ...
## 16                       hsa05410_Hypertrophic_cardiomyopathy_(HCM)      41.0      51.3      46.2
## 17                                     hsa00790_Folate_biosynthesis      53.8      38.5      46.2
## 18                     hsa04621_NOD-like_receptor_signaling_pathway      61.5      33.3      47.4
## 19                              hsa04350_TGF-beta_signaling_pathway      38.5      59.0      48.7
## 20                         hsa05150_Staphylococcus_aureus_infection      79.5      25.6      52.6
##      ORA.PVAL GSEA.PVAL
##     <numeric> <numeric>
## 1      0.0351   0.00000
## 2      0.0717   0.00000
## 3      0.1220   0.00765
## 4      0.2010   0.00000
## 5      0.2250   0.00201
## ...       ...       ...
## 16      0.403    0.0565
## 17      0.556    0.0204
## 18      0.631    0.0198
## 19      0.390    0.0982
## 20      0.851    0.0120

3 Genomic region enrichment analysis

Microarrays and next-generation sequencing are also widely applied for large-scale detection of variable and regulatory genomic regions, e.g. single nucleotide polymorphisms, copy number variations, and transcription factor binding sites.

Such experimentally-derived genomic region sets are raising similar questions regarding functional enrichment as in gene expression data analysis.

Of particular interest is thereby whether experimentally-derived regions overlap more (enrichment) or less (depletion) than expected by chance with regions representing known functional features such as genes or promoters.

The regioneR package implements a general framework for testing overlaps of genomic regions based on permutation sampling. This allows to repeatedly sample random regions from the genome, matching size and chromosomal distribution of the region set under study. By recomputing the overlap with the functional features in each permutation, statistical significance of the observed overlap can be assessed.

library(regioneR)

To demonstrate the basic functionality of the package, we consider the overlap of gene promoter regions and CpG islands in the human genome. We expect to find an enrichment as promoter regions are known to be GC-rich. Hence, is the overlap between CpG islands and promoters greater than expected by chance?

We use the collection of CpG islands described in Wu et al., 2010 and restrict them to the set of canonical chromosomes 1-23, X, and Y.

cpgHMM <- toGRanges("http://www.haowulab.org/software/makeCGI/model-based-cpg-islands-hg19.txt")
cpgHMM <- filterChromosomes(cpgHMM, chr.type="canonical")
cpgHMM <- sort(cpgHMM)
cpgHMM
## GRanges object with 63705 ranges and 5 metadata columns:
##           seqnames               ranges strand |    length  CpGcount GCcontent     pctGC    obsExp
##              <Rle>            <IRanges>  <Rle> | <integer> <integer> <integer> <numeric> <numeric>
##       [1]     chr1     [ 10497,  11241]      * |       745       110       549     0.737     1.106
##       [2]     chr1     [ 28705,  29791]      * |      1087       115       792     0.729     0.818
##       [3]     chr1     [135086, 135805]      * |       720        42       484     0.672     0.548
##       [4]     chr1     [136164, 137362]      * |      1199        71       832     0.694     0.524
##       [5]     chr1     [137665, 138121]      * |       457        22       301     0.659     0.475
##       ...      ...                  ...    ... .       ...       ...       ...       ...       ...
##   [63701]     chrY [59213702, 59214290]      * |       589        43       366     0.621     0.765
##   [63702]     chrY [59240512, 59241057]      * |       546        40       369     0.676     0.643
##   [63703]     chrY [59348047, 59348370]      * |       324        17       193     0.596     0.593
##   [63704]     chrY [59349137, 59349565]      * |       429        31       276     0.643       0.7
##   [63705]     chrY [59361489, 59362401]      * |       913       128       650     0.712     1.108
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

Analogously, we load promoter regions in the hg19 human genome assembly as available from UCSC:

promoters <- toGRanges("http://gattaca.imppc.org/regioner/data/UCSC.promoters.hg19.bed")
promoters <- filterChromosomes(promoters, chr.type="canonical")
promoters <- sort(promoters)
promoters
## GRanges object with 49049 ranges and 3 metadata columns:
##           seqnames               ranges strand |       V4       V5       V6
##              <Rle>            <IRanges>  <Rle> | <factor> <factor> <factor>
##       [1]     chr1       [ 9873, 12073]      * |      TSS        .        +
##       [2]     chr1       [16565, 18765]      * |      TSS        .        -
##       [3]     chr1       [17551, 19751]      * |      TSS        .        -
##       [4]     chr1       [17861, 20061]      * |      TSS        .        -
##       [5]     chr1       [19559, 21759]      * |      TSS        .        -
##       ...      ...                  ...    ... .      ...      ...      ...
##   [49045]     chrY [59211948, 59214148]      * |      TSS        .        +
##   [49046]     chrY [59328251, 59330451]      * |      TSS        .        +
##   [49047]     chrY [59350972, 59353172]      * |      TSS        .        +
##   [49048]     chrY [59352984, 59355184]      * |      TSS        .        +
##   [49049]     chrY [59360654, 59362854]      * |      TSS        .        -
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

To speed up the example, we restrict analysis to chromosomes 21 and 22. Note that this is done for demonstration only. To make an accurate claim, the complete region set should be used (which, however, runs considerably longer).

cpg <- cpgHMM[seqnames(cpgHMM) %in% c("chr21", "chr22")]
prom <- promoters[seqnames(promoters) %in% c("chr21", "chr22")]

Now, we are applying an overlap permutation test with 100 permutations (ntimes=100), while maintaining chromosomal distribution of the CpG island region set (per.chromosome=TRUE). Furthermore, we use the option count.once=TRUE to count an overlapping CpG island only once, even if it overlaps with 2 or more promoters. This takes about 2 minutes on a standard laptop.

pt <- overlapPermTest(cpg, prom, genome="hg19", ntimes=100, per.chromosome=TRUE, count.once=TRUE)
pt
## $numOverlaps
## P-value: 0.0099009900990099
## Z-score: 66.4363
## Number of iterations: 100
## Alternative: greater
## Evaluation of the original region set: 719
## Evaluation function: numOverlaps
## Randomization function: randomizeRegions
## 
## attr(,"class")
## [1] "permTestResultsList"
summary(pt[[1]]$permuted)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    99.0   112.8   120.0   119.2   126.0   141.0

The resulting permutation p-value indicates a significant enrichment. Out of the 2859 CpG islands, 719 overlap with at least one promoter. In contrast, when repeatedly drawing random regions matching the CpG islands in size and chromosomal distribution, the mean number of overlapping regions across permutations was 119.2 \(\pm\) 9.

Note #1: The function regioneR::permTest allows to incorporate user-defined functions for randomizing regions and evaluating additional measures of overlap such as total genomic size in bp.

Note #2: The LOLA package implements a genomic region ORA, which assesses genomic region overlap based on the hypergeometric distribution using a library of pre-defined functional region sets.

4 Multi-omics enrichment analysis

Multi-omics experiments are increasingly commonplace in biomedical research as e.g. apparent in recent large-scale projects such as ENCODE and TCGA. Such experiments are composed of multiple complementary data types for a set of samples, thereby adding layers of complexity to experimental design, data integration, and analysis. In Bioconductor, the MultiAssayExperiment package provides data structures and methods for representing, manipulating, and integrating multi-assay genomic experiments. See the MultiAssayExperiment workshop for an introduction.

library(MultiAssayExperiment)

Here, we consider an example dataset featuring ovarian cancer samples from TCGA.

data.dir <- system.file("extdata", package="enrichOmics")
mae.file <-  file.path(data.dir, "tcga_ov_mae.rds")
mae <- readr::read_rds(mae.file)
mae
## A MultiAssayExperiment object of 3 listed
##  experiments with user-defined names and respective classes. 
##  Containing an ExperimentList class object of length 3: 
##  [1] RNASeq2GeneNorm: matrix with 54 rows and 228 columns 
##  [2] RPPAArray: ExpressionSet with 54 rows and 228 columns 
##  [3] gistica: SummarizedExperiment with 54 rows and 228 columns 
## Features: 
##  experiments() - obtain the ExperimentList instance 
##  colData() - the primary/phenotype DataFrame 
##  sampleMap() - the sample availability DataFrame 
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
##  *Format() - convert into a long or wide DataFrame 
##  assays() - convert ExperimentList to a SimpleList of matrices

The dataset contains data for RNA-seq, reverse phase protein array, and gistic copy number variation. Sample data is available on BRCA mutation and expression-based subtype classification.

head(colData(mae))
## DataFrame with 6 rows and 2 columns
##              BRCA_Mutation        Subtype
##                   <factor>       <factor>
## TCGA-04-1348          None Immunoreactive
## TCGA-04-1357       Present Immunoreactive
## TCGA-04-1362          None Differentiated
## TCGA-04-1514          None  Proliferative
## TCGA-04-1519          None  Proliferative
## TCGA-09-1662          None Differentiated

We first make sure that rows (genes/proteins) and columns (patients) are matching between the three assays:

intersectRows(mae)
## A MultiAssayExperiment object of 3 listed
##  experiments with user-defined names and respective classes. 
##  Containing an ExperimentList class object of length 3: 
##  [1] RNASeq2GeneNorm: matrix with 54 rows and 228 columns 
##  [2] RPPAArray: ExpressionSet with 54 rows and 228 columns 
##  [3] gistica: SummarizedExperiment with 54 rows and 228 columns 
## Features: 
##  experiments() - obtain the ExperimentList instance 
##  colData() - the primary/phenotype DataFrame 
##  sampleMap() - the sample availability DataFrame 
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
##  *Format() - convert into a long or wide DataFrame 
##  assays() - convert ExperimentList to a SimpleList of matrices
intersectColumns(mae)
## A MultiAssayExperiment object of 3 listed
##  experiments with user-defined names and respective classes. 
##  Containing an ExperimentList class object of length 3: 
##  [1] RNASeq2GeneNorm: matrix with 54 rows and 228 columns 
##  [2] RPPAArray: ExpressionSet with 54 rows and 228 columns 
##  [3] gistica: SummarizedExperiment with 54 rows and 228 columns 
## Features: 
##  experiments() - obtain the ExperimentList instance 
##  colData() - the primary/phenotype DataFrame 
##  sampleMap() - the sample availability DataFrame 
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
##  *Format() - convert into a long or wide DataFrame 
##  assays() - convert ExperimentList to a SimpleList of matrices

Despite the availability of multi-omics data resources, bioinformatic and statistical methods for the analysis of such experiments are still nascent. This also holds for multi-omics enrichment analysis, where ongoing developments are anticipated to produce novel methods for this purpose. Here, we consider the mogsa package, which uses multivariate extensions of principal component analysis (PCA) to project the data onto a lower dimensional space so that relationships between datasets can be identified.

library(mogsa)
assayMats <- assays(mae)
annotSup <- prepSupMoa(as.list(assayMats), geneSets = hsa.gs)
ov.moa <- moa(assayMats, proc.row = "center_ssq1", w.data = "inertia")
smoa <- sup.moa(ov.moa, sup=annotSup, nf=5)
score.moa <- slot(smoa, "score")
rownames(score.moa) <- substring(rownames(score.moa),1,8)
colnames(score.moa) <- substring(colnames(score.moa),1,12)
score.moa[,1:5]
##           TCGA-04-1348 TCGA-04-1357 TCGA-04-1362 TCGA-04-1514  TCGA-04-1519
## hsa04520 -0.0318637847 -0.003108687  0.008484968  0.046242903  0.0044851459
## hsa05206 -0.0551838953  0.006262606  0.036212258  0.079098096  0.0336993934
## hsa04550 -0.0002012615 -0.006471282  0.033188501  0.015290983 -0.0217330746
## hsa04390  0.0083469335  0.000568580  0.050975826 -0.017561918 -0.0513168755
## hsa04150 -0.0129897021  0.002724539  0.020539524  0.009876181 -0.0007548642
## hsa04910  0.0023102971 -0.019664139  0.031444559  0.021833525 -0.0249869330
## hsa04068 -0.0578305556  0.017333485  0.012303989  0.079185760  0.0503775243
## hsa04066  0.0176969572 -0.007076473 -0.024284570 -0.027197559  0.0027100939

This produces individual enrichment scores for each sample. We use the limma for comparison of sample groups, analogous to differential expression analysis. Here, we define the groups according to presence of the BRCA mutation.

library(limma)
## 
## Attaching package: 'limma'
## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA
group <- colData(mae)[,"BRCA_Mutation"]
design <- model.matrix(~group)
fit <- lmFit(score.moa, design)
fit <- eBayes(fit)
topTable(fit, number=nrow(score.moa), coef="groupPresent")
##                 logFC       AveExpr          t    P.Value adj.P.Val         B
## hsa04066 -0.006924686 -1.800346e-18 -2.1044897 0.03642246 0.2710558 -5.679644
## hsa04150 -0.006891859 -2.287286e-18 -1.8352316 0.06776395 0.2710558 -6.202719
## hsa04550  0.002766459  4.400054e-18  1.0136580 0.31181214 0.6482234 -7.365323
## hsa05206 -0.007800228 -7.011174e-18 -0.9246359 0.35612536 0.6482234 -7.451499
## hsa04520  0.004091628 -1.496484e-18  0.8340134 0.40513961 0.6482234 -7.531151
## hsa04910 -0.001914035  3.896470e-18 -0.6683906 0.50455516 0.6727402 -7.655623
## hsa04390  0.002120871  6.715396e-18  0.5047460 0.61422099 0.7019668 -7.751747
## hsa04068  0.002661311 -6.406303e-18  0.3695316 0.71207185 0.7120719 -7.810978
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
## 
## Matrix products: default
## BLAS: /usr/local/lib/R/lib/libRblas.so
## LAPACK: /usr/local/lib/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] limma_3.33.5                mogsa_1.11.0                MultiAssayExperiment_1.3.20
##  [4] regioneR_1.9.1              BSgenome_1.45.1             rtracklayer_1.37.2         
##  [7] Biostrings_2.45.3           XVector_0.17.0              memoise_1.1.0              
## [10] airway_0.111.0              SummarizedExperiment_1.7.5  DelayedArray_0.3.16        
## [13] matrixStats_0.52.2          GenomicRanges_1.29.10       GenomeInfoDb_1.13.4        
## [16] hgu95av2.db_3.2.3           ALL_1.19.0                  EnrichmentBrowser_2.7.0    
## [19] pathview_1.17.2             org.Hs.eg.db_3.4.1          GSEABase_1.39.0            
## [22] graph_1.55.0                annotate_1.55.0             XML_3.98-1.9               
## [25] AnnotationDbi_1.39.1        IRanges_2.11.10             S4Vectors_0.15.5           
## [28] Biobase_2.37.2              BiocGenerics_0.23.0         BiocStyle_2.5.8            
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.1.0               GOstats_2.43.0                Hmisc_4.0-3                  
##   [4] AnnotationHub_2.9.5           plyr_1.8.4                    lazyeval_0.2.0               
##   [7] shinydashboard_0.6.1          splines_3.4.1                 BiocParallel_1.11.4          
##  [10] ggplot2_2.2.1                 digest_0.6.12                 BiocInstaller_1.27.2         
##  [13] ensembldb_2.1.10              htmltools_0.3.6               svd_0.4                      
##  [16] GO.db_3.4.1                   gdata_2.18.0                  magrittr_1.5                 
##  [19] checkmate_1.8.3               cluster_2.0.6                 readr_1.1.1                  
##  [22] R.utils_2.5.0                 ggbio_1.25.3                  prettyunits_1.0.2            
##  [25] colorspace_1.3-2              rappdirs_0.3.1                blob_1.1.0                   
##  [28] RCurl_1.95-4.8                genefilter_1.59.0             survival_2.41-3              
##  [31] VariantAnnotation_1.23.5      gtable_0.2.0                  zlibbioc_1.23.0              
##  [34] graphite_1.23.2               Rgraphviz_2.21.0              SparseM_1.77                 
##  [37] scales_0.4.1                  DBI_0.7                       GGally_1.3.1                 
##  [40] edgeR_3.19.3                  Rcpp_0.12.12                  xtable_1.8-2                 
##  [43] progress_1.1.2                htmlTable_1.9                 foreign_0.8-69               
##  [46] bit_1.1-12                    OrganismDbi_1.19.0            Formula_1.2-2                
##  [49] AnnotationForge_1.19.4        htmlwidgets_0.9               httr_1.2.1                   
##  [52] gplots_3.0.1                  RColorBrewer_1.1-2            acepack_1.4.1                
##  [55] pkgconfig_2.0.1               reshape_0.8.6                 R.methodsS3_1.7.1            
##  [58] nnet_7.3-12                   locfit_1.5-9.1                rlang_0.1.1                  
##  [61] reshape2_1.4.2                munsell_0.4.3                 tools_3.4.1                  
##  [64] RSQLite_2.0                   evaluate_0.10.1               stringr_1.2.0                
##  [67] yaml_2.1.14                   knitr_1.16                    bit64_0.9-7                  
##  [70] caTools_1.17.1                KEGGREST_1.17.0               AnnotationFilter_1.1.3       
##  [73] RBGL_1.53.0                   mime_0.5                      R.oo_1.21.0                  
##  [76] KEGGgraph_1.37.1              biomaRt_2.33.3                compiler_3.4.1               
##  [79] curl_2.8.1                    png_0.1-7                     interactiveDisplayBase_1.15.0
##  [82] PFAM.db_3.4.1                 tibble_1.3.3                  geneplotter_1.55.0           
##  [85] stringi_1.1.5                 GenomicFeatures_1.29.8        lattice_0.20-35              
##  [88] ProtGenerics_1.9.0            Matrix_1.2-10                 corpcor_1.6.9                
##  [91] data.table_1.10.4             bitops_1.0-6                  httpuv_1.3.5                 
##  [94] R6_2.2.2                      latticeExtra_0.6-28           hwriter_1.3.2                
##  [97] KernSmooth_2.23-15            gridExtra_2.2.1               codetools_0.2-15             
## [100] dichromat_2.0-0               gtools_3.5.0                  assertthat_0.2.0             
## [103] DESeq2_1.17.10                Category_2.43.1               rprojroot_1.2                
## [106] safe_3.17.0                   ReportingTools_2.17.0         GenomicAlignments_1.13.4     
## [109] Rsamtools_1.29.0              GenomeInfoDbData_0.99.1       hms_0.3                      
## [112] grid_3.4.1                    rpart_4.1-11                  tidyr_0.6.3                  
## [115] rmarkdown_1.6                 biovizBase_1.25.1             shiny_1.0.3                  
## [118] base64enc_0.1-3