InterMine constitutes a powerful open source data warehouse system which integrates diverse biological data sets and provides a growing number of analysis tools, including enrichment analysis widgets (Smith et al. 2012; Kalderimis et al. 2014).
Specifically, the gene set enrichment tool looks for annotations to a set of genes that occur more than would be expected by chance, given a background population of annotations. The hypergeometric distribution is used to calculate a P-value for each annotation and a choice of correction methods for multiple testing (Bonferonni, Holm-Bonferonni and Benjamini Hochberg) are available (Smith et al. 2012; Kalderimis et al. 2014).
InterMine provides Gene Ontology enrichment statistics as well as enrichment statistics for other annotation types including protein domains, pathways, human diseases, mammalian phenotypes and publications. The default background population is all genes in the genome with the specified annotation type. However, the background population can be changed by specifying another list. More information can be found in the online documentation.
Performing enrichment analysis with InterMineR is preceded by two steps:
Retrieve the list of bioentities of interest (Genes, Proteins, SNPs, etc.) for which the analysis will be performed.
Get the enrichment widget name which indicates the annotation types that you want to investigate for enrichment (e.g. Gene Ontology Terms, protein domains, KEGG and Reactome Pathways, human diseases, mammalian phenotypes and publications).
In this example, enrichment analysis is performed using a list of genes which are associated with all forms of Diabetes according to OMIM. All query results and calculations are correct according to HumanMine release 5.1 (November 2018). Genomic coordinates and p-values are likely to change between database releases but the methods will remain the same.
The PL_DiabetesGenes dataset is included in the InterMineR package and can also be found online in HumanMine.
Retrieve PL_DiabetesGenes data and get gene identifiers as HumanrMine Gene.symbol and Gene.primaryIdentifier values (ENTREZ IDs).
# get HumanMine instance
im.human = initInterMine(listMines()["HumanMine"])
# retrieve data model for HumanMine
# hsa_model = getModel(im.human) # temporarily removed
# all targets of HumanMine enrichment widgets possess InterMine ids
# subset(hsa_model, type %in% c("Gene", "Protein", "SNP") & child_name == "id") # temporarily removed
# load human genes which are associated with Diabetes (retrieved from HumanMine)
data("PL_DiabetesGenes")
head(PL_DiabetesGenes, 3)
## Gene.symbol Gene.name Gene.primaryIdentifier
## 1 ABCC8 ATP binding cassette subfamily C member 8 6833
## 2 ACE angiotensin I converting enzyme 1636
## 3 AKT2 AKT serine/threonine kinase 2 208
## Gene.secondaryIdentifier Gene.length Gene.organism.name
## 1 ENSG00000006071 83961 Homo sapiens
## 2 ENSG00000159640 21320 Homo sapiens
## 3 ENSG00000105221 55215 Homo sapiens
# get Gene.symbol
hsa_gene_names = as.character(PL_DiabetesGenes$Gene.symbol)
head(hsa_gene_names, 3)
## [1] "ABCC8" "ACE" "AKT2"
# get Gene.primaryIdentifiers (ENTREZ Gene identifiers)
hsa_gene_entrez = as.character(PL_DiabetesGenes$Gene.primaryIdentifier)
head(hsa_gene_entrez, 3)
## [1] "6833" "1636" "208"
After obtaining the Gene.id values of interest, we must determine the type of annotation for the enrichment analysis.
To retrieve all available widgets for a Mine we can use the getWidgets
function.
## startClassDisplay
## 1 primaryIdentifier
## 2 <NA>
## 3 primaryIdentifier
## 4 primaryIdentifier
## 5 <NA>
## 6 primaryIdentifier
## 7 primaryIdentifier
## 8 primaryIdentifier
## 9 primaryIdentifier
## 10 <NA>
## enrichIdentifier
## 1 GWASResults.study.publication.pubMedId
## 2 <NA>
## 3 goAnnotation.ontologyTerm.parents.identifier
## 4 publications.pubMedId
## 5 <NA>
## 6 pathways.identifier
## 7 GWASResults.study.publication.pubMedId
## 8 proteins.proteinDomainRegions.proteinDomain.primaryIdentifier
## 9 proteinDomainRegions.proteinDomain.primaryIdentifier
## 10 <NA>
## name
## 1 snp_gwas_study_enrichment
## 2 chromosome_distribution_for_snp
## 3 go_enrichment_for_gene
## 4 publication_enrichment
## 5 chromosome_distribution_for_gene
## 6 pathway_enrichment
## 7 snp_publication_enrichment
## 8 prot_dom_enrichment_for_gene
## 9 prot_dom_enrichment_for_protein
## 10 interactions
## description
## 1 GWAS studies enriched for SNPs in this list.
## 2 Actual: number of items in this list found on each chromosome. Expected: given the total number of items on the chromosome and the number of items in this list, the number of items expected to be found on each chromosome.
## 3 GO terms enriched for items in this list.
## 4 Publications enriched for genes in this list.
## 5 Actual: number of items in this list found on each chromosome. Expected: given the total number of items on the chromosome and the number of items in this list, the number of items expected to be found on each chromosome.
## 6 Pathways enriched for genes in this list - data from KEGG and Reactome
## 7 Publications enriched for SNPs in this list.
## 8 Protein Domains enriched for items in this list.
## 9 Protein Domains enriched for items in this list.
## 10 Genes (from the list or not) that interact with genes in this list. Counts may include the same interaction more than once if observed in multiple experiments.
## enrich startClass
## 1 GWASResults.study.name SNP
## 2 <NA> SNP
## 3 goAnnotation.ontologyTerm.parents.name Gene
## 4 publications.title Gene
## 5 <NA> Gene
## 6 pathways.name Gene
## 7 GWASResults.study.publication.title SNP
## 8 proteins.proteinDomainRegions.proteinDomain.name Gene
## 9 proteinDomainRegions.proteinDomain.name Protein
## 10 <NA> <NA>
## title targets widgetType chartType
## 1 GWAS study Enrichment for SNPs SNP enrichment <NA>
## 2 Chromosome Distribution SNP chart ColumnChart
## 3 Gene Ontology Enrichment Gene enrichment <NA>
## 4 Publication Enrichment Gene enrichment <NA>
## 5 Chromosome Distribution Gene chart ColumnChart
## 6 Pathway Enrichment Gene enrichment <NA>
## 7 Publication Enrichment for SNPs SNP enrichment <NA>
## 8 Protein Domain Enrichment Gene enrichment <NA>
## 9 Protein Domain Enrichment Protein enrichment <NA>
## 10 Interactions Gene table <NA>
## filters labels
## 1 <NA> <NA>
## 2 organism.name=[list] Chromosome & Count
## 3 biological_process,cellular_component,molecular_function <NA>
## 4 <NA> <NA>
## 5 organism.name=[list] Chromosome & Count
## 6 All,KEGG pathways data set,Reactome data set <NA>
## 7 <NA> <NA>
## 8 <NA> <NA>
## 9 <NA> <NA>
## 10 <NA> <NA>
HumanMine provides enrichment for genes, proteins and SNPs, but here we are interested only in the gene-related, enrichment widgets.
# get enrichment, gene-related widgets for HumanMine
subset(human.widgets, widgetType == 'enrichment' & targets == "Gene")
## startClassDisplay
## 3 primaryIdentifier
## 4 primaryIdentifier
## 6 primaryIdentifier
## 8 primaryIdentifier
## enrichIdentifier
## 3 goAnnotation.ontologyTerm.parents.identifier
## 4 publications.pubMedId
## 6 pathways.identifier
## 8 proteins.proteinDomainRegions.proteinDomain.primaryIdentifier
## name
## 3 go_enrichment_for_gene
## 4 publication_enrichment
## 6 pathway_enrichment
## 8 prot_dom_enrichment_for_gene
## description
## 3 GO terms enriched for items in this list.
## 4 Publications enriched for genes in this list.
## 6 Pathways enriched for genes in this list - data from KEGG and Reactome
## 8 Protein Domains enriched for items in this list.
## enrich startClass
## 3 goAnnotation.ontologyTerm.parents.name Gene
## 4 publications.title Gene
## 6 pathways.name Gene
## 8 proteins.proteinDomainRegions.proteinDomain.name Gene
## title targets widgetType chartType
## 3 Gene Ontology Enrichment Gene enrichment <NA>
## 4 Publication Enrichment Gene enrichment <NA>
## 6 Pathway Enrichment Gene enrichment <NA>
## 8 Protein Domain Enrichment Gene enrichment <NA>
## filters labels
## 3 biological_process,cellular_component,molecular_function <NA>
## 4 <NA> <NA>
## 6 All,KEGG pathways data set,Reactome data set <NA>
## 8 <NA> <NA>
The column names provides the character strings that can be passed as arguments to the doEnrichment
function and thus define the type of enrichment analysis.
We will perform Gene Ontology Enrichment analysis for the genes in our list using the Gene.id values and the ‘go_enrichment_for_gene’ widget.
# Assign directly the doEnrichment.string from getGeneIds function to the ids argument of doEnrichment to perform the analysis
# Perform enrichment analysis
GO_enrichResult = doEnrichment(
im = im.human,
ids = hsa_gene_entrez,
widget = "go_enrichment_for_gene"
# organism = "Homo sapiens" # optional if results from more than one organism are retrieved
)
As mentioned above ‘PL_DiabetesGenes’ constitutes a genelist which already exists in HumanMine. Therefore, the enrichment analysis could also be performed by passing the name of the list to the genelist argument, with the same results:
# Perform enrichment analysis with genelist name instead of ids
GO_enrichResult = doEnrichment(
im = im.human,
genelist = "PL_DiabetesGenes",
# ids = hsa_gene_entrez,
widget = "go_enrichment_for_gene"
)
doEnrichment
function returns a list which contains:
## identifier description pValue
## 1 GO:0046879 hormone secretion 6.204441832880645E-19
## 2 GO:0030072 peptide hormone secretion 6.476612587650905E-19
## 3 GO:0009914 hormone transport 6.723020438657542E-19
## 4 GO:0030073 insulin secretion 1.5680383071753694E-16
## 5 GO:0010817 regulation of hormone levels 1.5506760945034293E-15
## 6 GO:0090276 regulation of peptide hormone secretion 3.5511625693881988E-15
## count populationAnnotationCount
## 1 22 314
## 2 21 259
## 3 22 321
## 4 18 213
## 5 23 538
## 6 17 213
## [1] 622 5
## [1] 16493
## [1] 4
## $mine
## HumanMine
## "https://www.humanmine.org/humanmine"
##
## $token
## [1] ""
## ids
## "1131430,1093405,1188128,1229795,1216372,1177683,1022676,1177065,1178932,1026605,1038683,1084950,1191771,1106734,1206901,1227196,1201352,1099872,1165469,1190571,1138950,1031624,1274298,1143616,1128168,1222664,1263823,1253457,1212332,1227784,1047623,1214714,1270030,1039222,1245987,1155021,1247856,1188250,1258926,1056496,1110142,1226582,1092036,1256785,1214871,1261136,1117996,1087538,1231385,1235292,1224965,1016209,1144121,1194921,1138149,1176009,1217635,1195369,1184491,1106856,1155143,1081673,1042632,1144093,1096577"
## widget
## "go_enrichment_for_gene"
## maxp
## "0.05"
## correction
## "Benjamini Hochberg"
Some of the enrichment widgets contain filters which, when applied, limit the annotation types of the enrichment analysis.
In our example, if we want to retrieve only the enriched GO terms in our list of genes that are related to molecular function, we will use the appropriate filter:
# get available filter values for Gene Ontology Enrichment widget
as.character(subset(human.widgets, name == "go_enrichment_for_gene")$filters)
## [1] "biological_process,cellular_component,molecular_function"
# Perform enrichment analysis for GO molecular function terms
GO_MF_enrichResult = doEnrichment(
im = im.human,
ids = hsa_gene_entrez,
widget = "go_enrichment_for_gene",
filter = "molecular_function")
head(GO_MF_enrichResult$data)
## identifier description
## 1 GO:0005158 insulin receptor binding
## 2 GO:0005102 signaling receptor binding
## 3 GO:0140297 DNA-binding transcription factor binding
## 4 GO:0005159 insulin-like growth factor receptor binding
## 5 GO:0001067 regulatory region nucleic acid binding
## 6 GO:0000976 transcription regulatory region sequence-specific DNA binding
## pValue count populationAnnotationCount
## 1 0.002461313324665432 4 22
## 2 0.004171469750781318 19 1531
## 3 0.03683404640731029 8 346
## 4 0.0373066722185923 3 16
## 5 0.039088326209195855 13 1013
## 6 0.046163610927798915 12 897
## [1] 7 5
To reduce the probability of false positive errors, three different multiple correction algorithms can be applied to the results of the enrichment analysis:
The application of one of these algorithms changes the p-values and determines the number of the results which will be returned from the analysis:
# Perform enrichment analysis for Protein Domains in list of genes
PD_FDR_enrichResult = doEnrichment(
im = im.human,
ids = hsa_gene_entrez,
widget = "prot_dom_enrichment_for_gene",
correction = "Benjamini Hochberg"
)
head(PD_FDR_enrichResult$data)
## NULL
# Perform enrichment analysis for Protein Domains in list of genes
# but without a correction algoritm this time
PD_None_enrichResult = doEnrichment(
im = im.human,
ids = hsa_gene_entrez,
widget = "prot_dom_enrichment_for_gene",
correction = "None"
)
head(PD_None_enrichResult$data)
## identifier description
## 1 IPR006897 Hepatocyte nuclear factor 1, beta isoform, C-terminal
## 2 IPR023219 Hepatocyte nuclear factor 1, N-terminal domain superfamily
## 3 IPR039066 Hepatocyte nuclear factor 1
## 4 IPR006899 Hepatocyte nuclear factor 1, N-terminal
## 5 IPR039011 Insulin receptor substrate
## 6 IPR006020 PTB/PI domain
## pValue count populationAnnotationCount
## 1 1.0862293177715881E-5 2 2
## 2 1.0862293177715881E-5 2 2
## 3 1.0862293177715881E-5 2 2
## 4 3.251701806989728E-5 2 3
## 5 3.251701806989728E-5 2 3
## 6 3.161463481151051E-4 3 40
In order to visualize the InterMineR enrichment analysis results, we will use the function convertToGeneAnswers
. This function was created to facilitate the visualization of doEnrichment
function results by converting them into a GeneAnswers-class
object.
This way we can utilize the functions of the package GeneAnswers to visualize the results of the enrichment analysis and the relations between the features (e.g. genes) and/or the annoatation categories (e.g. GO terms) (Feng et al. 2010, 2012; Huang et al. 2014).
# load GeneAnswers package
library(GeneAnswers)
# convert InterMineR Gene Ontology Enrichment analysis results to GeneAnswers object
geneanswer_object = convertToGeneAnswers(
# assign with doEnrichment result:
enrichmentResult = GO_enrichResult,
# assign with list of genes:
geneInput = data.frame(GeneID = as.character(hsa_gene_entrez),
stringsAsFactors = FALSE),
# assign with the type of gene identifier
# in our example we use Gene.primaryIdentifier values (ENTREZ IDs):
geneInputType = "Gene.primaryIdentifier",
# assign with Bioconductor annotation package:
annLib = 'org.Hs.eg.db',
# assign with annotation category type
# in our example we use Gene Ontology (GO) terms:
categoryType = "GO"
#optional define manually if 'enrichIdentifier' is missing from getWidgets:
#enrichCategoryChildName = "Gene.goAnnotation.ontologyTerm.parents.identifier"
)
class(geneanswer_object)
## [1] "GeneAnswers"
## attr(,"package")
## [1] "GeneAnswers"
## This GeneAnswers instance was build from GO based on hyperG test.
## Statistical information of 622 categories with p value less than 0.05 are reported. Other categories are considered as nonsignificant.
## There are 622 categories related to the given 68 genes
##
## Summary of GeneAnswers instance information:
##
## Slot: geneInput
## GeneID
## 1 6833
## 2 1636
## 3 208
## 4 26060
## 5 359
## 6 551
## ......
##
## Slot: testType
## [1] "hyperG"
##
## Slot: pvalueT
## [1] 0.05
##
## Slot: genesInCategory
## $`GO:0046879`
## [1] "6833" "640" "11132" "2645" "3077" "6927" "6928" "3172"
## [9] "3557" "3569" "3630" "3667" "8660" "3710" "3767" "4544"
## [17] "4760" "3651" "56729" "6514" "169026" "6934"
##
## $`GO:0030072`
## [1] "6833" "640" "11132" "2645" "3077" "6927" "6928" "3172"
## [9] "3557" "3569" "3630" "3667" "8660" "3710" "3767" "4544"
## [17] "4760" "3651" "6514" "169026" "6934"
##
## $`GO:0009914`
## [1] "6833" "640" "11132" "2645" "3077" "6927" "6928" "3172"
## [9] "3557" "3569" "3630" "3667" "8660" "3710" "3767" "4544"
## [17] "4760" "3651" "56729" "6514" "169026" "6934"
##
## $`GO:0030073`
## [1] "6833" "640" "11132" "2645" "6927" "6928" "3172" "3557"
## [9] "3667" "8660" "3710" "3767" "4544" "4760" "3651" "6514"
## [17] "169026" "6934"
##
## $`GO:0010817`
## [1] "6833" "1636" "640" "11132" "2645" "3077" "6927" "6928"
## [9] "3172" "3557" "3569" "3630" "3667" "8660" "3710" "3767"
## [17] "4544" "4760" "3651" "56729" "6514" "169026" "6934"
##
## $`GO:0090276`
## [1] "6833" "640" "11132" "2645" "3077" "3172" "3630" "3667"
## [9] "8660" "3710" "3767" "4544" "4760" "3651" "6514" "169026"
## [17] "6934"
##
## ......
##
## Slot: geneExprProfile
## NULL
##
## Slot: annLib
## [1] "org.Hs.eg.db"
##
## Slot: categoryType
## [1] "GO"
##
## Slot: enrichmentInfo
## genes in Category percent in the observed List percent in the genome
## GO:0046879 22 0.3235294 0.01903838
## GO:0030072 21 0.3088235 0.01570363
## GO:0009914 22 0.3235294 0.01946280
## GO:0030073 18 0.2647059 0.01291457
## GO:0010817 23 0.3382353 0.03261990
## GO:0090276 17 0.2500000 0.01291457
## fold of overrepresents odds ratio p value fdr p value
## GO:0046879 16.99354 24.64262 6.204442e-19 6.204442e-19
## GO:0030072 19.66574 28.00575 6.476613e-19 6.476613e-19
## GO:0009914 16.62296 24.09481 6.723020e-19 6.723020e-19
## GO:0030073 20.49669 27.51549 1.568038e-16 1.568038e-16
## GO:0010817 10.36899 15.15758 1.550676e-15 1.550676e-15
## GO:0090276 19.35798 25.47731 3.551163e-15 3.551163e-15
## ......
GeneAnswers is package designed for the enrichment analysis and visualization of gene lists. It is worth mentioning that the InterMineR filters for the widgets of Gene Ontology and Pathway Enrichment:
as.character(
subset(human.widgets,
title %in% c("Gene Ontology Enrichment",
"Pathway Enrichment"))$filters
)
## [1] "biological_process,cellular_component,molecular_function"
## [2] "All,KEGG pathways data set,Reactome data set"
match the following available values:
InterMineR widget name | InterMineR filter | convertToGeneAnswers categoryType |
---|---|---|
go_enrichment_for_gene | biological_process | GO.BP |
go_enrichment_for_gene | cellular_component | GO.CC |
go_enrichment_for_gene | molecular_function | GO.MF |
pathway_enrichment | KEGG pathways data set | KEGG |
pathway_enrichment | Reactome data set | REACTOME.PATH |
for the categoryType argument of the geneAnswerBuilder
function, and can be assigned accordingly in the categoryType argument of the convertToGeneAnswers
, therby facilitating the conversion of InterMineR Gene Ontology and Pathway Enrichment analysis results to GeneAnswers-class
objects.
# convert to GeneAnswers results for GO terms associated with molecular function
geneanswer_MF_object = convertToGeneAnswers(
enrichmentResult = GO_MF_enrichResult,
geneInput = data.frame(GeneID = as.character(hsa_gene_entrez),
stringsAsFactors = FALSE),
geneInputType = "Gene.primaryIdentifier",
annLib = 'org.Hs.eg.db',
categoryType = "GO.MF"
#enrichCategoryChildName = "Gene.goAnnotation.ontologyTerm.parents.identifier"
)
class(geneanswer_MF_object)
## [1] "GeneAnswers"
## attr(,"package")
## [1] "GeneAnswers"
## This GeneAnswers instance was build from GO.MF based on hyperG test.
## Statistical information of 7 categories with p value less than 0.05 are reported. Other categories are considered as nonsignificant.
## There are 7 categories related to the given 68 genes
##
## Summary of GeneAnswers instance information:
##
## Slot: geneInput
## GeneID
## 1 6833
## 2 1636
## 3 208
## 4 26060
## 5 359
## 6 551
## ......
##
## Slot: testType
## [1] "hyperG"
##
## Slot: pvalueT
## [1] 0.05
##
## Slot: genesInCategory
## $`GO:0005158`
## [1] "3630" "3667" "8660" "5770"
##
## $`GO:0005102`
## [1] "1636" "551" "640" "2056" "3077" "3082" "3159" "3172" "3557"
## [10] "3569" "3630" "3643" "3667" "8660" "5468" "5770" "56729" "6934"
## [19] "7422"
##
## $`GO:0140297`
## [1] "50943" "3159" "3172" "4760" "5468" "6925" "6934" "7466"
##
## $`GO:0005159`
## [1] "3630" "3643" "3667"
##
## $`GO:0001067`
## [1] "50943" "169792" "3159" "6927" "6928" "3172" "8462" "4760"
## [9] "3651" "5325" "5468" "6925" "6934"
##
## $`GO:0000976`
## [1] "50943" "169792" "3159" "6927" "6928" "3172" "4760" "3651"
## [9] "5325" "5468" "6925" "6934"
##
## ......
##
## Slot: geneExprProfile
## NULL
##
## Slot: annLib
## [1] "org.Hs.eg.db"
##
## Slot: categoryType
## [1] "GO.MF"
##
## Slot: enrichmentInfo
## genes in Category percent in the observed List percent in the genome
## GO:0005158 4 0.05882353 0.0013426095
## GO:0005102 19 0.27941176 0.0934334188
## GO:0140297 8 0.11764706 0.0211155865
## GO:0005159 3 0.04411765 0.0009764433
## GO:0001067 13 0.19117647 0.0618210668
## GO:0000976 12 0.17647059 0.0547418528
## fold of overrepresents odds ratio p value fdr p value
## GO:0005158 43.812834 46.488636 0.002461313 0.002461313
## GO:0005102 2.990491 3.762314 0.004171470 0.004171470
## GO:0140297 5.571574 6.181118 0.036834046 0.036834046
## GO:0005159 45.181985 47.221154 0.037306672 0.037306672
## GO:0001067 3.092416 3.586987 0.039088326 0.039088326
## GO:0000976 3.223687 3.700191 0.046163611 0.046163611
## ......
Briefly, in the following sections we present how to use several functions of the GeneAnswers package to visualize the results of the Gene Ontology Enrichment analysis on the PL_DiabetesGenes
gene list.
For more information about the usage of GeneAnswers package, the user should look at the respective documentation.
# Make GeneAnswers Instance readable
geneanswer_object_readable = geneAnswersReadable(geneanswer_object)
# GeneAnswers pieChart
geneAnswersChartPlots(geneanswer_object_readable,
chartType='pieChart',
sortBy = 'correctedPvalue',
top = 3)
Figure 1: Pie chart of the top three GO terms with the smallest FDR-corrected p-values.
# GeneAnswers barplot
geneAnswersChartPlots(geneanswer_object_readable,
chartType='barPlot',
sortBy = 'correctedPvalue',
top = 5)