Gene lists derived from the results of genomic analyses are rich in biological information. For instance, differentially expressed genes (DEGs) from a microarray or RNA-Seq analysis are related functionally in terms of their response to a treatment or condition. Gene lists can vary in size, up to several thousand genes, depending on the robustness of the perturbations or how widely different the conditions are biologically. Having a way to associate biological relatedness between hundreds and thousands of genes systematically is impractical by manually curating the annotation and function of each gene. Over-representation analysis (ORA) of genes was developed to identify biological themes. Given a Gene Ontology (GO) and an annotation of genes that indicate the categories each one fits into, significance of the over-representation of the genes within the ontological categories is determined by a Fisher’s exact test or modeling according to a hypergeometric distribution. Comparing a small number of enriched biological categories for a few samples is manageable using Venn diagrams or other means for assessing overlaps. However, with hundreds of enriched categories and many samples, the comparisons are laborious. Furthermore, if there are enriched categories that are shared between samples, trying to represent a common theme across them is highly subjective. goSTAG uses GO subtrees to tag and annotate genes within a set. goSTAG visualizes the similarities between the over-representation of DEGs by clustering the p-values from the enrichment statistical tests and labels clusters with the GO term that has the most paths to the root within the subtree generated from all the GO terms in the cluster.
The goSTAG package is available through Bioconductor and you can install it by running:
if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("goSTAG")
Once the library is installed, you can load it by running:
This section describes how to prepare your data to perform a goSTAG analysis.
The key input data that goSTAG requires is a set of gene lists. These gene lists can be derived from any type of gene-based analysis, such as differential expression analysis or ChIP-seq peak targets. They can be generated within R using another package, or using an external tool and then loaded into R.
Once you have generated your gene lists and loaded them into R, you must convert them into the format that goSTAG requires, which is a list of vectors. Each element of the list corresponds to a gene list, and the vectors contain the genes in the gene list.
We provide an example set of gene lists in the goSTAG package.
data( goSTAG_example_gene_lists ) head( lapply( goSTAG_example_gene_lists, head ) )
## $Day1 ##  "CNN2" "GHDC" "GALNT10" "LIPT1" "TPP1" "TBPL2" ## ## $Day2 ##  "GPR37L1" "TNFSF13B" "OLAH" "WWC1" "NYNRIN" "MRGPRX4" ## ## $Day3 ##  "MAD2L1BP" "MRPL47" "SLIRP" "MEF2BNB-MEF2B" ##  "TMEM245" "UBA6"
For convienience, we have provided a function for loading external gene lists.
The first option is to put your gene lists into a single GMT file. This GMT file is a tab-delimited file containing one gene list per line. The first entry in the line is the name of the gene list, the second entry in the line is a description of the gene list (which goSTAG ignores), and the rest of the entries are the genes. For more information on the GMT format, see here.
Once you have created your GMT file, you can load it using the
gene_lists <- loadGeneLists( "gene_lists.gmt" )
The second option is to have your gene lists in separate tab-delimited text files within a directory.
You can load the gene lists within this directory using the
loadGeneLists function with the
type argument set to
gene_lists <- loadGeneLists( "/gene/lists/directory", type = "DIR" )
By default, the
loadGeneLists function expects the files in this directory to have no header and to have the genes in the first column. You can modify this behavior using the
header argument and the
In this example, the files contain a header that needs to be ignored, and the genes are in the 7th column.
gene_lists <- loadGeneLists( "/gene/lists/directory", type = "DIR", header = TRUE, col = 7 )
The second input data required by goSTAG is a set of GO terms and the genes associated with those GO terms. The GO terms should all belong to the same GO domain (either biological process, cellular component, or molecular function). The expected format is a list of vectors, where each list corresponds to a GO term, and the vectors contain the genes associated with the GO terms. The names of the lists should be the GO IDs of the GO terms. The list must also have an additional element named “ALL”, which is a vector that contains all annotated genes.
The goSTAG package provides the
loadGOTerms function for loading the GO terms and gene associations from BioMart using the
go_terms <- loadGOTerms() head( lapply( go_terms, head ) )
## $`GO:0006810` ##  "SLC25A26" "GJA3" "GJB2" "HERC1" "SLC35A2" "BAG6" ## ## $`GO:0048011` ##  "BTC" "FGF7" "FURIN" "APH1B" "NDN" "AGO2" ## ## $`GO:0007173` ##  "BTC" "FGF7" "AGO2" "FGF9" "MAPKAP1" "FGF2" ## ## $`GO:0008543` ##  "BTC" "FGF7" "AGO2" "FGF9" "MAPKAP1" "FGF2" ## ## $`GO:0048015` ##  "BTC" "FGF7" "UBE2C" "AGO2" "FGF9" "MAPKAP1" ## ## $`GO:0008284` ##  "BTC" "FGF7" "EDN3" "RPS4X" "AVP" "FGF9"
head( go_terms[["ALL"]] )
##  "SLC25A26" "BTC" "GAST" "HAP1" "JUP" "FKBP10"
Since loading the GO terms from BioMart can take several minutes, the default behavior of the
loadGOTerms function is to use a previously generated version that has been archived in the goSTAG package, which is much faster.
You can load the most recent version of the GO terms and gene associations directly from BioMart by setting the
use_archived argument to
go_terms <- loadGOTerms( use_archived = FALSE )
By default, the
loadGOTerms function will associate genes based on the annotation for human. It will also load GO terms for the biological process domain and remove any GO terms that has fewer than 5 genes associated with it. To modify this behavior, use the
min_num_genes arguments. Currently, the supported options for the
species argument are “human”, “mouse”, or “rat”. Acceptable options for the
domain argument are “BP” (biological process), “CC” (cellular component), or “MF” (molecular function).
This example will load GO terms for mouse, using the molecular function domain, and remove any GO terms with fewer than 10 genes.
go_terms_mouse <- loadGOTerms( species = "mouse", domain = "MF", min_num_genes = 10 )
Once your gene lists and GO terms are loaded, you are ready to perform a goSTAG analysis. This section will detail the four steps in the goSTAG analysis.
The first step is to create a matrix of GO enrichment scores. The goSTAG package calculates an enrichment score for each gene list and GO term using the hypergeometric test. It will remove any GO term that has no samples with significant enrichment. The enrichment score is calculated as the -log10 p-value of the hypergeometric test. For tests without significant enrichment, the enrichment score is set to 0.
To generate an enrichment matrix from your gene lists and GO terms, use the
enrichment_matrix <- performGOEnrichment( goSTAG_example_gene_lists, go_terms ) head(enrichment_matrix)
## Day1 Day2 Day3 ## GO:0007165 0.000000 0.000000 1.343007 ## GO:0007369 0.000000 0.000000 1.389698 ## GO:0042127 0.000000 0.000000 1.729184 ## GO:0007249 1.862509 1.766861 1.654304 ## GO:0008380 1.345807 0.000000 1.306134 ## GO:0016049 0.000000 1.436532 0.000000
This function produces a matrix of enrichment scores, where columns are gene lists and rows are GO terms.
By default, this function will remove any GO term where none of the samples have a p-value of less than 0.05. You can specify whether to use a nominal p-value or an adjusted p-value using the
filter_method argument, and you can change the significance threshold using the
significance_threshold argument. This function uses the
p.adjust method to adjust the p-value, and you can specify the method used to adjust the p-values with the
p.adjust_method argument. See the
method argument for the
p.adjust function for available options.
This example generates the enrichment matrix and uses a Benjamini-Hochberg FDR of 0.3 to filter the GO terms.
enrichment_matrix_FDR <- performGOEnrichment( goSTAG_example_gene_lists, go_terms, filter_method = "p.adjust", significance_threshold = 0.3, p.adjust_method = "BH" )
After generating the enrichment matrix, the next step is to cluster the GO terms. Before clustering the GO terms into groups, you must first perform hierarchical clustering.
You can perform the hierarchical clustering using the
hclust_results <- performHierarchicalClustering( enrichment_matrix ) hclust_results
## ## Call: ## hclust(d = as.dist(1 - abs(cor(t(final_matrix)))), method = clustering_method) ## ## Cluster method : average ## Number of objects: 155
This function uses the
hclust function and returns an object of class
hclust containing the hierarchical clustering tree.
By default, this function uses one minus the absolute value of the correlation to measure distance. You can change this behavior using the
distance_method argument. Specifying anything other than
correlation will cause this function to use the
dist function to measure distance. The method specified in the
distance_method argument is the method that the
dist function will use. See the
method argument for the
dist function for available options.
This function uses the
hclust function to perform the hierarchical clustering, with a default agglomeration method of
average. You can change this by specifying the
This example performs hierarchical clustering on the GO terms using Euclidean distance and complete agglomeration.
hclust_results_euclidean <- performHierarchicalClustering( enrichment_matrix, distance_method = "euclidean", clustering_method = "complete" )
You can also perform hierarchical clustering on the samples by setting the
feature argument to
sample_hclust_results <- performHierarchicalClustering( enrichment_matrix, feature = "col" )
After the hierarchical clustering is complete, you can group the GO terms into clusters using the
clusters <- groupClusters( hclust_results_euclidean ) lapply( head( clusters ), head )
## $Cluster1 ##  "GO:0007249" ## ## $Cluster2 ##  "GO:0043928" ## ## $Cluster3 ##  "GO:0043537" "GO:0000288" ## ## $Cluster4 ##  "GO:0006417" ## ## $Cluster5 ##  "GO:0045760" "GO:0090316" ## ## $Cluster6 ##  "GO:0002224"
This function produces a list of vectors, where each list corresponds to a cluster, and the vectors are the GO terms in the clusters.
By default, this function will group GO terms whose distance to each other is less than 0.2. This threshold can be adjusted by changing the
distance_threshold argument. A larger distance threshold will produce fewer clusters with more GO terms, whereas a smaller one will produce more clusters with fewer GO terms.
The default distance threshold of 0.2 produces a moderate number of clusters.
length( clusters )
##  35
Increasing the distance threshold to 0.5 produces fewer clusters, but with more GO terms on average.
clusters_larger_threshold <- groupClusters( hclust_results_euclidean, distance_threshold = 0.5 ) length( clusters_larger_threshold )
##  21
Whereas decreasing the distance threshold to 0.05 produces more clusters, but with fewer GO terms on average.
clusters_smaller_threshold <- groupClusters( hclust_results_euclidean, distance_threshold = 0.05 ) length( clusters_smaller_threshold )
##  60
The final step is to annotate each of the clusters using the GO term within each cluster’s subtree that has the most paths back to the root.
You can annotate the clusters using this method with the
cluster_labels <- annotateClusters( clusters ) head( cluster_labels )
## GO:0007249 ## "I-kappaB kinase/NF-kappaB signaling" ## GO:0043928 ## "exonucleolytic catabolism of deadenylated mRNA" ## GO:0000288 ## "nuclear-transcribed mRNA catabolic process, deadenylation-dependent decay" ## GO:0006417 ## "regulation of translation" ## GO:0090316 ## "positive regulation of intracellular protein transport" ## GO:0002224 ## "toll-like receptor signaling pathway"
This returns a vector containing the descriptions of the GO terms tagged to each cluster.
After performing a goSTAG analysis, you can plot a heatmap of the results using the
plotHeatmap( enrichment_matrix, hclust_results_euclidean, clusters, cluster_labels )
You can add the hierarchical clustering of the samples to the plot using the
plotHeatmap( enrichment_matrix, hclust_results_euclidean, clusters, cluster_labels, sample_hclust_results = sample_hclust_results )
By default, this function will only label clusters with 10 or more GO terms. You can change this by modifying the
This example has the minimum number of GO terms decreased to 5, which increases the number of clusters that are labeled.
plotHeatmap( enrichment_matrix, hclust_results_euclidean, clusters, cluster_labels, min_num_terms = 5 )