Proteins at the cell surface connect intracellular and extracellular signaling networks and largely determine a cell’s capacity to communicate and interact with its environment.
Importantly, variations in transcriptomic profiles are often observed between healthy and diseased cells, presenting distinct sets of cell-surface proteins. Indeed, cell surface proteins i) may act as biomarkers for the detection of diseased cells in tissues or body fluids and ii) are the most prevalent target of human drugs: 66% of approved human drugs listed in the DrugBank database target a cell-surface protein. The investigation of the cell surfaceome therefore could provide new possibilities for diagnosis, prognosis, treatment development, and therapy response evaluation.
When a study aims to find new biomarkers, the small number of samples often limits the ability to obtain reliable results. However, as sequencing costs continue to decrease, several follow-up experiments will likely be conducted to re-address the same biological question, suggesting a need for methods able to jointly analyze data from multiple studies.
SurfR provides a solution to these issues, generating a list of ranked surface protein-coding differentially-expressed genes starting from the raw count matrix of your own RNA-seq experiment or from bulk transcriptomic data automatically retrieved from public databases such as GEO and TCGA. GEO queries are based on the ArchS4 pipeline. TCGA repository is interrogated through TCGAbiolinks.
SurfR also offers the opportunity to increase the sample size of a cohort by integrating related datasets, therefore enhancing the power to detect differentially expressed genes of interest. Meta-analysis can be performed through metaRNASeq, taking into account inter-study variability that may arise from technical differences among studies (e.g., sample preparation, library protocols, batch effects) as well as additional biological variability.
Gene ontology (GO) and pathway annotation can also be performed within SurfR to gain further insights about surface protein candidates.
Finally, SurfR includes functions to visualize DEG and enrichment results, aiding in the interpretation of findings.
Install the package from Bioconductor or GitHub, ensuring correct Bioconductor dependencies.
if (!"BiocManager" %in% rownames(installed.packages()))
install.packages("BiocManager", repos = "https://cran.r-project.org")
When the package is available on Bioconductor, use
BiocManager::install("SurfR")
Use the pre-release or devel version from GitHub using devtools with
# install.packages("devtools")
devtools::install_github("auroramaurizio/SurfR", build_vignettes = TRUE)
The basic idea behind SurfR has been to create a complete framework to detect surface protein coding genes within your data, or within public datasets. This framework facilitates the simultaneous analysis and comparison of multiple studies, easily revealing functional consensus and differences among distinct conditions. To begin, let’s ask SurfR to detect surface protein coding genes among a vector of input genes. Gene ID can be provided as gene_name
, ensembl
, entrez
or uniProt_name
.
The protein classification is based on a recently developed surfaceome predictor, called SURFY (Bausch-Fluck et al. 2018), based on machine learning.
library(SurfR)
#>
GeneNames <- c("CIITA", "EPCAM", "CD24", "CDCP1", "LYVE1")
SurfaceProteins_df <- Gene2SProtein(GeneNames,
input_type = "gene_name",
output_tsv = FALSE,
Surfy_version = "new")
#> 4 out of 5 genes have a matching surface protein
#The output dataframe contains several information retrieved from Surfy.
colnames(SurfaceProteins_df)
#> [1] "UniProt.name"
#> [2] "UniProt.accession"
#> [3] "UniProt.description"
#> [4] "UniProt.gene"
#> [5] "Surfaceome.Label"
#> [6] "Surfaceome.Label.Source"
#> [7] "Comment"
#> [8] "length"
#> [9] "TM.domains"
#> [10] "signalpeptide"
#> [11] "topology"
#> [12] "topology.source"
#> [13] "MachineLearning.trainingset"
#> [14] "MachineLearning.score"
#> [15] "MachineLearning.FPR.class.(1=1%,.2=5%,.3=15%)"
#> [16] "Ensembl.gene"
#> [17] "Ensembl.protein"
#> [18] "CD.number"
#> [19] "Membranome.Almen.main-class"
#> [20] "Membranome.Almen.sub-class"
#> [21] "nxst.motifs"
#> [22] "noncyt..nxst.count"
#> [23] "peps.with.accessible.noncyt..nxst"
#> [24] "noncyt..Trp.count"
#> [25] "peps.with.accessible.noncyt..Trp"
#> [26] "noncyt..Tyr.count"
#> [27] "peps.with.accessible.noncyt..Tyr"
#> [28] "glycomineN.sites"
#> [29] "glycomineO.sites"
#> [30] "glycomineC.sites"
#> [31] "CSPA.category"
#> [32] "CSPA.peptide.count"
#> [33] "CSPA.peptides"
#> [34] "CSPA.N115.sites"
#> [35] "CSPA.id"
#> [36] "UniProt.subcellular"
#> [37] "UniProt.keywords"
#> [38] "UniProt.uniref"
#> [39] "COMPARTMENTS.link"
#> [40] "COMPARTMENTS.benchmark.pos"
#> [41] "COMPARTMENTS.benchmark.neg"
#> [42] "HPA.antibody"
#> [43] "DrugBank.approved.drug.IDs"
#> [44] "GeneID"
#> [45] "entrezID"
#These are the 5 surface protein coding genes detected by SurfR.
SurfaceProteins_df$UniProt.gene
#> [1] "EPCAM" "CD24" "CDCP1" "LYVE1"
Although SurfR provides many functions to retrieve public data you can always start from your own dataset.
Here we are going to simulate a small bulkRNA dataset with the R package SPsimSeq (Assefa, Vandesompele, and Thas 2020)
starting from a subset of Zhang RNA-seq data (Zhang et al. 2015),
adding 20% of Differentially Expressed genes (pDE = 0.2
).
You can than decide to stick to it or combine it with other datasets (public or private).
library(SPsimSeq)
data("zhang.data.sub")
zhang.counts <- zhang.data.sub$counts
MYCN.status <- zhang.data.sub$MYCN.status
# Simulation of bulk RNA data
sim.data.bulk <- SPsimSeq(n.sim = 1,
s.data = zhang.counts,
group = MYCN.status,
n.genes = 1000,
batch.config = 1,
group.config = c(0.5, 0.5),
tot.samples = ncol(zhang.counts),
pDE = 0.2,
lfc.thrld = 0.5,
result.format = "list",
return.details = TRUE)
sim.data.bulk1 <- sim.data.bulk$sim.data.list[[1]]
countMatrix <- sim.data.bulk1$counts
row.names(countMatrix) <- row.names(zhang.counts)
metadata <- sim.data.bulk1$colData
metadata$Group <- as.factor(metadata$Group)
A fundamental task in the analysis of count data from RNA-seq is the detection of differentially expressed genes. For this task we rely on the package DESeq2 (Love, Huber, and Anders 2014), starting from counts data. The count data are presented as a table which reports, for each sample, the number of sequence fragments that have been assigned to each gene.
We use the built-in SurfR DGE function, to perform the differential expression analysis of the simulated dataset.
Good Differential Expressed surface protein are the ones which are strongly expressed in one condition and almost not expressed in the other. To help detecting those candidates, the output data.frame of the DGE function contains information on the average expression in the two group (see Mean_CPM_T and Mean_CPM_C columns).
library(SurfR)
df_zhang <- DGE(expression = countMatrix,
metadata = metadata,
Nreplica = 50,
design = "~Group",
condition = "Group",
alpha = 0.05,
TEST = "1", CTRL = "0",
output_tsv = FALSE)
head(df_zhang)
Once DEGS have been detected, we may want to isolate Surface protein-coding genes.
The protein classification is based on a recently developed surfaceome predictor, called SURFY, based on machine learning.
We use the built-in SurfR Gene2SProtein function to identify Surface protein-coding genes (SP).
# remove NA values
df_zhang <- df_zhang[!is.na(df_zhang$padj), ]
# all fdr
fdr_GeneID <- df_zhang[df_zhang$padj < 0.05, "GeneID"]
SP <- Gene2SProtein(genes = fdr_GeneID, input_type = "gene_name")
# upregulated fdr
fdrUP_GeneID <- df_zhang[df_zhang$padj < 0.05 & df_zhang$log2FoldChange > 0,
"GeneID"]
SPup <- Gene2SProtein(genes = fdrUP_GeneID, input_type = "gene_name")
# dowregulated fdr
fdrDW_GeneID <- df_zhang[df_zhang$padj < 0.05 & df_zhang$log2FoldChange < 0,
"GeneID"]
SPdw <- Gene2SProtein(genes = fdrDW_GeneID, input_type = "gene_name")
GEO (Edgar 2002) is a public functional genomics data repository containing high throughput gene expression data and hybridization arrays. We provide a handy interface to download experiments and curated gene expression profiles.
Here we are going to reanalyze the Cloughesy et al. (Cloughesy et al. 2019) public dataset of recurrent glioblastoma patients undergoing two different treatments: neoadjuvant pembrolizumab or adjuvant pembrolizumab..
This study is available under the GEO accession series GSE121810.
The metadata is downloaded with SurfR built-in function GEOmetadata. Note that this study has been sequenced with only one sequencing platform. If this is not the case, you have to download separately all the metadata specifying the GPL series numbers, and then merge them.
The count matrix is downloaded from ArchS4 (Lachmann et al. 2018) with the SurfR built-in function DownloadArchS4, to ensure handling not normalize data.
library(SurfR)
library(stringr)
# Download metadata from GEO
mGSE121810 <- GEOmetadata(GSE = "GSE121810")
# create new metadata column in order to remove unwanted special characters
unwanted_character <- " "
fx <- function(x) {
str_split(string = x, pattern = unwanted_character)[[1]][1]
}
mGSE121810$condition <- sapply(mGSE121810$therapy, fx)
mGSE121810$condition <- as.factor(mGSE121810$condition)
# Preview metadata
head(mGSE121810)
# only select 3 samples per condition to save time
na_samples <- c("GSM3447013", "GSM3447018", "GSM3447019")
a_samples <- c("GSM3447023", "GSM3447024", "GSM3447026")
mGSE121810 <- mGSE121810[c(na_samples, a_samples), ]
# Download count matrix from ArchS4
cGSE121810 <- DownloadArchS4(mGSE121810$GSM,
species = "human",
print_tsv = FALSE,
filename = NULL)
# Preview count matrix
head(cGSE121810[, ])
A fundamental objective in the analysis of RNA-seq counts data is the detection of differentially expressed genes. For this task we rely on the package DESeq2, starting from count data. Count data reports for each sample the number of sequence fragments that have been assigned to each gene.
Here, we use the built-in SurfR DGE function, to perform the differential expression analysis of the GSE121810 dataset.
Good Differential Expressed surface protein are the ones which are strongly expressed in one condition and almost not expressed in the other. To help detecting the best candidates, the output data.frame of the DGE function contains information on the average expression in the two group (see Mean_CPM_T and Mean_CPM_C columns).
# Perform DGE
df_GEO <- DGE(expression = cGSE121810,
metadata = mGSE121810,
Nreplica = 3,
design = "~condition",
condition = "condition",
alpha = 0.05,
TEST = "neoadjuvant", CTRL = "adjuvant",
output_tsv = FALSE)
# remove NA values
df_GEO <- df_GEO[!is.na(df_GEO$padj), ]
head(df_GEO)
Once DEGS have been detected, we may want to isolate Surface protein-coding genes.
The protein classification is based on a recently developed surfaceome predictor, called SURFY, based on machine learning.
We use the built-in SurfR Gene2SProtein function to identify Surface protein-coding genes (SP).
# Detect SP amoung differentially expressed genes
fdr_GeneID <- df_GEO[df_GEO$padj < 0.1, "GeneID"]
SP <- Gene2SProtein(genes = fdr_GeneID, input_type = "gene_name")
fdrUP_GeneID <- df_GEO[df_GEO$padj < 0.1 & df_GEO$log2FoldChange > 0, "GeneID"]
SPup <- Gene2SProtein(genes = fdrUP_GeneID, input_type = "gene_name")
fdrDW_GeneID <- df_GEO[df_GEO$padj < 0.1 & df_GEO$log2FoldChange < 0, "GeneID"]
SPdw <- Gene2SProtein(genes = fdrDW_GeneID, input_type = "gene_name")
TCGA (The Cancer Genome Atlas Research Network et al. 2013) contains data for thousands of tumor samples across more than 20 types of cancer. Navigating through all of the files manually is impossible. Therefore we provide a function based on TCGAbiolinks that automates and streamlines the retrieval of public TCGA transcriptomics data. Note that to use this function you need to install the developmental version of TCGAbiolinks (Mounir et al. 2019) (Colaprico et al. 2016).
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinksGUI.data")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")
Here we reanalyze the TCGA-THYM dataset, since it is one of the smallest TCGA datasets, including normal solid tissue samples.
The TCGA count matrix and the metadata are downloaded with SurfR built-in function TCGA_download. The shortLetterCode column of the metadata allows us to distinguish Primary Solid Tumor (TP) and normal (NT) samples.
A fundamental task in the analysis of count data from RNA-seq is the detection of differentially expressed genes. For this task we rely on the package DESeq2, starting from counts data. The count data are presented as a table which reports, for each sample, the number of sequence fragments that have been assigned to each gene.
We use the built-in SurfR DGE function to perform the differential expression analysis of the TCGA-THYM dataset.
Good Differential Expressed surface protein are the ones which are strongly expressed in one condition and almost not expressed in the other. To help detecting those candidates, the output data.frame of the DGE function contains information on the average expression in the two group (see Mean_CPM_T and Mean_CPM_C columns).
df_TCGA <- DGE(expression = cTCGA.THYM,
metadata = mTCGA.THYM,
Nreplica = 2,
design = "~shortLetterCode",
condition = "shortLetterCode",
alpha = 0.05,
TEST = "TP", CTRL = "NT",
output_tsv = FALSE)
head(df_TCGA)
Once DEGs have been detected, we may want to isolate Surface protein-coding genes.
The protein classification takes advantage of a recently developed surfaceome predictor, called SURFY, based on machine learning.
We use the built-in SurfR Gene2SProtein function to identify Surface protein-coding genes (SP).
# remove NA values
df_TCGA <- df_TCGA[!is.na(df_TCGA$padj), ]
fdr_GeneID <- df_TCGA[df_TCGA$padj < 0.05,
"GeneID"]
SP <- Gene2SProtein(genes = fdr_GeneID, input_type = "gene_name")
fdrUP_GeneID <- df_TCGA[df_TCGA$padj < 0.05 & df_TCGA$log2FoldChange > 0,
"GeneID"]
SPup <- Gene2SProtein(genes = fdrUP_GeneID, input_type = "gene_name")
fdrDW_GeneID <- df_TCGA[df_TCGA$padj < 0.05 & df_TCGA$log2FoldChange < 0,
"GeneID"]
SPdw <- Gene2SProtein(genes = fdrDW_GeneID, input_type = "gene_name")
Analyzing data arising from several experiments studying the same question is a way to obtain more robust results, increasing the detection power of differentially expressed genes.
In SurfR we provide a set of functions based on MetaRNASeq (Rau, Marot, and Jaffrézic 2014) package to combine data from multiple RNAseq experiments.
Let’s suppose we want to integrate Breast cancer data from GEO and TCGA. Breast Cancer datasets are downloaded from TCGA with the SurfR built-in function TCGA_download.
cTCGA.BRCA <- TCGA.BRCA[[1]]
mTCGA.BRCA <- TCGA.BRCA[[2]]
table(mTCGA.BRCA$shortLetterCode)
#>
#> NT TP
#> 3 3
In GEO, we want to analyze Varley data (Varley et al. 2014), which includes samples of ER+ breast cancer, Triple Negative Breast cancer, adjacent tissues, and normal breast. These datasets can be retrieved from the GEO series GSE58135, using the SurfR built-in functions GEOmetadata and DownloadArchS4.
mGSE58135 <- GEOmetadata("GSE58135")
mGSE58135 <- mGSE58135[mGSE58135$tissue != "Breast Cancer Cell Line", ]
mGSE58135$condition <- "NT"
mGSE58135$condition[mGSE58135$tissue %in% c("ER+ Breast Cancer Primary Tumor",
"Triple Negative Breast Cancer Primary Tumor")] <- "TP"
# only select 3 samples per condition to save time
TP_samples <- c("GSM1401694", "GSM1401717", "GSM1401729")
NT_samples <- c("GSM1401799", "GSM1401813", "GSM1401814")
mGSE58135 <- mGSE58135[c(TP_samples, NT_samples), ]
cGSE58135 <- DownloadArchS4(mGSE58135$GSM, species = "human")
cGSE58135 <- cGSE58135[, row.names(mGSE58135)]
table(mGSE58135$condition)
#>
#> NT TP
#> 3 3
The first step in the analysis is to detect differentially expressed genes for each count data, separately. For this task we rely on the package DESeq2, starting from counts data. The count data are presented as a table which reports, for each sample, the number of sequence fragments that have been assigned to each gene.
We use the built-in SurfR DGE function to perform the differential expression analysis of the TCGA-BRCA dataset and GSE58135.
Good differentially expressed surface proteins should be strongly expressed in one condition and almost not expressed in the other. The output dataframe of the DGE function contains information on the average expression in the two groups (see Mean_CPM_T and Mean_CPM_C columns), to help in the detection of the best candidates.
# TCGA DGE
df.TCGA <- DGE(expression = cTCGA.BRCA,
metadata = mTCGA.BRCA,
Nreplica = 3,
design = "~shortLetterCode",
condition = "shortLetterCode",
alpha = 0.05,
TEST = "TP", CTRL = "NT",
output_tsv = FALSE)
#> Warning in DESeqDataSet(se, design = design, ignoreRank): 1233 duplicate
#> rownames were renamed by adding numbers
#> Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
#> design formula are characters, converting to factors
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
head(df.TCGA)
#> log2 fold change (MLE): shortLetterCode TP vs NT
#> Wald test p-value: shortLetterCode TP vs NT
#> DataFrame with 6 rows and 8 columns
#> GeneID Mean_CPM_T Mean_CPM_C log2FoldChange lfcSE stat
#> <character> <numeric> <numeric> <numeric> <numeric> <numeric>
#> CST1 CST1 19.0532 0.0000000 12.33847 1.414031 8.72574
#> MMP13 MMP13 98.5584 0.0923984 10.44607 1.064577 9.81242
#> MMP1 MMP1 94.4173 0.2706227 8.69994 1.749042 4.97412
#> COL10A1 COL10A1 212.4102 1.2454422 7.69794 0.668788 11.51027
#> MMP11 MMP11 723.8036 5.3934170 7.43622 0.534409 13.91485
#> MMP10 MMP10 48.1166 0.5060880 6.92586 0.850766 8.14073
#> pvalue padj
#> <numeric> <numeric>
#> CST1 2.64455e-18 6.03219e-16
#> MMP13 9.95552e-23 3.43042e-20
#> MMP1 6.55456e-07 2.49767e-05
#> COL10A1 1.17109e-30 9.98197e-28
#> MMP11 5.14677e-44 1.66704e-40
#> MMP10 3.92895e-16 7.48581e-14
# GSE58135 DGE
df.GSE58135 <- DGE(expression = cGSE58135,
metadata = mGSE58135,
Nreplica = 3,
design = "~condition",
condition = "condition",
alpha = 0.05,
TEST = "TP", CTRL = "NT",
output_tsv = FALSE)
#> Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
#> design formula are characters, converting to factors
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
head(df.GSE58135)
#> log2 fold change (MLE): condition TP vs NT
#> Wald test p-value: condition TP vs NT
#> DataFrame with 6 rows and 8 columns
#> GeneID Mean_CPM_T Mean_CPM_C log2FoldChange lfcSE
#> <character> <numeric> <numeric> <numeric> <numeric>
#> AL513523.1 AL513523.1 14.42919 0.0000000 12.8883 1.31949
#> AL513523.2 AL513523.2 14.42919 0.0000000 12.8883 1.31949
#> FAM204BP FAM204BP 39.67196 0.0136062 11.7927 1.44085
#> HSFX2 HSFX2 3.32139 0.0000000 10.8999 1.46267
#> CTD-2369P2.12 CTD-2369P2.12 32.45917 0.0224913 10.7918 1.09479
#> RP11-325P15.2 RP11-325P15.2 2.16334 0.0000000 10.2662 1.39009
#> stat pvalue padj
#> <numeric> <numeric> <numeric>
#> AL513523.1 9.76762 1.55049e-22 5.46822e-21
#> AL513523.2 9.76762 1.55049e-22 5.46822e-21
#> FAM204BP 8.18454 2.73347e-16 4.00536e-15
#> HSFX2 7.45206 9.18934e-14 9.53677e-13
#> CTD-2369P2.12 9.85743 6.36555e-23 2.38817e-21
#> RP11-325P15.2 7.38527 1.52150e-13 1.53376e-12
Here we provide a function based on metaRNASeq bioconductor package to implement two p-value combination techniques (inverse normal and Fisher methods).
The meta-analysis is performed by the SurfR built-in function metaRNAseq, which requires as input:
ind_deg
);fishercomb
)
or the inverse normal combination technique (invnorm
);BHth
);nrep
).The function automatically produces and saves as .pdf histograms of raw p-values for each of the individual differential analyses performed using the independent filtering from DESeq2 package. You can also examine the p-value distribution after p.value combination.
L_fishercomb <- metaRNAseq(ind_deg = list(TCGA.BRCA = df.TCGA, GEO.GSE58135 = df.GSE58135),
test_statistic = "fishercomb",
BHth = 0.05,
adjpval.t = 0.05)
L_invnorm <- metaRNAseq(ind_deg = list(TCGA.BRCA = df.TCGA, GEO.GSE58135 = df.GSE58135),
test_statistic = "invnorm",
BHth = 0.05,
adjpval.t = 0.05,
nrep = c(102, 56))
Finally, we can summarize the results of the meta-analysis in a data.frame highlighting the statistical information for the common genes to all methods using the built-in SurfR function combine_fisher_invnorm and use the built-in SurfR Gene2SProtein function to identify Surface protein-coding genes (SP) among those.
Genes displaying contradictory differential expression in separate studies can be identified in the column signFC
= 0 and removed from the list of differentially expressed genes via meta-analysis.
metacomb <- combine_fisher_invnorm(ind_deg = list(TCGA.BRCA = df.TCGA,
GEO.GSE58135 = df.GSE58135),
invnorm = L_invnorm,
fishercomb = L_fishercomb,
adjpval = 0.05)
metacomb_GeneID <- metacomb[metacomb$signFC != 0, "GeneID"]
SP <- Gene2SProtein(genes = metacomb_GeneID, input_type = "gene_name")
#> 866 out of 7394 genes have a matching surface protein
metacombUP_GeneID <- metacomb[metacomb$signFC == 1, "GeneID"]
SPup <- Gene2SProtein(genes = metacombUP_GeneID, input_type = "gene_name")
#> 266 out of 3749 genes have a matching surface protein
metacombDW_GeneID <- metacomb[metacomb$signFC == -1, "GeneID"]
SPdw <- Gene2SProtein(genes = metacombDW_GeneID, input_type = "gene_name")
#> 600 out of 3645 genes have a matching surface protein
After identifying the subset of genes enriched in our specific condition of interest, a range of analyses becomes useful to move beyond a mere gene list.
A general enrichment analysis allows to gain further insights about upregulated or downregulated DEGs.
To do so, we use the SurfR built-in function Enrichment, based on the enrichR cran package (Kuleshov et al. 2016).
You have the option to indicate the specific database you wish to utilize among the available in enrichR.
The enrichR function listEnrichrDbs()
allows you to navigate the options.
Sporadically, network connectivity issues may arise with EnrichR server.
If it happens, please, retry to run the function after a few minutes.
library(enrichR)
dfList <- list(GEO = df.GSE58135,
TCGA = df.TCGA)
# Enrichment analysis
Enrich <- Enrichment(dfList,
enrich.databases = c("GO_Biological_Process_2021",
"KEGG_2021_Human"),
p_adj = 0.05, logFC = 1)
head(Enrich$GEO$fdr_up$GO_Biological_Process_2021)
SurfR implements several visualization methods to help interpret enrichment results obtained through EnrichR using ggplot2, with the built-in function Enrichment_barplot.
It depicts gene count ratio and enrichment scores (- Log10 adjusted p values) as bar height and color. Users can specify the number of terms (most significant) to display.
library(ggplot2)
# barplot of the top 5 upregulated pathways
Enrichment_barplot(Enrich$GEO,
enrich.databases <- c("GO_Biological_Process_2021",
"KEGG_2021_Human"),
p_adj = 0.05,
num_term = 5,
cond = "UP")
# barplot of the top 5 downregulated pathways
Enrichment_barplot(Enrich$GEO,
enrich.databases <- c("GO_Biological_Process_2021",
"KEGG_2021_Human"),
p_adj = 0.05,
num_term = 5,
cond = "DOWN")
Moreover, we can annotate our list of genes with cross-database identifiers and descriptions (Entrezid, Uniprot, KEGG, etc.), taking advantage of one of the 35 gene-set libraries present in the Enrichr database, using the SurfR built-in function Annotate_SPID.
annotated <- Annotate_SPID(df.GSE58135, "WikiPathway_2021_Human")
head(annotated, 10)
Utilizing the SurfR function Splot you can create barplots to visualize the annotation classes reported in the dataframe produced by the Gene2SProtein function. The default grouping is the Membranome Almen classification.
# upregulated genes in GEO dataset
fdrUP_GeneID <- df.GSE58135[df.GSE58135$padj < 0.05 & df.GSE58135$log2FoldChange > 0, "GeneID"]
# corresponding Surface Proteins
SPup <- Gene2SProtein(genes = fdrUP_GeneID, input_type = "gene_name")
#> 392 out of 5167 genes have a matching surface protein
# Barplot of Almen classification
Splot(SPup,
group.by = "Membranome.Almen.main-class",
main = "Almen class")
#> Warning in Splot(SPup, group.by = "Membranome.Almen.main-class", main = "Almen
#> class"): NA value in your dataframe
You can compare the list of resulting surface proteins from up to 7 different studies, using a venn diagram, with the built-in SurfR function SVenn.
S_list <- list(SP_1 = c("EPCAM", "CD24", "DLK1", "CDCP1", "LYVE1"),
SP_2 = c("DLK1", "EPCAM", "EGFR", "UPK1A", "UPK2"))
SVenn(S_list,
cols.use = c("green", "blue"),
opacity = 0.5,
output_intersectionFile = FALSE)