spongeEffects.Rmd

Hoffmann M - Boniolo F

2/10/2022

In recent years, competing endogenous RNA (ceRNA) networks have established themselves as promising tools to identify biomarkers and inspired the formulation of new hypotheses regarding the post-transcriptional regulatory role of microRNAs (miRNAs). While different studies have proven the potential of these regulatory networks, current methods only allow the study of their global characteristics. Here, we introduce spongEffects, a novel method that infers subnetworks from pre-calculated ceRNA networks and calculates patient/sample-specific scores related to their regulatory activity. Notably, these module scores can be inferred from gene expression data alone and can thus be applied to cohorts where miRNA expression information is lacking. In this vignette, we show how spongEffects can be used for ceRNA module identification and enrichment. We do so by showcasing its use in a reduced breast cancer dataset (TCGA-BRCA). Moreover, we illustrate how the identified modules can be exploited for further downstream subtype classification tasks and identification of biologically meaningfull ceRNA modules.

spongEffects has been developed as an add-on to the SPONGE package (List et al. 2019, https://www.bioconductor.org/packages/release/bioc/html/SPONGE.html). SPONGE allows the inference of robust ceRNA networks from gene and miRNA expression data.

Further details demonstrating the implementation and use of spongEffects for stratification and biomarker identification are available in the related manuscript (currently available as preprint at: https://www.biorxiv.org/content/10.1101/2022.03.29.486212v1).

Loading dependencies for spongEffects

We start with loading the package and its dependencies. spongEffects core functions allow for the registration of a parallel backend (if desired).

library(SPONGE)
library(doParallel)
library(foreach)
library(dplyr)

# Register your backend here
num.of.cores <- 4
cl <- makeCluster(num.of.cores) 
registerDoParallel(cl)

Formats of the inputs necessary for spongEffect

spongEffects comes with a very small example datasetd useful for illustrating the functionalities of the package. The data were originally part of the TCGA-BRCA cohort. We downsized it to reduce the computational time required by the vignette.

We provide two gene and miRNA expression datasets together with related metadata, to simulate an optimal scenario where a train set and a test set are available. We also provide a small ceRNA network, gene-miRNA candidates, ceRNA network centrality measures (the ceRNA network and the centrality measures were created using the SPONGE vignette). While these are not required to run spongEffects (only gene expression and a pre-computed ceRNA network are), we want to showcase how to conduct a full downstream analysis of the ceRNA modules.

spongEffects requires a gene x sample expression matrix as input, with gene names as rownames. The type of gene identifier is up to the user but must be consistent between gene expression and ceRNA network. In this vignette we use Ensembl gene IDs.

The example datasets can be accessed once the SPONGE package is loaded:

Gene expression train set:

	TCGA-A1-A0SK	TCGA-A1-A0SN	TCGA-A2-A0CK	TCGA-A2-A0CM	TCGA-A2-A0EO	TCGA-A2-A0EP	TCGA-A2-A0EV	TCGA-A2-A0SY
ENSG00000000005	-6.4423792	-2.0500792	5.1654208	-6.4423792	1.7425208	5.6808208	3.5814208	3.8919208
ENSG00000000419	0.6104042	1.2722042	-0.6313958	0.7686042	-0.3658958	-0.5337958	0.0015042	-0.6387958
ENSG00000000457	-1.0013698	0.2439302	-0.3476698	-0.5785698	0.1670302	0.7566302	-0.0322698	-0.3200698
ENSG00000001460	0.3662063	-0.5251937	-0.0523937	0.9402063	0.1126063	-0.9400937	0.4092063	-0.2140937
ENSG00000001497	1.1365396	-0.6962604	-0.0888604	0.6670396	-0.1332604	-0.5305604	0.1172396	-0.3317604

Gene expression tes set:

	TCGA-4H-AAAK	TCGA-A2-A0EM	TCGA-A2-A0SW	TCGA-A2-A0T0	TCGA-A2-A0T1	TCGA-A2-A0YK	TCGA-A2-A1FZ	TCGA-A2-A25F
ENSG00000000005	-1.3183	-2.4659	0.0580	-5.5735	-1.5105	0.0580	3.1653	2.5363
ENSG00000000419	5.4256	5.1700	5.6268	5.9170	5.3413	5.4006	5.5242	5.3016
ENSG00000000457	3.2602	4.0705	3.0252	3.0002	3.3003	4.1498	4.4700	2.5972
ENSG00000001460	2.6371	2.9013	1.8643	1.7995	2.4934	2.7379	3.5596	1.6558
ENSG00000001497	5.2906	5.0631	4.2669	5.4706	4.9084	4.4588	4.9533	4.9924

miRNA expression train set:

knitr::kable(train_cancer_mir_expr[1:5,1:8])

	TCGA-A1-A0SK	TCGA-A1-A0SN	TCGA-A2-A0CK	TCGA-A2-A0CM	TCGA-A2-A0EO	TCGA-A2-A0EP	TCGA-A2-A0EV	TCGA-A2-A0SY
MIMAT0000062	-0.9498974	-0.9260029	0.7448973	-0.6404741	-0.0926522	0.2434683	-0.2048338	-0.1754074
MIMAT0000063	0.0924827	-0.8716071	1.0787497	-0.9931536	0.5592782	0.2478954	-0.1611533	1.3213929
MIMAT0000064	-1.1953576	-0.4072241	0.8010479	-0.6455455	0.8571077	1.0259656	0.2368933	0.3306993
MIMAT0000065	1.6249731	-0.6466453	-0.3711567	1.5371999	0.0170191	-0.2394728	-0.2496916	-1.1082111
MIMAT0000066	0.0112750	-1.9443273	-0.8010375	0.2618751	0.0433049	-0.0584883	-0.0338679	-0.8529546

miRNA expression test set:

knitr::kable(test_cancer_mir_expr[1:5,1:8])

	TCGA-4H-AAAK	TCGA-A2-A0EM	TCGA-A2-A0SW	TCGA-A2-A0T0	TCGA-A2-A0T1	TCGA-A2-A0YK	TCGA-A2-A1FZ	TCGA-A2-A25F
MIMAT0000062	15.413594	15.089313	13.463049	14.208041	14.158120	14.547002	14.740135	14.263688
MIMAT0000063	14.436443	15.224913	13.411917	14.858243	14.656125	14.900911	15.101607	15.460251
MIMAT0000064	11.683468	11.384820	9.037705	9.583915	11.285468	11.883935	11.727708	11.827215
MIMAT0000065	7.293012	8.175797	7.560531	8.262925	7.747067	7.019939	7.922954	7.505337
MIMAT0000066	10.727595	10.369630	9.575009	9.448063	9.866870	8.955930	10.011460	9.916859

ceRNA network We use a ceRNA network computed with SPONGE (List et al. 2019) for the TCGA-BRCA dataset. The ceRNA network was downsized for this vignette.

ceRNA networks and related descriptive statistics computed for 22 TCGA datasets can be downloaded from SPONGEdb (Hoffmann et al, 2021): https://exbio.wzw.tum.de/sponge/home

knitr::kable(train_ceRNA_interactions[1:5,1:8])

geneA	geneB	df	cor	pcor	mscor	p.val	p.adj
ENSG00000000005	ENSG00000028528	1	0.2341518	0.2032828	0.0308691	0.19	0.4810039
ENSG00000000005	ENSG00000059758	1	0.2479621	0.1969036	0.0510585	0.10	0.4575007
ENSG00000000005	ENSG00000080493	1	0.2184322	0.2182531	0.0001792	0.54	0.6252361
ENSG00000000005	ENSG00000092096	1	0.1792994	0.1426007	0.0366987	0.16	0.4575007
ENSG00000000005	ENSG00000107560	1	0.1762069	0.1765033	-0.0002963	0.55	0.6314070

ceRNA networks downloaded from spongeDB come with centrality measures and other information specific to the downloaded network. Centrality measures are going to be used to define sponge modules.

knitr::kable(train_network_centralities[1:5,1:5])

gene	degree	eigenvector	betweenness	page_rank
ENSG00000005073	2	0.0027093	235.0000	0.0031156
ENSG00000008853	1	0.0000000	0.0000	0.0038610
ENSG00000017427	3	0.0329487	196.5353	0.0030298
ENSG00000030419	2	0.0940164	0.0000	0.0021653
ENSG00000032219	10	0.6420192	1558.6639	0.0076750

Once calculated, spongEffects scores can be used for further downstream machine learning tasks. In this vignette, we will show how to use the to classifiy samples to different breast cancer subtypes (LuminalA, LuminalB, HER2+, Normal-like, and Basal). The user can use any metadata available for their own data.

Metadata train set

knitr::kable(train_cancer_metadata[1:5,1:8])

	sampleID	SUBTYPE	CANCER_TYPE_ACRONYM	OTHER_PATIENT_ID	AGE	SEX	AJCC_PATHOLOGIC_TUMOR_STAGE	AJCC_STAGING_EDITION
108	TCGA-A2-A4S3	BRCA_LumB	BRCA	418C11E8-2670-48D5-BBF5-95B46BFF1201	59	Female	STAGE IIB	7TH
125	TCGA-A7-A26I	BRCA_Basal	BRCA	b2ecbc0f-2c30-4200-8d5e-7b95424bcadb	65	Female	STAGE IIA	7TH
134	TCGA-A7-A5ZW	BRCA_LumA	BRCA	523E24A2-51B9-4658-BE2F-42E5FCCEBB17	47	Female	STAGE IIA	7TH
135	TCGA-A7-A5ZX	BRCA_LumA	BRCA	BA80DB4E-D899-4DA4-AE49-5263D98E1530	48	Female	STAGE IIIC	7TH
136	TCGA-A7-A6VV	BRCA_Basal	BRCA	57AF5C72-0D60-4A6B-B1B4-EC6DAB90F80F	51	Female	STAGE IIA	7TH

Metadata test set

knitr::kable(test_cancer_metadata[1:5,1:8])

	sampleID	SUBTYPE	CANCER_TYPE_ACRONYM	OTHER_PATIENT_ID	AGE	SEX	AJCC_PATHOLOGIC_TUMOR_STAGE	AJCC_STAGING_EDITION
128	TCGA-A7-A3RF	BRCA_LumA	BRCA	7451AF3C-ACC0-4D79-8429-6B8BE96911D8	79	Female	STAGE IIA	7TH
143	TCGA-A8-A06Q	BRCA_LumB	BRCA	281f70c5-876f-44b3-84fb-f2302f85e74c	63	Female	STAGE IIIA	5TH
161	TCGA-A8-A07W	BRCA_LumB	BRCA	01263518-5f7c-49dc-8d7e-84b0c03a6a63	76	Female	STAGE IV	6TH
167	TCGA-A8-A08A	BRCA_LumA	BRCA	7fe4670d-4626-459d-869b-ab15f478c6a7	89	Female	STAGE I	5TH
202	TCGA-A8-A0A9	BRCA_LumA	BRCA	7b605fb6-d077-401e-b870-b87ddf505f82	80	Female	STAGE IIA	6TH

Filter ceRNA network and add weighted centrality measures information Computationally inferred ceRNA networks often identify spurious associations between RNAs. Therefore, it is good practice to filter out ceRNA networks for weakly associated edges. More at List et al. 2019 and related vignette. In this example, we filter the ceRNA network for size effects (i.e. mscor) and statistical significance. Next, we want to identify the most important nodes in the network as seed genes for ceRNA modules. Various network centrality measures have been proposed in the literature to assess node importance. In the spongEffects paper and in this example, we used a weighted centrality, i.e. we consider the sum of weights of edges attached to a node rather than just the node degree. Node centralities can be computed via the sponge_node_centralities() function. The thresholds used here are purely indicative and are likely to be modified when real data are used. If you want to use the centralities of the unfiltered network, you can add your centralities to the Node_Centrality parameter of the filter_ceRNA_network() function. The centralities will be filtered and returned accordingly. If you want the centralities newly caculated based on the filtered network, you need to use the sponge_node_centralities() function on the filtered ceRNA network returned from the filter_ceRNA_network() function.

filtered_network_centralities=filter_ceRNA_network(sponge_effects = train_ceRNA_interactions, Node_Centrality = train_network_centralities,add_weighted_centrality=T, mscor.threshold = 0.01, padj.threshold = 0.1)

Discover modules Different classes of RNAs can be investigated as central nodes for the SPONGE modules. The user can identify their own class of interest, by modifying the filtering function or by using their own gene names. The data.frame loaded with the package contains gene ensemble IDs for protein coding, long non-coding, and circular RNAs as downloaded from biomaRt (version 2.50.3, see https://www.bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/biomaRt.html). Finally, it is possible to determine the type of degree centrality used to rank the importance of potential central nodes. Possibilities are: degree, eigenvector, betweenness, page_rank, or Weighted_Degree

Different number of central nodes can be selected via the cutoff parameter.

RNAs <- c("lncRNA","protein_coding")
RNAs.ofInterest <- ensembl.df %>% dplyr::filter(gene_biotype %in% RNAs) %>%
  dplyr::select(ensembl_gene_id)

central_gene_modules<-get_central_modules(central_nodes = RNAs.ofInterest$ensembl_gene_id,node_centrality = filtered_network_centralities$Node_Centrality,ceRNA_class = RNAs, centrality_measure = "Weighted_Degree", cutoff = 10)

Define SPONGE modules We define SPONGE modules based on a first-neighbout approach, taking into account all classes of potential RNAs around the central nodes identified above. We give the possibility to the user to consider the central node as part of the module or not (remove.central).

Sponge.modules <- define_modules(network = filtered_network_centralities$Sponge.filtered, central.modules = central_gene_modules, remove.central = F, set.parallel = F)
# Module size distribution
Size.modules <- sapply(Sponge.modules, length)

Calculate Enrichment Scores (i.e., spongEffects scores) we implemented three different single sample gene set enrichment approaches to score the modules per sample, Overall Enrichment (OE), single-sample Gene Set Enrichment Analysis (ssGSEA), and Gene Set Variation Analysis (GSVA). The choice of the optimal method may vary by the data set but when we compared the three algorithms in the spongEffects paper we did not observe large differences.

Here, we calculate the spongEffects scores both for the train and test dataset. In a real word scenarios, these are expected to be generated separately. SPONGE modules inferred on one dataset are expected to generalize well to new cohorts, as described in the original publication.

For this example, we run the OE algorithm

train.modules <- enrichment_modules(Expr.matrix = train_cancer_gene_expr, modules = Sponge.modules, bin.size = 10, min.size = 1, max.size = 2000, min.expr = 1, method = "OE", cores=1)

## [1] "Calculating modules with bin size: 10, min size: 1, max size:2000"

test.modules <-  enrichment_modules(Expr.matrix = test_cancer_gene_expr, modules = Sponge.modules, bin.size = 10, min.size = 1, max.size = 2000, min.expr = 1, method = "OE", cores=1)

## [1] "Calculating modules with bin size: 10, min size: 1, max size:2000"

Machine learning spongEffects scores have a tabular format (module x patient) and can thus be used for downstream analysis tasks as, for example, classification or regression. As a typical application case, we show how to calibrate a subtype classification model. In particular, we train a random forest via k-fold repeated cross validation. The resulting object contains the trained model and a confusion matrix evaluated on the train set.

N.B. In order to use a validate a model calibrated with the caret package (as done here), it is necessary that the test set contains the same input features (i.e., SPONGE modules here) as the train set.

# We find modules that were identified both in the train and test and use those as input features for the model
common_modules = intersect(rownames(train.modules), rownames(test.modules))
train.modules = train.modules[common_modules, ]
test.modules = test.modules[common_modules, ]

trained.model = calibrate_model(Input = train.modules, modules_metadata = train_cancer_metadata, label = "SUBTYPE", sampleIDs = "sampleID",Metric = "Exact_match", n_folds = 2, repetitions = 1)

trained.model[["ConfusionMatrix_training"]]

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    BRCA_Basal BRCA_Her2 BRCA_LumA BRCA_LumB BRCA_Normal
##   BRCA_Basal           7         0         4         5           1
##   BRCA_Her2            0         1         5         3           0
##   BRCA_LumA            5         6        31        13           2
##   BRCA_LumB            4         3         5         1           0
##   BRCA_Normal          0         0         0         0           0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4167          
##                  95% CI : (0.3168, 0.5218)
##     No Information Rate : 0.4688          
##     P-Value [Acc > NIR] : 0.8699          
##                                           
##                   Kappa : 0.1044          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: BRCA_Basal Class: BRCA_Her2 Class: BRCA_LumA
## Sensitivity                    0.43750          0.10000           0.6889
## Specificity                    0.87500          0.90698           0.4902
## Pos Pred Value                 0.41176          0.11111           0.5439
## Neg Pred Value                 0.88608          0.89655           0.6410
## Prevalence                     0.16667          0.10417           0.4688
## Detection Rate                 0.07292          0.01042           0.3229
## Detection Prevalence           0.17708          0.09375           0.5938
## Balanced Accuracy              0.65625          0.50349           0.5895
##                      Class: BRCA_LumB Class: BRCA_Normal
## Sensitivity                   0.04545            0.00000
## Specificity                   0.83784            1.00000
## Pos Pred Value                0.07692                NaN
## Neg Pred Value                0.74699            0.96875
## Prevalence                    0.22917            0.03125
## Detection Rate                0.01042            0.00000
## Detection Prevalence          0.13542            0.00000
## Balanced Accuracy             0.44165            0.50000

As in a typical scenario, the model performances can be evaluated on the test set to test for generalization performances of the model.

Input.test <- t(test.modules) %>% scale(center = T, scale = T)
Prediction.model <- predict(trained.model$Model, Input.test)

# We compute the confusion metrix on the test set
ConfusionMatrix_testing <- caret::confusionMatrix(as.factor(Prediction.model), as.factor(test_cancer_metadata$SUBTYPE))
trained.model$ConfusionMatrix_testing<-ConfusionMatrix_testing
ConfusionMatrix_testing

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    BRCA_Basal BRCA_Her2 BRCA_LumA BRCA_LumB BRCA_Normal
##   BRCA_Basal           0         0         6         1           0
##   BRCA_Her2            2         0         3         0           0
##   BRCA_LumA           14         4        29        11           2
##   BRCA_LumB            5         2        13         2           1
##   BRCA_Normal          0         0         0         0           0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3263          
##                  95% CI : (0.2336, 0.4302)
##     No Information Rate : 0.5368          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.1123         
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: BRCA_Basal Class: BRCA_Her2 Class: BRCA_LumA
## Sensitivity                    0.00000          0.00000           0.5686
## Specificity                    0.90541          0.94382           0.2955
## Pos Pred Value                 0.00000          0.00000           0.4833
## Neg Pred Value                 0.76136          0.93333           0.3714
## Prevalence                     0.22105          0.06316           0.5368
## Detection Rate                 0.00000          0.00000           0.3053
## Detection Prevalence           0.07368          0.05263           0.6316
## Balanced Accuracy              0.45270          0.47191           0.4320
##                      Class: BRCA_LumB Class: BRCA_Normal
## Sensitivity                   0.14286            0.00000
## Specificity                   0.74074            1.00000
## Pos Pred Value                0.08696                NaN
## Neg Pred Value                0.83333            0.96842
## Prevalence                    0.14737            0.03158
## Detection Rate                0.02105            0.00000
## Detection Prevalence          0.24211            0.00000
## Balanced Accuracy             0.44180            0.50000

1. Define random modules and calculate spongEffects scores In general, it is good idea to check the performance of a model calibrated on SPONGE modules against the one of a model calibrated on randomly defined modules. We offer a function that calculates random modules and related enrichment scores. The resulting modules have the same size distribution of the original ones.

# Define random modules
Random.modules <- Random_spongEffects(sponge_modules = Sponge.modules,
                                      gene_expr = train_cancer_gene_expr, min.size = 1,bin.size = 10, max.size = 200,
                                      min.expression=1, replace = F,method = "OE",cores = 1)

## [1] "Calculating modules with bin size: 10, min size: 1, max size:200"

# We can now use the randomly defined modules to calculate their enrichment in the test set
Random.modules.test <- enrichment_modules(Expr.matrix = test_cancer_gene_expr, modules = Random.modules$Random_Modules, bin.size = 10, min.size = 1, max.size = 2000, min.expr = 1, method = "OE", cores=1)

## [1] "Calculating modules with bin size: 10, min size: 1, max size:2000"

Train classification model on randomly defined modules

# We find random modules that were identified both in the train and test and use those as input features for the model
common_modules_random = intersect(rownames(Random.modules$Enrichment_Random_Modules), rownames(Random.modules.test))
Random.modules.train = Random.modules$Enrichment_Random_Modules[common_modules_random, ]
Random.modules.test = Random.modules.test[common_modules_random, ]

Random.model = calibrate_model(Input = Random.modules.train, modules_metadata = train_cancer_metadata, label = "SUBTYPE", sampleIDs = "sampleID",Metric = "Exact_match", n_folds = 2, repetitions = 1)

Random.model[["ConfusionMatrix_training"]]

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    BRCA_Basal BRCA_Her2 BRCA_LumA BRCA_LumB BRCA_Normal
##   BRCA_Basal           5         3         1         5           0
##   BRCA_Her2            1         0         0         3           0
##   BRCA_LumA            6         4        37         6           3
##   BRCA_LumB            4         3         6         8           0
##   BRCA_Normal          0         0         1         0           0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5208          
##                  95% CI : (0.4164, 0.6239)
##     No Information Rate : 0.4688          
##     P-Value [Acc > NIR] : 0.1786          
##                                           
##                   Kappa : 0.2599          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: BRCA_Basal Class: BRCA_Her2 Class: BRCA_LumA
## Sensitivity                    0.31250          0.00000           0.8222
## Specificity                    0.88750          0.95349           0.6275
## Pos Pred Value                 0.35714          0.00000           0.6607
## Neg Pred Value                 0.86585          0.89130           0.8000
## Prevalence                     0.16667          0.10417           0.4688
## Detection Rate                 0.05208          0.00000           0.3854
## Detection Prevalence           0.14583          0.04167           0.5833
## Balanced Accuracy              0.60000          0.47674           0.7248
##                      Class: BRCA_LumB Class: BRCA_Normal
## Sensitivity                   0.36364            0.00000
## Specificity                   0.82432            0.98925
## Pos Pred Value                0.38095            0.00000
## Neg Pred Value                0.81333            0.96842
## Prevalence                    0.22917            0.03125
## Detection Rate                0.08333            0.00000
## Detection Prevalence          0.21875            0.01042
## Balanced Accuracy             0.59398            0.49462

Validate classification model of randomly defined modules on the test set

Input.test <- t(Random.modules.test) %>% scale(center = T, scale = T)
Input.test<-Input.test[ , apply(Input.test, 2, function(x) !any(is.na(x)))]

Prediction.model <- predict(Random.model$Model, Input.test)

# We compute the confusion metrix on the test set
ConfusionMatrix_testing_random <- caret::confusionMatrix(as.factor(Prediction.model), as.factor(test_cancer_metadata$SUBTYPE))
Random.model$ConfusionMatrix_testing_random<-ConfusionMatrix_testing_random
ConfusionMatrix_testing_random

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    BRCA_Basal BRCA_Her2 BRCA_LumA BRCA_LumB BRCA_Normal
##   BRCA_Basal           1         2        12         2           0
##   BRCA_Her2            1         0         0         0           0
##   BRCA_LumA           14         3        21         8           2
##   BRCA_LumB            5         1        18         4           1
##   BRCA_Normal          0         0         0         0           0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.2737          
##                  95% CI : (0.1872, 0.3748)
##     No Information Rate : 0.5368          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.1286         
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: BRCA_Basal Class: BRCA_Her2 Class: BRCA_LumA
## Sensitivity                    0.04762          0.00000           0.4118
## Specificity                    0.78378          0.98876           0.3864
## Pos Pred Value                 0.05882          0.00000           0.4375
## Neg Pred Value                 0.74359          0.93617           0.3617
## Prevalence                     0.22105          0.06316           0.5368
## Detection Rate                 0.01053          0.00000           0.2211
## Detection Prevalence           0.17895          0.01053           0.5053
## Balanced Accuracy              0.41570          0.49438           0.3991
##                      Class: BRCA_LumB Class: BRCA_Normal
## Sensitivity                   0.28571            0.00000
## Specificity                   0.69136            1.00000
## Pos Pred Value                0.13793                NaN
## Neg Pred Value                0.84848            0.96842
## Prevalence                    0.14737            0.03158
## Detection Rate                0.04211            0.00000
## Detection Prevalence          0.30526            0.00000
## Balanced Accuracy             0.48854            0.50000

Train model on central genes’expression Another way to evaluate the perfomance of the model trained on spongEffects scores is to compare its performances against a model trained on the central genes alone. The idea is to investigate if the module activity that reflects the contribution of miRNA regulation offers additional insights to the expression level of the central genes alone. Thus, as a baseline mode we use only the central genes part of the modules we used to calibrate the model in step G.

Once again, we need to verify that both train and test sets contain all the central genes of interest before model calibration.

Input.centralgenes.train <- train_cancer_gene_expr[rownames(train_cancer_gene_expr) %in% names(Sponge.modules), ]
Input.centralgenes.test <- test_cancer_gene_expr[rownames(test_cancer_gene_expr) %in% names(Sponge.modules), ]

common_modules = intersect(rownames(Input.centralgenes.train), rownames(Input.centralgenes.test))
Input.centralgenes.train = Input.centralgenes.train[common_modules, ]
Input.centralgenes.test = Input.centralgenes.test[common_modules, ]

# Calibrate model
CentralGenes.model = calibrate_model(Input = Input.centralgenes.train, modules_metadata = train_cancer_metadata, label = "SUBTYPE", sampleIDs = "sampleID",Metric = "Exact_match", n_folds = 1, repetitions = 1)

# Validate on test set
Input.centralgenes.test <- t(Input.centralgenes.test) %>% scale(center = T, scale = T)
CentralGenes.prediction <- predict(CentralGenes.model$Model, Input.centralgenes.test)

# We compute the confusion metrix on the test set
ConfusionMatrix_testing <- caret::confusionMatrix(as.factor(CentralGenes.prediction), as.factor(test_cancer_metadata$SUBTYPE))
CentralGenes.model$ConfusionMatrix_testing<-ConfusionMatrix_testing
ConfusionMatrix_testing

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    BRCA_Basal BRCA_Her2 BRCA_LumA BRCA_LumB BRCA_Normal
##   BRCA_Basal           1         2        11         2           1
##   BRCA_Her2            0         0         1         0           0
##   BRCA_LumA           19         4        32        12           2
##   BRCA_LumB            1         0         7         0           0
##   BRCA_Normal          0         0         0         0           0
## 
## Overall Statistics
##                                          
##                Accuracy : 0.3474         
##                  95% CI : (0.2526, 0.452)
##     No Information Rate : 0.5368         
##     P-Value [Acc > NIR] : 0.9999         
##                                          
##                   Kappa : -0.1707        
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: BRCA_Basal Class: BRCA_Her2 Class: BRCA_LumA
## Sensitivity                    0.04762          0.00000           0.6275
## Specificity                    0.78378          0.98876           0.1591
## Pos Pred Value                 0.05882          0.00000           0.4638
## Neg Pred Value                 0.74359          0.93617           0.2692
## Prevalence                     0.22105          0.06316           0.5368
## Detection Rate                 0.01053          0.00000           0.3368
## Detection Prevalence           0.17895          0.01053           0.7263
## Balanced Accuracy              0.41570          0.49438           0.3933
##                      Class: BRCA_LumB Class: BRCA_Normal
## Sensitivity                   0.00000            0.00000
## Specificity                   0.90123            1.00000
## Pos Pred Value                0.00000                NaN
## Neg Pred Value                0.83908            0.96842
## Prevalence                    0.14737            0.03158
## Detection Rate                0.00000            0.00000
## Detection Prevalence          0.08421            0.00000
## Balanced Accuracy             0.45062            0.50000

It is possible to compare the performances of the different models

plot_accuracy_sensitivity_specificity(trained_model=trained.model,central_genes_model=NA,
                                      random_model= Random.model,
                                      training_dataset_name="TCGA",testing_dataset_name="TCGA",
                                      subtypes=as.factor(test_cancer_metadata$SUBTYPE))

Interpretation of the results We offer here a few ways to visualize and interpret the SPONGE modules obtained via spongEffects. First, we visualize modules driving subtype prediction. These could be of interest for further validation, to identify biomarkers for the disease of interest.

lollipop_plot=plot_top_modules(trained_model=trained.model, k_modules_red = 2, k_modules = 4)
lollipop_plot

Second, we can visualize the distribution of the spongEffects scores in the different groups of interest. spongEffects scores should follow a normal distribution. Divergences from it may highlight the presence of subsets of samples with characteristics different than the ones of the class they belong to.

We show here the distribution of the spongEffects scores for the train set, dividev by the 5 breast cancer subtypes

density_plot_train=plot_density_scores(trained_model=trained.model,spongEffects = train.modules, meta_data = train_cancer_metadata, label = "SUBTYPE", sampleIDs = "sampleID")
density_plot_train

Module driving prediction can also be visualized with an heatmap. At the moment, the user can choose one layer of annotation.

heatmap.train = plot_heatmaps(trained_model = trained.model,spongEffects = train.modules,
               meta_data = train_cancer_metadata, label = "SUBTYPE", sampleIDs = "sampleID",Modules_to_Plot = 2,
              show.rownames = F, show.colnames = F)
heatmap.train

heatmap.test = plot_heatmaps(trained_model = trained.model,spongEffects = test.modules,
               meta_data = test_cancer_metadata, label = "SUBTYPE", sampleIDs = "sampleID",Modules_to_Plot = 2,
              show.rownames = F, show.colnames = F)
heatmap.test

spongEffects can be interpreted as the effect of ceRNA-ceRNA regulation and miRNA-ceRNA regulation (see publication for more details). If miRNA data are available, it is interesting to identify miRNAs that are involved in the regulation of modules driving prediction. We offer a way to visualize these with an heatmap.

plot_involved_miRNAs_to_modules(sponge_modules=Sponge.modules,
                                trained_model=trained.model,
                                gene_mirna_candidates= train_genes_miRNA_candidates,
                                k_modules = 2,
                                filter_miRNAs = 0.0,
                                bioMart_gene_symbol_columns = "hgnc_symbol",
                                bioMart_gene_ensembl = "hsapiens_gene_ensembl")

## [1] "ENSG00000072364"
## [1] "ENSG00000198108"

stop the cluster

#stop your backend parallelisation if registered
stopCluster(cl)