Contents

1 Abstract

Emerging infectious diseases, including zoonoses, pose a significant threat to public health and the global economy, as exemplified by the COVID-19 pandemic caused by the zoonotic severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Understanding the protein-protein interactions (PPIs) between host and viral proteins is crucial for identifying targets for antiviral therapies and comprehending the mechanisms underlying pathogen replication and immune evasion. Experimental techniques like yeast two-hybrid screening and affinity purification mass spectrometry have provided valuable insights into host-virus interactomes. However, these approaches are limited by experimental noise and cost, resulting in incomplete interaction maps. Computational models based on machine learning have been developed to predict host-virus PPIs using sequence-derived features. Although these models have been successful, they often overlook the semantic information embedded in protein sequences and require effective encoding schemes. Here, we introduces DeProViR, a deep learning (DL) framework that predicts interactions between viruses and human hosts using only primary amino acid sequences. DeProViR employs a Siamese-like neural network architecture, incorporating convolutional and bidirectional long short-term memory (Bi-LSTM) networks to capture local and global contextual information. It utilizes GloVe embedding to represent amino acid sequences, allowing for the integration of semantic associations between residues. The proposed framework addresses limitations of existing models, such as the need for feature engineering and the dependence on the choice of encoding scheme. DeProViR presents a promising approach for accurate and efficient prediction of host-virus interactions and can contribute to the development of antiviral therapies and understanding of infectious diseases.

2 Proposed Framework

The DeProViR framework is composed of a two-step automated computational workflow: (1) Learning sequence representation of both host and viral proteins and (2) inferring host-viral PPIs through a hybrid deep learning architecture. More specifically, in the first step, host or virus protein sequences are separately encoded into a sequence of tokens via a tokenizer and padded to the same length of size 1000 with a pad token. The embedding matrix E
of 100-dimension is then generated by applying the unsupervised GloVe embedding model to a host or viral profile representation to learn the implicit yet low-dimensional vector space based on the corpus of tokens. Next, the embedding layer is fed with sequences of integers, i.e., amino acid token indexes, and mapped to corresponding pre-trained vectors in the GloVe embedding matrix E, which turns the tokens into a dense real-valued 3D matrix M. In the subsequent step, DeProViR uses a Siamese-like neural network architecture composed of two identical sub-networks with the same configuration and weights. Each sub-network combines convolution and recurrent neural networks (bidirectional Bi-LSTM) to capture amino acids’ local and global contextual relatedness accurately.

To achieve the best-performing DL architecture, we fine-tuned the hyper-parameters for each block on the validation set by random search employing auROC as the performance metric. We determined the number of epochs through an early stopping strategy on the validation set, with a patience threshold set to 3. The optimized DL architecture achieved an auROC of 0.96 using 5-fold cross-validation and 0.90 on the test set. This architecture includes 32 filters (1-D kernel with size 16) in the first CNN layer to generate a feature map from the input layer (i.e., embedding matrix M) through convolution operation and non-linear transformation of its input with the ReLU activation function. Next, the hidden features generated by the first convolution layer are transformed to the second CNN layer with 64 filters (1-D kernel with size seven) in the same way. After the convolutional layers, the k-max pooling layer is added to perform max pooling, where k is set to 30. Subsequently, the flattened pooling output is fed into a bidirectional LSTM consisting of 64 hidden neurons, which finally connects to a fully dense layer of 8 neurons that connects to the output layer with the sigmoid activation function to output the predicted probability score.

3 Vignette Overview

The modular structure of this package is designed in a way that allows users the flexibility to either utilize their own training set or load a fine-tuned pre-trained model that constructed previously (see previous section). This dual capability empowers researchers to tailor their model development approach to their specific needs and preferences.

In the first approach, users can use their own training data to train a model tailored to their specific needs and subsequently apply the trained model to make predictions on uncharted interactions. This capability proves particularly valuable when users wish to undertake diverse tasks, such as predicting interactions between host and bacterial pathogens, drug-target interactions, or protein-protein interactions, etc.

Alternatively, the second approach streamlines the process by allowing users to leverage a fine-tuned pre-trained model. This model has undergone training on a comprehensive dataset, as detailed in the accompanying paper, achieving an auROC > 90 in both cross-validation and external test sets. In this scenario, users simply upload the pre-trained model and initiate predictions without the need for additional training. This approach offers the advantage of speed and convenience since it bypasses the time-consuming training phase. By employing a pre-trained model, users can swiftly obtain predictions and insights, making it a time-efficient option for their research needs.

It’s important to note that for the second approach, a random search strategy has been employed to meticulously tune all possible hyperparameters of the pre-trained model. This tuning process ensures the acquisition of the best-performing for the given training set. However, if you intend to alter the training input, we strongly recommend that you exercise caution and take the time to carefully fine-tune the hyperparameters using tfruns to achieve optimal results.

4 First Approach

The modelTraining function included in this package allows users to update the training dataset. It begins by converting protein sequences into amino acid tokens, where tokens are mapped to positive integers. Next, it represents each amino acid token using pre-trained co-occurrence embedding vectors acquired from GloVe. Following this, it utilizes an embedding layer to convert a sequence of amino acid token indices into dense vectors based on the GloVe token vectors. Finally, it leverages a Siamese-like neural network architecture for model training, employing a k-fold cross-validation strategy. Please ensure that the newly imported training set adheres to the format of the sample training set stored in the inst/extdata/trainingSet directory of the DeProViR package.

The modelTraining function takes following parameters:

To run modelTraining, we can use the following commands:

options(timeout=240)
library(tensorflow)
library(data.table)
library(DeProViR)

tensorflow::set_random_seed(101)
model_training <- modelTraining(
      url_path = "https://nlp.stanford.edu/data",
      training_dir = system.file("extdata", "training_Set",
                              package = "DeProViR"),
      input_dim = 20,
      output_dim = 100,
      filters_layer1CNN = 32,
      kernel_size_layer1CNN = 16,
      filters_layer2CNN = 64,
      kernel_size_layer2CNN = 7,
      pool_size = 30,
      layer_lstm = 64,
      units = 8,
      metrics = "AUC",
      cv_fold = 2,
      epochs = 5, # for the sake of this example 
      batch_size = 128,
      plots = FALSE,
      tpath = tempdir(),
      save_model_weights = FALSE,
      filepath = tempdir()) 
## .Epoch 1/5
## 2/2 - 7s - loss: 0.7638 - auc: 0.5040 - 7s/epoch - 4s/step
## Epoch 2/5
## 2/2 - 1s - loss: 0.5202 - auc: 0.4928 - 800ms/epoch - 400ms/step
## Epoch 3/5
## 2/2 - 0s - loss: 0.3505 - auc: 0.5016 - 283ms/epoch - 142ms/step
## Epoch 4/5
## 2/2 - 0s - loss: 0.2659 - auc: 0.5567 - 257ms/epoch - 128ms/step
## Epoch 5/5
## 2/2 - 0s - loss: 0.2532 - auc: 0.5473 - 254ms/epoch - 127ms/step
## 8/8 - 2s - 2s/epoch - 198ms/step
## .Epoch 1/5
## 2/2 - 0s - loss: 0.3229 - auc: 0.5350 - 277ms/epoch - 139ms/step
## Epoch 2/5
## 2/2 - 0s - loss: 0.3245 - auc: 0.5466 - 305ms/epoch - 152ms/step
## Epoch 3/5
## 2/2 - 0s - loss: 0.3135 - auc: 0.5637 - 289ms/epoch - 145ms/step
## Epoch 4/5
## 2/2 - 0s - loss: 0.2951 - auc: 0.6422 - 253ms/epoch - 127ms/step
## Epoch 5/5
## 2/2 - 0s - loss: 0.2917 - auc: 0.5937 - 280ms/epoch - 140ms/step
## 8/8 - 0s - 124ms/epoch - 15ms/step

When the plots argument set to TRUE, the modelTraining function generates one pdf file containing three figures as shown below indicating the performance of the DL model using k-fold cross-validation.

5 Second Approach

In this context, users have the option to employ the loadPreTrainedModel function to load the finely-tuned pre-trained model for predictive purposes.

options(timeout=240)
library(tensorflow)
library(data.table)
library(DeProViR)
pre_trainedmodel <- 
   loadPreTrainedModel()

6 Viral-Host Interactions Prediction

The models that have undergone training can subsequently be leveraged to generate predictions on unlabeled data, specifically on interactions that are yet to be identified. This can be achieved by executing the following commands:

#load the demo test set (unknown interactions)
testing_set <- fread(
   system.file("extdata", "test_Set", "test_set_unknownInteraction.csv",
                                           package = "DeProViR"))
scoredPPIs <- predInteractions( 
    url_path = "https://nlp.stanford.edu/data",
                 testing_set,
                 trainedModel = pre_trainedmodel)
## GLoVe importing is done ....
## Viral Embedding is done ....
## Host Embedding is done ....
## 2/2 - 1s - 1s/epoch - 738ms/step
scoredPPIs
##            [,1]
##  [1,] 0.4949012
##  [2,] 0.4721132
##  [3,] 0.4730626
##  [4,] 0.4998964
##  [5,] 0.4996737
##  [6,] 0.4995671
##  [7,] 0.4982419
##  [8,] 0.4990018
##  [9,] 0.4995191
## [10,] 0.4972966
## [11,] 0.4978293
## [12,] 0.4883155
## [13,] 0.4718277
## [14,] 0.4833253
## [15,] 0.4825312
## [16,] 0.4897426
## [17,] 0.4875355
## [18,] 0.4681536
## [19,] 0.4722295
## [20,] 0.4624399
## [21,] 0.4911220
## [22,] 0.4806543
## [23,] 0.4981351
## [24,] 0.5000000
## [25,] 0.4948011
## [26,] 0.5000000
## [27,] 0.5000000
## [28,] 0.4473523
## [29,] 0.4981150
## [30,] 0.4997106
## [31,] 0.4740297
## [32,] 0.4879003
## [33,] 0.4422222
## [34,] 0.4996724
## [35,] 0.4658140
## [36,] 0.4712954
## [37,] 0.5000000
## [38,] 0.4993962
## [39,] 0.4794826
## [40,] 0.4722496
## [41,] 0.4624018
## [42,] 0.5000000
## [43,] 0.4839961
## [44,] 0.4887513
## [45,] 0.4747218
## [46,] 0.4933570
## [47,] 0.4991108
## [48,] 0.4683652
## [49,] 0.4513777
# or using the newly trained model 
predInteractions(url_path = "https://nlp.stanford.edu/data",
                 testing_set,
                 trainedModel = model_training)
## 2/2 - 0s - 53ms/epoch - 27ms/step
##             [,1]
##  [1,] 0.06847472
##  [2,] 0.06261349
##  [3,] 0.14807273
##  [4,] 0.11379582
##  [5,] 0.11342905
##  [6,] 0.11324910
##  [7,] 0.07552297
##  [8,] 0.10592774
##  [9,] 0.18093717
## [10,] 0.07737362
## [11,] 0.07673151
## [12,] 0.10696561
## [13,] 0.05968594
## [14,] 0.14501905
## [15,] 0.08853142
## [16,] 0.06537560
## [17,] 0.05571234
## [18,] 0.04516679
## [19,] 0.07321151
## [20,] 0.12023668
## [21,] 0.14446163
## [22,] 0.10691193
## [23,] 0.14698668
## [24,] 0.11394577
## [25,] 0.10463829
## [26,] 0.11332257
## [27,] 0.14500308
## [28,] 0.05191290
## [29,] 0.10444717
## [30,] 0.10586728
## [31,] 0.14742784
## [32,] 0.06016451
## [33,] 0.16597618
## [34,] 0.09705145
## [35,] 0.12315301
## [36,] 0.09256612
## [37,] 0.13355102
## [38,] 0.08848444
## [39,] 0.09008365
## [40,] 0.04317314
## [41,] 0.04434558
## [42,] 0.13375083
## [43,] 0.06927688
## [44,] 0.14544113
## [45,] 0.08168349
## [46,] 0.10931861
## [47,] 0.06693837
## [48,] 0.06493673
## [49,] 0.05134948

7 Session information

sessionInfo()
## R version 4.4.0 RC (2024-04-16 r86468)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] DeProViR_1.1.0    keras_2.15.0      data.table_1.15.4 tensorflow_2.16.0
## [5] knitr_1.46        BiocStyle_2.33.0 
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.2.2            pROC_1.18.5          rlang_1.1.3         
##   [4] magrittr_2.0.3       compiler_4.4.0       RSQLite_2.3.6       
##   [7] png_0.1-8            vctrs_0.6.5          reshape2_1.4.4      
##  [10] stringr_1.5.1        crayon_1.5.2         pkgconfig_2.0.3     
##  [13] fastmap_1.1.1        dbplyr_2.5.0         PRROC_1.3.1         
##  [16] utf8_1.2.4           rmarkdown_2.26       prodlim_2023.08.28  
##  [19] tzdb_0.4.0           purrr_1.0.2          bit_4.0.5           
##  [22] xfun_0.43            cachem_1.0.8         jsonlite_1.8.8      
##  [25] recipes_1.0.10       blob_1.2.4           parallel_4.4.0      
##  [28] R6_2.5.1             bslib_0.7.0          stringi_1.8.3       
##  [31] reticulate_1.36.1    parallelly_1.37.1    rpart_4.1.23        
##  [34] lubridate_1.9.3      jquerylib_0.1.4      Rcpp_1.0.12         
##  [37] bookdown_0.39        iterators_1.0.14     future.apply_1.11.2 
##  [40] base64enc_0.1-3      readr_2.1.5          Matrix_1.7-0        
##  [43] splines_4.4.0        nnet_7.3-19          timechange_0.3.0    
##  [46] tidyselect_1.2.1     yaml_2.3.8           timeDate_4032.109   
##  [49] codetools_0.2-20     curl_5.2.1           listenv_0.9.1       
##  [52] lattice_0.22-6       tibble_3.2.1         plyr_1.8.9          
##  [55] withr_3.0.0          evaluate_0.23        archive_1.1.8       
##  [58] future_1.33.2        survival_3.6-4       BiocFileCache_2.13.0
##  [61] pillar_1.9.0         BiocManager_1.30.22  filelock_1.0.3      
##  [64] whisker_0.4.1        foreach_1.5.2        stats4_4.4.0        
##  [67] generics_0.1.3       vroom_1.6.5          hms_1.1.3           
##  [70] ggplot2_3.5.1        munsell_0.5.1        scales_1.3.0        
##  [73] globals_0.16.3       class_7.3-22         glue_1.7.0          
##  [76] tools_4.4.0          ModelMetrics_1.2.2.2 gower_1.0.1         
##  [79] fmsb_0.7.6           grid_4.4.0           ipred_0.9-14        
##  [82] colorspace_2.1-0     nlme_3.1-164         cli_3.6.2           
##  [85] tfruns_1.5.3         fansi_1.0.6          lava_1.8.0          
##  [88] dplyr_1.1.4          gtable_0.3.5         zeallot_0.1.0       
##  [91] sass_0.4.9           digest_0.6.35        caret_6.0-94        
##  [94] memoise_2.0.1        htmltools_0.5.8.1    lifecycle_1.0.4     
##  [97] hardhat_1.3.1        httr_1.4.7           bit64_4.0.5         
## [100] MASS_7.3-60.2