1 Introduction

Johnson et al. (Johnson et al. 2023) published for 303 human serine/threonine specific kinases substrate affinities in the form of position-specific weight matrices (PWMs). The JohnsonKinaseData package provides access to these PWMs including basic functionality to match user-provided phosphosites against all kinase PWMs. The aim is to give the user a simple way of predicting kinase-substrate relationships based on PWM-phosphosite matching. These predictions can serve to infer kinase activity from differential phospho-proteomic data.

2 Installation

The JohnsonKinaseData package can be install using the following code:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ExperimentHub")
BiocManager::install("JohnsonKinaseData")

3 Loading kinase PWMs

The kinase PWMs can be accessed with the getKinasePWM() function. It returns a list with 303 human serine/threonine specific PWMs.

library(JohnsonKinaseData)
pwms <- getKinasePWM()
#> see ?JohnsonKinaseData and browseVignettes('JohnsonKinaseData') for documentation
#> loading from cache

head(names(pwms))
#> [1] "AAK1"   "ACVR2A" "ACVR2B" "AKT1"   "AKT2"   "AKT3"

Each PWM is a numeric matrix with amino acids as rows and positions as columns. Matrix elements are log2-odd scores measuring differential affinity relative to a random frequency of amino acids (Johnson et al. 2023).

pwms[["PLK2"]]
#>             -5           -4          -3         -2           -1           0
#> A -0.036821844 -0.277009455 -0.83856373 -0.4463446 -0.186229068          NA
#> C  0.009633819 -0.034899138 -0.24690897  0.4799548 -0.467333943          NA
#> D  0.549718451  0.795766948  0.82130204  1.6459783  1.329410671          NA
#> E  0.614756952  1.127897364  2.86862751  1.2354207  0.689388627          NA
#> F  0.449006639  0.078199920 -0.41273103 -0.9773836 -0.602963759          NA
#> G  0.326652391 -0.151522275 -0.77793738 -0.6106535 -0.767584829          NA
#> H  0.148478616 -0.172018427 -0.67807191 -0.3219281  0.214995135          NA
#> I -0.311864412 -0.172018427 -1.65154094 -0.8406292 -0.519941731          NA
#> K -0.469329925 -0.647467443 -1.77349147 -1.7345631 -0.656307931          NA
#> L -0.245197993  0.144568518 -0.71785677  0.3032255 -0.511690664          NA
#> M -0.248793390 -0.206894852 -0.38948891  0.3123167 -0.194955239          NA
#> N -0.065823218  0.002018361 -0.54077824  0.9076598  0.307545102          NA
#> P -0.066578437 -0.108114249 -1.05139915 -0.4418303  0.542703792          NA
#> Q -0.530739153 -0.241782116 -0.48096139 -0.1800049 -0.264477823          NA
#> R -0.528032212 -0.715485867 -1.58640592 -1.1059389 -0.339345148          NA
#> S -0.065823218 -0.172018427 -0.77793738 -0.4463446 -0.194955239  0.00000000
#> T -0.065823218 -0.172018427 -0.77793738 -0.4463446 -0.194955239 -0.09585422
#> V -0.401253684 -0.367545642 -1.89324968 -1.3562361 -0.152804813          NA
#> W -0.034160317 -0.140189435 -1.05799229 -1.1256358 -1.093879047          NA
#> Y  0.083383588 -0.242293983 -1.12217724 -0.5640514 -0.004045212          NA
#> s  0.059632160  0.750692249  0.06873959  0.1075540  0.101650076          NA
#> t  0.059632160  0.750692249  0.06873959  0.1075540  0.101650076          NA
#> y  0.707878133  0.679784089  0.26351522 -0.1321035  2.184534212          NA
#>              1            2           3           4
#> A -0.812485602 -0.109981413 -0.53574997 -0.33515312
#> C -0.310253562  0.145612247  0.00000000  0.04362448
#> D -0.942307133  1.124791311  1.17957474  0.98389654
#> E -0.201410261  1.154194325  1.37389873  1.13638828
#> F  1.906390375 -0.122334266 -0.21541226 -0.12610808
#> G -0.918660373 -0.888701547 -0.30329392 -0.24827921
#> H -0.671163536 -0.002165667 -0.13020754 -0.01785518
#> I  0.374065718 -0.042308229 -0.25963366 -0.03785821
#> K -1.145924538 -2.141143704 -1.48196851 -1.17755536
#> L  0.032665112 -0.500013836 -0.19379970 -0.02664588
#> M  0.833902077  0.008200014 -0.23463499 -0.20273795
#> N -0.818579360 -0.015082595  0.07710624 -0.20706138
#> P -2.650181828 -0.911044318 -0.71667083  0.10218779
#> Q  0.266756562 -0.411003598 -0.01873185 -0.18852897
#> R -0.532824877 -1.190338611 -1.33715648 -1.18082233
#> S -0.532824877 -0.109981413 -0.21541226 -0.12610808
#> T -0.532824877 -0.109981413 -0.21541226 -0.12610808
#> V -0.008682243 -0.249993850 -0.38571419 -0.85152138
#> W -0.550465037  0.385154897  0.11769504  0.30836088
#> Y  0.360757558  0.526569660  0.07546417 -0.04751733
#> s  0.412402175  1.196984664  1.25574242  1.70655265
#> t  0.412402175  1.196984664  1.25574242  1.70655265
#> y  0.490467444  3.461305904  1.53012070  1.85199884

Beside the 20 standard amino acids, also phosphorylated serine, threonine and tyrosine residues are included. These phosphorylated residues are distinct from the central phospho-acceptor (serine/threonine at position 0) and can have a strong impact on the affinity of a given kinase-substrate pair (phospho-priming).

The central phospho-acceptor site is located at position 0 and only measures the favorability of serine over threonine. The user can exclude this favorability measure by setting the parameter includeSTfavorability to FALSE, in which case the central position doesn’t contribute to the PWM score.

pwms2 <- getKinasePWM(includeSTfavorability=FALSE)
#> see ?JohnsonKinaseData and browseVignettes('JohnsonKinaseData') for documentation
#> loading from cache

4 Processing user-provided phosphosites

Phosphorylated peptides are often represented in two different formats: (1) the phosphorylated residues are indicated by an asterix as in SAGLLS*DEDC. Alternatively, phosphorylated residues are given by lower case letters as in SAGLLsDEDC. In order to unify the phosophosite representation for PWM matching, JohnsonKinaseData provides the function processPhosphopeptides(). It takes a character vector with phospho-peptides, aligns them to the central phospho-acceptor position and pads and/or truncates the surrounding residues, such that the processed site consists of 5 upstream residues, a central acceptor and 4 downstream residues. The central phospho-acceptor position is defined as the left closest position to the midpoint of the peptide given by floor(nchar(sites)/2)+1. This midpoint definition is also the default alignment position if no phosphorylated residue was recognized.

ppeps <- c("SAGLLS*DEDC", "GDtND", "EKGDSN__", "HKRNyGsDER", "PEKS*GyNV")

sites <- processPhosphopeptides(ppeps)
#> Warning in processPhosphopeptides(ppeps): No S/T at central phospho-acceptor
#> position.

sites
#> # A tibble: 5 × 3
#>   sites       processed  acceptor
#>   <chr>       <chr>      <chr>   
#> 1 SAGLLS*DEDC SAGLLSDEDC S       
#> 2 GDtND       ___GDTND__ T       
#> 3 EKGDSN__    _EKGDSN___ S       
#> 4 HKRNyGsDER  _HKRNYGsDE Y       
#> 5 PEKS*GyNV   __PEKSGyNV S

If a peptide contains several phosphorylated residues, option onlyCentralAcceptor controls how to select the acceptor position. Setting onlyCentralAcceptor=FALSE will return all possible aligned phosphosites for a given input peptide. Note that in this case the output is not parallel to the input.

sites <- processPhosphopeptides(ppeps, onlyCentralAcceptor=FALSE)
#> Warning in processPhosphopeptides(ppeps, onlyCentralAcceptor = FALSE): No S/T
#> at central phospho-acceptor position.

sites
#> # A tibble: 7 × 3
#>   sites       processed  acceptor
#>   <chr>       <chr>      <chr>   
#> 1 SAGLLS*DEDC SAGLLSDEDC S       
#> 2 GDtND       ___GDTND__ T       
#> 3 EKGDSN__    _EKGDSN___ S       
#> 4 HKRNyGsDER  _HKRNYGsDE Y       
#> 5 HKRNyGsDER  KRNyGSDER_ S       
#> 6 PEKS*GyNV   __PEKSGyNV S       
#> 7 PEKS*GyNV   PEKsGYNV__ Y

5 Scoring of user-provided phosphosites

Once peptides are processed to sites, the function scorePhosphosites() can be used to create a matrix of kinase-substrate match scores.

selected <- sites |> 
  dplyr::filter(acceptor %in% c('S','T')) |> 
  dplyr::pull(processed)

scores <- scorePhosphosites(pwms, selected)

dim(scores)
#> [1]   5 303

scores[,1:5]
#>                 AAK1     ACVR2A      ACVR2B       AKT1       AKT2
#> SAGLLSDEDC -6.794078 -0.1666423  0.30390179 -5.8821117 -4.7783302
#> ___GDTND__ -4.803921 -1.0410203 -0.56120674 -2.8360934 -2.5125933
#> _EKGDSN___ -8.274386 -1.5402977 -0.92960511 -0.6188352 -0.8554523
#> KRNyGSDER_ -6.290564 -1.9202469 -1.38766899 -3.0601553 -1.7486155
#> __PEKSGyNV  1.695554 -0.1171313  0.06161951 -4.7296786 -3.6486856

The PWM scoring can be parallelized by supplying a BiocParallelParam object to BPPARAM=.

scores <- scorePhosphosites(pwms, selected, BPPARAM=BiocParallel::SerialParam())

By default, the resulting score is the log2-odds score of the PWM. Alternatively, by setting scoreType="percentile", a percentile rank of the log2-odds score is calculated, using for each PWM a background score distribution which is derived by matching each PWM to the 85’603 unique phosphosites published in Johnson et al. 2023.

scores <- scorePhosphosites(pwms, selected, scoreType="percentile")
#> see ?JohnsonKinaseData and browseVignettes('JohnsonKinaseData') for documentation
#> loading from cache

scores[,1:5]
#>                 AAK1   ACVR2A   ACVR2B     AKT1     AKT2
#> SAGLLSDEDC 22.375586 79.73910 83.79933 14.73447 14.59609
#> ___GDTND__ 53.371824 67.48779 74.89617 56.34769 53.31220
#> _EKGDSN___  7.927565 57.36739 69.80942 79.14942 74.56646
#> KRNyGSDER_ 29.304770 48.35330 61.93582 53.01150 64.98986
#> __PEKSGyNV 98.620247 80.26811 81.54857 28.17005 32.26440

Quantifying PWM matches by percentile rank was first described in Yaffe et al. 2001 (Yaffe et al. 2001). It is also the matching score underlying the kinase activity predictions published in Johnson et al. 2023 (Johnson et al. 2023).

Note that these percentile ranks do not account for phospho-priming, as non-central phosphorylated residues were missing in the background sites published in Johnson et al. I.e. the score distributions derived from the background sites do not reflect the impact of phospho-priming.

6 Session info

sessionInfo()
#> R version 4.4.0 RC (2024-04-16 r86468)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] JohnsonKinaseData_1.1.0 BiocStyle_2.33.0       
#> 
#> loaded via a namespace (and not attached):
#>  [1] KEGGREST_1.45.0         xfun_0.43               bslib_0.7.0            
#>  [4] Biobase_2.65.0          vctrs_0.6.5             tools_4.4.0            
#>  [7] generics_0.1.3          stats4_4.4.0            curl_5.2.1             
#> [10] parallel_4.4.0          tibble_3.2.1            fansi_1.0.6            
#> [13] AnnotationDbi_1.67.0    RSQLite_2.3.6           blob_1.2.4             
#> [16] pkgconfig_2.0.3         checkmate_2.3.1         dbplyr_2.5.0           
#> [19] S4Vectors_0.43.0        lifecycle_1.0.4         GenomeInfoDbData_1.2.12
#> [22] stringr_1.5.1           compiler_4.4.0          Biostrings_2.73.0      
#> [25] codetools_0.2-20        GenomeInfoDb_1.41.0     htmltools_0.5.8.1      
#> [28] sass_0.4.9              yaml_2.3.8              tidyr_1.3.1            
#> [31] pillar_1.9.0            crayon_1.5.2            jquerylib_0.1.4        
#> [34] BiocParallel_1.39.0     cachem_1.0.8            mime_0.12              
#> [37] ExperimentHub_2.13.0    AnnotationHub_3.13.0    tidyselect_1.2.1       
#> [40] digest_0.6.35           stringi_1.8.3           purrr_1.0.2            
#> [43] dplyr_1.1.4             bookdown_0.39           BiocVersion_3.20.0     
#> [46] fastmap_1.1.1           cli_3.6.2               magrittr_2.0.3         
#> [49] utf8_1.2.4              withr_3.0.0             backports_1.4.1        
#> [52] filelock_1.0.3          UCSC.utils_1.1.0        rappdirs_0.3.3         
#> [55] bit64_4.0.5             rmarkdown_2.26          XVector_0.45.0         
#> [58] httr_1.4.7              bit_4.0.5               png_0.1-8              
#> [61] memoise_2.0.1           evaluate_0.23           knitr_1.46             
#> [64] IRanges_2.39.0          BiocFileCache_2.13.0    rlang_1.1.3            
#> [67] glue_1.7.0              DBI_1.2.2               BiocManager_1.30.22    
#> [70] BiocGenerics_0.51.0     jsonlite_1.8.8          R6_2.5.1               
#> [73] zlibbioc_1.51.0

References

Johnson, Jared L., Tomer M. Yaron, Emily M. Huntsman, Alexander Kerelsky, Junho Song, Amit Regev, Ting-Yu Lin, et al. 2023. “An Atlas of Substrate Specificities for the Human Serine/Threonine Kinome.” Journal Article. Nature 613 (7945): 759–66. https://doi.org/10.1038/s41586-022-05575-3.

Yaffe, Michael B., German G. Leparc, Jack Lai, Toshiyuki Obata, Stefano Volinia, and Lewis C. Cantley. 2001. “A Motif-Based Profile Scanning Approach for Genome-Wide Prediction of Signaling Pathways.” Journal Article. Nature Biotechnology 19 (4): 348–53. https://doi.org/10.1038/86737.