1 Installation

Install the package using Bioconductor. Start R and enter:

if(!requireNamespace("BiocManager", quietly = TRUE))
        install.packages("BiocManager")
BiocManager::install("ProteinGymR")

2 Setup

Now, load the package and dependencies used in the vignette.

library(ProteinGymR)
library(tidyr)
library(dplyr)
library(stringr)
library(ggplot2)
library(ComplexHeatmap)
library(AnnotationHub)

3 Introduction

Predicting the effects of mutations in proteins is critical to many applications, from understanding genetic disease to designing novel proteins to address our most pressing challenges in climate, agriculture and healthcare. Despite an increase in machine learning-based protein modeling methods, assessing the effectiveness of these models is problematic due to the use of distinct, often contrived, experimental datasets and variable performance across different protein families.

ProteinGym v1.1 is a large-scale and holistic set of benchmarks specifically designed for protein fitness prediction and design curated by (Notin et al. 2023). It encompasses both a broad collection of over 250 standardized deep mutational scanning (DMS) assays, spanning millions of mutated sequences, as well as curated clinical datasets providing high-quality expert annotations about mutation effects. Furthermore, ProteinGym reports the performance of a diverse set of over 60 high-performing models from various subfields (eg., mutation effects, inverse folding) into a unified benchmark.

ProteinGym v1.1 datasets are openly available as a community resource both on Zenodo and the official ProteinGym website.

4 Data

The ProteinGymR package provides the following analysis-ready datasets from ProteinGym v1.1:

  1. DMS assay scores from 217 assays measuring the impact of all possible amino acid substitutions across 186 proteins. The data is provided with dms_substitutions().

  2. AlphaMissense pathogenicity scores for ~1.6 M substitutions in the ProteinGym DMS data. The data is provided with am_scores().

  3. Five model performance metrics (“AUC”, “MCC”, “NDCG”, “Spearman”, “Top_recall”) for 62 models across 217 assays calculated on DMS substitutions in a zero-shot setting. The data is provided with zeroshot_DMS_metrics().

  4. Reference file containing metadata associated with the 217 DMS assays.

5 Explore and visualize data

This vignette explores and visualizes the first dataset of DMS scores.

Deep mutational scanning is an experimental technique that provides comprehensive data on the functional effects of all possible single mutations in a protein (Fowler & Fields 2014). For each position in a protein, the amino acid residue is mutated and the fitness effects are recorded. While most mutations tend to be deleterious, some can enhance protein activity. In addition to analyzing single mutations, this method can also examine the effects of multiple mutations, yielding insights into protein structure and function. Overall, DMS scores provide a detailed map of how changes in a protein’s sequence affect its function, offering valuable yet complex insights for researchers studying protein biology.

5.1 Load and explore the DMS data from ExperimentHub

Datasets in ProteinGymR can be easily loaded with built-in functions.

dms_data <- dms_substitutions()

View the DMS study names for the first 6 assays.

head(names(dms_data))
#> [1] "A0A140D2T1_ZIKV_Sourisseau_2019"  "A0A192B1T2_9HIV1_Haddox_2018"    
#> [3] "A0A1I9GEU1_NEIME_Kennouche_2019"  "A0A247D711_LISMN_Stadelmann_2021"
#> [5] "A0A2Z5U3Z0_9INFA_Doud_2016"       "A0A2Z5U3Z0_9INFA_Wu_2014"

View an example of one DMS assay.

head(dms_data[[1]])
#>   UniProt_id                          DMS_id mutant
#> 1 A0A140D2T1 A0A140D2T1_ZIKV_Sourisseau_2019  I291A
#> 2 A0A140D2T1 A0A140D2T1_ZIKV_Sourisseau_2019  I291Y
#> 3 A0A140D2T1 A0A140D2T1_ZIKV_Sourisseau_2019  I291W
#> 4 A0A140D2T1 A0A140D2T1_ZIKV_Sourisseau_2019  I291V
#> 5 A0A140D2T1 A0A140D2T1_ZIKV_Sourisseau_2019  I291T
#> 6 A0A140D2T1 A0A140D2T1_ZIKV_Sourisseau_2019  I291S
mutated_sequence
#> 1 MKNPKKKSGGFRIVNMLKRGVARVNPLGGLKRLPAGLLLGHGPIRMVLAILAFLRFTAIKPSLGLINRWGSVGKKEAMEIIKKFKKDLAAMLRIINARKERKRRGADTSIGIIGLLLTTAMAAEITRRGSAYYMYLDRSDAGKAISFATTLGVNKCHVQIMDLGHMCDATMSYECPMLDEGVEPDDVDCWCNTTSTWVVYGTCHHKKGEARRSRRAVTLPSHSTRKLQTRSQTWLESREYTKHLIKVENWIFRNPGFALVAVAIAWLLGSSTSQKVIYLVMILLIAPAYSARCIGVSNRDFVEGMSGGTWVDVVLEHGGCVTVMAQDKPTVDIELVTTTVSNMAEVRSYCYEASISDMASDSRCPTQGEAYLDKQSDTQYVCKRTLVDRGWGNGCGLFGKGSLVTCAKFTCSKKMTGKSIQPENLEYRIMLSVHGSQHSGMIVNDTGYETDENRAKVEVTPNSPRAEATLGGFGSLGLDCEPRTGLDFSDLYYLTMNNKHWLVHKEWFHDIPLPWHAGADTGTPHWNNKEALVEFKDAHAKRQTVVVLGSQEGAVHTALAGALEAEMDGAKGKLFSGHLKCRLKMDKLRLKGVSYSLCTAAFTFTKVPAETLHGTVTVEVQYAGTDGPCKIPVQMAVDMQTLTPVGRLITANPVITESTENSKMMLELDPPFGDSYIVIGVGDKKITHHWHRSGSTIGKAFEATVRGAKRMAVLGDTAWDFGSVGGVFNSLGKGIHQIFGAAFKSLFGGMSWFSQILIGTLLVWLGLNTKNGSISLTCLALGGVMIFLSTAVSADVGCSVDFSKKETRCGTGVFIYNDVEAWRDRYKYHPDSPRRLAAAVKQAWEEGICGISSVSRMENIMWKSVEGELNAILEENGVQLTVVVGSVKNPMWRGPQRLPVPVNELPHGWKAWGKSYFVRAAKTNNSFVVDGDTLKECPLEHRAWNSFLVEDHGFGVFHTSVWLKVREDYSLECDPAVIGTAVKGREAAHSDLGYWIESEKNDTWRLKRAHLIEMKTCEWPKSHTLWTDGVEESDLIIPKSLAGPLSHHNTREGYRTQVKGPWHSEELEIRFEECPGTKVYVEETCGTRGPSLRSTTASGRVIEEWCCRECTMPPLSFRAKDGCWYGMEIRPRKEPESNLVRSMVTAGSTDHMDHFSLGVLVILLMVQEGLKKRMTTKIIMSTSMAVLVVMILGGFSMSDLAKLVILMGATFAEMNTGGDVAHLALVAAFKVRPALLVSFIFRANWTPRESMLLALASCLLQTAISALEGDLMVLINGFALAWLAIRAMAVPRTDNIALPILAALTPLARGTLLVAWRAGLATCGGIMLLSLKGKGSVKKNLPFVMALGLTAVRVVDPINVVGLLLLTRSGKRSWPPSEVLTAVGLICALAGGFAKADIEMAGPMAAVGLLIVSYVVSGKSVDMYIERAGDITWEKDAEVTGNSPRLDVALDESGDFSLVEEDGPPMREIILKVVLMAICGMNPIAIPFAAGAWYVYVKTGKRSGALWDVPAPKEVKKGETTDGVYRVMTRRLLGSTQVGVGVMQEGVFHTMWHVTKGAALRSGEGRLDPYWGDVKQDLVSYCGPWKLDAAWDGLSEVQLLAVPPGERARNIQTLPGIFKTKDGDIGAVALDYPAGTSGSPILDKCGRVIGLYGNGVVIKNGSYVSAITQGKREEETPVECFEPSMLKKKQLTVLDLHPGAGKTRRVLPEIVREAIKKRLRTVILAPTRVVAAEMEEALRGLPVRYMTTAVNVTHSGTEIVDLMCHATFTSRLLQPIRVPNYNLYIMDEAHFTDPSSIAARGYISTRVEMGEAAAIFMTATPPGTRDAFPDSNSPIMDTEVEVPERAWSSGFDWVTDHSGKTVWFVPSVRNGNEIAACLTKAGKRVIQLSRKTFETEFQKTKNQEWDFVITTDISEMGANFKADRVIDSRRCLKPVILDGERVILAGPMPVTHASAAQRRGRIGRNPNKPGDEYMYGGGCAETDEGHAHWLEARMLLDNIYLQDGLIASLYRPEADKVAAIEGEFKLRTEQRKTFVELMKRGDLPVWLAYQVASAGITYTDRRWCFDGTTNNTIMEDSVPAEVWTKYGEKRVLKPRWMDARVCSDHAALKSFKEFAAGKRGAALGVMEALGTLPGHMTERFQEAIDNLAVLMRAETGSRPYKAAAAQLPETLETIMLLGLLGTVSLGIFFVLMRNKGIGKMGFGMVTLGASAWLMWLSEIEPARIACVLIVVFLLLVVLIPEPEKQRSPQDNQMAIIIMVAVGLLGLITANELGWLERTKNDIAHLMGRREEGATMGFSMDIDLRPASAWAIYAALTTLITPAVQHAVTTSYNNYSLMAMATQAGVLFGMGKGMPFYAWDLGVPLLMMGCYSQLTPLTLIVAIILLVAHYMYLIPGLQAAAARAAQKRTAAGIMKNPVVDGIVVTDIDTMTIDPQVEKKMGQVLLIAVAISSAVLLRTAWGWGEAGALITAATSTLWEGSPNKYWNSSTATSLCNIFRGSYLAGASLIYTVTRNAGLVKRRGGGTGETLGEKWKARLNQMSALEFYSYKKSGITEVCREEARRALKDGVATGGHAVSRGSAKLRWLVERGYLQPYGKVVDLGCGRGGWSYYAATIRKVQEVRGYTKGGPGHEEPMLVQSYGWNIVRLKSGVDVFHMAAEPCDTLLCDIGESSSSPEVEETRTLRVLSMVGDWLEKRPGAFCIKVLCPYTSTMMETMERLQRRHGGGLVRVPLSRNSTHEMYWVSGAKSNIIKSVSTTSQLLLGRMDGPRRPVKYEEDVNLGSGTRAVASCAEAPNMKIIGRRIERIRNEHAETWFLDENHPYRTWAYHGSYEAPTQGSASSLVNGVVRLLSKPWDVVTGVTGIAMTDTTPYGQQRVFKEKVDTRVPDPQEGTRQVMNIVSSWLWKELGKRKRPRVCTKEEFINKVRSNAALGAIFEEEKEWKTAVEAVNDPRFWALVDREREHHLRGECHSCVYNMMGKREKKQGEFGKAKGSRAIWYMWLGARFLEFEALGFLNEDHWMGRENSGGGVEGLGLQRLGYILEEMNRAPGGKMYADDTAGWDTRISKFDLENEALITNQMEEGHRTLALAVIKYTYQNKVVKVLRPAEGGKTVMDIISRQDQRGSGQVVTYALNTFTNLVVQLIRNMEAEEVLEMQDLWLLRKPEKVTRWLQSNGWDRLKRMAVSGDDCVVKPIDDRFAHALRFLNDMGKVRKDTQEWKPSTGWSNWEEVPFCSHHFNKLYLKDGRSIVVPCRHQDELIGRARVSPGAGWSIRETACLAKSYAQMWQLLYFHRRDLRLMANAICSAVPVDWVPTGRTTWSIHGKGEWMTTEDMLMVWNRVWIEENDHMEDKTPVTKWTDIPYLGKREDLWCGSLIGHRPRTTWAENIKDTVNMVRRIIGDEEKYMDYLSTQVRYLGEEGSTPGVL
#> 2 MKNPKKKSGGFRIVNMLKRGVARVNPLGGLKRLPAGLLLGHGPIRMVLAILAFLRFTAIKPSLGLINRWGSVGKKEAMEIIKKFKKDLAAMLRIINARKERKRRGADTSIGIIGLLLTTAMAAEITRRGSAYYMYLDRSDAGKAISFATTLGVNKCHVQIMDLGHMCDATMSYECPMLDEGVEPDDVDCWCNTTSTWVVYGTCHHKKGEARRSRRAVTLPSHSTRKLQTRSQTWLESREYTKHLIKVENWIFRNPGFALVAVAIAWLLGSSTSQKVIYLVMILLIAPAYSYRCIGVSNRDFVEGMSGGTWVDVVLEHGGCVTVMAQDKPTVDIELVTTTVSNMAEVRSYCYEASISDMASDSRCPTQGEAYLDKQSDTQYVCKRTLVDRGWGNGCGLFGKGSLVTCAKFTCSKKMTGKSIQPENLEYRIMLSVHGSQHSGMIVNDTGYETDENRAKVEVTPNSPRAEATLGGFGSLGLDCEPRTGLDFSDLYYLTMNNKHWLVHKEWFHDIPLPWHAGADTGTPHWNNKEALVEFKDAHAKRQTVVVLGSQEGAVHTALAGALEAEMDGAKGKLFSGHLKCRLKMDKLRLKGVSYSLCTAAFTFTKVPAETLHGTVTVEVQYAGTDGPCKIPVQMAVDMQTLTPVGRLITANPVITESTENSKMMLELDPPFGDSYIVIGVGDKKITHHWHRSGSTIGKAFEATVRGAKRMAVLGDTAWDFGSVGGVFNSLGKGIHQIFGAAFKSLFGGMSWFSQILIGTLLVWLGLNTKNGSISLTCLALGGVMIFLSTAVSADVGCSVDFSKKETRCGTGVFIYNDVEAWRDRYKYHPDSPRRLAAAVKQAWEEGICGISSVSRMENIMWKSVEGELNAILEENGVQLTVVVGSVKNPMWRGPQRLPVPVNELPHGWKAWGKSYFVRAAKTNNSFVVDGDTLKECPLEHRAWNSFLVEDHGFGVFHTSVWLKVREDYSLECDPAVIGTAVKGREAAHSDLGYWIESEKNDTWRLKRAHLIEMKTCEWPKSHTLWTDGVEESDLIIPKSLAGPLSHHNTREGYRTQVKGPWHSEELEIRFEECPGTKVYVEETCGTRGPSLRSTTASGRVIEEWCCRECTMPPLSFRAKDGCWYGMEIRPRKEPESNLVRSMVTAGSTDHMDHFSLGVLVILLMVQEGLKKRMTTKIIMSTSMAVLVVMILGGFSMSDLAKLVILMGATFAEMNTGGDVAHLALVAAFKVRPALLVSFIFRANWTPRESMLLALASCLLQTAISALEGDLMVLINGFALAWLAIRAMAVPRTDNIALPILAALTPLARGTLLVAWRAGLATCGGIMLLSLKGKGSVKKNLPFVMALGLTAVRVVDPINVVGLLLLTRSGKRSWPPSEVLTAVGLICALAGGFAKADIEMAGPMAAVGLLIVSYVVSGKSVDMYIERAGDITWEKDAEVTGNSPRLDVALDESGDFSLVEEDGPPMREIILKVVLMAICGMNPIAIPFAAGAWYVYVKTGKRSGALWDVPAPKEVKKGETTDGVYRVMTRRLLGSTQVGVGVMQEGVFHTMWHVTKGAALRSGEGRLDPYWGDVKQDLVSYCGPWKLDAAWDGLSEVQLLAVPPGERARNIQTLPGIFKTKDGDIGAVALDYPAGTSGSPILDKCGRVIGLYGNGVVIKNGSYVSAITQGKREEETPVECFEPSMLKKKQLTVLDLHPGAGKTRRVLPEIVREAIKKRLRTVILAPTRVVAAEMEEALRGLPVRYMTTAVNVTHSGTEIVDLMCHATFTSRLLQPIRVPNYNLYIMDEAHFTDPSSIAARGYISTRVEMGEAAAIFMTATPPGTRDAFPDSNSPIMDTEVEVPERAWSSGFDWVTDHSGKTVWFVPSVRNGNEIAACLTKAGKRVIQLSRKTFETEFQKTKNQEWDFVITTDISEMGANFKADRVIDSRRCLKPVILDGERVILAGPMPVTHASAAQRRGRIGRNPNKPGDEYMYGGGCAETDEGHAHWLEARMLLDNIYLQDGLIASLYRPEADKVAAIEGEFKLRTEQRKTFVELMKRGDLPVWLAYQVASAGITYTDRRWCFDGTTNNTIMEDSVPAEVWTKYGEKRVLKPRWMDARVCSDHAALKSFKEFAAGKRGAALGVMEALGTLPGHMTERFQEAIDNLAVLMRAETGSRPYKAAAAQLPETLETIMLLGLLGTVSLGIFFVLMRNKGIGKMGFGMVTLGASAWLMWLSEIEPARIACVLIVVFLLLVVLIPEPEKQRSPQDNQMAIIIMVAVGLLGLITANELGWLERTKNDIAHLMGRREEGATMGFSMDIDLRPASAWAIYAALTTLITPAVQHAVTTSYNNYSLMAMATQAGVLFGMGKGMPFYAWDLGVPLLMMGCYSQLTPLTLIVAIILLVAHYMYLIPGLQAAAARAAQKRTAAGIMKNPVVDGIVVTDIDTMTIDPQVEKKMGQVLLIAVAISSAVLLRTAWGWGEAGALITAATSTLWEGSPNKYWNSSTATSLCNIFRGSYLAGASLIYTVTRNAGLVKRRGGGTGETLGEKWKARLNQMSALEFYSYKKSGITEVCREEARRALKDGVATGGHAVSRGSAKLRWLVERGYLQPYGKVVDLGCGRGGWSYYAATIRKVQEVRGYTKGGPGHEEPMLVQSYGWNIVRLKSGVDVFHMAAEPCDTLLCDIGESSSSPEVEETRTLRVLSMVGDWLEKRPGAFCIKVLCPYTSTMMETMERLQRRHGGGLVRVPLSRNSTHEMYWVSGAKSNIIKSVSTTSQLLLGRMDGPRRPVKYEEDVNLGSGTRAVASCAEAPNMKIIGRRIERIRNEHAETWFLDENHPYRTWAYHGSYEAPTQGSASSLVNGVVRLLSKPWDVVTGVTGIAMTDTTPYGQQRVFKEKVDTRVPDPQEGTRQVMNIVSSWLWKELGKRKRPRVCTKEEFINKVRSNAALGAIFEEEKEWKTAVEAVNDPRFWALVDREREHHLRGECHSCVYNMMGKREKKQGEFGKAKGSRAIWYMWLGARFLEFEALGFLNEDHWMGRENSGGGVEGLGLQRLGYILEEMNRAPGGKMYADDTAGWDTRISKFDLENEALITNQMEEGHRTLALAVIKYTYQNKVVKVLRPAEGGKTVMDIISRQDQRGSGQVVTYALNTFTNLVVQLIRNMEAEEVLEMQDLWLLRKPEKVTRWLQSNGWDRLKRMAVSGDDCVVKPIDDRFAHALRFLNDMGKVRKDTQEWKPSTGWSNWEEVPFCSHHFNKLYLKDGRSIVVPCRHQDELIGRARVSPGAGWSIRETACLAKSYAQMWQLLYFHRRDLRLMANAICSAVPVDWVPTGRTTWSIHGKGEWMTTEDMLMVWNRVWIEENDHMEDKTPVTKWTDIPYLGKREDLWCGSLIGHRPRTTWAENIKDTVNMVRRIIGDEEKYMDYLSTQVRYLGEEGSTPGVL
#> 3 MKNPKKKSGGFRIVNMLKRGVARVNPLGGLKRLPAGLLLGHGPIRMVLAILAFLRFTAIKPSLGLINRWGSVGKKEAMEIIKKFKKDLAAMLRIINARKERKRRGADTSIGIIGLLLTTAMAAEITRRGSAYYMYLDRSDAGKAISFATTLGVNKCHVQIMDLGHMCDATMSYECPMLDEGVEPDDVDCWCNTTSTWVVYGTCHHKKGEARRSRRAVTLPSHSTRKLQTRSQTWLESREYTKHLIKVENWIFRNPGFALVAVAIAWLLGSSTSQKVIYLVMILLIAPAYSWRCIGVSNRDFVEGMSGGTWVDVVLEHGGCVTVMAQDKPTVDIELVTTTVSNMAEVRSYCYEASISDMASDSRCPTQGEAYLDKQSDTQYVCKRTLVDRGWGNGCGLFGKGSLVTCAKFTCSKKMTGKSIQPENLEYRIMLSVHGSQHSGMIVNDTGYETDENRAKVEVTPNSPRAEATLGGFGSLGLDCEPRTGLDFSDLYYLTMNNKHWLVHKEWFHDIPLPWHAGADTGTPHWNNKEALVEFKDAHAKRQTVVVLGSQEGAVHTALAGALEAEMDGAKGKLFSGHLKCRLKMDKLRLKGVSYSLCTAAFTFTKVPAETLHGTVTVEVQYAGTDGPCKIPVQMAVDMQTLTPVGRLITANPVITESTENSKMMLELDPPFGDSYIVIGVGDKKITHHWHRSGSTIGKAFEATVRGAKRMAVLGDTAWDFGSVGGVFNSLGKGIHQIFGAAFKSLFGGMSWFSQILIGTLLVWLGLNTKNGSISLTCLALGGVMIFLSTAVSADVGCSVDFSKKETRCGTGVFIYNDVEAWRDRYKYHPDSPRRLAAAVKQAWEEGICGISSVSRMENIMWKSVEGELNAILEENGVQLTVVVGSVKNPMWRGPQRLPVPVNELPHGWKAWGKSYFVRAAKTNNSFVVDGDTLKECPLEHRAWNSFLVEDHGFGVFHTSVWLKVREDYSLECDPAVIGTAVKGREAAHSDLGYWIESEKNDTWRLKRAHLIEMKTCEWPKSHTLWTDGVEESDLIIPKSLAGPLSHHNTREGYRTQVKGPWHSEELEIRFEECPGTKVYVEETCGTRGPSLRSTTASGRVIEEWCCRECTMPPLSFRAKDGCWYGMEIRPRKEPESNLVRSMVTAGSTDHMDHFSLGVLVILLMVQEGLKKRMTTKIIMSTSMAVLVVMILGGFSMSDLAKLVILMGATFAEMNTGGDVAHLALVAAFKVRPALLVSFIFRANWTPRESMLLALASCLLQTAISALEGDLMVLINGFALAWLAIRAMAVPRTDNIALPILAALTPLARGTLLVAWRAGLATCGGIMLLSLKGKGSVKKNLPFVMALGLTAVRVVDPINVVGLLLLTRSGKRSWPPSEVLTAVGLICALAGGFAKADIEMAGPMAAVGLLIVSYVVSGKSVDMYIERAGDITWEKDAEVTGNSPRLDVALDESGDFSLVEEDGPPMREIILKVVLMAICGMNPIAIPFAAGAWYVYVKTGKRSGALWDVPAPKEVKKGETTDGVYRVMTRRLLGSTQVGVGVMQEGVFHTMWHVTKGAALRSGEGRLDPYWGDVKQDLVSYCGPWKLDAAWDGLSEVQLLAVPPGERARNIQTLPGIFKTKDGDIGAVALDYPAGTSGSPILDKCGRVIGLYGNGVVIKNGSYVSAITQGKREEETPVECFEPSMLKKKQLTVLDLHPGAGKTRRVLPEIVREAIKKRLRTVILAPTRVVAAEMEEALRGLPVRYMTTAVNVTHSGTEIVDLMCHATFTSRLLQPIRVPNYNLYIMDEAHFTDPSSIAARGYISTRVEMGEAAAIFMTATPPGTRDAFPDSNSPIMDTEVEVPERAWSSGFDWVTDHSGKTVWFVPSVRNGNEIAACLTKAGKRVIQLSRKTFETEFQKTKNQEWDFVITTDISEMGANFKADRVIDSRRCLKPVILDGERVILAGPMPVTHASAAQRRGRIGRNPNKPGDEYMYGGGCAETDEGHAHWLEARMLLDNIYLQDGLIASLYRPEADKVAAIEGEFKLRTEQRKTFVELMKRGDLPVWLAYQVASAGITYTDRRWCFDGTTNNTIMEDSVPAEVWTKYGEKRVLKPRWMDARVCSDHAALKSFKEFAAGKRGAALGVMEALGTLPGHMTERFQEAIDNLAVLMRAETGSRPYKAAAAQLPETLETIMLLGLLGTVSLGIFFVLMRNKGIGKMGFGMVTLGASAWLMWLSEIEPARIACVLIVVFLLLVVLIPEPEKQRSPQDNQMAIIIMVAVGLLGLITANELGWLERTKNDIAHLMGRREEGATMGFSMDIDLRPASAWAIYAALTTLITPAVQHAVTTSYNNYSLMAMATQAGVLFGMGKGMPFYAWDLGVPLLMMGCYSQLTPLTLIVAIILLVAHYMYLIPGLQAAAARAAQKRTAAGIMKNPVVDGIVVTDIDTMTIDPQVEKKMGQVLLIAVAISSAVLLRTAWGWGEAGALITAATSTLWEGSPNKYWNSSTATSLCNIFRGSYLAGASLIYTVTRNAGLVKRRGGGTGETLGEKWKARLNQMSALEFYSYKKSGITEVCREEARRALKDGVATGGHAVSRGSAKLRWLVERGYLQPYGKVVDLGCGRGGWSYYAATIRKVQEVRGYTKGGPGHEEPMLVQSYGWNIVRLKSGVDVFHMAAEPCDTLLCDIGESSSSPEVEETRTLRVLSMVGDWLEKRPGAFCIKVLCPYTSTMMETMERLQRRHGGGLVRVPLSRNSTHEMYWVSGAKSNIIKSVSTTSQLLLGRMDGPRRPVKYEEDVNLGSGTRAVASCAEAPNMKIIGRRIERIRNEHAETWFLDENHPYRTWAYHGSYEAPTQGSASSLVNGVVRLLSKPWDVVTGVTGIAMTDTTPYGQQRVFKEKVDTRVPDPQEGTRQVMNIVSSWLWKELGKRKRPRVCTKEEFINKVRSNAALGAIFEEEKEWKTAVEAVNDPRFWALVDREREHHLRGECHSCVYNMMGKREKKQGEFGKAKGSRAIWYMWLGARFLEFEALGFLNEDHWMGRENSGGGVEGLGLQRLGYILEEMNRAPGGKMYADDTAGWDTRISKFDLENEALITNQMEEGHRTLALAVIKYTYQNKVVKVLRPAEGGKTVMDIISRQDQRGSGQVVTYALNTFTNLVVQLIRNMEAEEVLEMQDLWLLRKPEKVTRWLQSNGWDRLKRMAVSGDDCVVKPIDDRFAHALRFLNDMGKVRKDTQEWKPSTGWSNWEEVPFCSHHFNKLYLKDGRSIVVPCRHQDELIGRARVSPGAGWSIRETACLAKSYAQMWQLLYFHRRDLRLMANAICSAVPVDWVPTGRTTWSIHGKGEWMTTEDMLMVWNRVWIEENDHMEDKTPVTKWTDIPYLGKREDLWCGSLIGHRPRTTWAENIKDTVNMVRRIIGDEEKYMDYLSTQVRYLGEEGSTPGVL
#> 4 MKNPKKKSGGFRIVNMLKRGVARVNPLGGLKRLPAGLLLGHGPIRMVLAILAFLRFTAIKPSLGLINRWGSVGKKEAMEIIKKFKKDLAAMLRIINARKERKRRGADTSIGIIGLLLTTAMAAEITRRGSAYYMYLDRSDAGKAISFATTLGVNKCHVQIMDLGHMCDATMSYECPMLDEGVEPDDVDCWCNTTSTWVVYGTCHHKKGEARRSRRAVTLPSHSTRKLQTRSQTWLESREYTKHLIKVENWIFRNPGFALVAVAIAWLLGSSTSQKVIYLVMILLIAPAYSVRCIGVSNRDFVEGMSGGTWVDVVLEHGGCVTVMAQDKPTVDIELVTTTVSNMAEVRSYCYEASISDMASDSRCPTQGEAYLDKQSDTQYVCKRTLVDRGWGNGCGLFGKGSLVTCAKFTCSKKMTGKSIQPENLEYRIMLSVHGSQHSGMIVNDTGYETDENRAKVEVTPNSPRAEATLGGFGSLGLDCEPRTGLDFSDLYYLTMNNKHWLVHKEWFHDIPLPWHAGADTGTPHWNNKEALVEFKDAHAKRQTVVVLGSQEGAVHTALAGALEAEMDGAKGKLFSGHLKCRLKMDKLRLKGVSYSLCTAAFTFTKVPAETLHGTVTVEVQYAGTDGPCKIPVQMAVDMQTLTPVGRLITANPVITESTENSKMMLELDPPFGDSYIVIGVGDKKITHHWHRSGSTIGKAFEATVRGAKRMAVLGDTAWDFGSVGGVFNSLGKGIHQIFGAAFKSLFGGMSWFSQILIGTLLVWLGLNTKNGSISLTCLALGGVMIFLSTAVSADVGCSVDFSKKETRCGTGVFIYNDVEAWRDRYKYHPDSPRRLAAAVKQAWEEGICGISSVSRMENIMWKSVEGELNAILEENGVQLTVVVGSVKNPMWRGPQRLPVPVNELPHGWKAWGKSYFVRAAKTNNSFVVDGDTLKECPLEHRAWNSFLVEDHGFGVFHTSVWLKVREDYSLECDPAVIGTAVKGREAAHSDLGYWIESEKNDTWRLKRAHLIEMKTCEWPKSHTLWTDGVEESDLIIPKSLAGPLSHHNTREGYRTQVKGPWHSEELEIRFEECPGTKVYVEETCGTRGPSLRSTTASGRVIEEWCCRECTMPPLSFRAKDGCWYGMEIRPRKEPESNLVRSMVTAGSTDHMDHFSLGVLVILLMVQEGLKKRMTTKIIMSTSMAVLVVMILGGFSMSDLAKLVILMGATFAEMNTGGDVAHLALVAAFKVRPALLVSFIFRANWTPRESMLLALASCLLQTAISALEGDLMVLINGFALAWLAIRAMAVPRTDNIALPILAALTPLARGTLLVAWRAGLATCGGIMLLSLKGKGSVKKNLPFVMALGLTAVRVVDPINVVGLLLLTRSGKRSWPPSEVLTAVGLICALAGGFAKADIEMAGPMAAVGLLIVSYVVSGKSVDMYIERAGDITWEKDAEVTGNSPRLDVALDESGDFSLVEEDGPPMREIILKVVLMAICGMNPIAIPFAAGAWYVYVKTGKRSGALWDVPAPKEVKKGETTDGVYRVMTRRLLGSTQVGVGVMQEGVFHTMWHVTKGAALRSGEGRLDPYWGDVKQDLVSYCGPWKLDAAWDGLSEVQLLAVPPGERARNIQTLPGIFKTKDGDIGAVALDYPAGTSGSPILDKCGRVIGLYGNGVVIKNGSYVSAITQGKREEETPVECFEPSMLKKKQLTVLDLHPGAGKTRRVLPEIVREAIKKRLRTVILAPTRVVAAEMEEALRGLPVRYMTTAVNVTHSGTEIVDLMCHATFTSRLLQPIRVPNYNLYIMDEAHFTDPSSIAARGYISTRVEMGEAAAIFMTATPPGTRDAFPDSNSPIMDTEVEVPERAWSSGFDWVTDHSGKTVWFVPSVRNGNEIAACLTKAGKRVIQLSRKTFETEFQKTKNQEWDFVITTDISEMGANFKADRVIDSRRCLKPVILDGERVILAGPMPVTHASAAQRRGRIGRNPNKPGDEYMYGGGCAETDEGHAHWLEARMLLDNIYLQDGLIASLYRPEADKVAAIEGEFKLRTEQRKTFVELMKRGDLPVWLAYQVASAGITYTDRRWCFDGTTNNTIMEDSVPAEVWTKYGEKRVLKPRWMDARVCSDHAALKSFKEFAAGKRGAALGVMEALGTLPGHMTERFQEAIDNLAVLMRAETGSRPYKAAAAQLPETLETIMLLGLLGTVSLGIFFVLMRNKGIGKMGFGMVTLGASAWLMWLSEIEPARIACVLIVVFLLLVVLIPEPEKQRSPQDNQMAIIIMVAVGLLGLITANELGWLERTKNDIAHLMGRREEGATMGFSMDIDLRPASAWAIYAALTTLITPAVQHAVTTSYNNYSLMAMATQAGVLFGMGKGMPFYAWDLGVPLLMMGCYSQLTPLTLIVAIILLVAHYMYLIPGLQAAAARAAQKRTAAGIMKNPVVDGIVVTDIDTMTIDPQVEKKMGQVLLIAVAISSAVLLRTAWGWGEAGALITAATSTLWEGSPNKYWNSSTATSLCNIFRGSYLAGASLIYTVTRNAGLVKRRGGGTGETLGEKWKARLNQMSALEFYSYKKSGITEVCREEARRALKDGVATGGHAVSRGSAKLRWLVERGYLQPYGKVVDLGCGRGGWSYYAATIRKVQEVRGYTKGGPGHEEPMLVQSYGWNIVRLKSGVDVFHMAAEPCDTLLCDIGESSSSPEVEETRTLRVLSMVGDWLEKRPGAFCIKVLCPYTSTMMETMERLQRRHGGGLVRVPLSRNSTHEMYWVSGAKSNIIKSVSTTSQLLLGRMDGPRRPVKYEEDVNLGSGTRAVASCAEAPNMKIIGRRIERIRNEHAETWFLDENHPYRTWAYHGSYEAPTQGSASSLVNGVVRLLSKPWDVVTGVTGIAMTDTTPYGQQRVFKEKVDTRVPDPQEGTRQVMNIVSSWLWKELGKRKRPRVCTKEEFINKVRSNAALGAIFEEEKEWKTAVEAVNDPRFWALVDREREHHLRGECHSCVYNMMGKREKKQGEFGKAKGSRAIWYMWLGARFLEFEALGFLNEDHWMGRENSGGGVEGLGLQRLGYILEEMNRAPGGKMYADDTAGWDTRISKFDLENEALITNQMEEGHRTLALAVIKYTYQNKVVKVLRPAEGGKTVMDIISRQDQRGSGQVVTYALNTFTNLVVQLIRNMEAEEVLEMQDLWLLRKPEKVTRWLQSNGWDRLKRMAVSGDDCVVKPIDDRFAHALRFLNDMGKVRKDTQEWKPSTGWSNWEEVPFCSHHFNKLYLKDGRSIVVPCRHQDELIGRARVSPGAGWSIRETACLAKSYAQMWQLLYFHRRDLRLMANAICSAVPVDWVPTGRTTWSIHGKGEWMTTEDMLMVWNRVWIEENDHMEDKTPVTKWTDIPYLGKREDLWCGSLIGHRPRTTWAENIKDTVNMVRRIIGDEEKYMDYLSTQVRYLGEEGSTPGVL
#> 5 MKNPKKKSGGFRIVNMLKRGVARVNPLGGLKRLPAGLLLGHGPIRMVLAILAFLRFTAIKPSLGLINRWGSVGKKEAMEIIKKFKKDLAAMLRIINARKERKRRGADTSIGIIGLLLTTAMAAEITRRGSAYYMYLDRSDAGKAISFATTLGVNKCHVQIMDLGHMCDATMSYECPMLDEGVEPDDVDCWCNTTSTWVVYGTCHHKKGEARRSRRAVTLPSHSTRKLQTRSQTWLESREYTKHLIKVENWIFRNPGFALVAVAIAWLLGSSTSQKVIYLVMILLIAPAYSTRCIGVSNRDFVEGMSGGTWVDVVLEHGGCVTVMAQDKPTVDIELVTTTVSNMAEVRSYCYEASISDMASDSRCPTQGEAYLDKQSDTQYVCKRTLVDRGWGNGCGLFGKGSLVTCAKFTCSKKMTGKSIQPENLEYRIMLSVHGSQHSGMIVNDTGYETDENRAKVEVTPNSPRAEATLGGFGSLGLDCEPRTGLDFSDLYYLTMNNKHWLVHKEWFHDIPLPWHAGADTGTPHWNNKEALVEFKDAHAKRQTVVVLGSQEGAVHTALAGALEAEMDGAKGKLFSGHLKCRLKMDKLRLKGVSYSLCTAAFTFTKVPAETLHGTVTVEVQYAGTDGPCKIPVQMAVDMQTLTPVGRLITANPVITESTENSKMMLELDPPFGDSYIVIGVGDKKITHHWHRSGSTIGKAFEATVRGAKRMAVLGDTAWDFGSVGGVFNSLGKGIHQIFGAAFKSLFGGMSWFSQILIGTLLVWLGLNTKNGSISLTCLALGGVMIFLSTAVSADVGCSVDFSKKETRCGTGVFIYNDVEAWRDRYKYHPDSPRRLAAAVKQAWEEGICGISSVSRMENIMWKSVEGELNAILEENGVQLTVVVGSVKNPMWRGPQRLPVPVNELPHGWKAWGKSYFVRAAKTNNSFVVDGDTLKECPLEHRAWNSFLVEDHGFGVFHTSVWLKVREDYSLECDPAVIGTAVKGREAAHSDLGYWIESEKNDTWRLKRAHLIEMKTCEWPKSHTLWTDGVEESDLIIPKSLAGPLSHHNTREGYRTQVKGPWHSEELEIRFEECPGTKVYVEETCGTRGPSLRSTTASGRVIEEWCCRECTMPPLSFRAKDGCWYGMEIRPRKEPESNLVRSMVTAGSTDHMDHFSLGVLVILLMVQEGLKKRMTTKIIMSTSMAVLVVMILGGFSMSDLAKLVILMGATFAEMNTGGDVAHLALVAAFKVRPALLVSFIFRANWTPRESMLLALASCLLQTAISALEGDLMVLINGFALAWLAIRAMAVPRTDNIALPILAALTPLARGTLLVAWRAGLATCGGIMLLSLKGKGSVKKNLPFVMALGLTAVRVVDPINVVGLLLLTRSGKRSWPPSEVLTAVGLICALAGGFAKADIEMAGPMAAVGLLIVSYVVSGKSVDMYIERAGDITWEKDAEVTGNSPRLDVALDESGDFSLVEEDGPPMREIILKVVLMAICGMNPIAIPFAAGAWYVYVKTGKRSGALWDVPAPKEVKKGETTDGVYRVMTRRLLGSTQVGVGVMQEGVFHTMWHVTKGAALRSGEGRLDPYWGDVKQDLVSYCGPWKLDAAWDGLSEVQLLAVPPGERARNIQTLPGIFKTKDGDIGAVALDYPAGTSGSPILDKCGRVIGLYGNGVVIKNGSYVSAITQGKREEETPVECFEPSMLKKKQLTVLDLHPGAGKTRRVLPEIVREAIKKRLRTVILAPTRVVAAEMEEALRGLPVRYMTTAVNVTHSGTEIVDLMCHATFTSRLLQPIRVPNYNLYIMDEAHFTDPSSIAARGYISTRVEMGEAAAIFMTATPPGTRDAFPDSNSPIMDTEVEVPERAWSSGFDWVTDHSGKTVWFVPSVRNGNEIAACLTKAGKRVIQLSRKTFETEFQKTKNQEWDFVITTDISEMGANFKADRVIDSRRCLKPVILDGERVILAGPMPVTHASAAQRRGRIGRNPNKPGDEYMYGGGCAETDEGHAHWLEARMLLDNIYLQDGLIASLYRPEADKVAAIEGEFKLRTEQRKTFVELMKRGDLPVWLAYQVASAGITYTDRRWCFDGTTNNTIMEDSVPAEVWTKYGEKRVLKPRWMDARVCSDHAALKSFKEFAAGKRGAALGVMEALGTLPGHMTERFQEAIDNLAVLMRAETGSRPYKAAAAQLPETLETIMLLGLLGTVSLGIFFVLMRNKGIGKMGFGMVTLGASAWLMWLSEIEPARIACVLIVVFLLLVVLIPEPEKQRSPQDNQMAIIIMVAVGLLGLITANELGWLERTKNDIAHLMGRREEGATMGFSMDIDLRPASAWAIYAALTTLITPAVQHAVTTSYNNYSLMAMATQAGVLFGMGKGMPFYAWDLGVPLLMMGCYSQLTPLTLIVAIILLVAHYMYLIPGLQAAAARAAQKRTAAGIMKNPVVDGIVVTDIDTMTIDPQVEKKMGQVLLIAVAISSAVLLRTAWGWGEAGALITAATSTLWEGSPNKYWNSSTATSLCNIFRGSYLAGASLIYTVTRNAGLVKRRGGGTGETLGEKWKARLNQMSALEFYSYKKSGITEVCREEARRALKDGVATGGHAVSRGSAKLRWLVERGYLQPYGKVVDLGCGRGGWSYYAATIRKVQEVRGYTKGGPGHEEPMLVQSYGWNIVRLKSGVDVFHMAAEPCDTLLCDIGESSSSPEVEETRTLRVLSMVGDWLEKRPGAFCIKVLCPYTSTMMETMERLQRRHGGGLVRVPLSRNSTHEMYWVSGAKSNIIKSVSTTSQLLLGRMDGPRRPVKYEEDVNLGSGTRAVASCAEAPNMKIIGRRIERIRNEHAETWFLDENHPYRTWAYHGSYEAPTQGSASSLVNGVVRLLSKPWDVVTGVTGIAMTDTTPYGQQRVFKEKVDTRVPDPQEGTRQVMNIVSSWLWKELGKRKRPRVCTKEEFINKVRSNAALGAIFEEEKEWKTAVEAVNDPRFWALVDREREHHLRGECHSCVYNMMGKREKKQGEFGKAKGSRAIWYMWLGARFLEFEALGFLNEDHWMGRENSGGGVEGLGLQRLGYILEEMNRAPGGKMYADDTAGWDTRISKFDLENEALITNQMEEGHRTLALAVIKYTYQNKVVKVLRPAEGGKTVMDIISRQDQRGSGQVVTYALNTFTNLVVQLIRNMEAEEVLEMQDLWLLRKPEKVTRWLQSNGWDRLKRMAVSGDDCVVKPIDDRFAHALRFLNDMGKVRKDTQEWKPSTGWSNWEEVPFCSHHFNKLYLKDGRSIVVPCRHQDELIGRARVSPGAGWSIRETACLAKSYAQMWQLLYFHRRDLRLMANAICSAVPVDWVPTGRTTWSIHGKGEWMTTEDMLMVWNRVWIEENDHMEDKTPVTKWTDIPYLGKREDLWCGSLIGHRPRTTWAENIKDTVNMVRRIIGDEEKYMDYLSTQVRYLGEEGSTPGVL
#> 6 MKNPKKKSGGFRIVNMLKRGVARVNPLGGLKRLPAGLLLGHGPIRMVLAILAFLRFTAIKPSLGLINRWGSVGKKEAMEIIKKFKKDLAAMLRIINARKERKRRGADTSIGIIGLLLTTAMAAEITRRGSAYYMYLDRSDAGKAISFATTLGVNKCHVQIMDLGHMCDATMSYECPMLDEGVEPDDVDCWCNTTSTWVVYGTCHHKKGEARRSRRAVTLPSHSTRKLQTRSQTWLESREYTKHLIKVENWIFRNPGFALVAVAIAWLLGSSTSQKVIYLVMILLIAPAYSSRCIGVSNRDFVEGMSGGTWVDVVLEHGGCVTVMAQDKPTVDIELVTTTVSNMAEVRSYCYEASISDMASDSRCPTQGEAYLDKQSDTQYVCKRTLVDRGWGNGCGLFGKGSLVTCAKFTCSKKMTGKSIQPENLEYRIMLSVHGSQHSGMIVNDTGYETDENRAKVEVTPNSPRAEATLGGFGSLGLDCEPRTGLDFSDLYYLTMNNKHWLVHKEWFHDIPLPWHAGADTGTPHWNNKEALVEFKDAHAKRQTVVVLGSQEGAVHTALAGALEAEMDGAKGKLFSGHLKCRLKMDKLRLKGVSYSLCTAAFTFTKVPAETLHGTVTVEVQYAGTDGPCKIPVQMAVDMQTLTPVGRLITANPVITESTENSKMMLELDPPFGDSYIVIGVGDKKITHHWHRSGSTIGKAFEATVRGAKRMAVLGDTAWDFGSVGGVFNSLGKGIHQIFGAAFKSLFGGMSWFSQILIGTLLVWLGLNTKNGSISLTCLALGGVMIFLSTAVSADVGCSVDFSKKETRCGTGVFIYNDVEAWRDRYKYHPDSPRRLAAAVKQAWEEGICGISSVSRMENIMWKSVEGELNAILEENGVQLTVVVGSVKNPMWRGPQRLPVPVNELPHGWKAWGKSYFVRAAKTNNSFVVDGDTLKECPLEHRAWNSFLVEDHGFGVFHTSVWLKVREDYSLECDPAVIGTAVKGREAAHSDLGYWIESEKNDTWRLKRAHLIEMKTCEWPKSHTLWTDGVEESDLIIPKSLAGPLSHHNTREGYRTQVKGPWHSEELEIRFEECPGTKVYVEETCGTRGPSLRSTTASGRVIEEWCCRECTMPPLSFRAKDGCWYGMEIRPRKEPESNLVRSMVTAGSTDHMDHFSLGVLVILLMVQEGLKKRMTTKIIMSTSMAVLVVMILGGFSMSDLAKLVILMGATFAEMNTGGDVAHLALVAAFKVRPALLVSFIFRANWTPRESMLLALASCLLQTAISALEGDLMVLINGFALAWLAIRAMAVPRTDNIALPILAALTPLARGTLLVAWRAGLATCGGIMLLSLKGKGSVKKNLPFVMALGLTAVRVVDPINVVGLLLLTRSGKRSWPPSEVLTAVGLICALAGGFAKADIEMAGPMAAVGLLIVSYVVSGKSVDMYIERAGDITWEKDAEVTGNSPRLDVALDESGDFSLVEEDGPPMREIILKVVLMAICGMNPIAIPFAAGAWYVYVKTGKRSGALWDVPAPKEVKKGETTDGVYRVMTRRLLGSTQVGVGVMQEGVFHTMWHVTKGAALRSGEGRLDPYWGDVKQDLVSYCGPWKLDAAWDGLSEVQLLAVPPGERARNIQTLPGIFKTKDGDIGAVALDYPAGTSGSPILDKCGRVIGLYGNGVVIKNGSYVSAITQGKREEETPVECFEPSMLKKKQLTVLDLHPGAGKTRRVLPEIVREAIKKRLRTVILAPTRVVAAEMEEALRGLPVRYMTTAVNVTHSGTEIVDLMCHATFTSRLLQPIRVPNYNLYIMDEAHFTDPSSIAARGYISTRVEMGEAAAIFMTATPPGTRDAFPDSNSPIMDTEVEVPERAWSSGFDWVTDHSGKTVWFVPSVRNGNEIAACLTKAGKRVIQLSRKTFETEFQKTKNQEWDFVITTDISEMGANFKADRVIDSRRCLKPVILDGERVILAGPMPVTHASAAQRRGRIGRNPNKPGDEYMYGGGCAETDEGHAHWLEARMLLDNIYLQDGLIASLYRPEADKVAAIEGEFKLRTEQRKTFVELMKRGDLPVWLAYQVASAGITYTDRRWCFDGTTNNTIMEDSVPAEVWTKYGEKRVLKPRWMDARVCSDHAALKSFKEFAAGKRGAALGVMEALGTLPGHMTERFQEAIDNLAVLMRAETGSRPYKAAAAQLPETLETIMLLGLLGTVSLGIFFVLMRNKGIGKMGFGMVTLGASAWLMWLSEIEPARIACVLIVVFLLLVVLIPEPEKQRSPQDNQMAIIIMVAVGLLGLITANELGWLERTKNDIAHLMGRREEGATMGFSMDIDLRPASAWAIYAALTTLITPAVQHAVTTSYNNYSLMAMATQAGVLFGMGKGMPFYAWDLGVPLLMMGCYSQLTPLTLIVAIILLVAHYMYLIPGLQAAAARAAQKRTAAGIMKNPVVDGIVVTDIDTMTIDPQVEKKMGQVLLIAVAISSAVLLRTAWGWGEAGALITAATSTLWEGSPNKYWNSSTATSLCNIFRGSYLAGASLIYTVTRNAGLVKRRGGGTGETLGEKWKARLNQMSALEFYSYKKSGITEVCREEARRALKDGVATGGHAVSRGSAKLRWLVERGYLQPYGKVVDLGCGRGGWSYYAATIRKVQEVRGYTKGGPGHEEPMLVQSYGWNIVRLKSGVDVFHMAAEPCDTLLCDIGESSSSPEVEETRTLRVLSMVGDWLEKRPGAFCIKVLCPYTSTMMETMERLQRRHGGGLVRVPLSRNSTHEMYWVSGAKSNIIKSVSTTSQLLLGRMDGPRRPVKYEEDVNLGSGTRAVASCAEAPNMKIIGRRIERIRNEHAETWFLDENHPYRTWAYHGSYEAPTQGSASSLVNGVVRLLSKPWDVVTGVTGIAMTDTTPYGQQRVFKEKVDTRVPDPQEGTRQVMNIVSSWLWKELGKRKRPRVCTKEEFINKVRSNAALGAIFEEEKEWKTAVEAVNDPRFWALVDREREHHLRGECHSCVYNMMGKREKKQGEFGKAKGSRAIWYMWLGARFLEFEALGFLNEDHWMGRENSGGGVEGLGLQRLGYILEEMNRAPGGKMYADDTAGWDTRISKFDLENEALITNQMEEGHRTLALAVIKYTYQNKVVKVLRPAEGGKTVMDIISRQDQRGSGQVVTYALNTFTNLVVQLIRNMEAEEVLEMQDLWLLRKPEKVTRWLQSNGWDRLKRMAVSGDDCVVKPIDDRFAHALRFLNDMGKVRKDTQEWKPSTGWSNWEEVPFCSHHFNKLYLKDGRSIVVPCRHQDELIGRARVSPGAGWSIRETACLAKSYAQMWQLLYFHRRDLRLMANAICSAVPVDWVPTGRTTWSIHGKGEWMTTEDMLMVWNRVWIEENDHMEDKTPVTKWTDIPYLGKREDLWCGSLIGHRPRTTWAENIKDTVNMVRRIIGDEEKYMDYLSTQVRYLGEEGSTPGVL
#>    DMS_score DMS_score_bin
#> 1 0.03026869             0
#> 2 0.04860416             1
#> 3 0.09364165             1
#> 4 0.62674654             1
#> 5 1.76206629             1
#> 6 0.01723538             0

For each DMS assay, the columns show the UniProt protein identifier, the DMS experiment assay identifier, the mutant at a given protein position, the mutated protein sequence, the recorded DMS score, and a binary DMS score bin categorizing whether the mutation has an affect on fitness (1) or not (0). For more details, access the function documentation with ?dms_substitutions() and the reference publication from Notin et al. 2023.

To access the metadata associated with each DMS assay, we can load in the reference table. Do this by querying all datasets on ExperimentHub affilitated with “ProteinGymR”.

eh <- ExperimentHub::ExperimentHub()
AnnotationHub::query(eh, "ProteinGymR")
#> ExperimentHub with 4 records
#> # snapshotDate(): 2024-10-24
#> # $dataprovider: Marks Lab at Harvard Medical School, Cheng et al. 2023
#> # $species: NA
#> # $rdataclass: List, Data.Frame
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["EH9554"]]' 
#> 
#>            title                                                             
#>   EH9554 | AlphaMissense pathogenicity scores for variants in ProteinGym     
#>   EH9555 | ProteinGym deep mutational scanning (DMS) assays for substitutions
#>   EH9593 | ProteinGym zero-shot DMS substitution benchmarks                  
#>   EH9607 | ProteinGym metadata for 217 DMS substitution assays

dms_metadata <- eh[["EH9607"]]
names(dms_metadata)
#>  [1] "DMS_id"                      "DMS_filename"               
#>  [3] "UniProt_ID"                  "taxon"                      
#>  [5] "source_organism"             "target_seq"                 
#>  [7] "seq_len"                     "includes_multiple_mutants"  
#>  [9] "DMS_total_number_mutants"    "DMS_number_single_mutants"  
#> [11] "DMS_number_multiple_mutants" "DMS_binarization_cutoff"    
#> [13] "DMS_binarization_method"     "first_author"               
#> [15] "title"                       "year"                       
#> [17] "jo"                          "region_mutated"             
#> [19] "molecule_name"               "selection_assay"            
#> [21] "selection_type"              "MSA_filename"               
#> [23] "MSA_start"                   "MSA_end"                    
#> [25] "MSA_len"                     "MSA_bitscore"               
#> [27] "MSA_theta"                   "MSA_num_seqs"               
#> [29] "MSA_perc_cov"                "MSA_num_cov"                
#> [31] "MSA_N_eff"                   "MSA_Neff_L"                 
#> [33] "MSA_Neff_L_category"         "MSA_num_significant"        
#> [35] "MSA_num_significant_L"       "raw_DMS_filename"           
#> [37] "raw_DMS_phenotype_name"      "raw_DMS_directionality"     
#> [39] "raw_DMS_mutant_column"       "weight_file_name"           
#> [41] "pdb_file"                    "pdb_range"                  
#> [43] "ProteinGym_version"          "raw_mut_offset"             
#> [45] "coarse_selection_type"

There are 45 columns representing metadata for DMS assays. For more information about the information, see the ProteinGym publication.

5.2 Visualization of DMS data with ComplexHeatmap

Explore an assay and create a heatmap of the DMS scores.

ACE2 <- dms_data[["ACE2_HUMAN_Chan_2020"]]

We want to grab the reference amino acid, protein position, and mutant residue from the “mutant” column of the dataset.

ACE2 <- 
    ACE2 |>
    dplyr::mutate(
        ref = str_sub(ACE2$mutant, 1, 1),
        pos = as.integer(
            gsub(".*?([0-9]+).*", "\\1", ACE2$mutant)
            ),
        alt = str_sub(ACE2$mutant, -1)
    )

ACE2 <- ACE2 |> select("ref", "pos", "alt", "DMS_score")

head(ACE2)
#>   ref pos alt  DMS_score
#> 1   A  25   C -0.8386652
#> 2   A  25   D -1.8664560
#> 3   A  25   E -1.9210106
#> 4   A  25   F  1.2238317
#> 5   A  25   G -0.3568774
#> 6   A  25   H -2.6157298

## Reshape the data to wide format
ACE2_wide <- ACE2 |> 
    select(-ref) |> 
    pivot_wider(names_from = alt, values_from = DMS_score) |> 
    arrange(pos)

## Subset to first 100 position
ACE2_wide <- ACE2_wide |> 
    filter(pos <= 100)

head(ACE2_wide)
#> # A tibble: 6 × 21
#>     pos      C      D      E      F       G      H      I      K      L      M
#>   <int>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1    19 -1.17  -1.35  -1.44   1.13  -1.36   -0.150  0.771 -0.592 -0.495 -0.548
#> 2    20 -0.685 -0.748 -0.246 -1.60  -0.290  -1.14  -1.15  -0.921 -1.01  -0.978
#> 3    21 -1.23  -1.12  -1.15  -1.29   0.0681 -1.27  NA     -1.63  -1.36  -1.88 
#> 4    22 -0.894  0.367 NA     -0.582 -0.536  -0.922 -1.02  -0.847 -0.207 -1.10 
#> 5    23  0.872  0.300 NA      1.87   0.595   0.226 -0.500  0.648  0.248  0.848
#> 6    24 -1.76  -2.34  -0.892 -2.37  -1.78   -1.26  -2.19  -1.59  -1.31  -1.20 
#> # ℹ 10 more variables: N <dbl>, P <dbl>, Q <dbl>, R <dbl>, S <dbl>, T <dbl>,
#> #   V <dbl>, W <dbl>, Y <dbl>, A <dbl>

## Convert to matrix
pos <- ACE2_wide$pos
alt <- colnames(ACE2_wide)
alt <- alt[-c(1)]

heatmap_matrix <- ACE2_wide |> 
    select(2:length(ACE2_wide)) |> 
    as.matrix()

## Set amino acid position as rownames of matrix
rownames(heatmap_matrix) <- pos

## Transpose so position is x-axis
heatmap_matrix <- t(heatmap_matrix)

## Reorder rows based on physiochemical properties
phyiochem_order <- "DEKRHNQSTPGAVILMCFYW"
phyiochem_order <- unlist(strsplit(phyiochem_order, split = ""))

reordered_matrix <- heatmap_matrix[match(phyiochem_order, 
    rownames(heatmap_matrix)), ]

## Create the heatmap
ComplexHeatmap::Heatmap(reordered_matrix,
        name = "DMS Score",
        cluster_rows = FALSE,
        cluster_columns = FALSE,
        show_row_names = TRUE,
        show_column_names = TRUE)

The heatmap shows the DMS score at each position along the given protein (x-axis) where a residue was mutated (alternate amino acid on y-axis). For this demonstration, we subset to the first 100 positions and grouped the amino acids by their physiochemical properties (DE,KRH,NQ,ST,PGAVIL,MC,FYW). See here for more information. As a note, not all positions along the protein sequence may be subjected to mutation for every DMS assay. This results from the specific research objectives, prioritization choices of the investigators, or technical constraints inherent to the experimental design.

A low DMS score indicates low fitness, while a higher DMS score indicates high fitness.

Based on the “ACE2_HUMAN_Chan_2020” assay, we can see that at positions 90 and 92, fitness remained high despite across amino acid changes; possibly suggestive of a benign region of the protein. However, several mutations at position 48 resulted in low fitness. This could represent an important region for protein function where any perturbation would likely be deleterious.

6 Benchmarking across models

We will now use the built-in function benchmark_models() to compare performance across several variant effect prediction models calculated on the 217 DMS assays in the zero-shot setting. This function takes in one of the five available metrics, and compares up to 5 models of the 62 available.

In the zero-shot setting, experimental phenotypical measurements from a given assay are predicted without having access to any ground-truth labels at training time. Robust zero-shot performance is particularly informative when labels are subject to several biases or scarcely available (e.g., labels for rare genetic pathologies).

Model performance was evaluted across 5 metrics:

  1. Spearman’s rank correlation coefficient (primary metric)
  2. Area Under the ROC Curve (AUC)
  3. Matthews Correlation Coefficient (MCC) for bimodal DMS measurements
  4. Normalized Discounted Cumulative Gains (NDCG) for identifying the most functional protein variants
  5. Top K Recall (top 10% of DMS values)

To avoid placing too much weight on properties with many assays (e.g., thermostability), these metrics were first calculated within groups of assays that measure similar functions. The final value of the metric is then the average of these averages, giving each functional group equal weight. The final values are referred to as the ‘corrected average’.

Due to the often non-linear relationship between protein function and organism fitness (Boucher et al., 2016), the Spearman’s rank correlation coefficient is the most generally appropriate metric for model performance on experimental measurements. However, in situations where DMS measurements exhibit a bimodal profile, rank correlations may not be the optimal choice. Therefore, additional metrics are also provided, such as the Area Under the ROC Curve (AUC) and the Matthews Correlation Coefficient (MCC), which compare model scores with binarized experimental measurements. Furthermore, for certain goals (e.g., optimizing functional properties of designed proteins), it is more important that a model is able to correctly identify the most functional protein variants, rather than properly capture the overall distribution of all assayed variants. Thus, we also calculate the Normalized Discounted Cumulative Gains (NDCG), which up-weights a model if it gives its highest scores to sequences with the highest DMS value. Finally, we also calculate Top K Recall, where we select K to be the top 10% of DMS values.

To view all available models, use the function: available_models()

available_models()
#>  [1] "Site_Independent"           "EVmutation"                
#>  [3] "DeepSequence_single"        "DeepSequence_ensemble"     
#>  [5] "EVE_single"                 "EVE_ensemble"              
#>  [7] "Unirep"                     "Unirep_evotuned"           
#>  [9] "MSA_Transformer_single"     "MSA_Transformer_ensemble"  
#> [11] "ESM_1b"                     "ESM_1v_single"             
#> [13] "ESM_1v_ensemble"            "ESM2_8M"                   
#> [15] "ESM2_35M"                   "ESM2_150M"                 
#> [17] "ESM2_650M"                  "ESM2_3B"                   
#> [19] "ESM2_15B"                   "Wavenet"                   
#> [21] "RITA_S"                     "RITA_M"                    
#> [23] "RITA_L"                     "RITA_XL"                   
#> [25] "Progen2_S"                  "Progen2_M"                 
#> [27] "Progen2_Base"               "Progen2_L"                 
#> [29] "Progen2_XL"                 "GEMME"                     
#> [31] "VESPA"                      "VESPAl"                    
#> [33] "ProtGPT2"                   "Tranception_S_no_retrieval"
#> [35] "Tranception_M_no_retrieval" "Tranception_L_no_retrieval"
#> [37] "Tranception_S"              "Tranception_M"             
#> [39] "Tranception_L"              "TranceptEVE_S"             
#> [41] "TranceptEVE_M"              "TranceptEVE_L"             
#> [43] "CARP_38M"                   "CARP_600K"                 
#> [45] "CARP_640M"                  "CARP_76M"                  
#> [47] "MIF"                        "MIF_ST"                    
#> [49] "ESM_IF1"                    "ProteinMPNN"               
#> [51] "ProtSSN_k_10_h_512"         "ProtSSN_k_10_h_768"        
#> [53] "ProtSSN_k_10_h_1280"        "ProtSSN_k_20_h_512"        
#> [55] "ProtSSN_k_20_h_768"         "ProtSSN_k_20_h_1280"       
#> [57] "ProtSSN_k_30_h_512"         "ProtSSN_k_30_h_768"        
#> [59] "ProtSSN_k_30_h_1280"        "ProtSSN_ensemble"          
#> [61] "SaProt_650M"                "SaProt_35M"

Plot the AUC metric for 5 models.

benchmark_models(metric = "AUC", 
    models = c("GEMME", "CARP_600K", "ESM_1b", "EVmutation", "ProtGPT2"))

Based on the AUC metric of evaluation, GEMME performed the best while of the 5 selected models. If the metric argument is not defined, the default used is a Spearman correlation. For more information about the models and metrics, see the function documentation ?benchmark_models().

7 Reference

Notin, P., Kollasch, A., Ritter, D., van Niekerk, L., Paul, S., Spinner, H., Rollins, N., Shaw, A., Orenbuch, R., Weitzman, R., Frazer, J., Dias, M., Franceschi, D., Gal, Y., & Marks, D. (2023). ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems (Vol. 36, pp. 64331-64379). Curran Associates, Inc.

Fowler, D., Fields, S. Deep mutational scanning: a new style of protein science. Nat Methods 11, 801–807 (2014). doi: 10.1038/nmeth.3027.

Boucher JI, Bolon DN, Tawfik DS. Quantifying and understanding the fitness effects of protein mutations: Laboratory versus nature. Protein Sci. 2016 Jul; 25(7):1219-26. doi: 10.1002/pro.2928.

8 Session Info

sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] grid      stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] AnnotationHub_3.15.0  BiocFileCache_2.15.0  dbplyr_2.5.0         
#>  [4] BiocGenerics_0.53.1   generics_0.1.3        ComplexHeatmap_2.23.0
#>  [7] ggplot2_3.5.1         stringr_1.5.1         dplyr_1.1.4          
#> [10] tidyr_1.3.1           ProteinGymR_1.1.0     BiocStyle_2.35.0     
#> 
#> loaded via a namespace (and not attached):
#>  [1] DBI_1.2.3               rlang_1.1.4             magrittr_2.0.3         
#>  [4] clue_0.3-65             GetoptLong_1.0.5        matrixStats_1.4.1      
#>  [7] compiler_4.5.0          RSQLite_2.3.7           png_0.1-8              
#> [10] vctrs_0.6.5             pkgconfig_2.0.3         shape_1.4.6.1          
#> [13] crayon_1.5.3            fastmap_1.2.0           magick_2.8.5           
#> [16] XVector_0.47.0          labeling_0.4.3          utf8_1.2.4             
#> [19] rmarkdown_2.29          UCSC.utils_1.3.0        purrr_1.0.2            
#> [22] bit_4.5.0               xfun_0.49               zlibbioc_1.53.0        
#> [25] cachem_1.1.0            queryup_1.0.5           GenomeInfoDb_1.43.0    
#> [28] jsonlite_1.8.9          gghalves_0.1.4          blob_1.2.4             
#> [31] highr_0.11              parallel_4.5.0          spdl_0.0.5             
#> [34] cluster_2.1.6           R6_2.5.1                bslib_0.8.0            
#> [37] stringi_1.8.4           RColorBrewer_1.1-3      jquerylib_0.1.4        
#> [40] Rcpp_1.0.13-1           bookdown_0.41           iterators_1.0.14       
#> [43] knitr_1.48              IRanges_2.41.0          tidyselect_1.2.1       
#> [46] yaml_2.3.10             doParallel_1.0.17       codetools_0.2-20       
#> [49] curl_5.2.3              tibble_3.2.1            Biobase_2.67.0         
#> [52] withr_3.0.2             KEGGREST_1.47.0         evaluate_1.0.1         
#> [55] ggdist_3.3.2            circlize_0.4.16         ExperimentHub_2.15.0   
#> [58] Biostrings_2.75.0       pillar_1.9.0            BiocManager_1.30.25    
#> [61] filelock_1.0.3          foreach_1.5.2           stats4_4.5.0           
#> [64] distributional_0.5.0    BiocVersion_3.21.1      S4Vectors_0.45.0       
#> [67] munsell_0.5.1           scales_1.3.0            glue_1.8.0             
#> [70] tools_4.5.0             forcats_1.0.0           Cairo_1.6-2            
#> [73] AnnotationDbi_1.69.0    colorspace_2.1-1        GenomeInfoDbData_1.2.13
#> [76] RcppSpdlog_0.0.18       cli_3.6.3               rappdirs_0.3.3         
#> [79] fansi_1.0.6             gtable_0.3.6            sass_0.4.9             
#> [82] digest_0.6.37           farver_2.1.2            rjson_0.2.23           
#> [85] memoise_2.0.1           htmltools_0.5.8.1       lifecycle_1.0.4        
#> [88] httr_1.4.7              GlobalOptions_0.1.2     mime_0.12              
#> [91] bit64_4.5.2