Contents

1 Introduction

The PhyloProfileData package contains two experimental datasets to illustrate running and analysing phylogenetic profiles with PhyloProfile pakage (Tran et al. 2018).

library(ExperimentHub)
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
##     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rownames, sapply, setdiff, table, tapply,
##     union, unique, unsplit, which.max, which.min
## Loading required package: AnnotationHub
## Loading required package: BiocFileCache
## Loading required package: dbplyr
eh = ExperimentHub()
myData <- query(eh, "PhyloProfileData")
myData
## ExperimentHub with 6 records
## # snapshotDate(): 2023-10-24
## # $dataprovider: Applied Bioinformatics Dept., Goethe University Frankfurt
## # $species: NA
## # $rdataclass: data.frame, AAStringSet
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["EH2544"]]' 
## 
##            title                                                              
##   EH2544 | Phylogenetic profiles of human AMPK-TOR pathway                    
##   EH2545 | FASTA sequences for proteins in the phylogenetic profiles of hum...
##   EH2546 | Domain annotations for proteins in the phylogenetic profiles of ...
##   EH2547 | Phylogenetic profiles of BUSCO arthropoda proteins                 
##   EH2548 | FASTA sequences for proteins in the phylogenetic profiles of BUS...
##   EH2549 | Domain annotations for proteins in the phylogenetic profiles of ...

2 Phylogenetic profiles of AMPK-TOR pathway

The phylogenetic profiles of 147 human proteins in the AMPK-TOR pathway across 83 species in the three domains of life were taken from the study of Roustan et al. 2016.

This data set includes 3 files:

ampkTorPhyloProfile <- myData[["EH2544"]]
head(ampkTorPhyloProfile)
##       geneID     ncbiID                                orthoID     FAS_F
## 1 ampk_ACACA ncbi284812     ampk_ACACA|SCHPO@284812@1|P78820|1 0.9884601
## 2 ampk_ACACA ncbi665079     ampk_ACACA|SCLS1@665079@1|A7EM01|1 0.9905497
## 3 ampk_ACACA  ncbi35128      ampk_ACACA|THAPS@35128@1|B5YMF5|0 0.9058650
## 4 ampk_ACACA  ncbi35128      ampk_ACACA|THAPS@35128@1|B8BVD1|1 0.9794378
## 5 ampk_ACACA   ncbi7070       ampk_ACACA|TRICA@7070@1|D2A5X8|1 0.9813494
## 6 ampk_ACACA ncbi237631 ampk_ACACA|USTMA@237631@1|A0A0D1DYD5|1 0.9770244
##       FAS_B
## 1 0.9907436
## 2 0.9906191
## 3 0.8169658
## 4 0.9359992
## 5 0.9843459
## 6 0.9456425
ampkTorFasta <- myData[["EH2545"]]
head(ampkTorFasta)
## AAStringSet object of length 6:
##     width seq                                               names               
## [1]   297 VGYPVMLKASWGGGGKGIRKVSS...ALRDCVTVRGEIRTTTDYVLDLL ampk_ACACA|CHLRE@...
## [2]  2156 MLRTVKEYVAAYEGKRVIKRLLL...FATLLTYLDRQRIVRRGWFCFDS ampk_ACACA|MONBE@...
## [3]  2326 MPGHSTTGAAGETTPDTQDMVAQ...HLLSDKDREEAVAALRRGSIFHK ampk_ACACA|PHYRM@...
## [4]  2282 MIEINEYIKKLGGDKNIEKILIA...LLPFISTQQKEFLFESLKKDLNK ampk_ACACA|DICDI@...
## [5]  2168 MKAMQETSSPVGFRYDSMEQLCS...NPAIAKAAKVALDSSACAHSTAE ampk_ACACA|LEIMA@...
## [6]  3367 MINFFLSLLLFVLFFENLVVSIK...KIFKMLSQEQRTEFLNKINSYEN ampk_ACACA|PLAF7@...
ampkTorDomain <- myData[["EH2546"]]
head(ampkTorDomain)
##                                        seedID                          orthoID
## 1 ampk_ACACA#ampk_ACACA|ANOGA@7165@1|Q7PQ11|1 ampk_ACACA|ANOGA@7165@1|Q7PQ11|1
## 2 ampk_ACACA#ampk_ACACA|ANOGA@7165@1|Q7PQ11|1 ampk_ACACA|ANOGA@7165@1|Q7PQ11|1
## 3 ampk_ACACA#ampk_ACACA|ANOGA@7165@1|Q7PQ11|1 ampk_ACACA|ANOGA@7165@1|Q7PQ11|1
## 4 ampk_ACACA#ampk_ACACA|ANOGA@7165@1|Q7PQ11|1 ampk_ACACA|ANOGA@7165@1|Q7PQ11|1
## 5 ampk_ACACA#ampk_ACACA|ANOGA@7165@1|Q7PQ11|1 ampk_ACACA|ANOGA@7165@1|Q7PQ11|1
## 6 ampk_ACACA#ampk_ACACA|ANOGA@7165@1|Q7PQ11|1 ampk_ACACA|ANOGA@7165@1|Q7PQ11|1
##               feature start  end
## 1    pfam_CPSase_L_D2   260  462
## 2   pfam_ATPgrasp_Ter   121  352
## 3   pfam_ATPgrasp_Ter   367  469
## 4 pfam_Carboxyl_trans  1644 2198
## 5 smart_Biotin_carb_C   496  603
## 6  pfam_Biotin_carb_N   111  231

3 Phylogenetic profiles of BUSCO Arthropoda proteins

One fundamental step in establishing the phylogenetic profiles is searching orthologs for the query proteins in different taxa of interest. HaMStR-oneseq, an extended version of HaMStR (Ebersberger et al. 2009), has been shown to be an promising approach for sensitively predicting orthologs even in the distantly related taxa from the query species, which is required for the phylogenetic profiling of a broad range of taxa through all domains of the species tree of life. One main parameter for HaMStR-oneseq is the core ortholog group, the starting point for the orthology search. In order to set up a reliable core ortholog set that can be used for further phylogenetic profiling studies, we made use of the well-known BUSCO datasets (Simão et al. 2015). Here we represent the phylogenetic profiles of 1011 ortholog groups across 88 species, which was calculated from the BUSCO arthropoda dataset downloaded from https://busco.ezlab.org/datasets/arthropoda_odb9.tar.gz in Jan. 2018. The 88 species include 10 arthropoda species (Ladona fulva, Agrilus planipennis, Polypedilum vanderplanki, Daphnia magna, Harpegnathos saltator, Zootermopsis nevadensis, Halyomorpha halys, Heliconius melpomene, Stegodyphus mimosarum, Drosophila willistoni) downloaded from orthoDB version 10 (https://www.orthodb.org) and 78 species of the Quest for Ortholog dataset (Altenhoff et al. 2016).

This dataset includes 3 files:

arthropodaPhyloProfile <- myData[["EH2547"]]
head(arthropodaPhyloProfile)
##        geneID     ncbiID                                       orthoID
## 1 97421at6656   ncbi9598             97421at6656|PANTR@9598@1|H2QTF9|1
## 2 97421at6656 ncbi321614           97421at6656|PHANO@321614@1|Q0U682|1
## 3 97421at6656   ncbi3218             97421at6656|PHYPA@3218@1|A9TGR3|1
## 4 97421at6656 ncbi319348 97421at6656|POLVAN@319348@0|319348_0:000e70|1
## 5 97421at6656 ncbi208964           97421at6656|PSEAE@208964@1|Q9HXF1|1
## 6 97421at6656  ncbi10116              97421at6656|RAT@10116@1|D3ZAT9|1
##       FAS_F     FAS_B
## 1 0.6872810 0.9654661
## 2 0.7087412 0.9798884
## 3 0.7544057 0.8727715
## 4 0.8062524 0.9610529
## 5 0.7979757 0.9498075
## 6 0.7340443 0.9492033
arthropodaFasta <- myData[["EH2548"]]
head(arthropodaFasta)
## AAStringSet object of length 6:
##     width seq                                               names               
## [1]   484 MATSGAFAGGSPGRGFAPRGRAE...GISKLHQQLLYVDRLMLQLRDYA 42842at6656|MONBE...
## [2]   535 MSTRKQYACDLACRLVQDQYGDA...RVNEVMETSLAHLDQMIAVFNDF 42842at6656|CHLRE...
## [3]   607 MLCCLFGVQIKCALLKLLQHNVL...RSLDRLDRAIIHLDGMLMLYRDF 42842at6656|PHYRM...
## [4]   487 MHFSGFKSVVLSCVEEYFDTTAV...INDQIDLIEPIYIKLVETAMLLF 42842at6656|GIAIC...
## [5]   666 MYEQKVAIDIVKESFGDDVTKVF...RITQTLLTVILNLDNDLLHLYSF 42842at6656|DICDI...
## [6]   579 MNKARGTEVAGFITDAAHIRAAL...KGLDRLDFACLQLDETLMVLKDF 42842at6656|THAPS...
arthropodaDomain <- myData[["EH2549"]]
head(arthropodaDomain)
##                                                       seedID
## 1 136365at6656#136365at6656|AGRPL@224129@0|224129_0:000004|1
## 2 136365at6656#136365at6656|AGRPL@224129@0|224129_0:000004|1
## 3            136365at6656#136365at6656|ANOGA@7165@1|Q7QC64|1
## 4            136365at6656#136365at6656|ANOGA@7165@1|Q7QC64|1
## 5            136365at6656#136365at6656|ANOGA@7165@1|Q7QC64|1
## 6          136365at6656#136365at6656|AQUAE@224324@1|O67650|1
##                                         orthoID length
## 1 136365at6656|AGRPL@224129@0|224129_0:000004|1    142
## 2              136365at6656|DROME@7227@1|Q86BM8    138
## 3            136365at6656|ANOGA@7165@1|Q7QC64|1    142
## 4            136365at6656|ANOGA@7165@1|Q7QC64|1    142
## 5              136365at6656|DROME@7227@1|Q86BM8    138
## 6          136365at6656|AQUAE@224324@1|O67650|1     98
##                      feature start end weight path
## 1         pfam_Ribosomal_L27    26 106     NA    Y
## 2         pfam_Ribosomal_L27    22 104      1    Y
## 3         pfam_Ribosomal_L27    26 106     NA    Y
## 4 seg_low complexity regions    37  46     NA    Y
## 5         pfam_Ribosomal_L27    22 104      1    Y
## 6         pfam_Ribosomal_L27     2  80     NA    Y

4 References

Appendix

  1. Armenteros, JJA. et al. (2019) SignalP 5.0 improves signal peptide predictions using deep neural networks. Nature Biotechnology, 37, 420–423.
  2. Altenhoff, AM. et al. (2016) Standardized benchmarking in the quest for orthologs. Nature Methods, 13, 425–430.
  3. Ebersberger, I. et al. (2009) HaMStR: profile hidden markov model based search for orthologs in ESTs. BMC Evol Biol., 9, 157
  4. Finn, RD. (2014) Pfam: The protein families database. Nucleic Acids Res., 42, D222-30
  5. Koestler, T. et al. (2010) FACT: functional annotation transfer between proteins with similar feature architectures. BMC Bioinformatics, 11, 417.
  6. Kriventseva, EK. et al.(2018) OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs. Nucleic Acids Res., 47(D1), D807-D811.
  7. Krogh, A. et al. (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol., 305(3), 567-80.
  8. Letunic, I. et al. (2012) SMART 7: recent updates to the protein domain annotation resource. Nucleic Acids Res., 40, D302-5.
  9. Lupas, A. et al. (1991) Predicting Coiled Coils from Protein Sequences. Science, 252, 1162-1164.
  10. Promponas, VJ. et al. (2000) CAST: an iterative algorithm for the complexity analysis of sequence tracts. Bioinformatics, 16(10), 915–922.
  11. Roustan, V. et al. (2016) An evolutionary perspective of AMPK–TOR signaling in the three domains of life. Journal of Experimental Botany, 67(13), 3897–3907.
  12. Simão, F. et al. (2015) BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics, 31(19), 3210-2.
  13. Tran, NV. et al. (2018) PhyloProfile: dynamic visualization and exploration of multi-layered phylogenetic profiles. Bioinformatics, 34(17), 3041–3043.
  14. Wootton, J. and Federhen, S. (1996) Analysis of compositionally biased regions in sequence databases. Methods in Enzymol., 266, 554-571.