Contents

1 Introduction

SynExtend is a package of tools for working with objects of class Synteny built from the package DECIPHER’s FindSynteny() function.

Synteny maps provide a powerful tool for quantifying and visualizing where pairs of genomes share order. Typically these maps are built from predictions of orthologous pairs, where groups of pairs that provide contiguous and sequential blocks in their respective genomes are deemed a ‘syntenic block’. That designation of synteny can then used to further interrogate the predicted orthologs themselves, or query topics like genomic rearrangements or ancestor genome reconstruction.

FindSynteny takes a different approach, finding exactly matched shared k-mers and determining where shared k-mers, or blocks of proximate shared k-mers are significant. Combining the information generated by FindSynteny with locations of genomic features allows us to simply mark where features are linked by syntenic k-mers. These linked features represent potential orthologous pairs, and can be easily evaluated on the basis of the k-mers that they share, or alignment.

2 Package Structure

Currently SynExtend contains one set of functions for performig orthology predictions, as well as a rearrangement estimation function that is currently under construction.

2.1 Installation

  1. Install the latest version of R using CRAN.
  2. Install SynExtend in R by running the following commands:
if (!requireNamespace("BiocManager",
                      quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("SynExtend")

3 Usage

Using the FindSynteny function in DECIPHER builds an object of class Synteny. In this tutorial, a prebuilt DECIPHER database is used. For database construction see ?Seqs2DB in DECIPHER. This example starts with a database containing three endosymbiont genomes that were chosen to keep package data to a minimum.

library(SynExtend)
## Loading required package: DECIPHER
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
##     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rownames, sapply, setdiff, table, tapply,
##     union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: GenomeInfoDb
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
## 
## Attaching package: 'SynExtend'
## The following object is masked from 'package:stats':
## 
##     dendrapply
DBPATH <- system.file("extdata",
                      "Endosymbionts.sqlite",
                      package = "SynExtend")

Syn <- FindSynteny(dbFile = DBPATH)
## ================================================================================
## 
## Time difference of 11.13 secs

Synteny maps represent where genomes share order. Simply printing a synteny object to the console displays a gross level view of the data inside. Objects of class Synteny can also be plotted to provide clear visual representations of the data inside. The genomes used in this example are distantly related and fairly dissimilar.

Syn
##           1         2         3
## 1     1 seq 4.7% hits 8.8% hits
## 2 48 blocks    2 seqs  13% hits
## 3 37 blocks 52 blocks     1 seq
pairs(Syn)

Data present inside objects of class Synteny can also be accessed relatively easily. The object itself is functionally a matrix of lists, with data describing exactly matched k-mers present in the upper triangle, and data describing blocks of chained k-mers in the lower triangle. For more information see ?FindSynteny in the package DECIPHER.

print(head(Syn[[1, 2]]))
##      index1 index2 strand width start1  start2 frame1 frame2
## [1,]      1      1      0   225 429274 1502877      1      3
## [2,]      1      1      0   345 429532 1503129      1      3
## [3,]      1      1      0   951 429889 1503483      1      3
## [4,]      1      1      0   423 208659  424439      3      2
## [5,]      1      1      0   771 209097  424871      3      2
## [6,]      1      1      0   348 209880  425651      3      2
print(head(Syn[[2, 1]]))
##      index1 index2 strand score start1  start2   end1    end2 first_hit
## [1,]      1      1      0   891 429274 1502877 430839 1504433         1
## [2,]      1      1      0   680 208659  424439 210227  425998         4
## [3,]      1      1      0   598 514717  824540 515655  825478         7
## [4,]      1      1      0   471 519354 1811748 520670 1813070         8
## [5,]      1      1      0   466 521704  815578 522945  816867        12
## [6,]      1      1      0   374  49749 1357669  50564 1358484        16
##      last_hit
## [1,]        3
## [2,]        6
## [3,]        7
## [4,]       11
## [5,]       15
## [6,]       16

The above printed objects show the data for the comparison between the first and second genome in our database.

To take advantage of these synteny maps, we can then overlay the gene calls for each genome present on top of our map.

Next, GFF annotations for the associated genomes are parsed to provide gene calls in a use-able format. GFFs are not the only possible source of appropriate gene calls, but they are the source that was used during package construction and testing. Parsed GFFs can be constructed with gffToDataFrame, for full functionality, or GFFs can be imported via rtracklater::import() for limited functionality. GeneCalls for both the PairSummaries and NucleotideOverlap functions must be named list, and those names must match dimnames(Syn)[[1]].

# generating genecalls with local data:
GC <- gffToDataFrame(GFF = system.file("extdata",
                                       "GCF_021065005.1_ASM2106500v1_genomic.gff.gz",
                                       package = "SynExtend"),
                     Verbose = TRUE)
## ================================================================================
## Time difference of 5.651052 secs
# in an effort to be space conscious, not all original gffs are kept within this package
GeneCalls <- get(data("Endosymbionts_GeneCalls", package = "SynExtend"))

SynExtend’s gffToDataFrame function will directly import gff files into a usable format, and includes other extracted information.

print(head(GeneCalls[[1]]))
## DataFrame with 6 rows and 11 columns
##       Index    Strand     Start      Stop        Type                 ID
##   <integer> <integer> <integer> <integer> <character>        <character>
## 1         1         0         1      1362        gene gene-Gromo_RS00005
## 2         1         0      1652      2740        gene gene-Gromo_RS00010
## 3         1         0      2745      3905        gene gene-Gromo_RS00015
## 4         1         0      4016      6400        gene gene-Gromo_RS00020
## 5         1         0      6496      8598        gene gene-Gromo_RS00025
## 6         1         0      8722      9822        gene gene-Gromo_RS00030
##           Range                Product    Coding Translation_Table
##   <IRangesList>            <character> <logical>       <character>
## 1        1-1362 chromosomal replicat..      TRUE                11
## 2     1652-2740 DNA polymerase III s..      TRUE                11
## 3     2745-3905      AAA family ATPase      TRUE                11
## 4     4016-6400 DNA topoisomerase (A..      TRUE                11
## 5     6496-8598   hypothetical protein      TRUE                11
## 6     8722-9822        MFS transporter      TRUE                11
##          Contig
##     <character>
## 1 NZ_CP069247.1
## 2 NZ_CP069247.1
## 3 NZ_CP069247.1
## 4 NZ_CP069247.1
## 5 NZ_CP069247.1
## 6 NZ_CP069247.1

Raw GFF imports are also acceptable, but prevent alignments in amino acid space with PairSummaries().

X01 <- rtracklayer::import(system.file("extdata",
                                       "GCA_000875775.1_ASM87577v1_genomic.gff.gz",
                                       package = "SynExtend"))
class(X01)
print(X01)

SynExtend’s primary functions provide a way to identify where pairs of genes are explicitly linked by syntenic hits, and then summarize those links. The first step is just identifying those links.

Links <- NucleotideOverlap(SyntenyObject = Syn,
                           GeneCalls = GeneCalls,
                           Verbose = TRUE)
## 
## Reconciling genecalls.
## ================================================================================
## Finding connected features.
## ================================================================================
## Time difference of 0.1929841 secs

The Links object generated by NucleotideOverlap is a raw representation of positions on the synteny map where shared k-mers link genes between paired genomes. As such, it is analagous in shape to objects of class Synteny. This raw object is unlikely to be useful to most users, but has been left exposed to ensure that this data remains accessible should a user desire to have access to it.

class(Links)
## [1] "LinkedPairs"
print(Links)
##           1         2        3
## 1  1 Contig  41 Pairs 34 Pairs
## 2  99 Kmers 2 Contigs 48 Pairs
## 3 129 Kmers 121 Kmers 1 Contig

This raw data can be processed to provide a straightforward summary of predicted pairs.

LinkedPairs1 <- PairSummaries(SyntenyLinks = Links,
                              DBPATH = DBPATH,
                              PIDs = FALSE,
                              Verbose = TRUE)
## 
## Preparing overhead data.
## Overhead complete.
## Collecting pairs.
## ================================================================================
## Time difference of 1.608255 secs

The object LinkedPairs1 is a data.frame where each row is populated by information about a predicted orthologous pair. By default PairSummaries uses a simple model to determine whether the k-mers that link a pair of genes are likely to provide an erroneous link. When set to Model = "Global", is is simply a prediction of whether the involved nucleotides are likely to describe a pair of genomic features whose alignment would result in a PID that falls within a random distribution. This model is effective if somewhat permissive, but is significantly faster than performing many pairwise alignments.

print(head(LinkedPairs1))
##       p1       p2 ExactMatch Consensus TotalKmers MaxKmer p1FeatureLength
## 1  1_1_4 2_1_1146       1071 0.9926940          6     234            2385
## 2 1_1_11 2_1_1708        351 0.9793716          2     240            1161
## 3 1_1_23  2_1_156        588 0.9960786          1     588            1068
## 4 1_1_39 2_1_1491        336 0.9242528          1     336            1476
## 5 1_1_40 2_1_1492        816 0.9932482          1     816            2385
## 6 1_1_43  2_1_751        270 0.9956627          1     270            1275
##   p2FeatureLength Adjacent    TetDist PIDType PredictedPID
## 1            2484        0 0.03564127      AA    0.8052259
## 2            1071        0 0.05678748      AA    0.8158521
## 3            1053        0 0.05201223      AA    0.9922821
## 4            1278        1 0.04403144      AA    0.7227297
## 5            2448        1 0.04585398      AA    0.8955756
## 6            1245        0 0.04466403      AA    0.8482546

PairSummaries includes arguments that allow for aligning all pairs that are predicted, via PIDs = TRUE, while IgnoreDefaultStringSet = FALSE indicates that alignments should be performed in nucleotide or amino acid space as is appropriate for the linked sequences. Setting IgnoreDefaultStringSet = TRUE will force all alignments into nucleotide space.

As of SynExtend v 1.3.13, the functions ExtractBy and DisjointSet have been added to provide users with direct tools to work with PairSummaries objects.

SingleLinkageClusters <- DisjointSet(Pairs = LinkedPairs1,
                                     Verbose = TRUE)
## 
## Assigning initial root:
## ================================================================================
## 
## Time difference of 0.001984596 secs
## 
## Assigning final root:
## ================================================================================
## 
## Time difference of 0.001736879 secs
## 
## Assigning single linkage clusters.
## Assignments complete.
## 
## Time difference of 0.006320953 secs
# extract the first 10 clusters
Sets <- ExtractBy(x = LinkedPairs1,
                  y = DBPATH,
                  z = SingleLinkageClusters[1:10],
                  Verbose = TRUE)
## 
## Extracting Sequences:
## ================================================================================
## 
## Arranging Sequences:
## ================================================================================
## 
## Time difference of 0.2042353 secs
head(Sets)
## [[1]]
## DNAStringSet object of length 3:
##     width seq                                               names               
## [1]  2385 ATGGATTCAGACGATATTAAGAA...ATGTGAGAAATTTAGATGTATAA 1_1_4
## [2]  2484 ATGACTACAATGACCAAAGAAGA...CGGTTAAGAATTTGGATATTTAA 2_1_1146
## [3]  2385 ATGATTAATTATGATTCTTCGAA...CAAATTTTGACTTAGATTTTTAA 3_1_28
## 
## [[2]]
## DNAStringSet object of length 3:
##     width seq                                               names               
## [1]  1161 GTGGATTTTGATAGATTCGATAT...GTTGGATTTCAAATTCTTGGTGA 1_1_11
## [2]  1071 ATGAAAAAAACGGTTGTTGTGGG...TGATCGAGCCAGAAGCCCCCTAA 2_1_1708
## [3]  1083 ATGTTTAGTATTTTTTATTATAT...AAATATTTAAAATAATAAAATAA 3_1_197
## 
## [[3]]
## DNAStringSet object of length 2:
##     width seq                                               names               
## [1]  1068 TTGAATAAGAATCAAAATGATGA...AAGATGTACTTGAAGATGTATAA 1_1_23
## [2]  1053 ATGGCTCAACAAAATGATGCAGG...TAAAAGAAGAGGCCTTGGTTTAA 2_1_156
## 
## [[4]]
## DNAStringSet object of length 6:
##     width seq                                               names               
## [1]  1476 ATGTTAAATATGAAAATTGCTGT...GACTTGGATGGTTCGATGAGTAG 1_1_39
## [2]  2385 ATGTGCGCAAGAGACTTGGATGG...GCTTTAAATTGGAGAAGATATAA 1_1_40
## [3]  1278 ATGAACAAAGATCTAGTCGCAAT...AAATTCGCAAAAAAAGAGGTTAA 2_1_1491
## [4]  2448 TTGGCCAAGAATTTAAAAATAAA...CCTATTTAGACCAAGAGCTATAA 2_1_1492
## [5]  1500 ATGTCGTATAAAGAAAAAAAAAT...TTAAATCATATATTCAAGAATGA 3_1_100
## [6]  1029 ATGAATAAAAAAATACCGTTAAA...ATATAAATGAATATTTAAAATAA 3_1_101
## 
## [[5]]
## DNAStringSet object of length 4:
##     width seq                                               names               
## [1]  1275 ATGTCCGATGTATTGCTATCAAT...GTATGCGTGGAGAAATAACTTGA 1_1_43
## [2]  1245 ATGAACCAGAAAAAAAAATTTAA...GTCTCTTGGGAAGGATTGTATGA 2_1_751
## [3]  1347 ATGCGTAAACAACAAGTAAAAAC...GTCTTAAGAGTTTTAGCCATTAA 2_1_769
## [4]  1317 ATGAAACAAGTCTATATAAAAAC...TATATGGTAAAATTTTTTTTTAA 3_1_165
## 
## [[6]]
## DNAStringSet object of length 3:
##     width seq                                               names               
## [1]  1083 ATGAGTAAACTTGAATTTGAAAA...AAATGTCTGCAATAGATGAATAA 1_1_91
## [2]  1083 ATGCAAGAGCAGATAGGTAAGCT...TTGTTGGAAATCAATTTTCATGA 2_1_795
## [3]  1050 ATGGATTTAAAAATTTCATCTCA...TATTAGATCTTTTTTGTAAGTGA 3_1_95

Session Info:

sessionInfo()
## R Under development (unstable) (2024-01-16 r85808)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] SynExtend_1.15.2    DECIPHER_2.31.3     Biostrings_2.71.2  
## [4] GenomeInfoDb_1.39.7 XVector_0.43.1      IRanges_2.37.1     
## [7] S4Vectors_0.41.3    BiocGenerics_0.49.1 BiocStyle_2.31.0   
## 
## loaded via a namespace (and not attached):
##  [1] bit_4.0.5               jsonlite_1.8.8          compiler_4.4.0         
##  [4] BiocManager_1.30.22     highr_0.10              crayon_1.5.2           
##  [7] Rcpp_1.0.12             blob_1.2.4              magick_2.8.3           
## [10] parallel_4.4.0          jquerylib_0.1.4         yaml_2.3.8             
## [13] fastmap_1.1.1           R6_2.5.1                knitr_1.45             
## [16] bookdown_0.37           GenomeInfoDbData_1.2.11 DBI_1.2.2              
## [19] bslib_0.6.1             rlang_1.1.3             cachem_1.0.8           
## [22] xfun_0.42               sass_0.4.8              bit64_4.0.5            
## [25] RSQLite_2.3.5           memoise_2.0.1           cli_3.6.2              
## [28] magrittr_2.0.3          zlibbioc_1.49.0         digest_0.6.34          
## [31] lifecycle_1.0.4         vctrs_0.6.5             evaluate_0.23          
## [34] rmarkdown_2.25          tools_4.4.0             pkgconfig_2.0.3        
## [37] htmltools_0.5.7