BiocNeighbors 1.21.2
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 5655 9916 4555 3712 2675 9291 6037 6688 9474 750
## [2,] 8350 5888 3830 1454 5089 120 6322 147 1312 3886
## [3,] 7696 4626 2812 8812 2878 3097 3047 2380 7698 2138
## [4,] 9906 2287 7837 1995 8123 3453 3068 7430 4105 1891
## [5,] 5568 4989 910 7800 6352 6174 1011 4946 9970 5361
## [6,] 4891 1632 1424 256 1039 5672 4383 4469 6530 4143
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.9037439 0.9392462 0.9887332 1.0066178 1.0107894 1.0232581 1.0383744
## [2,] 0.6992599 1.0185457 1.0731937 1.0732443 1.1061745 1.1223277 1.1243474
## [3,] 0.7929197 0.7929567 0.8181315 0.8381198 0.9228263 0.9257279 0.9440160
## [4,] 0.7267991 0.8266458 0.8371309 0.8453515 0.9267266 0.9294744 0.9409822
## [5,] 0.8623570 0.8835005 0.9797719 1.0147164 1.0199069 1.0209705 1.0225179
## [6,] 0.9316973 0.9694417 1.0084006 1.0130004 1.0197872 1.0205606 1.0280993
## [,8] [,9] [,10]
## [1,] 1.0431373 1.0459933 1.0463119
## [2,] 1.1425595 1.1468573 1.1497592
## [3,] 0.9635164 0.9746322 0.9819571
## [4,] 0.9449583 0.9553596 0.9636202
## [5,] 1.0348885 1.0564480 1.0837294
## [6,] 1.0460012 1.0558894 1.0642107
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 6493 8867 1024 2998 381
## [2,] 9393 2339 7358 5196 1545
## [3,] 2362 7780 6281 1795 4394
## [4,] 4229 952 4429 3041 616
## [5,] 5584 8957 2743 5041 6433
## [6,] 3217 7707 7083 8820 3256
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.8128672 0.8578821 0.8705066 0.9702455 0.9925223
## [2,] 0.8412158 0.8695172 0.8758276 0.9138339 0.9140742
## [3,] 0.9183341 0.9336022 0.9697371 0.9736194 0.9827428
## [4,] 0.8936045 1.0226010 1.0341974 1.0551001 1.0600582
## [5,] 0.8838007 0.9886907 0.9888728 1.0567856 1.0998604
## [6,] 0.7503693 0.9141562 0.9623226 0.9693999 1.1943313
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmpVrDY2V/filec27bc35ba0f89.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R Under development (unstable) (2023-11-11 r85510)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.21.2 knitr_1.45 BiocStyle_2.31.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.2 rlang_1.1.2 xfun_0.41
## [4] jsonlite_1.8.8 S4Vectors_0.41.2 htmltools_0.5.7
## [7] stats4_4.4.0 sass_0.4.8 rmarkdown_2.25
## [10] grid_4.4.0 evaluate_0.23 jquerylib_0.1.4
## [13] fastmap_1.1.1 yaml_2.3.8 lifecycle_1.0.4
## [16] bookdown_0.37 BiocManager_1.30.22 compiler_4.4.0
## [19] codetools_0.2-19 Rcpp_1.0.11 BiocParallel_1.37.0
## [22] lattice_0.22-5 digest_0.6.33 R6_2.5.1
## [25] parallel_4.4.0 bslib_0.6.1 Matrix_1.6-4
## [28] tools_4.4.0 BiocGenerics_0.49.1 cachem_1.0.8