Organization of files on a local machine can be cumbersome. This is especially true for local copies of remote resources that may periodically require a new download to have the most updated information available. BiocFileCache is designed to help manage local and remote resource files stored locally. It provides a convenient location to organize files and once added to the cache management, the package provides functions to determine if remote resources are out of date and require a new download.
BiocFileCache
is a Bioconductor package and can be installed through
BiocManager::install()
.
if (!"BiocManager" %in% rownames(installed.packages()))
install.packages("BiocManager")
BiocManager::install("BiocFileCache", dependencies=TRUE)
After the package is installed, it can be loaded into R workspace by
library(BiocFileCache)
The initial step to utilizing BiocFileCache in managing files is to create a
cache object specifying a location. We will create a temporary directory for use
with examples in this vignette. If a path is not specified upon creation, the
default location is a directory ~/.BiocFileCache
in the typical user cache
directory as defined by tools::R_user_dir("", which="cache")
.
path <- tempfile()
bfc <- BiocFileCache(path, ask = FALSE)
If the path location exists and has been utilized to store files previously, the previous object will be loaded with any files saved to the cache. If the path location does not exist the user will be prompted to create the new directory. If the session is not interactive to promt the user or the user decides not to create the directory a temporary directory will be used.
Some utility functions to examine the cache are:
bfccache(bfc)
length(bfc)
show(bfc)
bfcinfo(bfc)
bfccache()
will show the cache path. NOTE: Because we are using temporary
directories, your path location will be different than shown.
bfccache(bfc)
## [1] "/tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9"
length(bfc)
## [1] 0
length()
on a BiocFileCache will show the number of files currently being
tracked by the BiocFileCache
. For more detailed information on what is store
in the BiocFileCache
object, there is a show method which will display the
object, object class, cache path, and number of items currently being tracked.
bfc
## class: BiocFileCache
## bfccache: /tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9
## bfccount: 0
## For more information see: bfcinfo() or bfcquery()
bfcinfo()
will list a table of BiocFileCache
resource files being tracked in
the cache. It returns a dplyr object of class tbl_sqlite
.
bfcinfo(bfc)
## # A tibble: 0 × 10
## # ℹ 10 variables: rid <chr>, rname <chr>, create_time <dbl>, access_time <dbl>,
## # rpath <chr>, rtype <chr>, fpath <chr>, last_modified_time <dbl>,
## # etag <chr>, expires <dbl>
The table of resource files includes the following information:
rid
: resource id. Autogenerated. This is a unique identifier automatically
generated when a resource is added to the cache.rname
: resource name. This is given by the user when a resource is added to
the cache. It does not have to be unique and can be updated at anytime. We
recommend descriptive key words and identifiers.create_time
: The date and time a resource is added to the cache.access_time
: The date and time a resource is utilized within the cache. The
access time is updated when the resource is updated or downloaded.rpath
: resource path. This is the path to the local file.rtype
: resource type. Either “local” or “web”, indicating if the resource
has a remote origin.fpath
: If rtype is “web”, this is the link to the remote resource. It will
be utilized to download the remote data.last_modified_time
: For a remote resource, the last_modified (if available)
information for the local copy of the data. This information is checked
against the remote resource to determine if the local copy is stale and needs
to be updated. If it is not available or your resource is not a remote
resource, the last modified time will be marked as NA.etag
: For a remote resource, the etag (if available) information for the
local copy of the data. This information is checked against the remote
resource to determine if the local copy is stale and needs to be updated. If
it is not available or your resource is not a remote resource, the etag will
be marked as NA.expires
: For a remote resource, the expires (if available) information for
the local copy of the data. This information is checked against the
Sys.time
to determine if the local copy needs to be updated. If it is not
available or your resource is not a remote resource, the expires will be
marked as NA.Now that we have created the cache object and location, let’s explore adding files that the cache will manage!
Now that a BiocFileCache
object and cache location has been created, files can
be added to the cache for tracking. There are two functions to add a resource to
the cache:
bfcnew()
bfcadd()
The difference between the options: bfcnew()
creates an entry for a resource
and returns a filepath to save to. As there are many types of data that can be
saved in many different ways, bfcnew()
allows you to save any R data object
in the appropriate manner and still be able to track the saved file. bfcadd()
should be utilized when a file already exists or a remote resource is being
accessed.
bfcnew
takes the BiocFileCache
object and a user specified rname
and
returns a path location to save data to. (optionally) you can add the file
extension if you know the type of file that will be saved:
savepath <- bfcnew(bfc, "NewResource", ext=".RData")
savepath
## BFC1
## "/tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9/3a0e4c5041493e_3a0e4c5041493e.RData"
## now we can use that path in any save function
m = matrix(1:12, nrow=3)
save(m, file=savepath)
## and that file will be tracked in the cache
bfcinfo(bfc)
## # A tibble: 1 × 10
## rid rname create_time access_time rpath rtype fpath last_modified_time etag
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 BFC1 NewR… 2024-05-01… 2024-05-01… /tmp… rela… 3a0e… NA <NA>
## # ℹ 1 more variable: expires <dbl>
bfcadd()
is for existing files or remote resources. The user will still
specify an rname
of their choosing but also must specify a path to local file
or web resource as fpath
. If no fpath
is given, the default is to assume the
rname
is also the path location. If the fpath
is a local file, there are a
few options for the user determined by the action
argument. action
will
allow the user to either copy
the existing file into the cache directory,
move
the existing file into the cache directory, or leave the file whereever
it is on the local system yet still track through the cache object asis
. copy
and move will rename the file to the generated cache file path. If the fpath
is a remote source, the source will try to be downloaded, if it is successful it
will save in the cache location and track in the cache object; The original
source will be added to the cache information as fpath
. If the user does not
want the remote resource to be downloaded initially, the argument
download=FALSE
may be used to delay the download but add the resource to the
cache. Relative path locations may also be used, specified with
rtype = "relative"
. This will store a relative location for the file within
the cache; only actions copy
and move
are available for relative paths.
First let’s use local files:
fl1 <- tempfile(); file.create(fl1)
## [1] TRUE
add2 <- bfcadd(bfc, "Test_addCopy", fl1) # copy
# returns filepath being tracked in cache
add2
## BFC2
## "/tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9/3a0e4c4ed30f80_file3a0e4c1e74ac0f"
# the name is the unique rid in the cache
rid2 <- names(add2)
fl2 <- tempfile(); file.create(fl2)
## [1] TRUE
add3 <- bfcadd(bfc, "Test2_addMove", fl2, action="move") # move
rid3 <- names(add3)
fl3 <- tempfile(); file.create(fl3)
## [1] TRUE
add4 <- bfcadd(bfc, "Test3_addAsis", fl3, rtype="local",
action="asis") # reference
rid4 <- names(add4)
file.exists(fl1) # TRUE - copied from original location
## [1] TRUE
file.exists(fl2) # FALSE - moved from original location
## [1] FALSE
file.exists(fl3) # TRUE - left asis, original location tracked
## [1] TRUE
Now let’s add some examples with remote sources:
url <- "http://httpbin.org/get"
add5 <- bfcadd(bfc, "TestWeb", fpath=url)
rid5 <- names(add5)
url2<- "https://en.wikipedia.org/wiki/Bioconductor"
add6 <- bfcadd(bfc, "TestWeb", fpath=url2)
rid6 <- names(add6)
# add a remote resource but don't initially download
add7 <- bfcadd(bfc, "TestNoDweb", fpath=url2, download=FALSE)
rid7 <- names(add7)
# let's look at our BiocFileCache object now
bfc
## class: BiocFileCache
## bfccache: /tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9
## bfccount: 7
## For more information see: bfcinfo() or bfcquery()
bfcinfo(bfc)
## # A tibble: 7 × 10
## rid rname create_time access_time rpath rtype fpath last_modified_time etag
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 BFC1 NewR… 2024-05-01… 2024-05-01… /tmp… rela… 3a0e… <NA> <NA>
## 2 BFC2 Test… 2024-05-01… 2024-05-01… /tmp… rela… /tmp… <NA> <NA>
## 3 BFC3 Test… 2024-05-01… 2024-05-01… /tmp… rela… /tmp… <NA> <NA>
## 4 BFC4 Test… 2024-05-01… 2024-05-01… /tmp… local /tmp… <NA> <NA>
## 5 BFC5 Test… 2024-05-01… 2024-05-01… /tmp… web http… <NA> <NA>
## 6 BFC6 Test… 2024-05-01… 2024-05-01… /tmp… web http… 2024-05-01 18:43:… <NA>
## 7 BFC7 Test… 2024-05-01… 2024-05-01… /tmp… web http… <NA> <NA>
## # ℹ 1 more variable: expires <chr>
Now that we are tracking resources, let’s explore accessing their information!
Files will by default have a unique identifier added to the start of the
original file name (identifier_originalName) when added to the cache to allow
for multiple versions of the same file name. There is an option to override this
default behavior by using the fname
argument of bfcadd
or bfcnew
. fname
takes one of two options: unique
or exact
. The unique
option behaves as
default and adds a unique identifier to the original file name. The exact
option wil override and not add a unique identifier and an exact match to the
original file name will be added.
Before we get into exploring individual resources, a helper function. Most of
the functions provided require the unique rid[s] assigned to a resource. The
bfcadd
and bfcnew
return the path as a named character vector, the name of
the character vector is the rid. However, you may want to access a resource
that you have added some time ago.
bfcquery()
bfcquery()
will take in a key word and search across the rname
, rpath
, and
fpath
for any matching entries. The columns that are searched can be
controlled with the argument field
.
bfcquery(bfc, "Web")
## # A tibble: 2 × 10
## rid rname create_time access_time rpath rtype fpath last_modified_time etag
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 BFC5 Test… 2024-05-01… 2024-05-01… /tmp… web http… <NA> <NA>
## 2 BFC6 Test… 2024-05-01… 2024-05-01… /tmp… web http… 2024-05-01 18:43:… <NA>
## # ℹ 1 more variable: expires <chr>
bfcquery(bfc, "copy")
## # A tibble: 0 × 10
## # ℹ 10 variables: rid <chr>, rname <chr>, create_time <dbl>, access_time <dbl>,
## # rpath <chr>, rtype <chr>, fpath <chr>, last_modified_time <dbl>,
## # etag <chr>, expires <dbl>
q1 <- bfcquery(bfc, "wiki")
q1
## # A tibble: 2 × 10
## rid rname create_time access_time rpath rtype fpath last_modified_time etag
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 BFC6 Test… 2024-05-01… 2024-05-01… /tmp… web http… 2024-05-01 18:43:… <NA>
## 2 BFC7 Test… 2024-05-01… 2024-05-01… /tmp… web http… <NA> <NA>
## # ℹ 1 more variable: expires <chr>
class(q1)
## [1] "tbl_bfc" "tbl_bfc" "tbl_df" "tbl" "data.frame"
As you can see above bfcquery()
, returns an object of class tbl_sql
and can
be investiaged further utilizing methods for these classes, such as the package
dplyr
methods. The rid
can be seen in the first column of the table to be
used in other functions. To get a quick count of how many objects in the cache
matched the query, use bfccount()
.
bfccount(q1)
## [1] 2
[
[
allows for subsetting of the BiocFileCache object. The output will be a
BiocFileSubCache object. Users will still be able to query, remove (from the
subset object only), and access resources of the subset, however the resources
cannot be updated.
bfcsubWeb = bfc[paste0("BFC", 5:6)]
bfcsubWeb
## class: BiocFileCacheReadOnly
## bfccache: /tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9
## bfccount: 2
## For more information see: bfcinfo() or bfcquery()
bfcinfo(bfcsubWeb)
## # A tibble: 2 × 10
## rid rname create_time access_time rpath rtype fpath last_modified_time etag
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 BFC5 Test… 2024-05-01… 2024-05-01… /tmp… web http… <NA> <NA>
## 2 BFC6 Test… 2024-05-01… 2024-05-01… /tmp… web http… 2024-05-01 18:43:… <NA>
## # ℹ 1 more variable: expires <chr>
There are three methods for retrieving the BiocFileCache
resource path
location.
[[
bfcpath()
bfcrpath()
The [[
will access the rpath
saved in the BiocFileCache
. Retrieving this
location will return the path to the local version of the resource; allowing the
user to then use this path in any load/read methods most appropriate for the
resource. The bfcpath()
and bfcrpath()
both return a named character vector
also displaying the local file that can be used for retrieval. bfcpath
requires rids
while bfcrpath()
can use rids
or rnames
(but not
both). bfcrpath()
can be used to add a resource into the cache when rnames
are specified; if the element in rnames
is not found, it will try and add to
the cache with bfcadd()
.
bfc[["BFC2"]]
## BFC2
## "/tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9/3a0e4c4ed30f80_file3a0e4c1e74ac0f"
bfcpath(bfc, "BFC2")
## BFC2
## "/tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9/3a0e4c4ed30f80_file3a0e4c1e74ac0f"
bfcpath(bfc, "BFC5")
## BFC5
## "/tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9/3a0e4c166f0a3c_get"
bfcrpath(bfc, rids="BFC5")
## BFC5
## "/tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9/3a0e4c166f0a3c_get"
bfcrpath(bfc)
## BFC1
## "/tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9/3a0e4c5041493e_3a0e4c5041493e.RData"
## BFC2
## "/tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9/3a0e4c4ed30f80_file3a0e4c1e74ac0f"
## BFC3
## "/tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9/3a0e4c38a2be62_file3a0e4c7bc03ed6"
## BFC4
## "/tmp/Rtmp58Dbhm/file3a0e4c4fd04af6"
## BFC5
## "/tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9/3a0e4c166f0a3c_get"
## BFC6
## "/tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9/3a0e4c7fcf11_Bioconductor"
## BFC7
## "/tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9/3a0e4c72d7e34e_Bioconductor"
bfcrpath(bfc, c("http://httpbin.org/get","Test3_addAsis"))
## adding rname 'http://httpbin.org/get'
## BFC4
## "/tmp/Rtmp58Dbhm/file3a0e4c4fd04af6"
## BFC8
## "/tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9/3a0e4c79a84c39_get"
Managing remote resources locally involves knowing when to update the local copy of the data.
bfcneedsupdate()
bfcneedsupdate()
is a method that will check the local copy of the data’s
etag and last_modifed time to the etag and last_modified time of the remote
resource as well as an expires time. The cache saves this information when the
web resource is initially added. The expires time is checked against the current
Sys.time to see if the local resource has expired. If so the resource will deem
need to be updated; if unavailable or not expired will check the etag and
last_modified_time. The etag information is used definitively if it is
available, if it is not available it checks the last_modified time. If the
resource does not have a last_modified tag either, it is undetermined. If the
resource has not been download yet, it is TRUE
.
Note: This function does not automatically download the remote source if it
is out of date. Please see bfcdownload()
.
bfcneedsupdate(bfc, "BFC5")
## BFC5
## NA
bfcneedsupdate(bfc, "BFC6")
## BFC6
## TRUE
bfcneedsupdate(bfc)
## BFC5 BFC6 BFC7 BFC8
## NA TRUE TRUE NA
Just as you could access the rpath
, the local resource path can be set with
[[<-
The file must exist in order to be replaced in the BiocFileCache
. If the user
wishes to rename, they must make a copy (or touch) the file first.
fileBeingReplaced <- bfc[[rid3]]
fileBeingReplaced
## BFC3
## "/tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9/3a0e4c38a2be62_file3a0e4c7bc03ed6"
# fl3 was created when we were adding resources
fl3
## [1] "/tmp/Rtmp58Dbhm/file3a0e4c4fd04af6"
bfc[[rid3]]<-fl3
## Warning in `[[<-`(`*tmp*`, rid3, value = "/tmp/Rtmp58Dbhm/file3a0e4c4fd04af6"):
## updating rpath, changing rtype to 'local'
bfc[[rid3]]
## BFC3
## "/tmp/Rtmp58Dbhm/file3a0e4c4fd04af6"
The user may also wish to change the rname
or fpath
associated with a
resource in addition to the rpath
. This can be done with
bfcupdate()
Again, if changing the rpath
the file must exist. If a fpath
is being
updated, the data will be downloaded and the user will be prompted to overwrite
the current file specified in rpath
. If the user does not want to be prompted
about overwritting of files, ask=FALSE
may be used.
bfcinfo(bfc, "BFC1")
## # A tibble: 1 × 10
## rid rname create_time access_time rpath rtype fpath last_modified_time etag
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 BFC1 NewR… 2024-05-01… 2024-05-01… /tmp… rela… 3a0e… NA <NA>
## # ℹ 1 more variable: expires <dbl>
bfcupdate(bfc, "BFC1", rname="FirstEntry")
bfcinfo(bfc, "BFC1")
## # A tibble: 1 × 10
## rid rname create_time access_time rpath rtype fpath last_modified_time etag
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 BFC1 Firs… 2024-05-01… 2024-05-01… /tmp… rela… 3a0e… NA <NA>
## # ℹ 1 more variable: expires <dbl>
Now let’s update a web resource
suppressPackageStartupMessages({
library(dplyr)
})
bfcinfo(bfc, "BFC6") %>% select(rid, rpath, fpath)
## # A tibble: 1 × 3
## rid rpath fpath
## <chr> <chr> <chr>
## 1 BFC6 /tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9/3a0e4c7fcf11_Bioconductor https://en…
bfcupdate(bfc, "BFC6", fpath=url, rname="Duplicate", ask=FALSE)
bfcinfo(bfc, "BFC6") %>% select(rid, rpath, fpath)
## # A tibble: 1 × 3
## rid rpath fpath
## <chr> <chr> <chr>
## 1 BFC6 /tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9/3a0e4c7fcf11_Bioconductor http://htt…
Lastly, remote resources may require an update if the Data is out of date (See
bfcneedsupdate()
). The bfcdownload
function will attempt to download from
the original resource saved in the cache as fpath
and overwrite the out of
date file rpath
bfcdownload()
The following confirms that resources need updating, and the performs the update
rid <- "BFC5"
test <- !identical(bfcneedsupdate(bfc, rid), FALSE) # 'TRUE' or 'NA'
if (test)
bfcdownload(bfc, rid, ask=FALSE)
## BFC5
## "/tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9/3a0e4c166f0a3c_get"
The following functions are provided for metadata:
bfcmeta()<-
bfcmeta()
bfcmetalist()
bfcmetaremove()
Additional metadata can be added as data.frames
that become tables in the sql
database. The data.frame
must contain a column rid
that matches the rid
column in the cache. Any metadata added will then be displayed when accessing
the cache. Metadata is added with bfcmeta()<-
. A table name
must be provided
as an argument. Users can add multiple metadata tables as long as the names are
unique. Tables may be appended or overwritten using additional arguments
append=TRUE
or overwrite=TRUE
.
names(bfcinfo(bfc))
## [1] "rid" "rname" "create_time"
## [4] "access_time" "rpath" "rtype"
## [7] "fpath" "last_modified_time" "etag"
## [10] "expires"
meta <- as.data.frame(list(rid=bfcrid(bfc)[1:3], idx=1:3))
bfcmeta(bfc, name="resourceData") <- meta
names(bfcinfo(bfc))
## [1] "rid" "rname" "create_time"
## [4] "access_time" "rpath" "rtype"
## [7] "fpath" "last_modified_time" "etag"
## [10] "expires" "idx"
The metadata tables that exist can be listed with bfcmetalist()
and can be
retrieved with bfcmeta()
.
bfcmetalist(bfc)
## [1] "resourceData"
bfcmeta(bfc, name="resourceData")
## rid idx
## 1 BFC1 1
## 2 BFC2 2
## 3 BFC3 3
Lastly, metadata can be removed with bfcmetaremove()
.
bfcmetaremove(bfc, name="resourceData")
Note:
While quick implementations of all the functions exist where if you
don’t specify a BiocFileCache object it will operate on BiocFileCache()
,
this option is not available for bfcmeta()<-
. This function must always
specify a BiocFileCache object by first defining a variable and then passing
that variable into the function.
Example of ERROR:
bfcmeta(name="resourceData") <- meta
Error in bfcmeta(name = "resourceData") <- meta :
target of assignment expands to non-language object
Correct implementation:
bfc <- BiocFileCache()
bfcmeta(bfc, name="resourceData") <- meta
All other functions have a default, if the BiocFileCache object is missing it
will operate on the default cache BiocFileCache()
.
Now that we have added resources, it is also possible to remove a resource.
bfcremove()
When you remove a resource from the cache, it will also delete the local file
but only if it is stored in the cache directory as given by bfccache(bfc)
. If
it is a path to a file somewhere else on the user system, it will only be
removed from the BiocFileCache
object but the file not deleted.
# let's remind ourselves of our object
bfc
## class: BiocFileCache
## bfccache: /tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9
## bfccount: 8
## For more information see: bfcinfo() or bfcquery()
bfcremove(bfc, "BFC6")
bfcremove(bfc, "BFC1")
# let's look at our BiocFileCache object now
bfc
## class: BiocFileCache
## bfccache: /tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9
## bfccount: 6
## For more information see: bfcinfo() or bfcquery()
There is another helper function that may be of use:
bfcsync()
This function will compare two things:
rpath
cannot be found (This would occur if bfcnew()
is used and
the path was not used to save an object)bfccache(bfc)
), that are not
being tracked by the BiocFileCache
object# create a new entry that hasn't been used
path <- bfcnew(bfc, "UseMe")
rmMe <- names(path)
# We also have a file not being tracked because we updated rpath
bfcsync(bfc)
## entries without corresponding files: 'BFC7' 'BFC9'
## files without cache entries
## /tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9/3a0e4c38a2be62_file3a0e4c7bc03ed6
## /tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9/add_or_return_rname.LOCK
##
## [1] FALSE
# you can suppress the messages and just have a TRUE/FALSE
bfcsync(bfc, FALSE)
## [1] FALSE
#
# Let's do some cleaning to have a synced object
#
bfcremove(bfc, rmMe)
unlink(fileBeingReplaced)
bfcsync(bfc)
## entries without corresponding files: 'BFC7'
## files without cache entries
## /tmp/Rtmp58Dbhm/file3a0e4c3ee3fdd9/add_or_return_rname.LOCK
##
## [1] FALSE
There is a helper function to export a BiocFileCache and associated files as a tar or zip archive as well as the appropriate import function.
exportbfc()
importbfc()
The exportbfc
function will take in a BiocFileCache object or subsetted object
and create a tar or zip archive that can then be shared to other collaborators
on different computer systems. The user can choose where the archive is created
with outputFile
; the current working directory and the name
BiocFileCacheExport.tar
is used as default. By default a tar archive is
created, but the user can create a zip archive instead using the argument
outputMethod="zip"
. Any additional argument to the utils::zip
or
utils::tar
may also be utilized.
The following are some example calls:
# export entire biocfilecache
exportbfc(bfc)
# export the first 4 entries of biocfilecache
# as a compressed tar
exportbfc(bfc, rids=paste0("BFC", 1:4),
outputFile="BiocFileCacheExport.tar.gz", compression="gzip")
# export the subsetted object of web resources as zip
sub1 <- bfc[bfcrid(bfcquery(bfc, "web", field='rtype'))]
exportbfc(sub1, outputFile = "BiocFileCacheExportWeb.zip",
outMethod="zip")
The archive once inflated on a users system will have a fully functional copy of
the sent cache. The archive can be extracted manually and the path used in the
constructor BiocFileCache()
or for convenience the function importbfc
may be
utilized. The importbfc
function takes in a path to the appropriate tar or zip
file, the argument archiveMethod
indicating if untar
or unzip
should be
used (the default is untar), a path to where the archive should be extracted to
as exdir
, and any additional arguments to the utils::untar
and
utils::unzip
methods. The function will extract the files and load the
associated BiocFileCache object into the R session.
The following are example calls to load the above example exported objects:
bfc <- importbfc("BiocFileCacheExport.tar")
bfc2 <- importbfc("BiocFileCacheExport.tar.gz", compression="gzip")
bfc3 <- importbfc("BiocFileCacheExportWeb.zip", archiveMethod="unzip")
There exists the following helper functions to convert existing data to a BiocFileCache:
makeBiocFileCacheFromDataFrame
These functions may take awhile to run if there are a lot of resources, however if the BiocFileCache is stored in a permanent location it will only need to be run once.
makeBiocFileCacheFromDataFrame
takes an existing data.frame and creates a
BiocFileCache object. The cache location can be specified by the cache
argument. The cache
must not already exist and the user will be prompted to
create the location. If the user opts ‘N’, the cache will be created in a
temporary directory and this function will have to be run again upon a new R
session. The original data.frame must contain the required BiocFileCache columns
rtype
, rpath
, and fpath
as described in the section 1.2 “Creating /
Loading the Cache”. The optional columns rname
, last_modified_time
, etag
and expires
may also be specified in the original data.frame although are not
required and will be populated with defaults if missing. For resources with
rtype="local"
, the actionLocal
will control if the local copy of the file is
copied or moved to the cache location, or if it is left asis on the local
system; A local copy of the file must exist if the resource is identified as
rtype=local
. For resources with rtype="web"
, actionWeb
will control if the
local copy of the remote file is copied or moved to the cache location. It is a
requirement of BiocFileCache that all remote resources download their local copy
to the cache location. A local copy of the file does not have to exist and can
be downloaded into the cache at a later time. Any additional columns of the
original data.frame besides those required or optional BiocFileCache columns,
are separated and added to the BiocFileCache as a meta data table with the name
given as metadataName
. See section 1.6 on “Adding Metadata”.
The following is an example data.frame with minimal columns ‘rtype’, ‘rpath’,
and ‘fpath’ and one additional column that will become metadata ‘keywords’. The
‘rpath’ can be NA
as these are remote resources (rtype='web'
) that have not
been downloaded yet.
tbl <- data.frame(rtype=c("web","web"),
rpath=c(NA_character_,NA_character_),
fpath=c("http://httpbin.org/get",
"https://en.wikipedia.org/wiki/Bioconductor"),
keywords = c("httpbin", "wiki"), stringsAsFactors=FALSE)
tbl
## rtype rpath fpath keywords
## 1 web <NA> http://httpbin.org/get httpbin
## 2 web <NA> https://en.wikipedia.org/wiki/Bioconductor wiki
newbfc <- makeBiocFileCacheFromDataFrame(tbl,
cache=file.path(tempdir(),"BFC"),
actionWeb="copy",
actionLocal="copy",
metadataName="resourceMetadata")
Finally, there are two function involved with cleaning or deleting the cache:
cleanbfc()
removebfc()
cleanbfc()
will evaluate the resources in the BiocFileCache
object and
determine which, if any, have not been created, redownloaded, or updated in a
specified number of days. If ask=TRUE
, each entry that is above that threshold
will ask if it should be removed from the cache object and the file deleted
(only deleted if in bfccache(bfc)
location). If ask=FALSE
, it does not ask
about each file and automatically removes and deletes the file. The default
number of days is 120. If a resource has not needed any updates, this function
could give a false positive. It is also does not take into account how many time
the resource was loaded by retrieving the path (ie. via [[, bfcpath, bfcrpath),
so may not be an accurate indication of how often the resource is
utilized. Please use this function with caution.
cleanbfc(bfc)
removebfc()
will remove the BiocFileCache
complete from the system. Any
files saved in bfccache(bfc)
directory will also be deleted.
removebfc(bfc)
Note Use with caution!
One use for BiocFileCache is to save local copies of remote resources. The benefits of this approach include reproducibility, faster access, and access (once cached) without need for an internet connection. An example is an Ensembl GTF file (also available via [AnnotationHub][])
## paste to avoid long line in vignette
url <- paste(
"ftp://ftp.ensembl.org/pub/release-71/gtf",
"homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz",
sep="/")
For a system-wide cache, simply load the BiocFileCache package and
ask for the local resource path (rpath
) of the resource.
library(BiocFileCache)
bfc <- BiocFileCache()
path <- bfcrpath(bfc, url)
Use the path returned by bfcrpath()
as usual, e.g.,
gtf <- rtracklayer::import.gff(path)
A more compact use, the first or any time, is
gtf <- rtracklayer::import.gff(bfcrpath(BiocFileCache(), url))
Ensembl releases do not change with time, so there is no need to check whether the cached resource needs to be updated.
One might use BiocFileCache to cache results from experimental
analysis. The rname
field provides an opportunity to provide
descriptive metadata to help manage collections of resources, without
relying on cryptic file naming conventions.
Here we create or use a local file cache in the directory in which we are doing our analysis.
library(BiocFileCache)
bfc <- BiocFileCache("~/my-experiment/results")
We perform our analysis…
suppressPackageStartupMessages({
library(DESeq2)
library(airway)
})
data(airway)
dds <- DESeqDataData(airway, design = ~ cell + dex)
result <- DESeq(dds)
…and then save our result in a location provided by BiocFileCache.
saveRDS(result, bfcnew(bfc, "airway / DESeq standard analysis"))
Retrieve the result at a later date
result <- readRDS(bfcrpath(bfc, "airway / DESeq standard analysis"))
Once might imagine the following workflow:
suppressPackageStartupMessages({
library(BiocFileCache)
library(rtracklayer)
})
# load the cache
path <- file.path(tempdir(), "tempCacheDir")
bfc <- BiocFileCache(path)
# the web resource of interest
url <- "ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz"
# check if url is being tracked
res <- bfcquery(bfc, url)
if (bfccount(res) == 0L) {
# if it is not in cache, add
ans <- bfcadd(bfc, rname="ensembl, homo sapien", fpath=url)
} else {
# if it is in cache, get path to load
rid = res %>% filter(fpath == url) %>% collect(Inf) %>% `[[`("rid")
ans <- bfcrpath(bfc, rid)
# check to see if the resource needs to be updated
check <- bfcneedsupdate(bfc, rid)
# check can be NA if it cannot be determined, choose how to handle
if (is.na(check)) check <- TRUE
if (check){
ans < - bfcdownload(bfc, rid)
}
}
# ans is the path of the file to load
ans
# we know because we search for the url that the file is a .gtf.gz,
# if we searched on other terms we can use 'bfcpath' to see the
# original fpath to know the appropriate load/read/import method
bfcpath(bfc, names(ans))
temp = GTFFile(ans)
info = import(temp)
#
# A simplier test to see if something is in the cache
# and if not start tracking it is using `bfcrpath`
#
suppressPackageStartupMessages({
library(BiocFileCache)
library(rtracklayer)
})
# load the cache
path <- file.path(tempdir(), "tempCacheDir")
bfc <- BiocFileCache(path, ask=FALSE)
# the web resources of interest
url <- "ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz"
url2 <- "ftp://ftp.ensembl.org/pub/release-71/gtf/rattus_norvegicus/Rattus_norvegicus.Rnor_5.0.71.gtf.gz"
# if not in cache will download and create new entry
pathsToLoad <- bfcrpath(bfc, c(url, url2))
## adding rname 'ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz'
## adding rname 'ftp://ftp.ensembl.org/pub/release-71/gtf/rattus_norvegicus/Rattus_norvegicus.Rnor_5.0.71.gtf.gz'
pathsToLoad
## BFC1
## "/tmp/Rtmp58Dbhm/tempCacheDir/3a0e4c35f10021_Homo_sapiens.GRCh37.71.gtf.gz"
## BFC2
## "/tmp/Rtmp58Dbhm/tempCacheDir/3a0e4c1035e7fa_Rattus_norvegicus.Rnor_5.0.71.gtf.gz"
# now load files as see fit
info = import(GTFFile(pathsToLoad[1]))
class(info)
## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"
summary(info)
## [1] "GRanges object with 2253155 ranges and 12 metadata columns"
#
# One could also imagine the following:
#
library(BiocFileCache)
# load the cache
bfc <- BiocFileCache()
#
# Do some work!
#
# add a location in the cache
filepath <- bfcnew(bfc, "R workspace")
save(list = ls(), file=filepath)
# now the R workspace is being tracked in the cache
A package may desire to use BiocFileCache to manage remote data. The following is example code providing some best practice guidelines.
Assumingly, the cache could potentially be called in a variety of places within
code, examples, and vignette. It is desirable to have a wrapper to the
BiocFileCache constructor. The following is a suggested example for a package
called MyNewPackage
:
.get_cache <-
function()
{
cache <- tools::R_user_dir("MyNewPackage", which="cache")
BiocFileCache::BiocFileCache(cache)
}
Essentially this will create a unique cache for the package. If run interactively, the user will have the option to permanently create the package cache, else a temporary directory will be used.
Managing remote resources then involves a function that will query to see if the resource has been added, if it is not it will add to the cache and if it has it checks if the file needs to be updated.
download_data_file <-
function( verbose = FALSE )
{
fileURL <- "http://a_path_to/someremotefile.tsv.gz"
bfc <- .get_cache()
rid <- bfcquery(bfc, "geneFileV2", "rname")$rid
if (!length(rid)) {
if( verbose )
message( "Downloading GENE file" )
rid <- names(bfcadd(bfc, "geneFileV2", fileURL ))
}
if (!isFALSE(bfcneedsupdate(bfc, rid)))
bfcdownload(bfc, rid)
bfcrpath(bfc, rids = rid)
}
A case has been identified where it may be desired to do some
processing of web-based resources before saving the resource in the
cache. This can be done through specific options of the bfcadd()
and
bfcdownload()
functions.
bfcadd()
using the download=FALSE
argument.bfcdownload()
using the FUN
argument.The FUN
argument is the name of a function to be applied before
saving the downloaded file into the cache. The default is
file.rename
, simply copying the downloaded file into the cache. A
user-supplied function must take ONLY two arguments. When invoked, the
arguments will be:
character(1)
A temporary file containing the resource as
retrieved from the web.character(1)
The BiocFileCache location where the processed file
should be saved.The function should return a TRUE
on success or a character(1)
description for failure on error. As an example:
url <- "http://bioconductor.org/packages/stats/bioc/BiocFileCache/BiocFileCache_stats.tab"
headFile <- # how to process file before caching
function(from, to)
{
dat <- readLines(from)
writeLines(head(dat), to)
TRUE
}
rid <- bfcquery(bfc, url, "fpath")$rid
if (!length(rid)) # not in cache, add but do not download
rid <- names(bfcadd(bfc, url, download = FALSE))
update <- bfcneedsupdate(bfc, rid) # TRUE if newly added or stale
if (!isFALSE(update)) # download & process
bfcdownload(bfc, rid, ask = FALSE, FUN = headFile)
## Warning in readLines(from): incomplete final line found on
## '/tmp/Rtmp58Dbhm/tempCacheDir/file3a0e4c1b5f52'
## BFC3
## "/tmp/Rtmp58Dbhm/tempCacheDir/3a0e4c3d2ca96c_BiocFileCache_stats.tab"
rpath <- bfcrpath(bfc, rids=rid) # path to processed result
readLines(rpath) # read processed result
## [1] "Year\tMonth\tNb_of_distinct_IPs\tNb_of_downloads"
## [2] "2024\tJan\t25221\t52704"
## [3] "2024\tFeb\t23033\t49696"
## [4] "2024\tMar\t28218\t78379"
## [5] "2024\tApr\t35615\t80557"
## [6] "2024\tMay\t0\t0"
Note: By default bfcadd uses the webfile name as the saved local file. If the
processing step involves saving the data in a different format, utilize the
bfcadd argument ext
to assign an extension to identify the type of file that
was saved.
For example
url = "http://httpbin.org/get"
bfcadd("myfile", url, download=FALSE)
# would save a file `<uniqueid>_get` in the cache
bfcadd("myfile", url, download=FALSE, ext=".Rdata")
# would save a file `<uniqueid>_get.Rdata` in the cache
BiocFileCache uses CRAN package httr
functions HEAD
and GET
for accessing
web resources. This can be problematic if operating behind a proxy. The easiest
solution is to set the httr::set_config
with the proxy information.
proxy <- httr::use_proxy("http://my_user:my_password@myproxy:8080")
## or
proxy <- httr::use_proxy(Sys.getenv('http_proxy'))
httr::set_config(proxy)
The situation may occur where a cache is desired to be shared across multiple
users on a system. This presents permissions errors. To allow access to
multiple users create a group that the users belong to and that the cache
belongs too. Permissions of potentially two files need to be altered depending
on what you would like individuals to be able to accomplish with the cache. A
read-only cache will require manual manipulatios of the
BiocFileCache.sqlite.LOCK so that the group permissions are g+rw
. To allow
users to download files to the shared cache, both the
BiocFileCache.sqlite.LOCK file and the BiocFileCache.sqlite file will need group
permissions to g+rw
. Please google how to create a user group for your system
of interest. To find the location of the cache to be able to change the group
and file permissions, you may run the following in R if you used the default location:
tools::R_user_dir("BiocFileCache", which="cache")
or if you created a unique
location, something like the following: bfc = BiocFileCache(cache="someUniquelocation"); bfccache(bfc)
. For quick reference
in linux you will use chown currentuser:newgroup
to change the group and
chmod
to change the file permissions: chmod 660
or chmod g+rw
should
accomplish the correct permissions.
Two issues have been commonly reported regarding the lock file.
There could be permission ERROR regarding group and public access. See the
previous Group Cache Access
section.
This is an issue with filelock on particular systems. Particular partitions and non standard file systems may not support filelock. The solution is to use a different section of the system to create the cache. The easiest way to define a new cache location is by using environment variables.
In R:
Sys.setenv(BFC_CACHE=<new cache location>)
Alternatively, you can set an environment variable globally to avoid having to set uniquely in each R session. Please google for specific instructions for setting environment variables globally for your particular OS system.
Other common filelock implemented packages that have specific environment variables to control location are:
It is our hope that this package allows for easier management of local and remote resources.
As of BiocFileCache version > 1.15.1, the default caching location has
changed. The default cache is now controlled by the function tools::R_user_dir
instead of rappdirs::user_cache_dir
. Users who have utilized the default
BiocFileCache location, to continue using the created cache, must move the cache and its
files to the new default location or delete the old cache and have to redownload
any previous files.
The following steps can be used to move the files to the new location:
Determine the old location by running the following in R
rappdirs::user_cache_dir(appname="BiocFileCache")
Determine the new location by running the following in R
tools::R_user_dir("BiocFileCache", which="cache")
Move the files to the new location. You can do this manually or do the following steps in R. Remember if you have a lot of cached files, this may take awhile and you will need permissions on all the files in order to move them.
# make sure you have permissions on the cache/files
# use at own risk
moveFiles<-function(package){
olddir <- path.expand(rappdirs::user_cache_dir(appname=package))
newdir <- tools::R_user_dir(package, which="cache")
dir.create(path=newdir, recursive=TRUE)
files <- list.files(olddir, full.names =TRUE)
moveres <- vapply(files,
FUN=function(fl){
filename = basename(fl)
newname = file.path(newdir, filename)
file.rename(fl, newname)
},
FUN.VALUE = logical(1))
if(all(moveres)) unlink(olddir, recursive=TRUE)
}
package="BiocFileCache"
moveFiles(package)
Users may always specify a unique caching location by providing the cache
argument to the BiocFileCache
constructor; however users must always specify this location as it will not be
recognized by default in subsequent runs.
Alternatively, the default caching location may also be controlled by a
user-wise or system-wide environment variable. Users may set the environment
variable BFC_CACHE
to the old location to continue using as default location.
Lastly, if a user does not care about the already existing default cache, the old location may be deleted to move forward with the new default location. This option should be used with caution. Once deleted, old cached resources will no longer be available and have to be re-downloaded.
One can do this manually by navigating to the location indicated in the ERROR
message as Problematic cache:
and deleting the folder and all its content.
The following can be done to delete through R code:
CAUTION This will remove the old cache and all downloaded resources.
library(BiocFileCache)
package = "BiocFileCache"
BFC_CACHE = rappdirs::user_cache_dir(appname=package)
Sys.setenv(BFC_CACHE = BFC_CACHE)
bfc = BiocFileCache(BFC_CACHE)
## CAUTION: This removes the cache and all downloaded resources
removebfc(bfc, ask=FALSE)
## create new empty cache in new default location
bfc = BiocFileCache(ask=FALSE)
sessionInfo()
## R version 4.4.0 RC (2024-04-16 r86468)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] rtracklayer_1.65.0 GenomicRanges_1.57.0 GenomeInfoDb_1.41.0
## [4] IRanges_2.39.0 S4Vectors_0.43.0 BiocGenerics_0.51.0
## [7] dplyr_1.1.4 BiocFileCache_2.13.0 dbplyr_2.5.0
## [10] BiocStyle_2.33.0
##
## loaded via a namespace (and not attached):
## [1] SummarizedExperiment_1.35.0 rjson_0.2.21
## [3] xfun_0.43 bslib_0.7.0
## [5] lattice_0.22-6 Biobase_2.65.0
## [7] vctrs_0.6.5 tools_4.4.0
## [9] bitops_1.0-7 generics_0.1.3
## [11] parallel_4.4.0 curl_5.2.1
## [13] tibble_3.2.1 fansi_1.0.6
## [15] RSQLite_2.3.6 blob_1.2.4
## [17] pkgconfig_2.0.3 Matrix_1.7-0
## [19] lifecycle_1.0.4 GenomeInfoDbData_1.2.12
## [21] compiler_4.4.0 Rsamtools_2.21.0
## [23] Biostrings_2.73.0 codetools_0.2-20
## [25] htmltools_0.5.8.1 sass_0.4.9
## [27] RCurl_1.98-1.14 yaml_2.3.8
## [29] pillar_1.9.0 crayon_1.5.2
## [31] jquerylib_0.1.4 BiocParallel_1.39.0
## [33] DelayedArray_0.31.0 cachem_1.0.8
## [35] abind_1.4-5 tidyselect_1.2.1
## [37] digest_0.6.35 purrr_1.0.2
## [39] restfulr_0.0.15 bookdown_0.39
## [41] grid_4.4.0 fastmap_1.1.1
## [43] SparseArray_1.5.0 cli_3.6.2
## [45] magrittr_2.0.3 S4Arrays_1.5.0
## [47] XML_3.99-0.16.1 utf8_1.2.4
## [49] withr_3.0.0 filelock_1.0.3
## [51] UCSC.utils_1.1.0 bit64_4.0.5
## [53] rmarkdown_2.26 XVector_0.45.0
## [55] httr_1.4.7 matrixStats_1.3.0
## [57] bit_4.0.5 memoise_2.0.1
## [59] evaluate_0.23 knitr_1.46
## [61] BiocIO_1.15.0 rlang_1.1.3
## [63] glue_1.7.0 DBI_1.2.2
## [65] BiocManager_1.30.22 jsonlite_1.8.8
## [67] R6_2.5.1 MatrixGenerics_1.17.0
## [69] GenomicAlignments_1.41.0 zlibbioc_1.51.0