date: “last edit: 9/21/2022”

Load packages into R session. It will automatically load the package of dplyr and dbplyr.

library(SQLDataFrame)
library(DBI)

SQLDataFrame slots

dbfile <- system.file("extdata/test.db", package = "SQLDataFrame")
conn <- DBI::dbConnect(dbDriver("SQLite"), dbname = dbfile)
dbListTables(conn)
#> [1] "colData" "r1_ridx" "state"
obj <- SQLDataFrame(
    conn = conn, dbtable = "state", dbkey = "state")
obj
#> SQLDataFrame with 50 rows and 4 columns
#>          state |           division        region population        size
#>    <character> |        <character>   <character>  <numeric> <character>
#>        Alabama | East South Central         South       3615      medium
#>         Alaska |            Pacific          West        365       small
#>        Arizona |           Mountain          West       2280      medium
#>       Arkansas | West South Central         South       2110      medium
#>     California |            Pacific          West      21198       large
#>            ... .                ...           ...        ...         ...
#>       Virginia |     South Atlantic         South       4981      medium
#>     Washington |            Pacific          West       3559      medium
#>  West Virginia |     South Atlantic         South       1799      medium
#>      Wisconsin | East North Central North Central       4589      medium
#>        Wyoming |           Mountain          West        376       small

Slot accessors

To make the SQLDataFrame object as light and compact as possible, there are only 5 slots contained in the object: tblData, dbkey, dbnrows, dbconcatKey, indexes. Metadata information could be returned through these 5 slots using slot accessors or other utility functions.

slotNames(obj)
#> [1] "dbkey"           "dbnrows"         "tblData"         "indexes"        
#> [5] "dbconcatKey"     "elementType"     "elementMetadata" "metadata"
dbtable(obj)
#> [1] "state"
dbkey(obj)
#> [1] "state"
dbcon(obj)
#> <SQLiteConnection>
#>   Path: /tmp/RtmpFamXjq/Rinst3abae72fbbf0d1/SQLDataFrame/extdata/test.db
#>   Extensions: TRUE

tblData slot

The tblData slot saves the dbplyr::tbl_dbi version of the database table, which is a light-weight representation of the database table in R. Of note is that this lazy tbl only contains unique rows. The rows could also be sorted by the dbkey(obj) if the SQLDataFrame object was generated from union or rbind. So when the saveSQLDataFrame() function was called, a database table will be written into a physical disk space containing only the unique records.

Accessor function is made avaible for this slot:

tblData(obj)
#> # Source:   table<state> [?? x 5]
#> # Database: sqlite 3.45.0 [/tmp/RtmpFamXjq/Rinst3abae72fbbf0d1/SQLDataFrame/extdata/test.db]
#>    division           region    state       population size  
#>    <chr>              <chr>     <chr>            <dbl> <chr> 
#>  1 East South Central South     Alabama           3615 medium
#>  2 Pacific            West      Alaska             365 small 
#>  3 Mountain           West      Arizona           2280 medium
#>  4 West South Central South     Arkansas          2110 medium
#>  5 Pacific            West      California       21198 large 
#>  6 Mountain           West      Colorado          2541 medium
#>  7 New England        Northeast Connecticut       3100 medium
#>  8 South Atlantic     South     Delaware           579 small 
#>  9 South Atlantic     South     Florida           8277 large 
#> 10 South Atlantic     South     Georgia           4931 medium
#> # ℹ more rows

dbnrows and dbconcatKey

The dbnrows slot saves the number of rows corresponding to the tblData, and dbconcatKey saves the realized (concatenated if multiple) key columns corresponding to the tblData. Accessor functions are also available for these 2 slots:

dbnrows(obj)  ## equivalent to nrow(obj)
#> [1] 50
dbconcatKey(obj)
#>  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
#>  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
#>  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
#> [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
#> [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
#> [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
#> [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
#> [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
#> [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
#> [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
#> [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
#> [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
#> [49] "Wisconsin"      "Wyoming"

indexes slot

The indexes slots is an unnamed list saving the row and column indexes respectively corresponding to the tblData slot, so that the SQLDataFrame could possibly have duplicate rows or only a subset of data records from the tblData, while the tblData slot doesn’t need to be changed. To be consistent, the slots of dbnrows and dbconcatKey will also remain unchanged.

obj@indexes
#> [[1]]
#> NULL
#> 
#> [[2]]
#> NULL
obj_sub <- obj[sample(5, 3, replace = TRUE), 2:3]
obj_sub
#> SQLDataFrame with 3 rows and 2 columns
#>        state |      region population
#>  <character> | <character>  <numeric>
#>     Arkansas |       South       2110
#>      Alabama |       South       3615
#>       Alaska |        West        365
obj_sub@indexes
#> [[1]]
#> [1] 4 1 2
#> 
#> [[2]]
#> [1] 2 3
identical(tblData(obj), tblData(obj_sub))
#> [1] TRUE
identical(dbnrows(obj), dbnrows(obj_sub))
#> [1] TRUE
nrow(obj)
#> [1] 50
nrow(obj_sub)
#> [1] 3

With a filter or select function (which is similar to [i, ] subsetting), only the indexes slot will be updated for the row or column index pointing to the tblData.

obj_filter <- obj %>% filter(division == "South Atlantic" & size == "medium")
obj_filter@indexes
#> [[1]]
#> [1] 10 20 40 46 48
#> 
#> [[2]]
#> NULL
identical(tblData(obj), tblData(obj_filter))
#> [1] TRUE

obj_select <- obj %>% select(division, size)
obj_select@indexes
#> [[1]]
#> NULL
#> 
#> [[2]]
#> [1] 1 4
identical(tblData(obj), tblData(obj_select))
#> [1] TRUE

SQLDataFrame methods

ROWNAMES

The ROWNAMES,SQLDataFrame method was defined to return the (concatenated if multiple) key column(s) value, so that the row subsetting with character vector works for the SQLDataFrame objects.

rnms <- ROWNAMES(obj)
obj[sample(rnms, 3), ]
#> SQLDataFrame with 3 rows and 4 columns
#>          state |           division        region population        size
#>    <character> |        <character>   <character>  <numeric> <character>
#>  West Virginia |     South Atlantic         South       1799      medium
#>         Kansas | West North Central North Central       2280      medium
#>         Nevada |           Mountain          West        590       small

For SQLDataFrame object with composite keys:

obj1 <- SQLDataFrame(conn = conn, dbtable = "state",
                     dbkey = c("region", "population"))
ROWNAMES(obj1[1:10,])
#>  [1] "South:3615.0"     "West:365.0"       "West:2280.0"      "South:2110.0"    
#>  [5] "West:21198.0"     "West:2541.0"      "Northeast:3100.0" "South:579.0"     
#>  [9] "South:8277.0"     "South:4931.0"
obj1[c("South:3615.0", "West:365.0"), ]
#> SQLDataFrame with 2 rows and 3 columns
#>       region population |           division       state        size
#>  <character>  <numeric> |        <character> <character> <character>
#>        South       3615 | East South Central     Alabama      medium
#>         West        365 |            Pacific      Alaska       small

SessionInfo()

sessionInfo()
#> R version 4.3.2 Patched (2023-11-13 r85521)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] DBI_1.2.1           SQLDataFrame_1.16.1 S4Vectors_0.40.2   
#> [4] BiocGenerics_0.48.1 dbplyr_2.4.0        dplyr_1.1.4        
#> 
#> loaded via a namespace (and not attached):
#>  [1] bit_4.0.5        jsonlite_1.8.8   compiler_4.3.2   tidyselect_1.2.0
#>  [5] blob_1.2.4       jquerylib_0.1.4  yaml_2.3.8       fastmap_1.1.1   
#>  [9] R6_2.5.1         generics_0.1.3   knitr_1.45       tibble_3.2.1    
#> [13] bslib_0.6.1      pillar_1.9.0     rlang_1.1.3      utf8_1.2.4      
#> [17] cachem_1.0.8     xfun_0.42        sass_0.4.8       lazyeval_0.2.2  
#> [21] bit64_4.0.5      RSQLite_2.3.5    memoise_2.0.1    cli_3.6.2       
#> [25] withr_3.0.0      magrittr_2.0.3   digest_0.6.34    lifecycle_1.0.4 
#> [29] vctrs_0.6.5      evaluate_0.23    glue_1.7.0       fansi_1.0.6     
#> [33] purrr_1.0.2      rmarkdown_2.25   tools_4.3.2      pkgconfig_2.0.3 
#> [37] htmltools_0.5.7

Compatibility with GenomicRanges and SummarizedExperiment


library(SQLDataFrame)
library(DBI)
dbfile <- system.file("extdata/test.db", package = "SQLDataFrame")
conn <- DBI::dbConnect(DBI::dbDriver("SQLite"), dbname = dbfile)
obj <- SQLDataFrame(conn = conn, dbtable = "state", dbkey = "state")
dim(obj)
#> [1] 50  4
is(obj, "DataFrame")
#> [1] TRUE

library(GenomicRanges)
#> Loading required package: IRanges
#> 
#> Attaching package: 'IRanges'
#> The following objects are masked from 'package:dplyr':
#> 
#>     collapse, desc, slice
#> Loading required package: GenomeInfoDb
gr <- GRanges("chr1", IRanges(1:50, 60))
mcols(gr) <- obj

## assay <- array(1:10, dim=c(50, 2))
## cdata <- obj[1:2, ]
## SE(assays = assya, rowRanges = gr, colData = cdata) ## see inst/extdata/sqldf_udpate.Rmd for FIXME tag!!!