1 Introduction

The TENxPBMCData package provides a R / Bioconductor resource for representing and manipulating nine different single-cell RNA-seq (scRNA-seq) and CITE-seq data sets on peripheral blood mononuclear cells (PBMC) generated by 10X Genomics:

  1. pbmc68k
  2. frozen_pbmc_donor_a
  3. frozen_pbmc_donor_b
  4. frozen_pbmc_donor_c
  5. pbmc33k
  6. pbmc3k
  7. pbmc6k
  8. pbmc4k
  9. pbmc8k
  10. pbmc5k-CITEseq

The number in the dataset title is roughly the number of cells in the experiment.

This package makes extensive use of the HDF5Array package to avoid loading the entire data set in memory, instead storing the counts on disk as a HDF5 file and loading subsets of the data into memory upon request.

Note: The purpose of this package is to provide testing and example data for Bioconductor packages. We have done no processing of the “filtered” 10X scRNA-RNA or CITE-seq data; it is delivered as is.

2 Work flow

2.1 Loading the data

We use the TENxPBMCData function to download the relevant files from Bioconductor’s ExperimentHub web resource. This includes the HDF5 file containing the counts, as well as the metadata on the rows (genes) and columns (cells). The output is a single SingleCellExperiment object from the SingleCellExperiment package. This is equivalent to a SummarizedExperiment class but with a number of features specific to single-cell data.

library(TENxPBMCData)
tenx_pbmc4k <- TENxPBMCData(dataset = "pbmc4k")
tenx_pbmc4k
## class: SingleCellExperiment 
## dim: 33694 4340 
## metadata(0):
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
##   ENSG00000268674
## rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
## colnames: NULL
## colData names(11): Sample Barcode ... Individual Date_published
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

Note: of particular interest to some users might be the pbmc68k dataset for its size.

The first call to TENxPBMCData() may take some time due to the need to download some moderately large files. The files are then stored locally such that ensuing calls in the same or new sessions are fast. Use the dataset argument to select which dataset to download; values are visible through the function definition:

args(TENxPBMCData)
## function (dataset = c("pbmc4k", "pbmc68k", "frozen_pbmc_donor_a", 
##     "frozen_pbmc_donor_b", "frozen_pbmc_donor_c", "pbmc33k", 
##     "pbmc3k", "pbmc6k", "pbmc8k", "pbmc5k-CITEseq"), as.sparse = TRUE) 
## NULL

The count matrix itself is represented as a DelayedMatrix from the DelayedArray package. This wraps the underlying HDF5 file in a container that can be manipulated in R. Each count represents the number of unique molecular identifiers (UMIs) assigned to a particular gene in a particular cell.

counts(tenx_pbmc4k)
## <33694 x 4340> sparse DelayedMatrix object of type "integer":
##                    [,1]    [,2]    [,3]    [,4] ... [,4337] [,4338] [,4339]
## ENSG00000243485       0       0       0       0   .       0       0       0
## ENSG00000237613       0       0       0       0   .       0       0       0
## ENSG00000186092       0       0       0       0   .       0       0       0
## ENSG00000238009       0       0       0       0   .       0       0       0
## ENSG00000239945       0       0       0       0   .       0       0       0
##             ...       .       .       .       .   .       .       .       .
## ENSG00000277856       0       0       0       0   .       0       0       0
## ENSG00000275063       0       0       0       0   .       0       0       0
## ENSG00000271254       0       0       0       0   .       0       0       0
## ENSG00000277475       0       0       0       0   .       0       0       0
## ENSG00000268674       0       0       0       0   .       0       0       0
##                 [,4340]
## ENSG00000243485       0
## ENSG00000237613       0
## ENSG00000186092       0
## ENSG00000238009       0
## ENSG00000239945       0
##             ...       .
## ENSG00000277856       0
## ENSG00000275063       0
## ENSG00000271254       0
## ENSG00000277475       0
## ENSG00000268674       0

2.2 Exploring the data

To quickly explore the data set, we compute some summary statistics on the count matrix. We tell the DelayedArray block size to indicate that we can use up to 1 GB of memory for loading the data into memory from disk.

options(DelayedArray.block.size=1e9)

We are interested in library sizes colSums(counts(tenx_pbmc4k)), number of genes expressed per cell colSums(counts(tenx_pbmc4k) != 0), and average expression across cells rowMeans(counts(tenx_pbmc4k)). A naive implement might be

lib.sizes <- colSums(counts(tenx_pbmc4k))
n.exprs <- colSums(counts(tenx_pbmc4k) != 0L)
ave.exprs <- rowMeans(counts(tenx_pbmc4k))

More advanced analysis procedures are implemented in various Bioconductor packages - see the SingleCell biocViews for more details.

2.3 Saving computations

Saving the tenx_pbmc4k object in a standard manner, e.g.,

destination <- tempfile()
saveRDS(tenx_pbmc4k, file = destination)

saves the row-, column-, and meta-data as an R object, and remembers the location and subset of the HDF5 file from which the object is derived. The object can be read into a new R session with readRDS(destination), provided the HDF5 file remains in it’s original location.

2.4 CITE-seq datasets

For CITE-seq datasets, both the transcriptomics data and the antibody capture data are available from a single SingleCellExperiment object. While the transcriptomics data can be accessed directly as described above, the antibody capture data should be accessed with the altExp function. Again, the resulting count matrix is represented as a DelayedMatrix.

tenx_pbmc5k_CITEseq <- TENxPBMCData(dataset = "pbmc5k-CITEseq")

counts(altExp(tenx_pbmc5k_CITEseq))
## <32 x 5247> sparse DelayedMatrix object of type "integer":
##           [,1]    [,2]    [,3]    [,4] ... [,5244] [,5245] [,5246] [,5247]
##    CD3      25     959     942     802   .     402     401       6    1773
##    CD4     164     720    1647    1666   .    1417       1      46    1903
##   CD8a      16       8      21       5   .       8     222       3       9
##  CD11b    3011      12      11      11   .      15       7    1027       9
##   CD14     696      12      13       9   .       9      17     382       8
##    ...       .       .       .       .   .       .       .       .       .
## HLA-DR     573      15      11      19   .       6      40     184      32
##  TIGIT      10       3       3       3   .       2      15       1      12
##   IgG1       4       4       2       4   .       1       0       2       4
##  IgG2a       1       3       0       6   .       4       0       4       2
##  IgG2b       6       2       4       8   .       0       0       2       5

3 Session information

sessionInfo()
## R version 4.4.0 RC (2024-04-16 r86468)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] TENxPBMCData_1.23.0         HDF5Array_1.33.0           
##  [3] rhdf5_2.49.0                DelayedArray_0.31.0        
##  [5] SparseArray_1.5.0           S4Arrays_1.5.0             
##  [7] abind_1.4-5                 Matrix_1.7-0               
##  [9] SingleCellExperiment_1.27.0 SummarizedExperiment_1.35.0
## [11] Biobase_2.65.0              GenomicRanges_1.57.0       
## [13] GenomeInfoDb_1.41.0         IRanges_2.39.0             
## [15] S4Vectors_0.43.0            BiocGenerics_0.51.0        
## [17] MatrixGenerics_1.17.0       matrixStats_1.3.0          
## [19] knitr_1.46                  BiocStyle_2.33.0           
## 
## loaded via a namespace (and not attached):
##  [1] KEGGREST_1.45.0         xfun_0.43               bslib_0.7.0            
##  [4] lattice_0.22-6          rhdf5filters_1.17.0     vctrs_0.6.5            
##  [7] tools_4.4.0             generics_0.1.3          curl_5.2.1             
## [10] AnnotationDbi_1.67.0    tibble_3.2.1            fansi_1.0.6            
## [13] RSQLite_2.3.6           blob_1.2.4              pkgconfig_2.0.3        
## [16] dbplyr_2.5.0            lifecycle_1.0.4         GenomeInfoDbData_1.2.12
## [19] compiler_4.4.0          Biostrings_2.73.0       htmltools_0.5.8.1      
## [22] sass_0.4.9              yaml_2.3.8              pillar_1.9.0           
## [25] crayon_1.5.2            jquerylib_0.1.4         cachem_1.0.8           
## [28] mime_0.12               ExperimentHub_2.13.0    AnnotationHub_3.13.0   
## [31] tidyselect_1.2.1        digest_0.6.35           purrr_1.0.2            
## [34] dplyr_1.1.4             bookdown_0.39           BiocVersion_3.20.0     
## [37] fastmap_1.1.1           grid_4.4.0              cli_3.6.2              
## [40] magrittr_2.0.3          utf8_1.2.4              withr_3.0.0            
## [43] rappdirs_0.3.3          filelock_1.0.3          UCSC.utils_1.1.0       
## [46] bit64_4.0.5             rmarkdown_2.26          XVector_0.45.0         
## [49] httr_1.4.7              bit_4.0.5               png_0.1-8              
## [52] memoise_2.0.1           evaluate_0.23           BiocFileCache_2.13.0   
## [55] rlang_1.1.3             glue_1.7.0              DBI_1.2.2              
## [58] BiocManager_1.30.22     jsonlite_1.8.8          R6_2.5.1               
## [61] Rhdf5lib_1.27.0         zlibbioc_1.51.0