Contents

1 Introduction

The barbieQ package provides a series of robust statistical tools for analysing barcode count data generated from cell clonal tracking (lineage tracing) experiments.

In these experiments, an initial cell and its offspring collectively form a clone (or lineage). A unique DNA barcode, incorporated into the genome of an initial cell, is inherited by all its progeny within the clone. This one-to-one mapping of barcodes to clones enables tracking of clonal behaviours. By quantifying barcode counts, researchers can measure the abundance of individual clones under various experimental conditions or perturbations.

While existing tools for barcode count data analysis primarily rely on qualitative interpretation through visualizations, they often lack robust methods to model the sources of barcode variability. [barcodetrackR, CellDestiny, genBaRcode]

To address this gap, this R software package, barbieQ, provides advanced statistical methods to model barcode variability. The package supports preprocessing, visualization, and statistical testing to identify barcodes with significant differences in proportions or occurrences across experimental conditions. Key functionalities include initializing data structures, filtering barcodes, and applying regression models to test for significant clonal changes.

The main functions include:

2 Intallation

# You can install the stable version of barbieQ like so:
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("barbieQ")

## Alternatively, you can install the development version of barbieQ from GitHub
# devtools::install_github("Oshlack/barbieQ")

3 Load Dependecy

suppressPackageStartupMessages({
  library(barbieQ)
  library(magrittr)
  library(tidyr)
  library(dplyr)
  library(ggplot2)
  library(circlize)
  library(logistf)
  library(igraph)
  library(data.table)
  library(ComplexHeatmap)
  library(limma)
  library(SummarizedExperiment)
  library(S4Vectors)
})
set.seed(2025)

4 Load Package Data

A subset of data from a study on monkey HSPC cell expansion using barcoding technique.1 NK clonal expansion(http://dx.doi.org/10.1126/sciimmunol.aat9781)), barcodetrackRData^ Barcode counts within different samples were used to interpret the patterns of HSPC differentiation.

It is a SummarizedExperiment object created using function barbieQ::createBarbie, containing a barcode count matrix with 16,603 rows and 62 columns, and a data frame of sample metadata.

data(monkeyHSPC, package = "barbieQ")

5 Example

5.1 Create barbieQ Object

Please start with creating a barbieQ structure by taking the barcode count matrix as input, using function createBarbieQ().

By creating a barbieQ object, you’ve automatically processed a series of data transformation and saved transformed data in the barbieQ object, for the convenience of the following analyses.

## Passing `object`, `sampleMetadata` and `factorColors` for optional
exampleBBQ <- createBarbieQ(
  object = SummarizedExperiment::assay(monkeyHSPC), 
  sampleMetadata = SummarizedExperiment::colData(monkeyHSPC)$sampleMetadata
)
#> continuing with missing `factorColors`.

5.2 Subset Dataset

Here we subset the object by selecting samples of certain stage of collection time.

Define “early”, “mid”, and “late” stages based on Months, and clean up Celltype.

updateSampleMetadata <- exampleBBQ$sampleMetadata %>% 
  as.data.frame() %>% 
  select(Celltype, Months) %>% 
  mutate(Phase = ifelse(Months < 6, "early", ifelse(Months >=55, "late", "mid"))) %>% 
  mutate(Celltype = gsub("(Gr).*", "\\1", Celltype))

SummarizedExperiment::colData(exampleBBQ)$sampleMetadata <- S4Vectors::DataFrame(updateSampleMetadata)

exampleBBQ$sampleMetadata
#> DataFrame with 62 rows and 3 columns
#>                                                  Celltype    Months       Phase
#>                                               <character> <numeric> <character>
#> ZG66_6.5m_T                                             T       6.5         mid
#> ZG66_12m_T                                              T      12.0         mid
#> ZG66_17m_T                                              T      17.0         mid
#> ZG66_27m_T                                              T      27.0         mid
#> ZG66_36m_T                                              T      36.0         mid
#> ...                                                   ...       ...         ...
#> ZG66_58m_NK_NKG2Ap_CD16p_KIR3DL01n NK_NKG2Ap_CD16p_KIR3..        58        late
#> ZG66_58m_NK_NKG2Ap_CD16p_KIR3DL01p NK_NKG2Ap_CD16p_KIR3..        58        late
#> ZG66_68m_NK_NKG2Ap_CD16p                  NK_NKG2Ap_CD16p        68        late
#> ZG66_68m_NK_NKG2Ap_CD16p_KIR3DL01n NK_NKG2Ap_CD16p_KIR3..        68        late
#> ZG66_68m_NK_NKG2Ap_CD16p_KIR3DL01p NK_NKG2Ap_CD16p_KIR3..        68        late

Subset dataset and keep samples of “mid” stage only.

flag_sample <- exampleBBQ$sampleMetadata$Phase == "mid"
exampleBBQ <- exampleBBQ[, flag_sample]
exampleBBQ$sampleMetadata
#> DataFrame with 42 rows and 3 columns
#>                                 Celltype    Months       Phase
#>                              <character> <numeric> <character>
#> ZG66_6.5m_T                            T       6.5         mid
#> ZG66_12m_T                             T      12.0         mid
#> ZG66_17m_T                             T      17.0         mid
#> ZG66_27m_T                             T      27.0         mid
#> ZG66_36m_T                             T      36.0         mid
#> ...                                  ...       ...         ...
#> ZG66_22m_Gr                           Gr      22.0         mid
#> ZG66_24m_Gr                           Gr      24.0         mid
#> ZG66_36m_Gr_2                         Gr      36.0         mid
#> ZG66_14.5m_NK_CD56n_CD16p NK_CD56n_CD16p      14.5         mid
#> ZG66_14.5m_NK_CD56p_CD16n NK_CD56p_CD16n      14.5         mid

5.3 Tag Top Contributing Barcodes

A filtering step is recommended to filter out barcodes that consistently show low counts across the dataset. The retained barcodes are considered with essential contribution, called “top barcodes”.

  • By applying the function tagTopBarcodes() to the barbieQ object, you’re determining which barcodes are “top barcodes” while tagging them in the object.

  • In this example dataset, we are interested in the differences on barcode outcomes between cell types, so we will group samples by cell types. We set up the nSampleThreshold as the minimum group size, which is 6.

## Check out minimum group size.
table(exampleBBQ$sampleMetadata$Celltype)
#> 
#>              B             Gr NK_CD56n_CD16p NK_CD56p_CD16n              T 
#>              6             12              7              7             10
## Tag top Barcodes.
exampleBBQ <- tagTopBarcodes(barbieQ = exampleBBQ, nSampleThreshold = 6)
  • Once “top barcodes” are determined and tagged, it’s good to check their contributions before cutting off the “bottom barcodes” that are considered as non-essential contributors.

    • By applying the function plotBarcodePareto() to the barbieQ object, you are visualizing the contribution of each barcode colour coded by “top” or “bottom”. (Here, the contribution refers to the average proportion of individual barcodes across samples in the dataset.)

    • By applying the function plotBarcodeSankey() to the barbieQ object, you are visualizing the collective contribution of the “top” or “bottom” barcode groups.

## visualize contribution of top vs. bottom barcodes
plotBarcodePareto(barbieQ = exampleBBQ) |> plot()
#> Warning: Removed 10 rows containing missing values or values outside the scale range
#> (`geom_bar()`).

## visualize collective contribution of top vs. bottom barcodes
plotBarcodeSankey(barbieQ = exampleBBQ) |> plot()

  • Once you are happy with the classification of “top” and “bottom” barcodes, you can filter out the “bottom” barcodes by subsetting the SE object based on the tagged array.
flag_barcode <- SummarizedExperiment::rowData(exampleBBQ)$isTopBarcode$isTop
exampleBBQ <- exampleBBQ[flag_barcode,]

5.4 Visualize Sample Correlation

To generally understand the sample similarity, you can visualize sample pair-wise correlations in a check board style by applying the function plotSamplePairCorrelation() to the barbieQ() object.

## visualize sample pair wise correlation
plotSamplePairCorrelation(barbieQ = exampleBBQ) |> plot()
#> setting Celltype as the primary factor in `sampleMetadata`.
#> displaying pearson correlation coefficient between samples on Barcode log2 CPM+1.

5.5 Test Barcodes and Identify Significant Changes

Based on the understanding of sample conditions that likely to be the source of variability in barcode outcomes, you can robustly test the significance of the barcode changes between the sample conditions, by applying the function testBarcodeSignif() to the barbieQ object. The testing results will be saved in the object, and can be further visualized using functions: plotBarcodeMA(), plotSignifBarcodeHeatmap(), plotSignifBarcodeProportion(), and etc.

  • By specifying the parameter method by diffProp (defaulted), you are testing individual barcodes’ differential proportion between conditions.

  • By specifying the parameter method by diffOcc, you are testing individual barcodes’ differential occurrence between conditions.

## test Barcode differential proportion between sample groups
exampleBBQ <- testBarcodeSignif(
  barbieQ = exampleBBQ,
  contrastFormula = "(CelltypeNK_CD56n_CD16p) - (CelltypeB+CelltypeGr+CelltypeT+CelltypeNK_CD56p_CD16n)/4",
  method = "diffProp"
)
#> setting Celltype as the primary factor in `sampleMetadata`.
#> removing factors with only one level from sampleMetadata: NA
#> no block specified, so there are no duplicate measurements.
plotBarcodeMA(exampleBBQ) |> plot()

plotSignifBarcodeHeatmap(exampleBBQ) |> plot()
#> setting testingBarcode as the primary factor in `sampleMetadata`.

plotSignifBarcodeProportion(exampleBBQ) |> plot()

6 References

Appendix

We are currently writing a paper to introduce the methods and approaches implemented in this barbieQ package and will update with a citation once available.

A SessionInfo

sessionInfo()
#> R Under development (unstable) (2025-03-13 r87965)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    grid      stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] SummarizedExperiment_1.37.0 Biobase_2.67.0             
#>  [3] GenomicRanges_1.59.1        GenomeInfoDb_1.43.4        
#>  [5] IRanges_2.41.3              S4Vectors_0.45.4           
#>  [7] BiocGenerics_0.53.6         generics_0.1.3             
#>  [9] MatrixGenerics_1.19.1       matrixStats_1.5.0          
#> [11] limma_3.63.12               ComplexHeatmap_2.23.1      
#> [13] data.table_1.17.0           igraph_2.1.4               
#> [15] logistf_1.26.0              circlize_0.4.16            
#> [17] ggplot2_3.5.1               dplyr_1.1.4                
#> [19] tidyr_1.3.1                 magrittr_2.0.3             
#> [21] barbieQ_0.99.1              BiocStyle_2.35.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rdpack_2.6.3            sandwich_3.1-1          rlang_1.1.5            
#>  [4] multcomp_1.4-28         clue_0.3-66             GetoptLong_1.0.5       
#>  [7] compiler_4.6.0          mgcv_1.9-3              png_0.1-8              
#> [10] vctrs_0.6.5             pkgconfig_2.0.3         shape_1.4.6.1          
#> [13] crayon_1.5.3            fastmap_1.2.0           magick_2.8.6           
#> [16] backports_1.5.0         XVector_0.47.2          labeling_0.4.3         
#> [19] rmarkdown_2.29          UCSC.utils_1.3.1        nloptr_2.2.1           
#> [22] tinytex_0.56            purrr_1.0.4             xfun_0.52              
#> [25] glmnet_4.1-8            jomo_2.7-6              cachem_1.1.0           
#> [28] jsonlite_2.0.0          DelayedArray_0.33.6     pan_1.9                
#> [31] broom_1.0.8             parallel_4.6.0          cluster_2.1.8.1        
#> [34] R6_2.6.1                bslib_0.9.0             RColorBrewer_1.1-3     
#> [37] boot_1.3-31             rpart_4.1.24            estimability_1.5.1     
#> [40] jquerylib_0.1.4         Rcpp_1.0.14             bookdown_0.42          
#> [43] iterators_1.0.14        knitr_1.50              zoo_1.8-13             
#> [46] Matrix_1.7-3            splines_4.6.0           nnet_7.3-20            
#> [49] tidyselect_1.2.1        abind_1.4-8             yaml_2.3.10            
#> [52] doParallel_1.0.17       codetools_0.2-20        lattice_0.22-7         
#> [55] tibble_3.2.1            withr_3.0.2             coda_0.19-4.1          
#> [58] evaluate_1.0.3          survival_3.8-3          pillar_1.10.2          
#> [61] BiocManager_1.30.25     mice_3.17.0             foreach_1.5.2          
#> [64] reformulas_0.4.0        munsell_0.5.1           scales_1.3.0           
#> [67] minqa_1.2.8             xtable_1.8-4            glue_1.8.0             
#> [70] emmeans_1.11.0          tools_4.6.0             lme4_1.1-37            
#> [73] mvtnorm_1.3-3           Cairo_1.6-2             rbibutils_2.3          
#> [76] colorspace_2.1-1        nlme_3.1-168            formula.tools_1.7.1    
#> [79] GenomeInfoDbData_1.2.14 cli_3.6.4               S4Arrays_1.7.3         
#> [82] gtable_0.3.6            sass_0.4.9              digest_0.6.37          
#> [85] operator.tools_1.6.3    TH.data_1.1-3           SparseArray_1.7.7      
#> [88] farver_2.1.2            rjson_0.2.23            htmltools_0.5.8.1      
#> [91] lifecycle_1.0.4         httr_1.4.7              GlobalOptions_0.1.2    
#> [94] mitml_0.4-5             statmod_1.5.0           MASS_7.3-65