1 Introduction

The majority of statistical strategies used in differential expression analysis of high-dimensional biological data (e.g., gene expression, CyTOF) are based on linear models or (analysis of variance). Linear models are favored in many biological problems mainly due to their ease of use and interpretability. However, some biological problems (e.g., gene expression, CyTOF) often require nonlinear models as linear models might be insufficient to capture the relationship among the co-expression features (e.g., relationships among signaling markers within (and across) cell subpopulations in CyTOF data). Kernel-based approaches extend the class of linear models to nonlinear models in a computationally tractable manner. Additionally, kernel-based approaches assume a more general relationship based on a Hilbert space of functions spanned by a certain kernel function instead of assuming a linear functional relationship.

In this vignette, we introduce a differential expression analysis of high-dimensional biological data using kernel based score test, referred to as kernel-based differential analysis (cytoKernel), which can identify differentially expressed features. The cytoKernel R-package contains the CytoK() function, which computes the adjusted p value for each feature based on two groups. Further, the package calculates the shrunken effective size and its corresponding effective size standard deviation (sd) using the Adaptive Shrinkage (ash) procedure (Stephens (2017)). We demonstrate with an example data set that the Differential expression analysis using Kernel-based Score test (cytoKernel) procedure can be adapted to flow and mass cytometry experiments along with other biological experiments (e.g., gene expression, RNASeq).

Consider a matrix of order \(p \times n\) with elements \(x_{ij}\), \(i=1,\dots,p\) and \(j=1,\dots,n\), where \(n\) is the number of samples (Group1+Group2 combined) and \(p\) is the number of features.

We observe the data \(\{x_{ij},y_j\}\), where \(x_{ij}\) is the median intensity for sample \(j\) and feature \(i\). \(y_j\) (binary response) is the group (or condition) label for sample \(j\) (\(y_j=0\) (Group1) and \(y_j=1\) (Group2).

For feature \(i\), \(i=1,\dots,p\), we define a simple logistic (semi-parametric) model,

\[ logit[P(y_j=1)] = \beta_0+ f(x_{ij}), \]

where \(f(.)\) is a centered smooth function in a Reproducible Kernel Hilbert Space (RKHS) spanned by \(k\).

If \(H_0:~f(.)=0\), then feature expression value \(x_{ij}\) is not related to the group labels \(y_j\) for feature \(i\) i.e., feature \(i\) is differentially expressed.

Let \(K\) be a \(n \times n\) Gram matrix with \(K_{st}=K_{\rho}(x_{sj},~x_{tj})\). Here, \(k_{\rho}(.,.)\) is the reproducing kernel of RKHS which contains \(f(.)\) and \(\rho\) is an unknown kernel parameter.

Let \(\mathbf{y}\) be a \(n \times 1\) vector of \(0\) and \(1\). The score test statistic under null hypothesis (\(H_0:~f(.)=0\)) in the logistic model defined above is, \[ S(\rho) = \frac{Q(\rho)- \mu_{Q}}{\sigma_{Q}}, \]

where \(Q(\rho)=(\mathbf{y}-\mathbf{\hat{\mu_0}})^{\mathbf{T}}\mathbf{K}(\mathbf{y}-\mathbf{\hat{\mu_0}})\), \(\mathbf{\hat{\mu_0}}={logit}^{-1}\hat{\beta_0}\) and \(\hat{\beta_0}\) is the estimate of \(\beta_0\) under null model.

More details about the estimation of \(\mu_{Q}\), \(\sigma_{Q}\) and choices of \(\rho\) in (Zhan, Patterson, and Ghosh (2015), Liu, Ghosh, and Lin (2008), Davies (1987).

\(Q(\rho)\) can be rewritten as, \[ Q(\rho) = \sum_s {\sum_t {{k(x_{is},x_{it})}(y_{s}-\hat{\mu_0})(y_{t}-\hat{\mu_0})}}, \] which is the component-wise product of matrices \(\mathbf{K}\) and \((\mathbf{y}-\mathbf{\hat{\mu_0}})(\mathbf{y}-\mathbf{\hat{\mu_0}})^{\mathbf{T}}\).

We use a Gaussian distance based kernel: \[ k(x_{is},x_{it})= exp\left\{-\frac{(x_{is}-x_{it})^2}{\rho}\right\}. \]

\(S(\rho)\) has a Normal distribution for each value of \(\rho\) (Davies (1987)).

The data structure is shown in Figure 1 below.
Feature-Sample data matrix

Figure 1: Feature-Sample data matrix

2 Getting Started

Load the package in R

library(cytoKernel) 

3 cytoHD Data

The cytoKernel package contains a pre-processed median marker expressions data SummarizedExperiment assay object of 126 cluster-marker combinations (features) measured in 8 subjects (4 measured before and 4 upon BCR/FcR-XL stimulation (BCRXL) with B cell receptor/Fc receptor crosslinking for 30’, resulting in a total of 8 samples) from (Bodenmiller et al. (2012)) that was also used in the CITRUS paper (Bruggner et al. (2014)) and CATALYST (Crowell et al. (2020)). In this vignette, we only used a subset of the original raw cytometry data downloaded from the Bioconductor data package HDCytoData (Weber and Soneson (2019)) using the command (Bodenmiller_BCR_XL_flowSet()).

3.1 cytoHDBMW data pre-processing

The cytoHDBMW data in the cytoKernel package was pre-processed using the CATALYST Bioconductor package (Crowell et al. (2020)). The data pre-processing include \(4\) steps and they are as follows: 1. Creating a SingleCellExperiment Object: the flowSet data object along with the metadata are converted into a SingleCellExperiment object using the CATALYST R/Bioconductor package. 2. Clustering: We apply Louvain algorithm using the R package igraph (Csardi, Nepusz, and others (2006)) to cluster the expression values by the state markers (surface markers) (Traag, Waltman, and Van Eck (2019)). 3. Median: Medians are calculated within a cluster for every signaling marker and subject using the scuttle Bioconductor package (McCarthy et al. (2017)). 4. Aggregating and converting the data: We convert the aggregated data into a SummarizedExperiment.

data("cytoHDBMW")
cytoHDBMW
## class: SummarizedExperiment 
## dim: 126 8 
## metadata(0):
## assays(1): exprs
## rownames(126): pNFkB pp38 ... pLat pS6
## rowData names(1): cluster
## colnames(8): Ref1 Ref2 ... BCRXL3 BCRXL4
## colData names(4): sample_id condition patient_id ids

4 Using the CytoK() function

4.1 Input for CytoK()

The CytoK() function must have two object as input: 1. object: a data frame or a matrix or a SummarizedExperiment object with abundance measurements of metabolites (features) on the rows and samples (samples) as the columns. CytoK() accepts objects which are a data frame or matrix with observations (e.g. cluster-marker combinations) on the rows and samples as the columns. 2. group_factor: a binary categorical response variable that represents the group condition for each sample. For example if the samples represent two different groups or conditions (e.g., before stimulation and after stimulation), provide CytoK() with a phenotype representing which columns in the object are different groups. 3. lowerRho: optional a positive value that represents the lower bound of the kernel parameter. Default is 2. 4. upperRho: optional a positive value that represents the upper bound of the kernel parameter. Default is 12. 5. gridRho: optional a positive value that represents the number of grid points in the interval of upper and bound of the kernel parameter. Default is 4. 6. alpha: optional level of significance to control the False Discovery rate (FDR). Default is \(0.05\) (i.e., \(\alpha=0.05\)). 7. featureVars: optional Vector of the columns which identify features. If a SummarizedExperiment is used for data, row variables will be used. Default is NULL.

4.2 Running CytoK()

4.2.1 cytoHDBMW SummarizedExperiment example - Identifying differentially expressed features

We apply the CytoK procedure to identify the differentially expressed cluster-marker combinations in the cytoHDBMW data. To run the CytoK() function, we only input the data object and group factor. We obtain 3 outputs after running the CytoK() function. They are shown below:

library(cytoKernel)
CytoK_output<- CytoK(cytoHDBMW,group_factor = rep(c(0, 1),
               c(4, 4)),lowerRho=2,upperRho=12,gridRho=4,
               alpha = 0.05,featureVars = NULL)
CytoK_output
## CytoK: Differential expression using
##            kernel-based score test
## CytoKFeatures (length = 126 ): 
##        cluster EffectSize EffectSizeSD pvalue  padj
## pNFkB        1      2.024        0.673  0.002 0.007
## pp38         1      0.036        0.165  0.003 0.011
## pStat5       1      0.393        0.167  0.577 0.765
## ...
## CytoKFeaturesOrdered (length = 126 ): 
##        cluster EffectSize EffectSizeSD pvalue padj
## pBtk.1       2      0.433        0.241      0    0
## pS6.8        9      3.360        0.671      0    0
## pp38.7       8      0.037        0.069      0    0
## ...
## Head of the data.frame containing shrunken effect sizes, shrunken ##effect size sd's, p values and adjusted p values
head(CytoKFeatures(CytoK_output))
##        cluster EffectSize EffectSizeSD      pvalue        padj
## pNFkB        1 2.02367428   0.67290649 0.001970156 0.007301166
## pp38         1 0.03647958   0.16532500 0.003251778 0.011381222
## pStat5       1 0.39345165   0.16748947 0.576504359 0.765271619
## pAkt         1 2.09276549   0.70372729 0.644688311 0.765271619
## pStat1       1 0.13722890   0.27677527 0.552315231 0.765271619
## pSHP2        1 0.44989429   0.08264906 0.681940272 0.765271619
## Head of the data.frame containing shrunken effect sizes, shrunken ##effect size sd's, p values and adjusted p values ordered by ##adjusted p values from low to high
head(CytoKFeaturesOrdered(CytoK_output))
##          cluster  EffectSize EffectSizeSD       pvalue         padj
## pBtk.1         2 0.433279948   0.24130035 7.831438e-07 4.933806e-05
## pS6.8          9 3.359722803   0.67069336 5.840003e-07 4.933806e-05
## pp38.7         8 0.036670302   0.06854772 1.669379e-06 7.011390e-05
## pp38.6         7 0.003349933   0.02985019 2.938640e-06 9.256716e-05
## pNFkB.2        3 2.399243818   0.54714335 4.314569e-06 1.079152e-04
## pPlcg2.8       9 0.539173837   0.12741839 5.138821e-06 1.079152e-04
## Percent of differentially expressed features
CytoKDEfeatures(CytoK_output)
## [1] 34.12698

4.3 Filtering the data by differentially expressed features

## Filtering the data by reproducible features
CytoKDEData_HD<- CytoKDEData(CytoK_output, by = "features")
CytoKDEData_HD
## class: SummarizedExperiment 
## dim: 43 8 
## metadata(1): nonDEfeatures
## assays(1): exprs
## rownames(43): pNFkB pp38 ... pErk pS6
## rowData names(5): cluster EffectSize EffectSizeSD pvalue padj
## colnames(8): Ref1 Ref2 ... BCRXL3 BCRXL4
## colData names(4): sample_id condition patient_id ids

4.4 Heatmap of the expression matrix

This heatmap illustrates the expression profiles with differentially expressed features (cluster-marker combinations) on the rows and patients on the columns from the kernel-based score test implemented using the CytoK function. The differentially expressed data (matrix) can be extracted using the (CytoKDEData) function (see above). The generic heatmap of the differentially expressed expression matrix can be plotted using the plotCytoK() function. Any specific meta information can also be added using the ComplexHeatmap package. An illustration is shown below where the cluster ids are separately added onto the heatmap generated by the plotCytoK() function.

heatmap1<- plotCytoK(CytoK_output,
group_factor = rep(c(0, 1), c(4, 4)),topK=10,
featureVars = NULL)
featureOrderedExtracted<- CytoKFeaturesOrdered(CytoK_output)
rowmeta_cluster<- featureOrderedExtracted$cluster
topK<- 10
rowmeta_clusterTopK<- rowmeta_cluster[seq_len(topK)]
library(ComplexHeatmap)
heatmap2<- Heatmap(rowmeta_clusterTopK,
             name = "cluster")
heatmap2+heatmap1
Differentially expressed (top 10) cluster-marker data using cytoKernel

Figure 2: Differentially expressed (top 10) cluster-marker data using cytoKernel

5 Session Info

sessionInfo() 
## R Under development (unstable) (2023-10-22 r85388)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] ComplexHeatmap_2.19.0 S4Vectors_0.41.0      cytoKernel_1.9.0     
## [4] knitr_1.44            BiocStyle_2.31.0     
## 
## loaded via a namespace (and not attached):
##  [1] SummarizedExperiment_1.33.0 circlize_0.4.15            
##  [3] shape_1.4.6                 rjson_0.2.21               
##  [5] xfun_0.40                   bslib_0.5.1                
##  [7] GlobalOptions_0.1.2         Biobase_2.63.0             
##  [9] lattice_0.22-5              Cairo_1.6-1                
## [11] vctrs_0.6.4                 tools_4.4.0                
## [13] bitops_1.0-7                generics_0.1.3             
## [15] stats4_4.4.0                parallel_4.4.0             
## [17] tibble_3.2.1                fansi_1.0.5                
## [19] cluster_2.1.4               pkgconfig_2.0.3            
## [21] Matrix_1.6-1.1              data.table_1.14.8          
## [23] SQUAREM_2021.1              RColorBrewer_1.1-3         
## [25] lifecycle_1.0.3             GenomeInfoDbData_1.2.11    
## [27] truncnorm_1.0-9             compiler_4.4.0             
## [29] codetools_0.2-19            clue_0.3-65                
## [31] GenomeInfoDb_1.39.0         htmltools_0.5.6.1          
## [33] sass_0.4.7                  RCurl_1.98-1.12            
## [35] yaml_2.3.7                  pillar_1.9.0               
## [37] crayon_1.5.2                jquerylib_0.1.4            
## [39] BiocParallel_1.37.0         cachem_1.0.8               
## [41] DelayedArray_0.29.0         magick_2.8.1               
## [43] iterators_1.0.14            abind_1.4-5                
## [45] foreach_1.5.2               tidyselect_1.2.0           
## [47] digest_0.6.33               dplyr_1.1.3                
## [49] bookdown_0.36               ashr_2.2-63                
## [51] fastmap_1.1.1               colorspace_2.1-0           
## [53] cli_3.6.1                   invgamma_1.1               
## [55] SparseArray_1.3.0           magrittr_2.0.3             
## [57] S4Arrays_1.3.0              utf8_1.2.4                 
## [59] withr_2.5.1                 rmarkdown_2.25             
## [61] XVector_0.43.0              matrixStats_1.0.0          
## [63] png_0.1-8                   GetoptLong_1.0.5           
## [65] evaluate_0.22               GenomicRanges_1.55.0       
## [67] IRanges_2.37.0              doParallel_1.0.17          
## [69] irlba_2.3.5.1               rlang_1.1.1                
## [71] Rcpp_1.0.11                 mixsqp_0.3-48              
## [73] glue_1.6.2                  BiocManager_1.30.22        
## [75] BiocGenerics_0.49.0         jsonlite_1.8.7             
## [77] R6_2.5.1                    MatrixGenerics_1.15.0      
## [79] zlibbioc_1.49.0

References

Bodenmiller, Bernd, Eli R Zunder, Rachel Finck, Tiffany J Chen, Erica S Savig, Robert V Bruggner, Erin F Simonds, et al. 2012. “Multiplexed Mass Cytometry Profiling of Cellular States Perturbed by Small-Molecule Regulators.” Nature Biotechnology 30 (9): 858–67.

Bruggner, Robert V, Bernd Bodenmiller, David L Dill, Robert J Tibshirani, and Garry P Nolan. 2014. “Automated Identification of Stratifying Signatures in Cellular Subpopulations.” Proceedings of the National Academy of Sciences 111 (26): E2770–E2777.

Crowell, Helena L, Stéphane Chevrier, Andrea Jacobs, Sujana Sivapatham, Bernd Bodenmiller, Mark D Robinson, Tumor Profiler Consortium, and others. 2020. “An R-Based Reproducible and User-Friendly Preprocessing Pipeline for Cytof Data.” F1000Research 9 (1263): 1263.

Csardi, Gabor, Tamas Nepusz, and others. 2006. “The Igraph Software Package for Complex Network Research.” InterJournal, Complex Systems 1695 (5): 1–9.

Davies, Robert B. 1987. “Hypothesis Testing When a Nuisance Parameter Is Present Only Under the Alternative.” Biometrika 74 (1): 33–43.

Liu, Dawei, Debashis Ghosh, and Xihong Lin. 2008. “Estimation and Testing for the Effect of a Genetic Pathway on a Disease Outcome Using Logistic Kernel Machine Regression via Logistic Mixed Models.” BMC Bioinformatics 9 (1): 1–11.

McCarthy, Davis J, Kieran R Campbell, Aaron TL Lun, and Quin F Wills. 2017. “Scater: Pre-Processing, Quality Control, Normalization and Visualization of Single-Cell Rna-Seq Data in R.” Bioinformatics 33 (8): 1179–86.

Stephens, Matthew. 2017. “False Discovery Rates: A New Deal.” Biostatistics 18 (2): 275–94.

Traag, Vincent A, Ludo Waltman, and Nees Jan Van Eck. 2019. “From Louvain to Leiden: Guaranteeing Well-Connected Communities.” Scientific Reports 9 (1): 1–12.

Weber, Lukas M, and Charlotte Soneson. 2019. “HDCytoData: Collection of High-Dimensional Cytometry Benchmark Datasets in Bioconductor Object Formats.” F1000Research 8.

Zhan, Xiang, Andrew D Patterson, and Debashis Ghosh. 2015. “Kernel Approaches for Differential Expression Analysis of Mass Spectrometry-Based Metabolomics Data.” BMC Bioinformatics 16 (1): 1–13.