1 Introduction

This vignette demonstrates the use plotCellTypeMDS(), plotCellTypePCA(), boxplotPCA() and calculateDiscriminantSpace(), which are designed to help assess and compare cell types between query and reference datasets using Multidimensional Scaling (MDS) and Principal Component Analysis (PCA), and Fisher Discriminant Analysis (FDA) respectively.

This vignette also demonstrates how to visualize gene expression data from single-cell RNA sequencing (scRNA-seq) experiments using two functions: plotGeneExpressionDimred() and plotMarkerExpression(). These functions allow researchers to explore gene expression patterns across various dimensions and compare expression distributions between different datasets or cell types.

Lastly this vignette will demonstrate how to visualize quality control (QC) scores and annotation scores, potentially identifying relationship between QC statistics (e.g., total library size or percentage of mitochondrial genes) and cell type annotation scores.

2 Preliminaries

In the context of the scDiagnostics package, this vignette illustrates how to evaluate cell type annotations using two distinct datasets:

  • reference_data: This dataset consists of meticulously curated cell type annotations assigned by experts, providing a robust benchmark for assessing the quality of annotations. It serves as a standard reference to evaluate and detect anomalies or inconsistencies within the cell type annotations.

  • query_data: This dataset includes annotations both from expert assessments and those generated by the SingleR package. By comparing these annotations, you can identify discrepancies between manual and automated results, thereby pinpointing potential inconsistencies or areas that may need further scrutiny.

  • qc_data: This dataset includes data from haematopoietic stem and progenitor cells, including QC metrics like library size and mitochondrial content, SingleR-predicted cell types and annotation scores indicating prediction confidence.

# Load library
library(scDiagnostics)

# Load datasets
data("reference_data")
data("query_data")
data("qc_data")

# Set seed for reproducibility
set.seed(0)

3 Visualization of Query vs. Reference Dataset

3.1 Plot Reference and Query Cell Types Using MDS

The plotCellTypeMDS() function creates a scatter plot using Multidimensional Scaling (MDS) to visualize the similarity between cell types in query and reference datasets. This function generates a MDS plot that colors cells according to their types based on a dissimilarity matrix calculated from a specified gene set.

# Generate the MDS scatter plot with cell type coloring
plotCellTypeMDS(query_data = query_data, 
                reference_data = reference_data, 
                cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"),
                query_cell_type_col = "SingleR_annotation", 
                ref_cell_type_col = "expert_annotation")

3.2 Plot Principal Components for Different Cell Types

The plotCellTypePCA() function projects the query dataset onto the PCA space of the reference dataset. It then plots specified principal components for different cell types, allowing comparison of PCA results between query and reference datasets.

# Generate the MDS scatter plot with cell type coloring
plotCellTypePCA(query_data = query_data, 
                reference_data = reference_data,
                cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"),
                query_cell_type_col = "SingleR_annotation", 
                ref_cell_type_col = "expert_annotation", 
                pc_subset = 1:3)

Here, plotCellTypePCA() projects the query data into the PCA space of the reference data and plots the specified principal components. The pc_subset argument allows you to select which principal components to display.

3.3 Plot Principal Components as Boxplots

The boxplotPCA() function provides a boxplot visualization of principal components for different cell types across query and reference datasets. This function helps in comparing the distributions of PCA scores for specified cell types and datasets.

# Plot the PCA data as boxplots
boxplotPCA(query_data = query_data, 
           reference_data = reference_data,
           cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"),
           query_cell_type_col = "SingleR_annotation", 
           ref_cell_type_col = "expert_annotation", 
           pc_subset = 1:3)

In this example, boxplotPCA() generates boxplots for the specified principal components, showing the distribution of PCA scores for different cell types and datasets.

3.4 Project Query Data onto Discriminant Space of Reference Data

The calculateDiscriminantSpace() function projects query single-cell RNA-seq data onto a discriminant space defined by reference data. This process involves using the reference data to identify key variables, compute discriminant vectors, and project both the reference and query data onto this space. The function then assesses the similarity between the projections of the query data and the reference data using cosine similarity and Mahalanobis distance.

3.5 Function Details

The function performs the following steps for each pairwise combination of cell types:

  1. Identify Important Variables: Determine the most significant variables that distinguish the two cell types from the reference data.
  2. Compute Covariance Matrices: Estimate the covariance matrices for each cell type using the Ledoit-Wolf shrinkage method.
  3. Construct Scatter Matrices: Create within-class and between-class scatter matrices.
  4. Solve Eigenvalue Problem: Solve the generalized eigenvalue problem to obtain discriminant vectors.
  5. Project Data: Project both reference and query data onto the discriminant space.
  6. Assess Similarity: Measure the similarity of the query data projections to the reference projections using cosine similarity and Mahalanobis distance.

3.5.1 Example Application

First, applying the function to the SingleR annotation.

# Compute important variables for all pairwise cell comparisons
disc_output <- calculateDiscriminantSpace(
  reference_data = reference_data,
  query_data = query_data, 
  query_cell_type_col = "SingleR_annotation", 
  ref_cell_type_col = "expert_annotation")

# Generate scatter and boxplot
plot(disc_output, plot_type = "scatterplot")

3.5.2 Using Mahalanobis Distance for Anomaly Detection in Single-Cell RNA-Seq Data

We aim to evaluate whether a new batch of query cells fits well within established reference cell types or if there are cells exhibiting unusual characteristics. We will use the Mahalanobis distance calculated by the calculateDiscriminantSpace() function to achieve this.

# Perform discriminant analysis and projection
discriminant_results <- calculateDiscriminantSpace(
  reference_data = reference_data,
  query_data = query_data,
  ref_cell_type_col = "expert_annotation",
  query_cell_type_col = "SingleR_annotation",
  calculate_metrics = TRUE,  # Compute Mahalanobis distance and cosine similarity
  alpha = 0.01  # Significance level for Mahalanobis distance cutoff
)

You can extract and analyze the Mahalanobis distances of query cells relative to reference cell types. High Mahalanobis distances may indicate potential anomalies or novel cell states.

# Extract Mahalanobis distances
mahalanobis_distances <- discriminant_results$`CD4-CD8`$query_mahalanobis_dist

# Identify anomalies based on a threshold
threshold <- discriminant_results$`CD4-CD8`$mahalanobis_crit  # Use the computed cutoff value
anomalies <- mahalanobis_distances[mahalanobis_distances > threshold]

You can now create visualizations to illustrate the distribution of Mahalanobis distances and highlight potential anomalies.

# Load ggplot2 for visualization
library(ggplot2)

# Convert distances to a data frame
distances_df <- data.frame(Distance = mahalanobis_distances,
                           Anomaly = ifelse(mahalanobis_distances > threshold, "Anomaly", "Normal"))

# Plot histogram of Mahalanobis distances
ggplot(distances_df, aes(x = Distance, fill = Anomaly)) +
  geom_histogram(binwidth = 0.5, position = "identity", alpha = 0.7) +
  labs(title = "Histogram of Mahalanobis Distances",
       x = "Mahalanobis Distance",
       y = "Frequency") +
  theme_bw()

  • Low Mahalanobis Distances: These query cells align well with the reference cell types, suggesting they are similar to the established cell types.

  • High Mahalanobis Distances: Cells with high distances may represent outliers or novel, previously uncharacterized cell types. Further investigation may be needed to understand these anomalies.

3.6 Project Data onto Sliced Inverse Regression (SIR) Space of Reference Data

The calculateSIRSpace() function leverages Sliced Inverse Regression (SIR) to analyze high-dimensional cell type data by focusing on the directions that best separate different cell types and their subtypes. The function projects data onto a lower-dimensional SIR space, aiming to highlight the most informative features for distinguishing between cell types.

The process starts by handling each cell type as a distinct slice of the data, allowing for a more detailed analysis. When multiple_cond_means is set to TRUE, the function refines this by computing conditional means for each cell type, using clustering to capture finer distinctions within each type. It then performs Singular Value Decomposition (SVD) on these refined means to uncover the principal directions that most effectively differentiate the cell types.

Ultimately, the function provides projections that reveal how cell types and their subtypes relate to each other in a reduced space, facilitating a deeper understanding of cellular variability and distinctions in complex datasets.

# Calculate the SIR space projections
sir_output <- calculateSIRSpace(
  reference_data = reference_data,
  query_data = query_data,
  query_cell_type_col = "SingleR_annotation",
  ref_cell_type_col = "expert_annotation",
  multiple_cond_means = TRUE
)

# Plot the SIR space projections
plot(sir_output, 
     plot_type = "boxplot", 
     sir_subset = 1:3)

In this example, we compute the SIR space projections using the calculateSIRSpace() function, and then we visualize the projections with a boxplot to analyze the distribution of the data in the SIR space. This example helps you understand how different cell types and their subtypes are projected into a common space, facilitating their comparison and interpretation.

The plot below visualizes the contribution of different markers to the variance along the SIR dimensions, helping identify which markers are most influential in differentiating e.g. between B cells (including plasma cells) from other cell types.

# Plot the top contributing markers
plot(sir_output, 
     plot_type = "varplot", 
     sir_subset = 1:3,
     n_top_vars = 10)

4 Visualization of Marker Expressions

4.1 Visualizing Gene Expression in Reduced Dimensions

The plotGeneExpressionDimred() function plots the expression of a specific gene on a dimensional reduction plot. This function is useful for exploring how gene expression varies across cells in the context of reduced dimensions, such as those derived from PCA, t-SNE, or UMAP. Each point on the plot represents a single cell, color-coded according to the expression level of the specified gene.

Let’s visualize the expression of the gene VPREB3 on a PCA plot:

# Visualize VPREB3 gene on a PCA plot
plotGeneExpressionDimred(se_object = query_data, 
                         method = "PCA", 
                         pc_subset = 1:3, 
                         feature = "VPREB3")

In this example, we visualize the expression of the VPREB3 gene on a PCA plot derived from the query_data dataset. The method argument specifies the dimensional reduction technique, which in this case is “PCA.” The pc_subset argument allows us to focus on specific principal components (PCs), here selecting the first five. The feature argument is the name of the gene we want to visualize.

When executed, this function generates a scatter plot with cells colored by their expression levels of VPREB3. The resulting plot can help identify clusters of cells with similar expression patterns or highlight how expression varies across the dataset.

You can also visualize gene expression using other dimensional reduction methods, such as t-SNE or UMAP.

4.2 Plotting Gene Expression Distribution

The plotMarkerExpression() function compares the distribution of a specific gene’s expression across an overall dataset and within specific cell types. It generates density plots to visualize these distributions, providing a graphical means of assessing the similarity between reference and query datasets. The function also applies a z-rank normalization to make expression values comparable between datasets.

plotMarkerExpression(
  reference_data = reference_data, 
  query_data = query_data, 
  ref_cell_type_col = "expert_annotation", 
  query_cell_type_col = "SingleR_annotation", 
  gene_name = "VPREB3", 
  cell_type = "B_and_plasma"
)

In this example, the function compares the expression distribution of the VPREB3 gene between the reference and query datasets. The ref_cell_type_col and query_cell_type_col arguments specify the columns in colData that identify the cell types in the reference and query datasets, respectively. The cell_type argument specifies the cell type of interest, and gene_name is the gene whose expression distribution we wish to compare.

When executed, this function generates two density plots: one showing the overall gene expression distribution across all cells and another focusing on the specified cell type. These plots help identify whether the gene’s expression levels are similar between the reference and query datasets, providing insight into dataset alignment.

5 Visualization of QC and Annotation Scores

5.1 Scatter Plot: QC Stats vs Cell Type Annotation Scores

The scatter plot visualizes the relationship between QC statistics (e.g., total library size or percentage of mitochondrial genes) and cell type annotation scores. This plot helps in understanding how QC metrics influence or correlate with the predicted cell types.

# Generate scatter plot
p1 <- plotQCvsAnnotation(se_object = qc_data,
                         cell_type_col = "SingleR_annotation",
                         qc_col = "total",
                         score_col = "annotation_scores")
p1 + ggplot2::xlab("Library Size")

A scatter plot can reveal patterns such as whether cells with higher library sizes or mitochondrial content tend to be associated with specific annotations. For instance, cells with unusually high mitochondrial content might be identified as low-quality or stressed, potentially affecting their annotations.

5.2 Histograms: QC Stats and Annotation Scores Visualization

Histograms provide a distribution view of QC metrics and annotation scores. They help in evaluating the range, central tendency, and spread of these variables across cells.

# Generate histograms
histQCvsAnnotation(se_object = query_data, 
                   cell_type_col = "SingleR_annotation", 
                   qc_col = "percent_mito", 
                   score_col = "annotation_scores")

Histograms are useful for assessing the overall distribution of QC metrics and annotation scores. For example, if the majority of cells have high percent_mito, it might indicate that many cells are stressed or dying, which could impact the quality of the data.

6 Visualization of Gene Sets or Pathway Scores on Dimensional Reduction Plots

Dimensional reduction plots (PCA, t-SNE, UMAP) are used to visualize the relationships between cells in reduced dimensions. Overlaying gene set scores on these plots provides insights into how specific gene activities are distributed across cell clusters.

# Plot gene set scores on PCA
plotGeneSetScores(se_object = query_data, 
                  method = "PCA", 
                  score_col = "gene_set_scores", 
                  pc_subset = 1:3)

By visualizing gene set scores on PCA or UMAP plots, one can identify clusters of cells with high or low gene set activities. This can help in understanding the biological relevance of different gene sets or pathways in various cell states or types.


7 R Session Info

R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS

Matrix products: default
BLAS:   /media/volume/teran2_disk/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] scater_1.33.4               ggplot2_3.5.1              
 [3] scran_1.33.2                scuttle_1.15.4             
 [5] SingleCellExperiment_1.27.2 SummarizedExperiment_1.35.5
 [7] Biobase_2.65.1              GenomicRanges_1.57.2       
 [9] GenomeInfoDb_1.41.2         IRanges_2.39.2             
[11] S4Vectors_0.43.2            BiocGenerics_0.51.3        
[13] MatrixGenerics_1.17.1       matrixStats_1.4.1          
[15] scDiagnostics_0.99.11       BiocStyle_2.33.1           

loaded via a namespace (and not attached):
 [1] DBI_1.2.3               gridExtra_2.3           cramer_0.9-4           
 [4] rlang_1.1.4             magrittr_2.0.3          ggridges_0.5.6         
 [7] compiler_4.4.1          systemfonts_1.1.0       vctrs_0.6.5            
[10] pkgconfig_2.0.3         crayon_1.5.3            fastmap_1.2.0          
[13] magick_2.8.5            XVector_0.45.0          labeling_0.4.3         
[16] utf8_1.2.4              rmarkdown_2.28          UCSC.utils_1.1.0       
[19] ggbeeswarm_0.7.2        ragg_1.3.3              tinytex_0.53           
[22] xfun_0.48               bluster_1.15.1          zlibbioc_1.51.2        
[25] cachem_1.1.0            beachmat_2.21.8         jsonlite_1.8.9         
[28] DelayedArray_0.31.14    BiocParallel_1.39.0     irlba_2.3.5.1          
[31] parallel_4.4.1          cluster_2.1.6           biglm_0.9-3            
[34] R6_2.5.1                bslib_0.8.0             ranger_0.16.0          
[37] limma_3.61.12           boot_1.3-31             jquerylib_0.1.4        
[40] Rcpp_1.0.13             bookdown_0.41           knitr_1.48             
[43] Matrix_1.7-1            igraph_2.1.1            tidyselect_1.2.1       
[46] abind_1.4-8             yaml_2.3.10             viridis_0.6.5          
[49] codetools_0.2-20        lattice_0.22-6          tibble_3.2.1           
[52] withr_3.0.1             evaluate_1.0.1          pillar_1.9.0           
[55] BiocManager_1.30.25     generics_0.1.3          munsell_0.5.1          
[58] scales_1.3.0            glue_1.8.0              metapod_1.13.0         
[61] tools_4.4.1             speedglm_0.3-5          BiocNeighbors_1.99.3   
[64] data.table_1.16.2       ScaledMatrix_1.13.0     locfit_1.5-9.10        
[67] grid_4.4.1              edgeR_4.3.21            colorspace_2.1-1       
[70] patchwork_1.3.0         GenomeInfoDbData_1.2.13 beeswarm_0.4.0         
[73] BiocSingular_1.21.4     vipor_0.4.7             cli_3.6.3              
[76] rsvd_1.0.5              textshaping_0.4.0       fansi_1.0.6            
[79] S4Arrays_1.5.11         viridisLite_0.4.2       dplyr_1.1.4            
[82] gtable_0.3.6            isotree_0.6.1-1         sass_0.4.9             
[85] digest_0.6.37           SparseArray_1.5.45      ggrepel_0.9.6          
[88] dqrng_0.4.1             farver_2.1.2            htmltools_0.5.8.1      
[91] lifecycle_1.0.4         httr_1.4.7              statmod_1.5.0          
[94] transport_0.15-4        MASS_7.3-61