CDI 1.4.0

This tutorial shows how to apply CDI to select optimal clustering labels among different candidate cell labels for UMI counts of single-cell RNA-sequencing dataset. Sections 1-3 are for datasets from one batch; Section 4 is for datasets from multiple batches. Datasets used in this tutorial were simulated for illustration purpose.

CDI can be installed from Bioconductor:

```
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("CDI")
```

The latest version can be installed from Github:

```
if (!requireNamespace("remotes", quietly = TRUE))
install.packages("remotes")
remotes::install_github("jichunxie/CDI", build_vignettes = TRUE)
```

CDI provides simulated toy single-cell RNA-seq UMI count matrices for illustration purpose. To use other datasets, please change the code in the following two blocks. Filtering of cells/genes can be applied before feature gene selection; normalization/imputation is not applicable here, as CDI models the raw UMI counts.

**Inputs**:

**one_batch_matrix**: a single-cell RNA-seq UMI count matrix from one batch (each row represents a gene and each column represent a cell). To apply CDI to datasets from multiple batches, please check section 4.**one_batch_matrix_label_df**: the label sets of the one_batch_matrix, where each row represents a cell, and each column represents a set of cell labels generated by existing clustering methods such as K-Means.**one_batch_matrix_celltype**: a vector of characters indicating the benchmark cell types of cells in one_batch_matrix (not necessary except for evaluation purpose).

```
data(one_batch_matrix, package = "CDI")
dim(one_batch_matrix)
```

`## [1] 3000 2000`

```
data(one_batch_matrix_celltype, package = "CDI")
table(one_batch_matrix_celltype)
```

```
## one_batch_matrix_celltype
## type1 type2 type3 type4 type5
## 400 400 400 400 400
```

Here we provide 12 label sets generated from PCA\(+\)KMeans and Seurat v3.1.5, where the number of clusters range from 2 to 7.

For better visualization and comparison, the column names of this data frame indicate the clustering method and the number of clusters. For example, “KMeans_k5” refers to the set of cell labels generated by KMeans with 5 clusters.

```
data(one_batch_matrix_label_df, package = "CDI")
knitr::kable(head(one_batch_matrix_label_df[,c("KMeans_k2", "KMeans_k4", "Seurat_k2", "Seurat_k3")], 3))
```

KMeans_k2 | KMeans_k4 | Seurat_k2 | Seurat_k3 |
---|---|---|---|

2 | 3 | 1 | 2 |

2 | 3 | 1 | 2 |

2 | 3 | 1 | 2 |

`feature_gene_selection`

Feature genes are those genes that express differently across cell types. Several methods are available for feature gene selection. Here we propose a new method called working dispersion score (WDS). We select genes with highest WDS as the feature genes.

```
feature_gene_indx <- feature_gene_selection(
X = one_batch_matrix,
batch_label = NULL,
method = "wds",
nfeature = 500)
sub_one_batch_matrix <- one_batch_matrix[feature_gene_indx,]
```

`calculate_CDI`

First, calculate the size factor of each gene with `size_factor`

. The output of `size_factor`

will be one of the inputs of `calculate_CDI`

.

`one_batch_matrix_size_factor <- size_factor(X = one_batch_matrix)`

Second, calculate CDI for each candidate set of cell labels:

```
start_time <- Sys.time()
CDI_return1 <- calculate_CDI(
X = sub_one_batch_matrix,
cand_lab_df = one_batch_matrix_label_df,
batch_label = NULL,
cell_size_factor = one_batch_matrix_size_factor)
end_time <- Sys.time()
print(difftime(end_time, start_time))
```

`## Time difference of 55.42861 secs`

`knitr::kable(CDI_return1)`

Label_name | Cluster_method | CDI_AIC | CDI_BIC | neg_llk_val | N_cluster | |
---|---|---|---|---|---|---|

KMeans_k2 | KMeans_k2 | KMeans | 1135287 | 1146489 | 565643.4 | 2 |

KMeans_k3 | KMeans_k3 | KMeans | 1120096 | 1136899 | 557047.9 | 3 |

KMeans_k4 | KMeans_k4 | KMeans | 1112561 | 1134964 | 552280.4 | 4 |

KMeans_k5 | KMeans_k5 | KMeans | 1101970 | 1129974 | 545984.9 | 5 |

KMeans_k6 | KMeans_k6 | KMeans | 1103150 | 1136756 | 545575.1 | 6 |

KMeans_k7 | KMeans_k7 | KMeans | 1105022 | 1144229 | 545511.2 | 7 |

Seurat_k2 | Seurat_k2 | Seurat | 1129063 | 1140264 | 562531.3 | 2 |

Seurat_k3 | Seurat_k3 | Seurat | 1120002 | 1136805 | 557001.2 | 3 |

Seurat_k4 | Seurat_k4 | Seurat | 1109115 | 1131518 | 550557.3 | 4 |

Seurat_k5 | Seurat_k5 | Seurat | 1102321 | 1130325 | 546160.4 | 5 |

Seurat_k6 | Seurat_k6 | Seurat | 1103152 | 1136751 | 545575.9 | 6 |

Seurat_k7 | Seurat_k7 | Seurat | 1104099 | 1143292 | 545049.6 | 7 |

CDI counts the number of unique characters in each label set as the “N_cluster”. If the column names of label set data frame are provided with the format "[ClusteringMethod]_k[NumberOfClusters]" such as “KMeans_K5, `calculate_CDI`

will extract the”[ClusteringMethod]" as the Cluster_method. The clustering method can also be provided in the argument `clustering_method`

for each label set.

The outputs of `calculate_CDI`

include CDI-AIC and CDI-BIC. CDI calculates AIC and BIC of cell-type-specific gene-specific NB model for UMI counts, where the cell types are based on each candidate label set, and only the selected subset of genes are considered. Whether to use CDI-AIC or CDI-BIC depend on the goals. We suggest using CDI-BIC to select optimal main cell types and using CDI-AIC to select optimal subtypes, because BIC puts more penalty on the complexity of models (number of clusters).

**Output visualization**

The outputs of CDI are demonstrated with a lineplot. The x-axis is for the number of clusters. The y-axis is for the CDI values. Different colors represent different clustering methods: The orange line represents label sets from Seurat; the blue line represents label sets from K-Means. The red triangle represents the optimal clustering result corresponding to the smallest CDI value.

The cell population in this simulated dataset doesn’t have a hierarchical structure. We use CDI-BIC to select the optimal set of cell labels. The label set “KMeans_k5” has the smallest CDI-BIC and is selected as the optimal.

`CDI_lineplot(cdi_dataframe = CDI_return1, cdi_type = "CDI_BIC")`