Contents

1 Introduction

1.1 About scTGIF

Here, we explain the concept of scTGIF. The analysis of single-cell RNA-Seq (scRNA-Seq) has a potential difficult problem; which data corresponds to what kind of cell type is not known a priori.

Therefore, at the start point of the data analysis of the scRNA-Seq dataset, each cell is “not colored” (unannotated) (Figure 1). There some approaches to support users to infer the cell types such as (1) Known marker gene expression, (2) BLAST-like gene expression comparison with reference DB, (3) differentially expressed genes (DEGs) and over-representative analysis (ORA) (scRNA-tools).

The first approach might be the most popular method, but this task is based on the expert knowledge about the cell types, and not always general-purpose. The second approach is easy and scalable, but still limited when the cell type is not known or still not measured by the other research organization. The third approach can perhaps be used in any situation but ambiguous and time-consuming task; this task is based on the cluster label and the true cluster structure, which is not known and some DEG methods have to be performed in each cluster, but recent scRNA-Seq dataset has tens to hundreds of cell types. Besides, a scRNA-Seq dataset can have low-quality cells and artifacts (e.g. doublet) but it is hard to distinguish from real cell data. Therefore, in actual data analytical situation, laborious trial-and-error cycle along with the change of cellular label cannot be evitable (Figure 1).

scTGIF is developed to reduce this trial-and-error cycle; This tool directly connects the unannotated cells and related gene function. Since this tool does not use reference DB, marker gene list, and cluster label can be used in any situation without expert knowledge and is not influenced by the change of cellular label.

Figure 1: Concept of scTGIF

In scTGIF, three data is required; the gene expression matrix, 2D coordinates of the cells (e.g. t-SNE, UMAP), and geneset of MSigDB. Firstly, the 2D coordinates are segmented as 50-by-50 grids, and gene expression is summarized in each grid level (X1). Next, the correspondence between genes and the related gene functions are summarized as gene-by-function matrix (X2). Here, we support only common genes are used in X1 and X2. Performing joint non-negative matrix factorization (jNMF) algorithm, which is implemented in nnTensor, the shared latent variables (W) with the two matrices are estimated.

Figure 2: Joint NMF

By this algorithm, a grid set and corresponding gene functions are paired. Lower-dimension (D)-by-Grid matrix H1 works as attention maps to help users to pay attention the grids, and D-by-Function matrix H2 shows the gene function enriched in the grids.