- 1 Introduction
- 2 Citation
- 3 Data Simulation
- 4 PsiNorm data normalization
- 5 Data Normalization with PsiNorm
- 6 Supervised approach: Silhouette index
- 7 Correlation of PC1 and PC2 with sequencing depth
- 8 Using PsiNorm in
`scone()`

- 9 Using PsiNorm with Seurat
- 10 Using PsiNorm with HDF5 files
- 11 Session Information

```
library(SingleCellExperiment)
library(splatter)
library(scater)
library(cluster)
library(scone)
```

PsiNorm is a scalable between-sample normalization for single cell RNA-seq count data based on the power-law Pareto type I distribution. It can be demonstrated that the Pareto parameter is inversely proportional to the sequencing depth, it is sample specific and its estimate can be obtained for each cell independently. PsiNorm computes the shape parameter for each cellular sample and then uses it as multiplicative size factor to normalize the data. The final goal of the transformation is to align the gene expression distribution especially for those genes characterised by high expression. Note that, similar to other global scaling methods, our method does not remove batch effects, which can be dealt with downstream tools.

To evaluate the ability of PsiNorm to remove technical bias and reveal the true cell similarity structure, we used both an unsupervised and a supervised approach.
We first simulate a scRNA-seq experiment with four known clusters using the *splatter* Bioconductor package. Then in the unsupervised approach, we i) reduce dimentionality using PCA, ii) identify clusters using the *clara* partitional method and then we iii) computed the Adjusted Rand Index (ARI) to compare the known and the estimated partition.

In the supervised approach, we i) reduce dimentionality using PCA, and we ii) compute the silhouette index of the known partition in the reduced dimensional space.

If you use `PsiNorm`

in publications, please cite the following article:

Borella, M., Martello, G., Risso, D., & Romualdi, C. (2021). PsiNorm: a scalable normalization for single-cell RNA-seq data. bioRxiv. https://doi.org/10.1101/2021.04.07.438822.

We simulate a matrix of counts with 2000 cellular samples and 10000 genes with splatter.

```
set.seed(1234)
params <- newSplatParams()
N=2000
sce <- splatSimulateGroups(params, batchCells=N, lib.loc=12,
group.prob = rep(0.25,4),
de.prob = 0.2, de.facLoc = 0.06,
verbose = FALSE)
```

`sce`

is a SingleCellExperiment object with a single batch and four different cellular groups.

To visualize the data we used the first two Principal Components estimated starting from the raw log-count matrix.

```
set.seed(1234)
assay(sce, "lograwcounts") <- log1p(counts(sce))
sce <- runPCA(sce, exprs_values="lograwcounts", scale=TRUE, ncomponents = 2)
plotPCA(sce, colour_by="Group")
```