1 Introduction

This guide provides an overview of the PIUMA1 PIUMA is the italian word for feather package, a comprehensive R package for performing Topological Data Analysis on high-dimensional datasets, such as -omics data.

1.1 Motivation

Phenotyping is a process of characterizing and classifying individuals based on observable traits or phenotypic characteristics. In the context of medicine and biology, phenotyping involves the systematic analysis and measurement of various physical, physiological, and behavioral features of individuals, such as height, weight, blood pressure, biochemical markers, imaging data, and more. Phenotyping plays a crucial role in precision medicine as it provides essential data for understanding individual health characteristics and disease manifestations, by combining data from different sources to gain comprehensive insights into an individual’s health status and disease risk. This integrated approach allows for more accurate disease diagnosis, prognosis, and treatment selection. The same considerations could be also be extended in omics research, in which the expression values of thousands of genes and proteins, or the incidence of somatic and germline polymorphic variants are usually assessed to link molecular activities with the onset or the progression of diseases. In this field, phenotyping is needed to identify patterns and associations between phenotypic traits and a huge amount of available features. These analyses can uncover novel disease subtypes, identify predictive markers, and facilitate the development of personalized treatment strategies. In this context, the application of unsupervised learning methodologies could help the identification of specific phenotypes in huge heterogeneous cohorts, such as clinical or -omics data. Among them, the Topological Data Analysis (TDA) is a rapidly growing field that combines concepts from algebraic topology and computational geometry to analyze and extract meaningful information from complex and high-dimensional data sets (Carlsson 2009). Moreover, TDA is a robust and effective methodology that preserves the intrinsic characteristics of data and the mutual relationships among observations, by presenting complex data in a graph-based representation. Indeed, building topological models as networks, TDA allows complex diseases to be inspected in a continuous space, where subjects can ‘fluctuate’ over the graph, sharing, at the same time, more than one adjacent node of the network (Dagliati et al. 2020). Overall, TDA offers a powerful set of tools to capture the underlying topological features of data, revealing essential patterns and relationships that might be hidden from traditional statistical techniques (Casaclang-Verzosa et al. 2019).

2 Installation

PIUMA can be installed by:

if (!require("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("PIUMA")

3 Tutorial

3.1 The testing dataset

We tested PIUMA on a subset of the single-cell RNA Sequencing dataset (GSE:GSE193346 generated and published by Feng et al. (2022) to demonstrate that distinct transcriptional profiles are present in specific cell types of each heart chambers, which were attributed to have roles in cardiac development (Feng et al. 2022). In this tutorial, our aim will be to exploit PIUMA for identifying sub-population of vascular endothelial cells, which can be associated with specific heart developmental stages. The original dataset consisted of three layers of heterogeneity: cell type, stage and zone (i.e., heart chamber). Our test dataset was obtained by subsetting vascular endothelial cells (cell type) by Seurat object, extracting raw counts and metadata. Thus, we filtered low expressed genes and normalized data by DaMiRseq :

#############################################
############# NOT TO EXECUTE ################
########## please skip this chunk ###########
#############################################


dataset_seu <- readRDS("./GSE193346_CD1_seurat_object.rds")

# subset vascular endothelial cells
vascularEC_seuobj <- subset(x = dataset_seu,
                            subset = markFinal == "vascular_ec")
df_data_counts <- vascularEC_seuobj@assays$RNA@counts
df_cl <- as.data.frame(df_data_counts)
meta_cl <- vascularEC_seuobj@meta.data[, c(10,13,14,15)]
meta_cl[sapply(meta_cl, is.character)] <- lapply(meta_cl[sapply(meta_cl,
                                                                is.character)],
                                                 as.factor)

## Filtering and normalization
colnames(meta_cl)[4] <- "class"
SE <- DaMiR.makeSE(df_cl, meta_cl)
data_norm <- DaMiR.normalization(SE,
                                 type = "vst",
                                 minCounts = 3,
                                 fSample = 0.4,
                                 hyper = "no")
vascEC_norm <- round(t(assay(data_norm)), 2)
vascEC_meta <- meta_cl[, c(3,4), drop=FALSE]
df_TDA <- cbind(vascEC_meta, vascEC_norm)

At the end, the dataset was composed of 1180 cells (observations) and 838 expressed genes (features). Moreover, 2 additional features are present in the metadata: ‘stage’ and ‘zone’. The first one describes the stage of heart development, while the second one refers to the heart chamber.

Users can directly import the testing dataset by:

library(PIUMA)
#> Loading required package: ggplot2
library(ggplot2)
data(vascEC_norm)
data(vascEC_meta)

df_TDA <- cbind(vascEC_meta, vascEC_norm)

dim(df_TDA)
#> [1] 1180  840
head(df_TDA[1:5, 1:7])
#>                       stage zone Rpl7  Dst Cox5b Eif5b Rpl31
#> AACCAACGTGGTACAG-a5k1 E15.5   RV 5.08 2.95  2.51  2.20  3.42
#> AACCATGAGGAAGTCC-a5k1    P3   LV 4.84 2.40  3.40  1.68  3.40
#> AACGAAAAGACCATAA-a5k1    P0   RV 4.42 2.43  2.61  2.33  2.61
#> AAGCGAGGTAGGAGGG-a5k1    P0   LV 3.74 2.52  2.52  2.01  2.52
#> AAGCGTTGTCTCGGAC-a5k1 E16.5   LV 4.99 3.31  2.33  2.33  3.62

3.2 The TDA object

The PIUMA package comes with a dedicated data structure to easily store the information gathered from all the steps performed by a Topological Data Analysis. As shown in the following cartoon, this object, called TDAobj, is an S4 class containing 9 slots:

  • orig_data: data.frame with the original data (without outcomes)
  • scaled_data: data.frame with re-scaled data (without outcomes)
  • outcomeFact: data.frame with the original outcomes
  • outcome: data.frame with original outcomes converted as numeric
  • comp: data.frame containing the components of projected data
  • dist_mat: data.frame containing the computed distance matrix
  • dfMapper: data.frame containing the nodes, with their elements,
  • jacc: matrix of Jaccard indexes between each pair of dfMapper nodes
  • node_data_mat: data.frame with the node size and the average value