**If you use the data from this package in published research,
please cite:**

Tianle Ma, Aidong Zhang, Integrate Multi-omic Data Using Affinity Network Fusion (ANF) for Cancer Patient Clustering, https://arxiv.org/abs/1708.07136

There are three R objects included in this package:
`Wall`

, `project_ids`

and
`survival.plot`

:

inst/scripts/make-data.R explains in detail how `Wall`

,
`project_ids`

and `survival.plot`

was generated
from original data downloaded from GDC data portal.

In short, `Wall`

contains a complex list of precomputed
patient affinity (similarity) matrices. In fact, `Wall`

is a
list (five cancer types) of list (six feature normalization types:
`raw.all`

, `raw.sel`

, `log.all`

,
`log.sel`

, `vst.sel`

, `normalized`

) of
list (three feature spaces or views: `fpkm`

,
`mirna`

, and `methy450`

) of pre-computed patient
affinity matrices. (So in total, there are 90 matrices.) The rownames of
each matrix are case IDs (i.e., patient IDs), and the column names of
each matrix are the aliquot IDs (i.e., TCGA barcode, which contains the
case ID as prefix). Based on these aliquot IDs, users can download the
original data about these patients from https://portal.gdc.cancer.gov/repository. The file UUIDs
of 10328 files used for deriving `Wall`

are also provided in
inst/extdata/fileUUIDs.csv

`project_ids`

is a named character vector that maps
case_id to TCGA project_id. Because each project_id corresponds to one
disease type, `project_ids`

contains information about
patient disease type information. Since our goal is to cluster cancer
patients into disease types, `project_ids`

is used for
evaluating clustering results, such as calculating Normalized Mutual
Information (NMI) and Adjusted Rand Index (ARI).

`surv.plot`

is a data.frame containing patient survival
data for survival analysis downloaded from https://portal.gdc.cancer.gov/exploration?searchTableTab=genes
(overall survival plot data), providing an “indirect” way to evaluate
clustering results.

See paper https://arxiv.org/abs/1708.07136 for more explanation.

`## Loading required package: BiocGenerics`

`## Loading required package: generics`

```
##
## Attaching package: 'generics'
```

```
## The following objects are masked from 'package:base':
##
## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
## setequal, union
```

```
##
## Attaching package: 'BiocGenerics'
```

```
## The following objects are masked from 'package:generics':
##
## intersect, setdiff, setequal, union
```

```
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
```

```
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
## lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff,
## setequal, table, tapply, union, unique, unsplit, which.max,
## which.min
```

`## Loading required package: AnnotationHub`

`## Loading required package: BiocFileCache`

`## Loading required package: dbplyr`

`## see ?HarmonizedTCGAData and browseVignettes('HarmonizedTCGAData') for documentation`

`## loading from cache`

```
## see ?HarmonizedTCGAData and browseVignettes('HarmonizedTCGAData') for documentation
## loading from cache
```

```
## see ?HarmonizedTCGAData and browseVignettes('HarmonizedTCGAData') for documentation
## loading from cache
```

Comprehensive molecular profiling data of dozens of cancer types have been made available by TCGA. We selected cancers from five primary sites (in the following we refer to the names of primary sites as cancer types). Each of these five cancer types has at least two known disease types. For other cancer types, we do not know the groundtruth disease types. That’s why they were not included in the package.

```
## [1] "adrenal_gland" "lung" "uterus" "kidney"
## [5] "colorectal"
```

We are trying to cluster patients into groups and identify cancer subtypes (i.e., disease types) using multi-omic data. Here we include three types of data: gene expression, miRNA expression and DNA methylation beta values. We selected 2582 patients who has all these data available and included them in this package.

`names(Wall[[1]][[1]]) #Note: "fpkm" refers to gene expression measurement, which can be HTSeq-Counts, transformed HTSeq-Counts (log2 transformation or variance-stabilizing transformation), and FPKM values. Sorry for the confusing name.`

`## [1] "fpkm" "mirnas" "methy450"`

For raw counts data (measuring gene expression and miRNA expression), we can perform feature selection (e.g., differential expression analysis) and feature transformation (e.g., log2 transformation and variance-stabilizing transformation). We have included six feature types in this package:

`raw.all`

: Raw counts of all genes or miRNAs

`raw.sel`

: Raw counts of selected (differentially
expressed) genes or miRNAs (Differential expression analysis was
performed using DESeq2)

`log.all`

: Log transformation of raw counts of all genes
or miRNAs

`log.sel`

: Log transformation of raw counts of selected
(differentially expressed) genes or miRNAs

`vst.sel`

: Variance stabilizing transformation of raw
counts of selected genes or miRNAs

`normalized`

: FPKM values of all genes or normalized
counts for all miRNAs

```
## [1] "raw.all" "raw.sel" "log.all" "log.sel" "vst.sel"
## [6] "normalized"
```

So for each of the five cancer types, we have the above six feature
types (`raw.all`

, `raw.sel`

, `log.all`

,
`log.sel`

, `vst.sel`

and `normalized`

).
For each feature type, we have three “views”: gene expression (named
`fpkm`

, i.e., `names(Wall[[1]][[1]])[1]`

), miRNA
expression (`mirnas`

), and DNA methylation beta values
(`methy450`

). In total, there are 90 matrices contained in
`Wall`

. (For DNA methylation, we directly used beta values
without feature transformation. So the six `methy450`

matrices are the same. Thus we actually only have 65 unique matrices in
`Wall`

.)

We can perform spectral clustering on a patient affinity matrix. Take adrend_gland cancer for example. We can cluseter patients using affinity matrix derived from log2 transformation of raw counts of differentially expressed genes.

```
library(ANF)
affinity.mat <- Wall[["adrenal_gland"]][["log.sel"]][["fpkm"]]
labels <- spectral_clustering(affinity.mat, k = 2)
```

Since we know true disease types, which correspond to project ids in
`project_ids`

, we can calculate NMI and ARI.

```
true.disease.types <- as.factor(project_ids[rownames(affinity.mat)])
print(table(labels, true.disease.types))
```

```
## true.disease.types
## labels TCGA-ACC TCGA-PCPG
## 1 0 176
## 2 76 1
```

```
nmi <- igraph::compare(true.disease.types, labels, method = "nmi")
adjusted.rand = igraph::compare(true.disease.types, labels, method = "adjusted.rand")
# we can also calculate p-value using `surv.plot` data
surv.plot <- surv.plot[rownames(affinity.mat), ]
f <- survival::Surv(surv.plot$time, !surv.plot$censored)
fit <- survival::survdiff(f ~ labels)
pval <- stats::pchisq(fit$chisq, df = length(fit$n) - 1, lower.tail = FALSE)
print(paste("NMI =", nmi, ", ARI =", adjusted.rand, ", p-val =", pval))
```

`## [1] "NMI = 0.962881312440823 , ARI = 0.983811442595234 , p-val = 1.46669868477277e-07"`

In ANF package, We have provided a function `eval_clu`

that streamlines the above process from spectral clustering to
calculating NMI, ARI and p-value. Here is an example of how to use
`eval_clu`

:

`## nmi = 0.962881312440823`

`## adjusted.rand = 0.983811442595234`

`## survdiff p value = 1.46669868477277e-07`

```
## labels
## true_class 1 2
## TCGA-ACC 0 76
## TCGA-PCPG 176 1
```

```
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1
## [75] 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
```

For adrenal_gland cancer, we only misclassify one out of 253 patients using this affinity matrix. That is a pretty good result (In fact, this the best result we can achieve. Users can try using other matrices and compare the results). However, for many cases, using a single affinity matrix does a “terrible” job in clustering patients into correct disease types. Take uterus cancer for example (the NMI is near 0).

`## nmi = 0.0563331568619408`

`## adjusted.rand = -0.0550627074255129`

`## survdiff p value = NA`

```
## labels
## true_class 1 2
## TCGA-UCEC 153 268
## TCGA-UCS 3 51
```

Instead of using one affinity matrix, we can “fuse” multiple affinity matrices using ANF package, and then perform spectral clustering on the fused affinity matrix.

Let’s take uterus cancer for example.

```
# fuse three matrices: "fpkm" (gene expression), "mirnas" (miRNA expression) and "methy450" (DNA methylation)
fused.mat <- ANF(Wall = Wall$uterus$raw.all)
# Spectral clustering on fused patient affinity matrix
labels <- spectral_clustering(A = fused.mat, k = 2)
# Or we can directly evaluate clustering results using function `eval_clu`, which calls `spectral_clustering` and calculate NMI and ARI (and p-value if patient survival data is available. `surv.plot` does not contain information for uterus cancer patients)
res <- eval_clu(true_class = project_ids[rownames(fused.mat)], w = fused.mat)
```

`## nmi = 0.48526760996706`

`## adjusted.rand = 0.684204025562809`

`## survdiff p value = NA`

```
## labels
## true_class 1 2
## TCGA-UCEC 410 11
## TCGA-UCS 14 40
```

As we can see, spectral clustering on the fused affinity matrix significantly improves the results for uterus cancer. This demonstrate the power of ANF. The paper https://arxiv.org/abs/1708.07136 have provided more results.