If many genes are perturbed in a population of cells, this can lead to diseases like cancer. The perturbations can happen in different ways, e.g. via mutations, copy number abberations or methylation. However, not all perturbations are observed in all samples.
Nested Effects Model-based perturbation inference (NEM\(\pi\)) uses observed perturbation profiles and gene expression data to infer unobserved perturbations and augment observed ones. The causal network of the perturbed genes (P-genes) is modelled as an adjacency matrix \(\phi\) and the genes with observed gene expression (E-genes) are modelled with the attachment \(\theta\) with \(\theta_{ij}=1\), if E-gene \(j\) is attached to S-gene \(i\). If E-gene \(j\) is attached to P-gene \(i\), \(j\) shows an effect for a perturbation of P-gene \(i\). Hence, \(\phi\theta\) predicts gene expression profiles, which can be compared to the real data. NEM\(\pi\) iteratively infers a network \(\phi\) based on gene expression profiles and a perturbation profile, and the perturbation profile based on a network \(\phi\).
Use devtools to install the latest version from github or use the BiocManahger to install the package from bioconductor.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("nempi")
Load the package with the library function.
library(nempi)
We look at a small example for which we first simulate data and then infer the unobserved perturbations. We draw a random network for the perturbed genes (P-genes). Then we simulate gene expression for each sample given the subset of P-genes that has been perturbed in the sample. E.g., if P-gene A is upstream of P-gene B and P-gene A has been perturbed, all E-genes attached to B also show an effect.
library(mnem)
seed <- 8675309
Pgenes <- 10
Egenes <- 5
samples <- 100
edgeprob <- 0.5
uninform <- floor((Pgenes * Egenes) * 0.1)
Nems <- mw <- 1
noise <- 1
multi <- c(0.2, 0.1)
set.seed(seed)
simmini <- simData(Sgenes = Pgenes, Egenes = Egenes, Nems = Nems, mw = mw, nCells = samples,
uninform = uninform, multi = multi, badCells = floor(samples * 0.1), edgeprob = edgeprob)
data <- simmini$data
ones <- which(data == 1)
zeros <- which(data == 0)
data[ones] <- rnorm(length(ones), 1, noise)
data[zeros] <- rnorm(length(zeros), -1, noise)
epiNEM::HeatmapOP(data, col = "RdBu", cexRow = 0.75, cexCol = 0.75, bordercol = "transparent",
xrot = 0, dendrogram = "both")
The typical data input for NEM\(\pi\) consists of a data matrix with samples as columns and E-genes as rows. The columns are either labeled by their perturbed gene(s) or unlabeled (default: ""). After the data simulation all samples are labeled. We unlabel \(50\%\) of the sample and pretend we do not know, which P-gene has been perturbed.
lost <- sample(1:ncol(data), floor(ncol(data) * 0.5))
colnames(data)[lost] <- ""
We use NEM\(\pi\) and other methods to infer the perturbations. We use the area under the precision-recall curve as the two measures of accuracy. We also plot the NEM\(\pi\) result.
res <- nempi(data)
fit <- pifit(res, simmini, data)
print(fit$auc)
## [1] 0.7788576
ressvm <- classpi(data)
fit <- pifit(ressvm, simmini, data, propagate = FALSE)
print(fit$auc)
## [1] 0.7090377
resnn <- classpi(data, method = "nnet")
fit <- pifit(resnn, simmini, data, propagate = FALSE)
print(fit$auc)
## [1] 0.6273228
resrf <- classpi(data, method = "randomForest")
fit <- pifit(resrf, simmini, data, propagate = FALSE)
print(fit$auc)
## [1] 0.6987766
col <- rgb(seq(0, 1, length.out = 10), seq(1, 0, length.out = 10), seq(1, 0, length.out = 10))
plot(res, heatlist = list(col = "RdBu"), barlist = list(col = col))
Compared to support vector machines (svm), neural nets (nnet) and random forest (rf) class prediction, NEM\(\pi\) achieves a higher accuracy.
Note that NEM\(\pi\) is in general more powerful, if the P-genes are connected in a denser network. The other methods perform equally well for sparse or even disconnected network \(\phi\). However, they usually profit from combinatorial perturbations.
Alternatively, we can also provide NEM\(\pi\) with a probabilistic perturbation matrix \(\Gamma\) as a prior. In the rows are the potentially perturbed genes and in the columns are the samples. The entries are between \(0\) and \(1\), with the sum of all entries of a sample summing to \(1\).
Gamma <- matrix(0, Pgenes, ncol(data))
rownames(Gamma) <- seq_len(Pgenes)
colnames(Gamma) <- colnames(data)
for (i in seq_len(Pgenes)) {
Gamma[i, grep(paste0("^", i, "_|_", i, "$|_", i, "_|^", i, "$"), colnames(data))] <- 1
}
Gamma <- apply(Gamma, 2, function(x) return(x/sum(x)))
Gamma[is.na(Gamma)] <- 0
epiNEM::HeatmapOP(Gamma, col = "RdBu", cexRow = 0.75, cexCol = 0.75, bordercol = "transparent",
xrot = 0, dendrogram = "both")
colnames(data) <- sample(seq_len(Pgenes), ncol(data), replace = TRUE)
res <- nempi(data, Gamma = Gamma)
fit <- pifit(res, simmini, data)
print(fit$auc)
## [1] 0.7788576
The final perturbation matrix \(\Omega\) over all samples is slightly different from the matrix \(\Gamma\) we used as input. \(\Gamma\) only denotes the most upstream P-gene perturbed. E.g., if A is upstream of B, than \(\Gamma\) only denotes a perturbation of A, even though B is also perturbed in every samples in which A is perturbed. We can compute it by the matrix multiplication \(\Omega = \phi^T \times \Gamma\).
Omega <- t(mnem::transitive.closure(res$res$adj)) %*% res$Gamma
epiNEM::HeatmapOP(Omega, col = "RdBu", cexRow = 0.75, cexCol = 0.75, bordercol = "transparent",
xrot = 0, dendrogram = "both")