DEP 1.29.0
Proteomics data suffer from a high rate of missing values, which need to be accounted for. Different methods have been applied to deal with this issue, including multiple imputation methods (see for example Lazar et al). The different options to deal with missing values in DEP are described in this vignette.
To exemplify the missing value handling, we work with a simulated dataset. This is very useful, because we know the ground truth of our dataset, meaning we know which proteins are belonging to the background (null distribution) and which proteins are differentially expressed between the control and sample conditions.
We generate a dataset with 3300 proteins, of which 300 proteins are differentially expressed. The variables used to generate the data are depicted below.
# Variables
replicates = 3
bg_proteins = 3000
DE_proteins = 300
log2_mean_bg = 27
log2_sd_bg = 2
log2_mean_DE_control = 25
log2_mean_DE_treatment = 30
log2_sd_DE = 2
# Loading DEP and packages required for data handling
library("DEP")
library("dplyr")
library("tidyr")
library("purrr")
library("ggplot2")
library("SummarizedExperiment")
# Background data (sampled from null distribution)
sim_null <- data_frame(
name = paste0("bg_", rep(1:bg_proteins, rep((2*replicates), bg_proteins))),
ID = rep(1:bg_proteins, rep((2*replicates), bg_proteins)),
var = rep(c("control_1", "control_2", "control_3",
"treatment_1", "treatment_2", "treatment_3"), bg_proteins),
val = 2^rnorm(2*replicates*bg_proteins, mean = log2_mean_bg, sd = log2_sd_bg))
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Data for DE proteins (sampled from alternative distribution)
sim_diff <- rbind(
data_frame(
name = paste0("DE_", rep(1:DE_proteins, rep(replicates, DE_proteins))),
ID = rep((bg_proteins+1):(bg_proteins+DE_proteins),
rep(replicates, DE_proteins)),
var = rep(c("control_1", "control_2", "control_3"), DE_proteins),
val = 2^rnorm(replicates*DE_proteins,
mean = log2_mean_DE_control, sd = log2_sd_DE)),
data_frame(
name = paste0("DE_", rep(1:DE_proteins, rep(replicates, DE_proteins))),
ID = rep((bg_proteins+1):(bg_proteins+DE_proteins),
rep(replicates, DE_proteins)),
var = rep(c("treatment_1", "treatment_2", "treatment_3"), DE_proteins),
val = 2^rnorm(replicates*DE_proteins,
mean = log2_mean_DE_treatment, sd = log2_sd_DE)))
# Combine null and DE data
sim <- rbind(sim_null, sim_diff) %>%
spread(var, val) %>%
arrange(ID)
# Generate experimental design
experimental_design <- data_frame(
label = colnames(sim)[!colnames(sim) %in% c("name", "ID")],
condition = c(rep("control", replicates),
rep("treatment", replicates)),
replicate = rep(1:replicates, 2))
Data can be missing at random (MAR) or missing not at random (MNAR). MAR means that values are randomly missing from all samples. In the case of MNAR, values are missing in specific samples and/or for specific proteins. For example, certain proteins might not be quantified in specific conditions, because they are below the detection limit in these specific samples.
To mimick these two types of missing values, we introduce missing values randomly over all data points (MAR) and we introduce missing values in the control samples of 100 differentially expressed proteins (MNAR). The variables used to introduce missing values are depicted below.
# Variables
MAR_fraction = 0.05
MNAR_proteins = 100
# Generate a MAR matrix
MAR_matrix <- matrix(
data = sample(c(TRUE, FALSE),
size = 2*replicates*(bg_proteins+DE_proteins),
replace = TRUE,
prob = c(MAR_fraction, 1-MAR_fraction)),
nrow = bg_proteins+DE_proteins,
ncol = 2*replicates)
# Introduce missing values at random (MAR)
controls <- grep("control", colnames(sim))
treatments <- grep("treatment", colnames(sim))
sim[, c(controls, treatments)][MAR_matrix] <- 0
sim$MAR <- apply(MAR_matrix, 1, any)
# Introduce missing values not at random (MNAR)
DE_protein_IDs <- grep("DE", sim$name)
sim[DE_protein_IDs[1:MNAR_proteins], controls] <- 0
sim$MNAR <- FALSE
sim$MNAR[DE_protein_IDs[1:MNAR_proteins]] <- TRUE
The data is stored in a SummarizedExperiment, as described in the ‘Introduction to DEP’ vignette.
# Generate a SummarizedExperiment object
sim_unique_names <- make_unique(sim, "name", "ID", delim = ";")
se <- make_se(sim_unique_names, c(controls, treatments), experimental_design)
A first consideration with missing values is whether or not to filter out proteins with too many missing values.
The number of proteins quantified over the samples can be visualized to investigate the extend of missing values.
# Plot a barplot of the protein quantification overlap between samples
plot_frequency(se)
Many proteins are quantified in all six samples and only a small subset of proteins were detected in less than half of the samples.
We can choose to not filter out any proteins at all, filter for only the proteins without missing values, filter for proteins with a certain fraction of quantified samples, and for proteins that are quantified in all replicates of at least one condition.
# No filtering
no_filter <- se
# Filter for proteins that are quantified in all replicates of at least one condition
condition_filter <- filter_proteins(se, "condition", thr = 0)
# Filter for proteins that have no missing values
complete_cases <- filter_proteins(se, "complete")
# Filter for proteins that are quantified in at least 2/3 of the samples.
frac_filtered <- filter_proteins(se, "fraction", min = 0.66)
To check the consequences of filtering, we calculate the number of background and DE proteins left in the dataset.
# Function to extract number of proteins
number_prots <- function(se) {
names <- rownames(get(se))
data_frame(Dataset = se,
bg_proteins = sum(grepl("bg", names)),
DE_proteins = sum(grepl("DE", names)))
}
# Number of bg and DE proteins still included
objects <- c("no_filter",
"condition_filter",
"complete_cases",
"frac_filtered")
map_df(objects, number_prots)
## # A tibble: 4 × 3
## Dataset bg_proteins DE_proteins
## <chr> <int> <int>
## 1 no_filter 3000 300
## 2 condition_filter 2937 267
## 3 complete_cases 2147 141
## 4 frac_filtered 2997 200
For our simulated dataset, filtering for complete cases drastically reduces the number of proteins, only keeping half of the DE proteins. Filtering protein to be quantified in at least 2/3 of the samples, keeps many more DE proteins and filters out all proteins with values MNAR. Filtering for proteins quantified in all replicates of at least one condition only filters out a minimal amount of DE proteins.
The data is scaled and variance stabilized using vsn.
# Scale and variance stabilize
no_filter <- normalize_vsn(se)
condition_filter <- normalize_vsn(condition_filter)
complete_cases <- normalize_vsn(complete_cases)
frac_filtered <- normalize_vsn(frac_filtered)
# Mean versus Sd plot
meanSdPlot(no_filter)
A second important consideration with missing values is whether or not to impute the missing values.
MAR and MNAR (see Introduce missing values) require different imputation methods. See the MSnbase vignette and more specifically the impute function descriptions for detailed information.
To explore the pattern of missing values in the data, a heatmap can be plotted indicating whether values are missing (0) or not (1). Only proteins with at least one missing value are visualized.
# Plot a heatmap of proteins with missing values
plot_missval(no_filter)
The missing values seem to be randomly distributed across the samples (MAR). However, we do note a block of values that are missing in all control samples (bottom left side of the heatmap). These proteins might have missing values not at random (MNAR).
To check whether missing values are biased to lower intense proteins, the densities and cumulative fractions are plotted for proteins with and without missing values.
# Plot intensity distributions and cumulative fraction of proteins
# with and without missing values
plot_detect(no_filter)
In our example data, there is no clear difference between the two distributions.
DEP borrows the imputation functions from MSnbase. See the MSnbase vignette and more specifically the impute function description for more information on the imputation methods.
# All possible imputation methods are printed in an error, if an invalid function name is given.
impute(no_filter, fun = "")
## Error in match.arg(fun): 'arg' should be one of "bpca", "knn", "QRILC", "MLE", "MinDet", "MinProb", "man", "min", "zero", "mixed", "nbavg"
Some examples of imputation methods.
# No imputation
no_imputation <- no_filter
# Impute missing data using random draws from a
# Gaussian distribution centered around a minimal value (for MNAR)
MinProb_imputation <- impute(no_filter, fun = "MinProb", q = 0.01)
# Impute missing data using random draws from a
# manually defined left-shifted Gaussian distribution (for MNAR)
manual_imputation <- impute(no_filter, fun = "man", shift = 1.8, scale = 0.3)
# Impute missing data using the k-nearest neighbour approach (for MAR)
knn_imputation <- impute(no_filter, fun = "knn", rowmax = 0.9)
The effect of data imputation on the distributions can be visualized.
# Plot intensity distributions before and after imputation
plot_imputation(no_filter, MinProb_imputation,
manual_imputation, knn_imputation)