squallms
is a Bioconductor package designed to easily label and remove low-quality chromatographic features from LC/GC-MS analysis. It does this by calculating a few metrics of peak quality from the raw MS data, providing methods for grouping similar peaks together and labeling them, and editing XCMS objects so only the high-quality peaks remain.
squallms
can be installed from Bioconductor (from release 3.19 onwards):
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("squallms")
squallms
makes extensive use of the tidyverse for data shaping and organization as well as the RaMS
package for MS data extraction. It also uses the shiny
and plotly
packages for interactive data visualization.
library(squallms)
Mass-spectrometry coupled to chromatographic separation is a powerful technique for the discovery and quantification of molecules. Recent developments in automated peakpicking software have made it possible to query an entire dataset for molecular features that appear in the data as “peaks”, but these algorithms continue to have unacceptably high false positive rates that require manual review of their performance. Thus, a method for streamlining peak quality annotation in a robust and efficient way is very much in demand. This software package attempts to bridge this gap with a set of tools for the calculation of peak quality metrics and the rapid and interactive labeling of peak lists.
This package leans heavily on existing resources. Input files are required to be mzML/mzXML, mandating the use of Proteowizard’s msconvert
or other software to convert proprietary formats into these open-source versions. This package also performs no peakpicking, retention time correction, or correspondence of its own, requiring the use of XCMS, MzMine, MSDIAL, or other software for the extraction of a preliminary peak list. However, the package has been designed to augment these existing resources (in particular XCMS) by accepting multiple input formats for peak lists that are shared across various systems.
At its core, this package takes in a peak list (a data frame containing columns for sample id and feature id) and returns one with only high-quality peaks. It does this by calculating a few metrics of peak quality from the raw MS data, providing methods for rapid feature viewing and labeling, and constructing a logistic model to label any remaining features. These steps are reviewed in detail with associated code entries in the rest of this vignette below.
This demo uses data peakpicked with XCMS but it’s worth noting that any peakpicking software could be used to produce a peak list as long as it can be turned into the proper input format (see “Calculating metrics of peak quality” below). The data files used here for demonstrationg purposes are those provided with the RaMS package.
suppressPackageStartupMessages({
library(dplyr)
library(tidyr)
library(tibble)
library(ggplot2)
library(xcms)
library(RaMS)
})
mzML_files <- system.file("extdata", package = "RaMS") %>%
list.files(full.names = TRUE, pattern = "[A-F].mzML")
register(BPPARAM = SerialParam())
cwp <- CentWaveParam(snthresh = 0, extendLengthMSW = TRUE, integrate = 2)
obp <- ObiwarpParam(binSize = 0.1, response = 1, distFun = "cor_opt")
pdp <- PeakDensityParam(
sampleGroups = 1:3, bw = 12, minFraction = 0,
binSize = 0.001, minSamples = 0
)
xcms_major_version <- as.numeric(substr(as.character(packageVersion("xcms")), 1, 1))
if (xcms_major_version >= 4) {
library(MSnbase)
msexp <- readMSData(mzML_files, msLevel. = 1, mode = "onDisk")
fpp <- FillChromPeaksParam(ppm = 5)
} else {
library(MsExperiment)
msexp <- readMsExperiment(mzML_files, msLevel. = 1, mode = "onDisk")
fpp <- ChromPeakAreaParam()
}
xcms_filled <- msexp %>%
findChromPeaks(cwp) %>%
adjustRtime(obp) %>%
groupChromPeaks(pdp) %>%
fillChromPeaks(fpp)
This XCMS object is useful for documenting processing steps and interfacing with R-based pipelines but can be challenging to inspect and interact with. squallms
provides a function that turns this S4 object into a flat file containing the feature and peak information.
feat_peak_info <- makeXcmsObjFlat(xcms_filled) %>%
select(feature, starts_with("mz"), starts_with("rt"), filename, filepath)
feat_peak_info %>%
head() %>%
mutate(filepath = paste0(substr(filepath, 1, 13), "~")) %>%
knitr::kable(caption = "Output from makeXcmsObjFlat.")
feature | mz | mzmin | mzmax | rt | rtmin | rtmax | filename | filepath |
---|---|---|---|---|---|---|---|---|
FT001 | 90.05549 | 90.05541 | 90.05556 | 11.11593 | 10.91385 | 11.41188 | LB12HL_AB.mzML.gz | /home/biocbui~ |
FT001 | 90.05549 | 90.05544 | 90.05552 | 11.08192 | 10.87952 | 11.41167 | LB12HL_CD.mzML.gz | /home/biocbui~ |
FT001 | 90.05549 | 90.05540 | 90.05553 | 11.05225 | 10.80270 | 11.39430 | LB12HL_EF.mzML.gz | /home/biocbui~ |
FT002 | 90.05543 | 90.05523 | 90.05553 | 11.81833 | 11.53837 | 11.98883 | LB12HL_CD.mzML.gz | /home/biocbui~ |
FT002 | 90.05546 | 90.05536 | 90.05551 | 11.78373 | 11.51833 | 11.98535 | LB12HL_EF.mzML.gz | /home/biocbui~ |
FT002 | 90.05547 | 90.05518 | 90.05563 | 11.44277 | 11.36553 | 11.70645 | LB12HL_AB.mzML.gz | /home/biocbui~ |
Although comprehensive, the only information needed is actually the grouping info, the peak bounding boxes, and the filepath (feature, mz*, rt*, and filepath) because we’ll recalculate the necessary metrics from the raw peak data pulled in from the mz(X)ML files. Similarly, other peakpicking algorithms tend to produce data in a similar format and can interface with squallms
here.
We also can operate over only a subset of the data, which is quite useful for very large datasets. Quality control files or pooled samples with a similar expected quality can be used in lieu of the entire dataset which may not fit into computer memory. The chunk of code below shows a possible way to do this using the dplyr
package’s filter
command to keep only two of the files, dropping a third from the quality calculations.
feat_peak_info_subset <- feat_peak_info %>%
filter(grepl("AB|CD", filename))
Note, however, that the retention time bounds must be uncorrected. makeXcmsObjFlat
does this automatically for XCMS objects, but if corrected retention times are provided then the data extracted from the mz(X)ML files will be incorrect.
The first step is to obtain some kind of information about the chromatographic features that will allow us to distinguish between good and bad. The most intuitive metrics (in my opinion) are similarity to a bell curve and the signal-to-noise ratio. Although many implementations of these exist, the ones I’ve found most useful are described in Kumler et al. (2023). The function for this in squallms
is extractChromMetrics
, as shown below. I read in the raw MS data with RaMS first because this saves a step and can be reused for quality checks.
msdata <- grabMSdata(unique(feat_peak_info$filepath), verbosity = 0)
shape_metrics <- extractChromMetrics(feat_peak_info, ms1_data = msdata$MS1)
knitr::kable(head(shape_metrics), caption = "Format of the output from extractChromMetrics")
feature | med_mz | med_rt | med_cor | med_snr |
---|---|---|---|---|
FT001 | 90.05549 | 11.105383 | 0.9187332 | 14.425791 |
FT002 | 90.05546 | 11.756300 | 0.8074746 | 3.870094 |
FT003 | 93.07429 | 11.144350 | 0.9552928 | 17.061047 |
FT004 | 104.07101 | 10.645483 | 0.8738360 | 20.505134 |
FT005 | 104.07100 | 8.301808 | 0.9474947 | 18.936980 |
FT006 | 104.07098 | 7.373142 | 0.7835736 | 5.283807 |
shape_metrics %>%
arrange(desc(med_snr)) %>%
ggplot() +
geom_point(aes(x = med_rt, y = med_mz, color = med_cor, size = med_snr), alpha = 0.5) +
theme_bw()
The output from extractChromMetrics
is a data frame with peak shape (med_cor) and signal-to-noise (med_snr) calculations as estimated for each feature in the dataset. Unfortunately, many of these features are very low-quality, often a result of noise rather than biological signal, and should not be included in the downstream analysis. We view a random subset of the features that XCMS produced in the below code chunk:
# Set seed for reproducibility, ensuring that slice_sample always returns the
# same features for discussion below.
set.seed(123)
some_random_feats <- shape_metrics %>%
slice_sample(n = 8)
some_random_feats %>%
mutate(mzmin = med_mz - med_mz * 5 / 1e6) %>%
mutate(mzmax = med_mz + med_mz * 5 / 1e6) %>%
mutate(rtmin = med_rt - 1) %>%
mutate(rtmax = med_rt + 1) %>%
left_join(msdata$MS1, join_by(
between(y$rt, x$rtmin, x$rtmax),
between(y$mz, x$mzmin, x$mzmax)
)) %>%
qplotMS1data(color_col = "med_cor") +
geom_vline(aes(xintercept = med_rt), color = "red") +
geom_text(aes(x = Inf, y = Inf, label = paste0("SNR: ", round(med_snr))),
data = some_random_feats, hjust = 1, vjust = 1, color = "black"
) +
facet_wrap(~feature, scales = "free", nrow = 3) +
scale_color_continuous(limits = c(0, 1), name = "Peak shape metric") +
scale_y_continuous(expand = expansion(c(0, 0.25))) +
theme(legend.position.inside = c(0.82, 0.12), legend.position = "inside") +
guides(color = guide_colorbar(direction = "horizontal", title.position = "top"))