Contents

squallms is a Bioconductor package designed to easily label and remove low-quality chromatographic features from LC/GC-MS analysis. It does this by calculating a few metrics of peak quality from the raw MS data, providing methods for grouping similar peaks together and labeling them, and editing XCMS objects so only the high-quality peaks remain.

1 Setup

1.1 Installation

squallms can be installed from Bioconductor (from release 3.19 onwards):

if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")

BiocManager::install("squallms")

squallms makes extensive use of the tidyverse for data shaping and organization as well as the RaMS package for MS data extraction. It also uses the shiny and plotly packages for interactive data visualization.

library(squallms)

2 Introduction

Mass-spectrometry coupled to chromatographic separation is a powerful technique for the discovery and quantification of molecules. Recent developments in automated peakpicking software have made it possible to query an entire dataset for molecular features that appear in the data as “peaks”, but these algorithms continue to have unacceptably high false positive rates that require manual review of their performance. Thus, a method for streamlining peak quality annotation in a robust and efficient way is very much in demand. This software package attempts to bridge this gap with a set of tools for the calculation of peak quality metrics and the rapid and interactive labeling of peak lists.

This package leans heavily on existing resources. Input files are required to be mzML/mzXML, mandating the use of Proteowizard’s msconvert or other software to convert proprietary formats into these open-source versions. This package also performs no peakpicking, retention time correction, or correspondence of its own, requiring the use of XCMS, MzMine, MSDIAL, or other software for the extraction of a preliminary peak list. However, the package has been designed to augment these existing resources (in particular XCMS) by accepting multiple input formats for peak lists that are shared across various systems.

At its core, this package takes in a peak list (a data frame containing columns for sample id and feature id) and returns one with only high-quality peaks. It does this by calculating a few metrics of peak quality from the raw MS data, providing methods for rapid feature viewing and labeling, and constructing a logistic model to label any remaining features. These steps are reviewed in detail with associated code entries in the rest of this vignette below.

2.1 Peakpicking with XCMS

This demo uses data peakpicked with XCMS but it’s worth noting that any peakpicking software could be used to produce a peak list as long as it can be turned into the proper input format (see “Calculating metrics of peak quality” below). The data files used here for demonstrationg purposes are those provided with the RaMS package.

suppressPackageStartupMessages({
    library(dplyr)
    library(tidyr)
    library(tibble)
    library(ggplot2)
    library(xcms)
    library(RaMS)
})
mzML_files <- system.file("extdata", package = "RaMS") %>%
    list.files(full.names = TRUE, pattern = "[A-F].mzML")

register(BPPARAM = SerialParam())
cwp <- CentWaveParam(snthresh = 0, extendLengthMSW = TRUE, integrate = 2)
obp <- ObiwarpParam(binSize = 0.1, response = 1, distFun = "cor_opt")
pdp <- PeakDensityParam(
    sampleGroups = 1:3, bw = 12, minFraction = 0,
    binSize = 0.001, minSamples = 0
)


xcms_major_version <- as.numeric(substr(as.character(packageVersion("xcms")), 1, 1))
if (xcms_major_version >= 4) {
    library(MSnbase)
    msexp <- readMSData(mzML_files, msLevel. = 1, mode = "onDisk")
    fpp <- FillChromPeaksParam(ppm = 5)
} else {
    library(MsExperiment)
    msexp <- readMsExperiment(mzML_files, msLevel. = 1, mode = "onDisk")
    fpp <- ChromPeakAreaParam()
}

xcms_filled <- msexp %>%
    findChromPeaks(cwp) %>%
    adjustRtime(obp) %>%
    groupChromPeaks(pdp) %>%
    fillChromPeaks(fpp)

This XCMS object is useful for documenting processing steps and interfacing with R-based pipelines but can be challenging to inspect and interact with. squallms provides a function that turns this S4 object into a flat file containing the feature and peak information.

feat_peak_info <- makeXcmsObjFlat(xcms_filled) %>%
    select(feature, starts_with("mz"), starts_with("rt"), filename, filepath)
feat_peak_info %>%
    head() %>%
    mutate(filepath = paste0(substr(filepath, 1, 13), "~")) %>%
    knitr::kable(caption = "Output from makeXcmsObjFlat.")

Table 1: Output from makeXcmsObjFlat.
feature mz mzmin mzmax rt rtmin rtmax filename filepath
FT001 90.05549 90.05541 90.05556 11.11593 10.91385 11.41188 LB12HL_AB.mzML.gz /home/biocbui~
FT001 90.05549 90.05544 90.05552 11.08192 10.87952 11.41167 LB12HL_CD.mzML.gz /home/biocbui~
FT001 90.05549 90.05540 90.05553 11.05225 10.80270 11.39430 LB12HL_EF.mzML.gz /home/biocbui~
FT002 90.05543 90.05523 90.05553 11.81833 11.53837 11.98883 LB12HL_CD.mzML.gz /home/biocbui~
FT002 90.05546 90.05536 90.05551 11.78373 11.51833 11.98535 LB12HL_EF.mzML.gz /home/biocbui~
FT002 90.05547 90.05518 90.05563 11.44277 11.36553 11.70645 LB12HL_AB.mzML.gz /home/biocbui~

Although comprehensive, the only information needed is actually the grouping info, the peak bounding boxes, and the filepath (feature, mz*, rt*, and filepath) because we’ll recalculate the necessary metrics from the raw peak data pulled in from the mz(X)ML files. Similarly, other peakpicking algorithms tend to produce data in a similar format and can interface with squallms here.

We also can operate over only a subset of the data, which is quite useful for very large datasets. Quality control files or pooled samples with a similar expected quality can be used in lieu of the entire dataset which may not fit into computer memory. The chunk of code below shows a possible way to do this using the dplyr package’s filter command to keep only two of the files, dropping a third from the quality calculations.

feat_peak_info_subset <- feat_peak_info %>%
    filter(grepl("AB|CD", filename))

Note, however, that the retention time bounds must be uncorrected. makeXcmsObjFlat does this automatically for XCMS objects, but if corrected retention times are provided then the data extracted from the mz(X)ML files will be incorrect.

3 Calculating metrics of peak quality (extractChromMetrics)

The first step is to obtain some kind of information about the chromatographic features that will allow us to distinguish between good and bad. The most intuitive metrics (in my opinion) are similarity to a bell curve and the signal-to-noise ratio. Although many implementations of these exist, the ones I’ve found most useful are described in Kumler et al. (2023). The function for this in squallms is extractChromMetrics, as shown below. I read in the raw MS data with RaMS first because this saves a step and can be reused for quality checks.

msdata <- grabMSdata(unique(feat_peak_info$filepath), verbosity = 0)
shape_metrics <- extractChromMetrics(feat_peak_info, ms1_data = msdata$MS1)
knitr::kable(head(shape_metrics), caption = "Format of the output from extractChromMetrics")

Table 2: Format of the output from extractChromMetrics
feature med_mz med_rt med_cor med_snr
FT001 90.05549 11.105383 0.9187332 14.425791
FT002 90.05546 11.756300 0.8074746 3.870094
FT003 93.07429 11.144350 0.9552928 17.061047
FT004 104.07101 10.645483 0.8738360 20.505134
FT005 104.07100 8.301808 0.9474947 18.936980
FT006 104.07098 7.373142 0.7835736 5.283807
shape_metrics %>%
    arrange(desc(med_snr)) %>%
    ggplot() +
    geom_point(aes(x = med_rt, y = med_mz, color = med_cor, size = med_snr), alpha = 0.5) +
    theme_bw()
Mass by time diagram of all features returned by XCMS. Points have been colored by their calculated peak shape metric and are sized by their calculated signal-to-noise ratio. Large, pale points tend to be higher quality than small dark ones.

Figure 1: Mass by time diagram of all features returned by XCMS
Points have been colored by their calculated peak shape metric and are sized by their calculated signal-to-noise ratio. Large, pale points tend to be higher quality than small dark ones.

The output from extractChromMetrics is a data frame with peak shape (med_cor) and signal-to-noise (med_snr) calculations as estimated for each feature in the dataset. Unfortunately, many of these features are very low-quality, often a result of noise rather than biological signal, and should not be included in the downstream analysis. We view a random subset of the features that XCMS produced in the below code chunk:

# Set seed for reproducibility, ensuring that slice_sample always returns the
# same features for discussion below.
set.seed(123)

some_random_feats <- shape_metrics %>%
    slice_sample(n = 8)
some_random_feats %>%
    mutate(mzmin = med_mz - med_mz * 5 / 1e6) %>%
    mutate(mzmax = med_mz + med_mz * 5 / 1e6) %>%
    mutate(rtmin = med_rt - 1) %>%
    mutate(rtmax = med_rt + 1) %>%
    left_join(msdata$MS1, join_by(
        between(y$rt, x$rtmin, x$rtmax),
        between(y$mz, x$mzmin, x$mzmax)
    )) %>%
    qplotMS1data(color_col = "med_cor") +
    geom_vline(aes(xintercept = med_rt), color = "red") +
    geom_text(aes(x = Inf, y = Inf, label = paste0("SNR: ", round(med_snr))),
        data = some_random_feats, hjust = 1, vjust = 1, color = "black"
    ) +
    facet_wrap(~feature, scales = "free", nrow = 3) +
    scale_color_continuous(limits = c(0, 1), name = "Peak shape metric") +
    scale_y_continuous(expand = expansion(c(0, 0.25))) +
    theme(legend.position.inside = c(0.82, 0.12), legend.position = "inside") +
    guides(color = guide_colorbar(direction = "horizontal", title.position = "top"))