Package: MsFeatures
Authors: Johannes Rainer [aut, cre] (https://orcid.org/0000-0002-6977-7147)
Last modified: 2024-04-23 17:11:02.257505
Compiled: Wed May 1 18:20:14 2024

1 Introduction

Electrospray ionization (ESI) is commonly used in mass spectrometry (MS)-based metabolomics to generate ions from the compounds to enable their detection by the MS instrument. Ionization can generate different ions (adducts) of the same original compound which are then reported as separate MS features with different mass-to-charge ratios (m/z). To reduce data set complexity (and to aid subsequent annotation steps) it is advisable to group features which supposedly represent signal from the same original compound into a single entity.

The MsFeatures package provides key concepts and functions for this feature grouping. Methods are implemented for base R objects as well as for Bioconductor’s SummarizedExperiment class. See also the description of the general grouping concept on the package webpage for more information. Additional grouping methodology is expected to be implemented in other R packages for data objects with additional LC-MS related information, such as the XCMSnExp object in the xcms package. The implementation for the SummarizedExperiment provided in this package can be used as a reference for these additional methodology.

After definition of the feature groups, the QFeatures package could be used to aggregate their abundances into a single signal.

2 Installation

The package can be installed with the BiocManager package. To install BiocManager use install.packages("BiocManager") and, after that, BiocManager::install("MsFeatures") to install this package.

3 Mass Spectrometry Feature Grouping

Features from the same originating compound inherit its characteristics including its retention time (for LC or GC-MS experiments) and abundance/intensity. For the latter it is expected that features from the same compound have the same pattern of feature values/abundances across samples.

The MsFeatures package defines the groupFeatures method to perform MS feature grouping based on the provided input data and a parameter object which selects and defines the feature grouping algorithm. This algorithm is supposed to assign individual features to a (single) feature group. Currently two feature grouping approaches are implemented:

  • SimilarRtimeParam: group features based on similar retention times.
  • AbundanceSimilarityParam: group features based on similar feature values/abundances across samples.

Additional algorithms, e.g. by considering also differences in features’ m/z values matching expected ions/adducts or isotopes, may be implemented in future in this or other packages.

In this document we demonstrate the feature grouping functionality on a simple toy data set used also in the xcms package with the raw data being provided in the faahKO data package. This data set consists of samples from 4 mice with knock-out of the fatty acid amide hydrolase (FAAH) and 4 wild type mice. Pre-processing of this data set is described in detail in the xcms vignette of the xcms package. Below we load all required packages and the result from this pre-processing which is provided as a SummarizedExperiment within this package and can be loaded with data(se).

library(MsFeatures)
library(SummarizedExperiment)

data("se")

Before performing the feature grouping we inspect the result object. Feature properties and definitions can be accessed with rowData, the feature abundances with assay.

rowData(se)
## DataFrame with 225 rows and 11 columns
##           mzmed     mzmin     mzmax     rtmed     rtmin     rtmax    npeaks
##       <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## FT001     200.1     200.1     200.1   2901.63   2880.73   2922.53         2
## FT002     205.0     205.0     205.0   2789.39   2782.30   2795.36         8
## FT003     206.0     206.0     206.0   2788.73   2780.73   2792.86         7
## FT004     207.1     207.1     207.1   2718.12   2713.21   2726.70         7
## FT005     219.1     219.1     219.1   2518.82   2517.40   2520.81         3
## ...         ...       ...       ...       ...       ...       ...       ...
## FT221    591.30     591.3     591.3   3005.03   2992.87   3006.05         5
## FT222    592.15     592.1     592.3   3022.11   2981.91   3107.59         6
## FT223    594.20     594.2     594.2   3418.16   3359.10   3427.90         3
## FT224    595.25     595.2     595.3   3010.15   2992.87   3013.77         6
## FT225    596.20     596.2     596.2   2997.91   2992.87   3002.95         2
##              KO        WT            peakidx  ms_level
##       <numeric> <numeric>             <list> <integer>
## FT001         2         0  287, 679,1577,...         1
## FT002         4         4     47,272,542,...         1
## FT003         3         4     32,259,663,...         1
## FT004         4         3     19,249,525,...         1
## FT005         1         2  639, 788,1376,...         1
## ...         ...       ...                ...       ...
## FT221         2         3    349,684,880,...         1
## FT222         1         3     86,861,862,...         1
## FT223         1         2  604, 985,1543,...         1
## FT224         2         3     67,353,876,...         1
## FT225         0         2  866,1447,1643,...         1
head(assay(se))
##        ko15.CDF   ko16.CDF   ko21.CDF  ko22.CDF  wt15.CDF  wt16.CDF  wt21.CDF
## FT001  159738.1  506848.88  113441.08  169955.6  216096.6  145509.7  230477.9
## FT002 1924712.0 1757150.96 1383416.72 1180288.2 2129885.1 1634342.0 1623589.2
## FT003  213659.3  289500.67  162897.19  178285.7  253825.6  241844.4  240606.0
## FT004  349011.5  451863.66  343897.76  208002.8  364609.8  360908.9  223322.5
## FT005  135978.5   25524.79   71530.84  107348.5  223951.8  134398.9  190203.8
## FT006  286221.4  289908.23  164008.97  149097.6  255697.7  311296.8  366441.5
##         wt22.CDF
## FT001  140551.30
## FT002 1354004.93
## FT003  185399.47
## FT004  221937.53
## FT005   84772.92
## FT006  271128.02

Columns "mzmed" and "rtmed" in the object’s rowData provide the m/z and retention time which characterizes each feature. In total 225 features are available in the present data set, with many of them most likely representing signal from different ions of the same compound. We aim to identify these based on the following assumptions of the LC-MS data:

  • Features (ions) of the same compound should have similar retention time.
  • The abundance of features (ions) of the same compound should have a similar pattern across samples, i.e. if a compound is highly concentrated in one sample and low in another, all ions from it also should follow the same pattern.

As detailed in the general grouping concept, the feature grouping implemented in MsFeatures is by default intended to be used as a stepwise approach in which each groupFeatures call further sub-groups (and thus refines) previously defined feature groups. This enables to either use a single algorithm for the feature grouping or to build a feature grouping pipeline by combining different algorithms. In our example we perform first a initial grouping of features based on similar retention time and subsequently further refine these feature groups by requiring also similarity of feature values across samples.

Note that it would also be possible to perform the grouping only on a subset of features instead of the full data set. An example is provided in the last section of this vignette.

3.1 Grouping of features by similar retention time

The most intuitive and simple way to group LC-MS features is based on their retention times: ionization of the compounds happens after the LC and thus all ions from the same compound should have the same retention time. The plot below shows the retention times (and m/z) of all features from the present data set.

plot(rowData(se)$rtmed, rowData(se)$mzmed,
     xlab = "retention time", ylab = "m/z", main = "features",
     col = "#00000060")
grid()
Plot of retention times and m/z for all features in the data set.

Figure 1: Plot of retention times and m/z for all features in the data set

As we can see there are several features with a similar retention time, especially for lower retention times. By using groupFeatures with the SimilarRtimeParam we can next group features if their difference in retention time is below a certain threshold. This approach will however not only group features representing ions from the same compound together, but also features from different, but co-eluting compounds (i.e. different compounds with the same retention time). Thus feature groups defined by this algorithm should be further refined based on another feature property to reduce false positives.

For the present example, we group features with a maximal difference in retention time of 10 seconds into a feature group. We also have to specify the column in the object’s rowData which contains the retention times for the features.

se <- groupFeatures(se, param = SimilarRtimeParam(10), rtime = "rtmed")

The groupFeatures call on the SummarizedExperiment added the results of the grouping into a new column called "feature_group" in the object’s rowData. This column can also be directly accessed with the featureGroups function. Below we print the number of features for each feature grouped defined by the SimilarRtimeParam approach.

table(featureGroups(se))
## 
## FG.001 FG.002 FG.003 FG.004 FG.005 FG.006 FG.007 FG.008 FG.009 FG.010 FG.011 
##      3      3      3      3      2      4      5      6      4      2      5 
## FG.012 FG.013 FG.014 FG.015 FG.016 FG.017 FG.018 FG.019 FG.020 FG.021 FG.022 
##      3      4      3      5      3      3      5      3      3      3      3 
## FG.023 FG.024 FG.025 FG.026 FG.027 FG.028 FG.029 FG.030 FG.031 FG.032 FG.033 
##      3      3      6      3      3      3      3      2      3      3      4 
## FG.034 FG.035 FG.036 FG.037 FG.038 FG.039 FG.040 FG.041 FG.042 FG.043 FG.044 
##      3      2      2      3      2      2      4      2      2      2      3 
## FG.045 FG.046 FG.047 FG.048 FG.049 FG.050 FG.051 FG.052 FG.053 FG.054 FG.055 
##      4      2      3      3      3      2      2      3      4      2      3 
## FG.056 FG.057 FG.058 FG.059 FG.060 FG.061 FG.062 FG.063 FG.064 FG.065 FG.066 
##      2      2      2      3      2      3      2      2      2      3      2 
## FG.067 FG.068 FG.069 FG.070 FG.071 FG.072 FG.073 FG.074 FG.075 FG.076 FG.077 
##      2      3      2      2      2      3      2      2      1      1      1 
## FG.078 FG.079 FG.080 FG.081 FG.082 FG.083 FG.084 
##      1      1      1      1      1      1      1

We also calculate the mean retention time for all the feature groups and order them increasingly.

split(rowData(se)$rtmed, featureGroups(se)) |>
vapply(FUN = mean, numeric(1)) |>
sort()
##   FG.005   FG.012   FG.062   FG.076   FG.069   FG.007   FG.018   FG.027 
## 2506.301 2511.821 2519.745 2595.503 2623.786 2684.671 2686.835 2688.495 
##   FG.064   FG.057   FG.066   FG.074   FG.003   FG.023   FG.004   FG.001 
## 2694.089 2718.799 2731.648 2751.041 2787.291 2788.022 2788.625 2788.949 
##   FG.011   FG.048   FG.014   FG.075   FG.084   FG.080   FG.060   FG.021 
## 2790.232 2793.065 2799.149 2832.560 2860.496 2871.735 2881.899 2899.457 
##   FG.063   FG.033   FG.071   FG.079   FG.050   FG.051   FG.039   FG.010 
## 2915.107 2924.922 2936.602 2953.627 2998.303 3005.543 3009.864 3011.159 
##   FG.040   FG.019   FG.006   FG.067   FG.036   FG.078   FG.037   FG.013 
## 3015.635 3022.568 3027.254 3038.254 3044.682 3057.242 3075.361 3079.912 
##   FG.017   FG.083   FG.044   FG.073   FG.070   FG.008   FG.068   FG.082 
## 3088.353 3114.721 3128.047 3142.710 3161.133 3170.219 3184.355 3203.298 
##   FG.029   FG.009   FG.055   FG.002   FG.042   FG.016   FG.035   FG.054 
## 3217.009 3226.662 3242.698 3258.465 3261.208 3263.141 3265.660 3284.004 
##   FG.043   FG.038   FG.052   FG.077   FG.056   FG.028   FG.061   FG.022 
## 3289.081 3295.654 3300.775 3311.799 3321.354 3324.717 3335.200 3353.939 
##   FG.031   FG.081   FG.041   FG.026   FG.015   FG.030   FG.047   FG.034 
## 3358.597 3372.908 3383.014 3395.552 3405.585 3407.192 3410.212 3417.383 
##   FG.024   FG.049   FG.059   FG.058   FG.045   FG.046   FG.065   FG.053 
## 3422.675 3428.323 3435.389 3442.423 3447.411 3457.582 3465.580 3472.813 
##   FG.032   FG.020   FG.025   FG.072 
## 3478.609 3483.910 3497.730 3510.664

Note that the differences in retention times between the feature groups can be smaller than the used cut-off (10 seconds in our case). If we were not happy with this feature grouping and would like to repeat it we would need to drop the "feature_group" column in the object’s rowData with rowData(se)$feature_group <- NULL and repeat the feature grouping with different settings. This is required, because by default groupFeatures will refine previous feature grouping results but not overwrite them.

As stated above, this initial grouping on retention times put features from the same, but also from different co-eluting compounds into the same feature group. We thus next refine the feature groups requiring also feature abundances across samples to be correlated.

3.2 Grouping of features by abundance correlation across samples

Features representing ions of the same compound are expected to have correlated feature values (intensities, abundances) across samples. groupFeatures with AbundanceSimilarityParam allows to group features with similar abundance patterns. This approach performs a pairwise similarity calculation and puts features with a similarity >= threshold into the same feature group. By calling this function on the previous result object the initial feature groups will be refined, by eventually splitting them based on the (missing) correlation of feature abundances.

We below evaluate the correlation between individual features indicating also the previously defined feature groups.

library(pheatmap)
fvals <- log2(assay(se))

cormat <- cor(t(fvals), use = "pairwise.complete.obs")
ann <- data.frame(fgroup = featureGroups(se))
rownames(ann) <- rownames(cormat)

res <- pheatmap(cormat, annotation_row = ann, cluster_rows = TRUE,
                cluster_cols = TRUE)