xcms 4.5.0
Package: xcms
Authors: Johannes Rainer
Modified: 2024-10-23 19:24:55.947541
Compiled: Tue Oct 29 20:08:16 2024
The xcms package provides the functionality to perform the preprocessing of LC-MS, GC-MS or LC-MS/MS data in which raw signals from mzML, mzXML or CDF files are processed into feature abundances. This preprocessing includes chromatographic peak detection, sample alignment and correspondence analysis.
The first version of the package was already published in 2006 [1] and has since been updated and modernized in several rounds to better integrate it with other R-based packages for the analysis of untargeted metabolomics data. This includes version 3 of xcms that used the MSnbase package for MS data representation [2]. The most recent update (xcms version 4) enables in addition preprocessing of MS data represented by the modern MsExperiment and Spectra packages which provides an even better integration with the RforMassSpectrometry R package ecosystem simplifying e.g. also compound annotation [3].
This document describes data import, exploration and preprocessing of a simple test LC-MS data set with the xcms package version >= 4. The same functions can be applied to the older MSnbase-based workflows (xcms version 3). Additional documents and tutorials covering also other topics of untargeted metabolomics analysis are listed at the end of this document. There is also a xcms tutorial available with more examples and details.
xcms supports analysis of any LC-MS(/MS) data that can be imported with the Spectra package. Such data will typically be provided in (AIA/ANDI) NetCDF, mzXML and mzML format but can, through dedicated extensions to the Spectra package, also be imported from other sources, e.g. also directly from raw data files in manufacturer’s formats.
For demonstration purpose we will analyze in this document a small subset of the data from [4] in which the metabolic consequences of the knock-out of the fatty acid amide hydrolase (FAAH) gene in mice was investigated. The raw data files (in NetCDF format) are provided through the faahKO data package. The data set consists of samples from the spinal cords of 6 knock-out and 6 wild-type mice. Each file contains data in centroid mode acquired in positive ion polarity from 200-600 m/z and 2500-4500 seconds. To speed-up processing of this vignette we will restrict the analysis to only 8 files.
Below we load all required packages, locate the raw CDF files within the
faahKO package and build a phenodata data.frame
describing the
experimental setup. Generally, such data frames should contain all relevant
experimental variables and sample descriptions (including also the names of the
raw data files) and will be imported into R using either the read.table()
function (if the file is in csv or tabulator delimited text file format) or
also using functions from the readxl R package if it is in Excel file format.
library(xcms)
library(faahKO)
library(RColorBrewer)
library(pander)
library(pheatmap)
library(MsExperiment)
## Get the full path to the CDF files
cdfs <- dir(system.file("cdf", package = "faahKO"), full.names = TRUE,
recursive = TRUE)[c(1, 2, 5, 6, 7, 8, 11, 12)]
## Create a phenodata data.frame
pd <- data.frame(sample_name = sub(basename(cdfs), pattern = ".CDF",
replacement = "", fixed = TRUE),
sample_group = c(rep("KO", 4), rep("WT", 4)),
stringsAsFactors = FALSE)
We next load our data using the readMsExperiment
function from the
MsExperiment package.
faahko <- readMsExperiment(spectraFiles = cdfs, sampleData = pd)
faahko
## Object of class MsExperiment
## Spectra: MS1 (10224)
## Experiment data: 8 sample(s)
## Sample data links:
## - spectra: 8 sample(s) to 10224 element(s).
The MS spectra data from our experiment is now available as a Spectra
object
within faahko
. Note that this MsExperiment
container could in addition to
spectra data also contain other types of data or also references to other
files. See the vignette from the MsExperiment for more
details. Also, when loading data from mzML, mzXML or CDF files, by default only
general spectra data is loaded into memory while the actual peaks data,
i.e. the m/z and intensity values are only retrieved on-the-fly from the raw
files when needed (this is similar to the MSnbase on-disk mode described in
[2]). This guarantees a low memory footprint
hence allowing to analyze also large experiments without the need of high
performance computing environments. Note that also different alternative
backends (and hence data representations) could be used for the Spectra
object within faahko
with eventually even lower memory footprint, or higher
performance. See the package vignette from the Spectra package or
the SpectraTutorials tutorial for
more details on Spectra
backends and how to change between them.
The MsExperiment
object is a simple and flexible container for MS
experiments. The raw MS data is stored as a Spectra
object that can be
accessed through the spectra()
function.
spectra(faahko)
## MSn data (Spectra) with 10224 spectra in a MsBackendMzR backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 2501.38 1
## 2 1 2502.94 2
## 3 1 2504.51 3
## 4 1 2506.07 4
## 5 1 2507.64 5
## ... ... ... ...
## 10220 1 4493.56 1274
## 10221 1 4495.13 1275
## 10222 1 4496.69 1276
## 10223 1 4498.26 1277
## 10224 1 4499.82 1278
## ... 33 more variables/columns.
##
## file(s):
## ko15.CDF
## ko16.CDF
## ko21.CDF
## ... 5 more files
All spectra are organized sequentially (i.e., not by file) but the
fromFile()
function can be used to get for each spectrum the information to
which of the data files it belongs. Below we simply count the number of spectra
per file.
table(fromFile(faahko))
##
## 1 2 3 4 5 6 7 8
## 1278 1278 1278 1278 1278 1278 1278 1278
Information on samples can be retrieved through the sampleData()
function.
sampleData(faahko)
## DataFrame with 8 rows and 3 columns
## sample_name sample_group spectraOrigin
## <character> <character> <character>
## 1 ko15 KO /home/bioc...
## 2 ko16 KO /home/bioc...
## 3 ko21 KO /home/bioc...
## 4 ko22 KO /home/bioc...
## 5 wt15 WT /home/bioc...
## 6 wt16 WT /home/bioc...
## 7 wt21 WT /home/bioc...
## 8 wt22 WT /home/bioc...
Each row in this DataFrame
represents one sample (input file). Using [
it is
possible to subset a MsExperiment
object by sample. Below we subset the
faahko
to the 3rd sample (file) and access its spectra and sample data.
faahko_3 <- faahko[3]
spectra(faahko_3)
## MSn data (Spectra) with 1278 spectra in a MsBackendMzR backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 2501.38 1
## 2 1 2502.94 2
## 3 1 2504.51 3
## 4 1 2506.07 4
## 5 1 2507.64 5
## ... ... ... ...
## 1274 1 4493.56 1274
## 1275 1 4495.13 1275
## 1276 1 4496.69 1276
## 1277 1 4498.26 1277
## 1278 1 4499.82 1278
## ... 33 more variables/columns.
##
## file(s):
## ko21.CDF
sampleData(faahko_3)
## DataFrame with 1 row and 3 columns
## sample_name sample_group spectraOrigin
## <character> <character> <character>
## 1 ko21 KO /home/bioc...
As a first evaluation of the data we below plot the base peak chromatogram (BPC)
for each file in our experiment. We use the chromatogram()
method and set the
aggregationFun
to "max"
to return for each spectrum the maximal intensity
and hence create the BPC from the raw data. To create a total ion chromatogram
we could set aggregationFun
to "sum"
.
## Get the base peak chromatograms. This reads data from the files.
bpis <- chromatogram(faahko, aggregationFun = "max")
## Define colors for the two groups
group_colors <- paste0(brewer.pal(3, "Set1")[1:2], "60")
names(group_colors) <- c("KO", "WT")
## Plot all chromatograms.
plot(bpis, col = group_colors[sampleData(faahko)$sample_group])