Package: MsExperiment
Authors: Laurent Gatto [aut, cre] (https://orcid.org/0000-0002-1520-2268),
Johannes Rainer [aut] (https://orcid.org/0000-0002-6977-7147),
Sebastian Gibb [aut] (https://orcid.org/0000-0001-7406-4443)
Last modified: 2024-04-23 17:21:26.036832
Compiled: Wed May 1 18:20:15 2024
The goal of the MsExperiment package is to provide a container
for all data related to a mass spectrometry (MS) experiment. Also other
Bioconductor packages allow to represent MS experiment data (such as the
MSnbase package). The MsExperiment
however aims at being very
light-weight and flexible to accommodate all possible types of MS experiments
(proteomics, metabolomics, …) and all types of MS data representations
(chromatographic and spectral data, quantified features etc). In addition, it
allows to bundle additional files and data, such as annotations, within the
object.
In this vignette, we will describe how to create a MsExperiment
object and
populate it with various types of data.
library("MsExperiment")
We will also use the Spectra package to import MS data and thus load it here too.
library("Spectra")
The package can be installed with the BiocManager package. To
install BiocManager
use install.packages("BiocManager")
and, after that,
BiocManager::install("MsExperiment")
to install MsExperiment
which will install the package including all required dependencies.
We will use a small subset of the PXD022816 project (Morgenstern et al. (2020)). The acquisitions correspond to a Pierce Thermo HeLa digestion standard, diluted to 50ng/uL with 97:3 + 0.1% formic acid, and acquired on a QExactive instrument.
Below, we use the rpx package to access the project
from the PRIDE repository, and download files of interest. Note that
these will automatically be cached in the rpx
packages’ cache
directory.
library("rpx")
px <- PXDataset("PXD022816")
## Loading PXD022816 from cache.
px
## Project PXD022816 with 32 files
##
## Resource ID BFC324 in cache in /home/biocbuild/.cache/R/rpx.
## [1] 'QEP2LC6_HeLa_50ng_251120_01-calib.mzID.gz' ... [32] 'checksum.txt'
## Use 'pxfiles(.)' to see all files.
pxfiles(px)
## Project PXD022816 files (32):
## [local] QEP2LC6_HeLa_50ng_251120_01-calib.mzID.gz
## [local] QEP2LC6_HeLa_50ng_251120_01-calib.mzML
## [remote] QEP2LC6_HeLa_50ng_251120_01.raw
## [local] QEP2LC6_HeLa_50ng_251120_02-calib.mzID.gz
## [local] QEP2LC6_HeLa_50ng_251120_02-calib.mzML
## [remote] QEP2LC6_HeLa_50ng_251120_02.raw
## [remote] QEP2LC6_HeLa_50ng_251120_03-calib.mzID.gz
## [remote] QEP2LC6_HeLa_50ng_251120_03-calib.mzML
## [remote] QEP2LC6_HeLa_50ng_251120_03.raw
## [remote] QEP2LC6_HeLa_50ng_251120_04-calib.mzID.gz
## ...
The project provides the vendor raw files, the converted mzML files as well as the identification mzid files. Let’s download fractions 1 and 2 of the mzML files.
If you run these commands interactively and it’s the first time you
use pxget()
, you will be asked to create the rpx
cache directory -
you can safelfy answer yes. The files will then be downloaded. Next
time you want to get the same files, they will be loaded automatically
from cache.
(i <- grep(".+0[12].+mzML$", pxfiles(px), value = TRUE))
## Project PXD022816 files (32):
## [local] QEP2LC6_HeLa_50ng_251120_01-calib.mzID.gz
## [local] QEP2LC6_HeLa_50ng_251120_01-calib.mzML
## [remote] QEP2LC6_HeLa_50ng_251120_01.raw
## [local] QEP2LC6_HeLa_50ng_251120_02-calib.mzID.gz
## [local] QEP2LC6_HeLa_50ng_251120_02-calib.mzML
## [remote] QEP2LC6_HeLa_50ng_251120_02.raw
## [remote] QEP2LC6_HeLa_50ng_251120_03-calib.mzID.gz
## [remote] QEP2LC6_HeLa_50ng_251120_03-calib.mzML
## [remote] QEP2LC6_HeLa_50ng_251120_03.raw
## [remote] QEP2LC6_HeLa_50ng_251120_04-calib.mzID.gz
## ...
## [1] "QEP2LC6_HeLa_50ng_251120_01-calib.mzML"
## [2] "QEP2LC6_HeLa_50ng_251120_02-calib.mzML"
fls <- pxget(px, i)
## Loading QEP2LC6_HeLa_50ng_251120_01-calib.mzML from cache.
## Loading QEP2LC6_HeLa_50ng_251120_02-calib.mzML from cache.
fls
## [1] "/home/biocbuild/.cache/R/rpx/2693231cf13b6f_QEP2LC6_HeLa_50ng_251120_01-calib.mzML"
## [2] "/home/biocbuild/.cache/R/rpx/26932395605ab_QEP2LC6_HeLa_50ng_251120_02-calib.mzML"
Let’s start by creating an empty MsExperiment
object that we will
populate with different pieces of data as we proceed with the analysis
of our data.
msexp <- MsExperiment()
msexp
## Empty object of class MsExperiment
Let’s now start with our MS experiment management by saving the
relevant files in a dedicated MsExperimentFiles
object. In addition
to the mzML files, let’s also assume we have the human proteomics
fasta file ready. Later, when loading the raw data into R, we will
refer directly to the files in this MsExperimentFiles
object.
msfls <- MsExperimentFiles(mzmls = fls,
fasta = "homo_sapiens.fasta")
msfls
## MsExperimentFiles of length 2
## [["mzmls"]] 2693231cf13b6f_QEP2LC6_HeLa_50ng_251120_01-calib.mzML ...
## [["fasta"]] homo_sapiens.fasta
Let’s add these files to the main experiment management object:
experimentFiles(msexp) <- msfls
msexp
## Object of class MsExperiment
## Files: mzmls, fasta
The sampleData
slot is used to describe the overall experimental design of the
experiment. It can be used to specify the samples of the experiment and to
relate them to the files that are part of the experiment. There can be a
one-to-one link between a sample and a file, such as for example in label-free
approaches, or one-to-many, in labelled multiplexed approaches.
Here, we create a simple data frame with sample annotations that include the original file names and the respective fractions.
sampleData(msexp) <- DataFrame(
mzmls = basename(experimentFiles(msexp)[["mzmls"]]),
fractions = 1:2)
sampleData(msexp)
## DataFrame with 2 rows and 2 columns
## mzmls fractions
## <character> <integer>
## 1 2693231cf1... 1
## 2 2693239560... 2
We can now create a Spectra
object containing the raw data stored in
the mzML files. If you are not familiar with the Spectra
object,
please refer to the package
vignettes.
sp <- Spectra(experimentFiles(msexp)[["mzmls"]])
sp
## MSn data (Spectra) with 58907 spectra in a MsBackendMzR backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 0.177987 1
## 2 1 0.599870 2
## 3 1 0.978849 3
## 4 1 1.363217 4
## 5 1 1.742965 5
## ... ... ... ...
## 58903 2 4198.59 29328
## 58904 1 4198.74 29329
## 58905 1 4199.11 29330
## 58906 1 4199.49 29331
## 58907 1 4199.87 29332
## ... 33 more variables/columns.
##
## file(s):
## 2693231cf13b6f_QEP2LC6_HeLa_50ng_251120_01-calib.mzML
## 26932395605ab_QEP2LC6_HeLa_50ng_251120_02-calib.mzML
We can now add this object to the main experiment management object:
spectra(msexp) <- sp
msexp
## Object of class MsExperiment
## Files: mzmls, fasta
## Spectra: MS1 (12983) MS2 (45924)
## Experiment data: 2 sample(s)
Let’s now assume we want to search the spectra in our mzML files
against the homo_sapiens.fasta
file. To do so, we would like to use
a search engine such as MSGF+, that is run using the command line and
generates mzid files.
The command to run MSGF+ would look like this (see the manual page for details):
java -jar /path/to/MSGFPlus.jar \
-s input.mzML \
-o output.mzid
-d proteins.fasta \
-t 20ppm \ ## precursor mass tolerance
-tda 1 \ ## search decoy database
-m 0 \ ## fragmentation method as written in the spectrum or CID if no info
-int 1 ## Orbitrap/FTICR/Lumos
We can easily build such a command for each of our input file:
mzids <- sub("mzML", "mzid", basename(experimentFiles(msexp)[["mzmls"]]))
paste0("java -jar /path/to/MSGFPlus.jar",
" -s ", experimentFiles(msexp)[["mzmls"]],
" -o ", mzids,
" -d ", experimentFiles(msexp)[["fasta"]],
" -t 20ppm",
" -m 0",
" int 1")
## [1] "java -jar /path/to/MSGFPlus.jar -s /home/biocbuild/.cache/R/rpx/2693231cf13b6f_QEP2LC6_HeLa_50ng_251120_01-calib.mzML -o 2693231cf13b6f_QEP2LC6_HeLa_50ng_251120_01-calib.mzid -d homo_sapiens.fasta -t 20ppm -m 0 int 1"
## [2] "java -jar /path/to/MSGFPlus.jar -s /home/biocbuild/.cache/R/rpx/26932395605ab_QEP2LC6_HeLa_50ng_251120_02-calib.mzML -o 26932395605ab_QEP2LC6_HeLa_50ng_251120_02-calib.mzid -d homo_sapiens.fasta -t 20ppm -m 0 int 1"
Here, for the sake of time and portability, we will not actually run MSGF+, but a simple shell script that will generate mzid files in a temporary R directory.
(output <- file.path(tempdir(), mzids))
## [1] "/tmp/Rtmpx26g10/2693231cf13b6f_QEP2LC6_HeLa_50ng_251120_01-calib.mzid"
## [2] "/tmp/Rtmpx26g10/26932395605ab_QEP2LC6_HeLa_50ng_251120_02-calib.mzid"
cmd <- paste("touch", output)
cmd
## [1] "touch /tmp/Rtmpx26g10/2693231cf13b6f_QEP2LC6_HeLa_50ng_251120_01-calib.mzid"
## [2] "touch /tmp/Rtmpx26g10/26932395605ab_QEP2LC6_HeLa_50ng_251120_02-calib.mzid"
The cmd
variable holds the two commands to be run on the command
line that will generate the new files. We can run each of these
commands with the system()
function.
sapply(cmd, system)
## touch /tmp/Rtmpx26g10/2693231cf13b6f_QEP2LC6_HeLa_50ng_251120_01-calib.mzid
## 0
## touch /tmp/Rtmpx26g10/26932395605ab_QEP2LC6_HeLa_50ng_251120_02-calib.mzid
## 0
Below, we add the names of the newly created files to our experiment:
experimentFiles(msexp)[["mzids"]] <- mzids
experimentFiles(msexp)
## MsExperimentFiles of length 3
## [["mzmls"]] 2693231cf13b6f_QEP2LC6_HeLa_50ng_251120_01-calib.mzML ...
## [["fasta"]] homo_sapiens.fasta
## [["mzids"]] 2693231cf13b6f_QEP2LC6_HeLa_50ng_251120_01-calib.mzid ...
msexp
## Object of class MsExperiment
## Files: mzmls, fasta, mzids
## Spectra: MS1 (12983) MS2 (45924)
## Experiment data: 2 sample(s)
We can also decide to store the commands that were used to generate
the mzid files in the experiment’s metadata slot. Here, we use the
convention to name that metadata item "mzmls_to_mzids"
to document
to input and output of these commands.
metadata(msexp)[["mzmls_to_mzids"]] <- cmd
metadata(msexp)
## $mzmls_to_mzids
## [1] "touch /tmp/Rtmpx26g10/2693231cf13b6f_QEP2LC6_HeLa_50ng_251120_01-calib.mzid"
## [2] "touch /tmp/Rtmpx26g10/26932395605ab_QEP2LC6_HeLa_50ng_251120_02-calib.mzid"
Finally, the existMsExperimentFiles()
can be used at any time to
check which of files that are associated with an experiment actually
exist:
existMsExperimentFiles(msexp)
## mzmls: 2 out of 2 exist(s)
## fasta: 0 out of 1 exist(s)
## mzids: 0 out of 2 exist(s)
The MsExperiment
object has been used to store files and data
pertaining to a mass spectrometry experiment. It is now possible to
save that object and reload it later to recover all data and metadata.
See also section Using MsExperiment
with MsBackendSql
below for an
alternative to load/restore defined MS experiments from an SQL database.
saveRDS(msexp, "msexp.rds")
rm(list = ls())
msexp <- readRDS("msexp.rds")
msexp
## Object of class MsExperiment
## Files: mzmls, fasta, mzids
## Spectra: MS1 (12983) MS2 (45924)
## Experiment data: 2 sample(s)
experimentFiles(msexp)
## MsExperimentFiles of length 3
## [["mzmls"]] 2693231cf13b6f_QEP2LC6_HeLa_50ng_251120_01-calib.mzML ...
## [["fasta"]] homo_sapiens.fasta
## [["mzids"]] 2693231cf13b6f_QEP2LC6_HeLa_50ng_251120_01-calib.mzid ...
We can access the raw data as long as the mzML files that were used to
generate it still exist in their original location, which is the case
here as they were saved in the rpx
cache directory.
sp <- spectra(msexp)
sp
## MSn data (Spectra) with 58907 spectra in a MsBackendMzR backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 0.177987 1
## 2 1 0.599870 2
## 3 1 0.978849 3
## 4 1 1.363217 4
## 5 1 1.742965 5
## ... ... ... ...
## 58903 2 4198.59 29328
## 58904 1 4198.74 29329
## 58905 1 4199.11 29330
## 58906 1 4199.49 29331
## 58907 1 4199.87 29332
## ... 33 more variables/columns.
##
## file(s):
## 2693231cf13b6f_QEP2LC6_HeLa_50ng_251120_01-calib.mzML
## 26932395605ab_QEP2LC6_HeLa_50ng_251120_02-calib.mzML
plotSpectra(sp[1000])
For some experiments and data analyses an explicit link between data, data files and respective samples is required. Such links enable an easy (and error-free) subset or re-ordering of a whole experiment by sample and would also simplify coloring and labeling of the data depending on the sample or of its variables or conditions.
Below we generate an MsExperiment
object for a simple experiment consisting of
a single sample measured in two different injections to the same LC-MS setup.
lmse <- MsExperiment()
sd <- DataFrame(sample_id = c("QC1", "QC2"),
sample_name = c("QC Pool", "QC Pool"),
injection_idx = c(1, 3))
sampleData(lmse) <- sd
We next add mzML files to the experiment for the sample that was measured. These
are available within the msdata
R package. We add also an additional
annotation file "internal_standards.txt"
to the experiment, which could be
e.g. a file with m/z and retention times of internal standards added to the
sample (note that such files don’t necessarily have to exist).
fls <- dir(system.file("sciex", package = "msdata"), full.names = TRUE)
basename(fls)
## [1] "20171016_POOL_POS_1_105-134.mzML" "20171016_POOL_POS_3_105-134.mzML"
experimentFiles(lmse) <- MsExperimentFiles(
mzML_files = fls,
annotations = "internal_standards.txt")
Next we load the MS data from the mzML files as a Spectra
object and add them
to the experiment (see the vignette of the Spectra for
details on import and representation of MS data).
sps <- Spectra(fls, backend = MsBackendMzR())
spectra(lmse) <- sps
lmse
## Object of class MsExperiment
## Files: mzML_files, annotations
## Spectra: MS1 (1862)
## Experiment data: 2 sample(s)
At this stage we have thus sample annotations and MS data in our object, but no
explicit relationships between them. Without such linking between files and
samples any subsetting would only subset the sampleData
but not any of the
potentially associated files. We next use the linkSpectraData
function
to establish and define such relationships. First we link the experimental files
to the samples: we want to link the first mzML file in the element called
"mzML_file"
in the object’s experimentFiles
to the first row in sampleData
and the second file to the second row.
lmse <- linkSampleData(lmse, with = "experimentFiles.mzML_file",
sampleIndex = c(1, 2), withIndex = c(1, 2))
To define the link we have thus to specify with which element within our
MsExperiment
we want to link samples. This can be done with the parameter
with
that takes a single character
representing the name (address) of the
data element. The name is a combination of the name of the slot within the
MsExperiment
and the name of the element (or column) within that slot
separated by a "."
. Using with = "experimentFiles.mzML_file"
means we want
to link samples to values within the "mzML_file"
element of the object’s
experimentFiles
slot - in other words, we want to link samples to values in
experimentFiles(lmse)$mzML_file
. The indices of the rows (samples) in
sampleData
and the indices of the values in with
to which we want to link
the samples can be defined with sampleIndex
and withIndex
. In the example
above we used sampleIndex = c(1, 2)
and withIndex = c(1, 2)
, thus, we want
to link the first row in sampleData
to the first value in with
and the
second row to the second value. See also the section Linking sample data to
other experimental data in the documentation of MsExperiment
for more
information and details.
What happened internally by the call above is illustrated in the figure
below. The link is represented as a two-column integer matrix
with the indices
of the linked sample in the first and the indices of the associated elements in
the second columns (this matrix is essentially a
cbind(sampleIndex, withIndex)
).