1 The scp data framework

Our data structure is relying on two curated data classes: QFeatures (Gatto (2020)) and SingleCellExperiment ((???)). QFeatures is dedicated to the manipulation and processing of MS-based quantitative data. It explicitly records the successive steps to allow users to navigate up and down the different MS levels. SingleCellExperiment is another class designed as an efficient data container that serves as an interface to state-of-the-art methods and algorithms for single-cell data. Our framework combines the two classes to inherit from their respective advantages.

Because mass spectrometry (MS)-based single-cell proteomics (SCP) only captures the proteome of between one and a few tens of single-cells in a single run, the data is usually acquired across many MS batches. Therefore, the data for each run should conceptually be stored in its own container, that we here call an assay. The expected input for working with the scp package is quantification data of peptide to spectrum matches (PSM). These data can then be processed to reconstruct peptide and protein data. The links between related features across different assays are stored to facilitate manipulation and visualization of of PSM, peptide and protein data. This is conceptually shown below.

The `scp` framework relies on `SingleCellExperiment` and `QFeatures` objects

Figure 1: The scp framework relies on SingleCellExperiment and QFeatures objects

There are two input tables required for starting an analysis with scp:

  1. The input table
  2. The sample table

2 Input table

The input table is generated after the identification and quantification of the MS spectra by a pre-processing software such as MaxQuant, ProteomeDiscoverer or MSFragger (the list of available software is actually much longer). We will here use as an example a data table that has been generated by MaxQuant. The table is available from the scp package and is called mqScpData (for MaxQuant generated SCP data).

library(scp)
data("mqScpData")
dim(mqScpData)
#> [1] 1361  149

In this toy example, there are 1361 rows corresponding to features (quantified PSMs) and 149 columns corresponding to different data fields recorded by MaxQuant during the processing of the MS spectra. There are three types of columns:

  • Feature quantification: 1 to n (depending on technology)
  • Feature annotations: e.g. peptide sequence, ion charge, protein name
  • Acquisition annotations: e.g. file name