1 The scp package

The scp package is used to process and analyse mass spectrometry (MS)-based single cell proteomics (SCP) data. The functions rely on a specific data structure that wraps QFeatures objects (Gatto (2020)) around SingleCellExperiment objects ((???)). This data structure could be seen as Matryoshka dolls were the SingleCellExperiment objects are small dolls contained in the bigger QFeatures doll.

The SingleCellExperiment class provides a dedicated framework for single-cell data. The SingleCellExperiment serves as an interface to many cutting-edge methods for processing, visualizing and analysis single-cell data. More information about the SingleCellExperiment class and associated methods can be found in the OSCA book.

The QFeatures class is a data framework dedicated to manipulate and process MS-based quantitative data. It preserves the relationship between the different levels of information: peptide to spectrum match (PSM) data, peptide data and protein data. The QFeatures package also provides an interface to many utility functions to streamline the processing MS data. More information about MS data analysis tools can be found in the RforMassSpectrometry project.

`scp` relies on `SingleCellExperiment` and `QFeatures` objects.

(#fig:scp_framework)scp relies on SingleCellExperiment and QFeatures objects.

Before running the vignette we need to load the scp package.

library("scp")

We also load ggplot2, magrittr and dplyr for convenient data manipulation and plotting.

library("ggplot2")
library("magrittr")
library("dplyr")

2 Before you start

This vignette will guide you through some common steps of mass spectrometry-based single-cell proteomics (SCP) data analysis. SCP is an emerging field and further research is required to develop a principled analysis workflow. Therefore, we do not guarantee that the steps presented here are the best steps for this type of data analysis. This vignette performs the steps that were described in the SCoPE2 landmark paper (Specht et al. (2021)) and that were reproduced in another work using the scp package ((???)). The replication on the full SCoPE2 dataset using scp is available in this vignette. We hope to convince the reader that, although the workflow is probably not optimal, scp has the full potential to perform standardized and principled data analysis. All functions presented here are comprehensively documented, highly modular, can easily be extended with new algorithms. Suggestions, feature requests or bug reports are warmly welcome. Feel free to open an issue in the GitHub repository.

This workflow can be applied to any MS-based SCP data. The minimal requirement to follow this workflow is that the data should contain the following information:

  • Raw.file: field in both the feature data and the sample data that gives the names of MS acquisition runs or files.
  • Channel: field in the sample data that links to columns in the quantification data and that allows to link samples to MS channels (more details in another vignette).
  • SampleType: field in the sample data that provides the type of sample that is acquired (carrier, reference, single-cell,…). Only needed for multiplexing experiments.
  • Potential.contaminant: field in the feature data that marks contaminant peptides.
  • Reverse: field in the feature data that marks reverse peptides.
  • PIF: field in the feature data that provides spectral purity.
  • PEP or dart_PEP: field in the feature data that provides peptide posterior error probabilities.
  • Modified.sequence: field in the feature data that provides the peptide identifiers.
  • Leading.razor.protein: field in the feature data that provides the protein identifiers.
  • At least one field in the feature data that contains quantification values. In this case, there are 16 quantification columns named as Reporter.intensity. followed by an index (1 to 16).

Each required field will be described more in detail in the corresponding sections. Names can be adapted by the user to more meaningful ones or adapted to other output tables.

3 Read in SCP data

The first step is to read in the PSM quantification table generated by, for example, MaxQuant (Tyanova, Temu, and Cox (2016)). We created a small example data by subsetting the MaxQuant evidence.txt table provided in the SCoPE2 landmark paper (Specht et al. (2021)). The mqScpData table is a typical example of what you would get after reading in a CSV file using read.csv or read.table. See ?mqScpData for more information about the table content.

data("mqScpData")

We also provide an example of a sample annotation table that provides useful information about the samples that are present in the example data. See ?sampleAnnotation for more information about the table content.

data("sampleAnnotation")

As a note, the example sample data contains 5 different types of samples (SampleType) that can be found in a TMT-based SCP data set:

table(sampleAnnotation$SampleType)
#> 
#>      Blank    Carrier Macrophage   Monocyte  Reference     Unused 
#>         19          3         20          5          3         14
  • The carrier channels (Carrier) contain 200 cell equivalents and are meant to boost the peptide identification rate.
  • The normalization channels (Reference) contain 5 cell equivalents and are used to partially correct for between-run variation.
  • The unused channels (Unused) are channels that are left empty due to isotopic cross-contamination by the carrier channel.
  • The negative control channels (Blank) contain samples that do not contain any cell but are processed as single-cell samples.
  • The single-cell sample channels contain the single-cell samples of interest, that are macrophage (Macrophage) or monocyte (Monocyte).

Using readSCP, we combine both tables in a QFeatures object formatted as described above.

scp <- readSCP(featureData = mqScpData,
               colData = sampleAnnotation,
               channelCol = "Channel",
               batchCol = "Raw.file",
               removeEmptyCols = TRUE)
#> Loading data as a 'SingleCellExperiment' object
#> Splitting data based on 'Raw.file'
#> Formatting sample metadata (colData)
#> Formatting data as a 'QFeatures' object
scp
#> An instance of class QFeatures containing 4 assays:
#>  [1] 190222S_LCA9_X_FP94BM: SingleCellExperiment with 395 rows and 11 columns 
#>  [2] 190321S_LCA10_X_FP97AG: SingleCellExperiment with 487 rows and 11 columns 
#>  [3] 190321S_LCA10_X_FP97_blank_01: SingleCellExperiment with 109 rows and 11 columns 
#>  [4] 190914S_LCB3_X_16plex_Set_21: SingleCellExperiment with 370 rows and 16 columns

See here that the 3 first assays contain 11 columns that correspond to the TMT-11 labels and the last assay contains 16 columns that correspond to the TMT-16 labels.

Important: More details about the usage of readSCP and how to read your own data set are provided in the Load data using readSCP vignette.

Another way to get an overview of the scp object is to plot the QFeatures object. This will create a graph where each node is an assay and links between assays are denoted as edges.

plot(scp)