This vignette provides a general background about machine learning (ML) methods and concepts, and their application to the analysis of spatial proteomics data in the *pRoloc* package. See the `pRoloc-tutorial`

vignette for details about the package itself.

pRoloc 1.45.1

For a general practical introduction to *pRoloc*, readers
are referred to the tutorial, available using
`vignette("pRoloc-tutorial", package = "pRoloc")`

. The following
document provides a overview of the algorithms available in the
package. The respective section describe unsupervised machine learning
(USML), supervised machine learning (SML), semi-supervised machine
learning (SSML) as implemented in the novelty detection algorithm and
transfer learning.

We provide 144 test data sets in the
*pRolocdata* package that can be readily used with
*pRoloc*. The data set can be listed with *pRolocdata*
and loaded with the *data* function. Each data set, including its
origin, is individually documented.

The data sets are distributed as *MSnSet* instances. Briefly,
these are dedicated containers for quantitation data as well as
feature and sample meta-data. More details about *MSnSet*s are
available in the *pRoloc* tutorial and in the
*MSnbase* package, that defined the class.

```
library("pRolocdata")
data(tan2009r1)
tan2009r1
```

```
## MSnSet (storageMode: lockedEnvironment)
## assayData: 888 features, 4 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: X114 X115 X116 X117
## varLabels: Fractions
## varMetadata: labelDescription
## featureData
## featureNames: P20353 P53501 ... P07909 (888 total)
## fvarLabels: FBgn Protein.ID ... markers.tl (16 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 19317464
## Annotation:
## - - - Processing information - - -
## Added markers from 'mrk' marker vector. Thu Jul 16 22:53:44 2015
## MSnbase version: 1.17.12
```

While our primary biological domain is quantitative proteomics, with
special emphasis on spatial proteomics, the underlying class
infrastructure on which *pRoloc* and implemented in the
Bioconductor *MSnbase* package enables the conversion from/to
transcriptomics data, in particular microarray data available as
*ExpressionSet* objects using the *as* coercion
methods (see the *MSnSet* section in the
`MSnbase-development`

vignette). As a result, it is
straightforward to apply the methods summarised here in detailed in
the other *pRoloc* vignettes to these other data structures.

Unsupervised machine learning refers to clustering, i.e.Â finding structure in a quantitative, generally multi-dimensional data set of unlabelled data.

Currently, unsupervised clustering facilities are available through
the *plot2D* function and the *MLInterfaces*
package (Carey et al., n.d.). The former takes an *MSnSet*
instance and represents the data on a scatter plot along the first two
principal components. Arbitrary feature meta-data can be represented
using different colours and point characters. The reader is referred
to the manual page available through *?plot2D* for more
details and examples.

*pRoloc* also implements a *MLean* method for
*MSnSet* instances, allowing to use the relevant
infrastructure with the organelle proteomics framework. Although
provides a common interface to unsupervised and numerous supervised
algorithms, we refer to the *pRoloc* tutorial for its usage
to several clustering algorithms.

**Note** Current development efforts in terms of clustering are
described on the *Clustering infrastructure* wiki page
(https://github.com/lgatto/pRoloc/wiki/Clustering-infrastructure)
and will be incorporated in future version of the package.

Supervised machine learning refers to a broad family of classification algorithms. The algorithms learns from a modest set of labelled data points called the training data. Each training data example consists of a pair of inputs: the actual data, generally represented as a vector of numbers and a class label, representing the membership to exactly 1 of multiple possible classes. When there are only two possible classes, on refers to binary classification. The training data is used to construct a model that can be used to classifier new, unlabelled examples. The model takes the numeric vectors of the unlabelled data points and return, for each of these inputs, the corresponding mapped class.

**k-nearest neighbour (KNN)** Function *knn* from package
*class*. For each row of the test set, the *k* nearest
(in Euclidean distance) training set vectors are found, and the
classification is decided by majority vote over the *k* classes, with
ties broken at random. This is a simple algorithm that is often used
as baseline classifier. If there are ties for the *k*th nearest
vector, all candidates are included in the vote.

**Partial least square DA (PLS-DA)** Function *plsda* from package
. Partial least square discriminant analysis is used to
fit a standard PLS model for classification.

**Support vector machine (SVM)** A support vector machine constructs a
hyperplane (or set of hyperplanes for multiple-class problem), which
are then used for classification. The best separation is defined as
the hyperplane that has the largest distance (the margin) to the
nearest data points in any class, which also reduces the
classification generalisation error. To assure liner separation of the
classes, the data is transformed using a *kernel function* into a
high-dimensional space, permitting liner separation of the classes.

*pRoloc* makes use of the functions *svm* from
package and *ksvm* from .

**Artificial neural network (ANN)** Function *nnet* from package
. Fits a single-hidden-layer neural network, possibly
with skip-layer connections.

**Naive Bayes (NB)** Function *naiveBayes* from package
. Naive Bayes classifier that computes the conditional
a-posterior probabilities of a categorical class variable given
independent predictor variables using the Bayes rule. Assumes
independence of the predictor variables, and Gaussian distribution
(given the target class) of metric predictors.

**Random Forest (RF)** Function *randomForest* from package
.

**Chi-square (\(\chi^2\))** Assignment based on squared differences
between a labelled marker and a new feature to be
classified. Canonical protein correlation profile method (PCP) uses
squared differences between a labelled marker and new features. In
(Andersen et al. 2003), \(\chi^2\) is defined as , i.e.Â \(\chi^{2} = \frac{\sum_{i=1}^{n} (x_i - m_i)^{2}}{n}\), whereas (Wiese et al. 2007) divide
by the value the squared value by the value of the reference feature
in each fraction, i.e.Â \(\chi^{2} = \sum_{i=1}^{n}\frac{(x_i - m_i)^{2}}{m_i}\), where \(x_i\) is normalised intensity of feature *x* in
fraction *i*, \(m_i\) is the normalised intensity of marker *m* in
fraction *i* and *n* is the number of fractions available. We will use
the former definition.

**PerTurbo ** From (Courty, Burger, and Laurent 2011): PerTurbo, an original, non-parametric
and efficient classification method is presented here. In our
framework, the manifold of each class is characterised by its
Laplace-Beltrami operator, which is evaluated with classical methods
involving the graph Laplacian. The classification criterion is
established thanks to a measure of the magnitude of the spectrum
perturbation of this operator. The first experiments show good
performances against classical algorithms of the
state-of-the-art. Moreover, from this measure is derived an efficient
policy to design sampling queries in a context of active
learning. Performances collected over toy examples and real world
datasets assess the qualities of this strategy.

The PerTurbo implementation comes from the *pRoloc*
packages.

It is essential when applying any of the above classification
algorithms, to wisely set the algorithm parameters, as these can have
an important effect on the classification. Such parameters are for
example the width *sigma* of the Radial Basis Function (Gaussian
kernel) \(exp(-\sigma \| x - x' \|^2 )\) and the *cost* (slack)
parameter (controlling the tolerance to mis-classification) of our SVM
classifier. The number of neighbours *k* of the KNN classifier is
equally important as will be discussed in this sections.

The next figure illustrates the effect of different
choices of \(k\) using organelle proteomics data from
(Dunkley et al. 2006) (*dunkley2006* from
*pRolocdata*). As highlighted in the squared region, we can
see that using a low \(k\) (*k = 1* on the left) will result in very
specific classification boundaries that precisely follow the contour
or our marker set as opposed to a higher number of neighbours (*k = 8*
on the right). While one could be tempted to believe that
*optimised* classification boundaries are preferable, it is
essential to remember that these boundaries are specific to the marker
set used to construct them, while there is absolutely no reason to
expect these regions to faithfully separate any new data points, in
particular proteins that we wish to classify to an organelle. In other
words, the highly specific *k = 1* classification boundaries are
*over-fitted* for the marker set or, in other words, lack
generalisation to new instances. We will demonstrate this using
simulated data taken from (James et al. 2013) and show what detrimental
effect *over-fitting* has on new data.