This vignette illustrates how to read and input your own data to the
SIAMCAT package. We will cover reading in text files from the disk,
formatting them and using them to create an object of
siamcat-class is the centerpiece of the package. All of the input data
and result are stored inside of it. The structure of the object is described
below in the siamcat-class object section.
Generally, there are three types of input for
The features should be a
data.frame, or an
organized as follows:
features (in rows) x samples (in columns).
Please note that
SIAMCATis supposed to work with relative abundances. Other types of data (e.g. counts) will also work, but not all functions of the package will result in meaningful outputs.
An example of a typical feature file is attached to the
containing data from a publication investigating the microbiome in colorectal
cancer (CRC) patients and controls (the study can be found here:
Zeller et al). The metagenomics
data were processed with the MOCAT pipeline, returning
taxonomic profiles on the species levels (
library(SIAMCAT) fn.in.feat <- system.file( "extdata", "feat_crc_zeller_msb_mocat_specI.tsv", package = "SIAMCAT" )
One way to load such data into
R could be the use of
(Beware of the defaults in R! They are not always useful…)
feat <- read.table(fn.in.feat, sep='\t', header=TRUE, quote='', stringsAsFactors = FALSE, check.names = FALSE) # look at some features feat[110:114, 1:2]
## CCIS27304052ST-3-0 CCIS15794887ST-4-0 ## Bacteroides caccae [h:1096] 1.557937e-03 1.761949e-03 ## Bacteroides eggerthii [h:1097] 2.734527e-05 4.146882e-05 ## Bacteroides stercoris [h:1098] 1.173786e-03 2.475838e-03 ## Bacteroides clarus [h:1099] 4.830533e-04 4.589747e-06 ## Methanohalophilus mahii [h:11] 0.000000e+00 0.000000e+00
The metadata should be either a matrix or a
samples (in rows) x metadata (in columns):
rownames of the metadata should match the
colnames of the feature
Again, an example of such a file is attached to the
SIAMCAT package, taken
from the same study:
fn.in.meta <- system.file( "extdata", "num_metadata_crc_zeller_msb_mocat_specI.tsv", package = "SIAMCAT" )
Also here, the
read.table can be used to load the data into
meta <- read.table(fn.in.meta, sep='\t', header=TRUE, quote='', stringsAsFactors = FALSE, check.names = FALSE) head(meta)
## age gender bmi diagnosis localization crc_stage fobt ## CCIS27304052ST-3-0 52 1 20 0 NA 0 0 ## CCIS15794887ST-4-0 37 1 18 0 NA 0 0 ## CCIS74726977ST-3-0 66 2 24 1 NA 0 0 ## CCIS16561622ST-4-0 54 2 26 0 NA 0 0 ## CCIS79210440ST-3-0 65 2 30 0 NA 0 1 ## CCIS82507866ST-3-0 57 2 24 0 NA 0 0 ## wif_test ## CCIS27304052ST-3-0 0 ## CCIS15794887ST-4-0 0 ## CCIS74726977ST-3-0 NA ## CCIS16561622ST-4-0 0 ## CCIS79210440ST-3-0 0 ## CCIS82507866ST-3-0 0
Finally, the label can come in different three different flavours:
Named vector: A named vector containing information about cases and
controls. The names of the vector should match the
rownames of the metadata
colnames of the feature data.
The label can contain either the information about cases and controls either
Metadata column: You can provide the name of a column in the metadata for the creation of the label. See below for an example.
SIAMCAT has a function called
read.label, which will
create a label object from a label file. The file should be organized as
#BINARY:1=[label for cases];-1=[label for controls]
1for each case and
-1for each control.
An example file is attached to the package again, if you want to have a look at it.
For our example dataset, we can create the label object out of the metadata
label <- create.label(meta=meta, label="diagnosis", case = 1, control=0)
When we later plot the results, it might be nicer to have names for the
different groups stored in the label object (instead of
0). We can
also supply them to the
label <- create.label(meta=meta, label="diagnosis", case = 1, control=0, p.lab = 'cancer', n.lab = 'healthy')
## Label used as case: ## 1 ## Label used as control: ## 0
## + finished create.label.from.metadata in 0 s
## healthy cancer ## -1 1
If you have no label information for your dataset, you can still create a
SIAMCATobject from your features alone. The
SIAMCATobject without label information will contain a
TESTlabel that can be used for making holdout predictions. Other functions, e.g. model training, will not work on such an object.
LEfSe is a tool for identification of associations between micriobial features and up to two metadata. LEfSe uses LDA (linear discriminant analysis).
LEfSe input file is a
.tsv file. The first few rows contain the metadata. The
following row contains sample names and the rest of the rows are occupied by
features. The first column holds the row names:
An example of such a file is attached to the
fn.in.lefse<- system.file( "extdata", "LEfSe_crc_zeller_msb_mocat_specI.tsv", package = "SIAMCAT" )
SIAMCAT has a dedicated function to read LEfSe format files. The
function will read in the input file and extract metadata and features:
meta.and.features <- read.lefse(fn.in.lefse, rows.meta = 1:6, row.samples = 7) meta <- meta.and.features$meta feat <- meta.and.features$feat
We can then create a label object from one of the columns of the meta object and
label <- create.label(meta=meta, label="label", case = "cancer")
## Label used as case: ## cancer ## Label used as control: ## healthy
## + finished create.label.from.metadata in 0.002 s
metagenomeSeq is an R package to determine differentially abundant features between multiple samples.
There are two ways to input data into metagenomeSeq:
SIAMCATjust like described in SIAMCAT input with
fn.in.feat <- system.file( "extdata", "CHK_NAME.otus.count.csv", package = "metagenomeSeq" ) feat <- read.table(fn.in.feat, sep='\t', header=TRUE, quote='', row.names = 1, stringsAsFactors = FALSE, check.names = FALSE )
BIOMformat file, that can be used in
SIAMCATas described in the following section
The BIOM format files can be added to
phyloseq. First the file
should be imported using the
import_biom. Then a
phyloseq object can be imported as a
siamcat object as descibed in the
siamcat object extends on the
phyloseq object. Therefore, creating
siamcat object from a
phyloseq object is really straightforward. This
can be done with the
siamcat constructor function. First, however, we need
to create a label object:
data("GlobalPatterns") ## phyloseq example data label <- create.label(meta=sample_data(GlobalPatterns), label = "SampleType", case = c("Freshwater", "Freshwater (creek)", "Ocean"))
## Label used as case: ## Freshwater,Freshwater (creek),Ocean ## Label used as control: ## rest
## + finished create.label.from.metadata in 0.006 s
# run the constructor function siamcat <- siamcat(phyloseq=GlobalPatterns, label=label)
## + starting validate.data
## +++ checking overlap between labels and features
## + Keeping labels of 26 sample(s).
## +++ checking sample number per class
## Data set has a limited number of training examples: ## rest 18 ## Case 8 ## Note that a dataset this small/skewed is not necessarily suitable for analysis in this pipeline.
## +++ checking overlap between samples and metadata
## + finished validate.data in 0.151 s
siamcat-class is the centerpiece of the package. All of the is stored
inside of the object:
In the figure above, rectangles depict slots of the object and the class of
the object stored in the slot is given in the ovals. There are two
obligatory slots -phyloseq (containing the metadata as
the original features as
otu_table) and label - marked with thick borders.
siamcat object is constructed using the
siamcat() function. There are
two ways to initialize it:
Features: You can provide a feature
otu_table to the function (together with label and metadata information):
siamcat <- siamcat(feat=feat, label=label, meta=meta)
phyloseq: The alternative is to create a
siamcat object directly out
siamcat <- siamcat(phyloseq=phyloseq, label=label)
Please note that you have to provide either
phyloseq and that
you cannot provide both.
In order to explain the
siamcat object better we will show how each of the
slots is filled.
The phyloseq and label slots are obligatory.
phyloseq, which is described in the help file of the
phyloseqclass. Help can be accessed by typing into R console:
help('otu_table-class')- stores the original feature table. For
SIAMCAT, this slot can be accessed by
help('label-class')- that are automatically generated by the
phyloseq, label and orig_feat are filled when the
siamcat object is
first created with the constructor function.
Other slots are filled during the run of the