This vignette provides instructions on how to load, convert, and save DNA
methylation (DNAm) array datasets using the
HDF5Array R packages.
These tasks show how to make and work with
SummarizedExperiment objects, which
are used by DNAm analysis packages such as
ChAMP. More in-depth
discussion of DNAm data types and storage formats can be found in the
recountmethylation User’s Guide.
For demonstration and development purposes, we can load the example
data provided in the
minfiData package. This load example data generated for
the HM450k array platform.
rg.hm450k <- get(data("RGsetEx"))
minfiDataEPIC package for similar small example datasets generated
from the EPIC array platform.
We can convert between HM450K and EPIC array platforms using
rg.epic <- convertArray(rg.hm450k, "IlluminaHumanMethylationEPIC")
This makes a new digital array object that mimics data generated from the EPIC array, which can be convenient for harmonizing samples across platforms or passing data to functions written for a particular platform.
To convert between and
RGChannelSet and other classes, we need to call the
preprocessRaw() (or some other preprocessing function which returns
mapToGenome(), which will map a
object to the genome coordinates. The latter is useful for genome or annotation-based queries and it may be required by certain normalization and analysis functions.
We can convert the object
rg to a
ms.hm450k <- preprocessRaw(rg) # make MethylSet ms.hm450k <- mapToGenome(ms.hm450k) # make GenomicMethylSet
We can also make new
SummarizedExperiment objects manually by specifying the
different fields for assays, metadata, experiment details, etc. This is useful
when, for instance, we only have a matrix of DNAm signals but we need to pass
SummarizedExperiment-type object to an analysis function.
We can make a non-normalized
GenomicRatioSet from the object
gr.hm450k <- GenomicMethylSet(gr = granges(ms.hm450k), Meth = getMeth(ms.hm450k), Unmeth = getUnmeth(ms.hm450k), annotation = annotation(ms.hm450k))
For details about similar constructor functions for different
classes, you can consult the function documentation using
We can convert a standard matrix-backed
SummarizedExperiment object, such as
shown above, into a
DelayedArray-backed object by first saving with
HDF5Array package. This will recast and store the new object in a new directory.
saveHDF5SummarizedExperiment(gr.hm450k, dir = "gr_h5se_new")
We load the new
DelayedArray-backed data with
and this should realize a subset of the data in memory.
gr.h5se <- loadHDF5SummarizedExperiment(dir = "gr_h5se_new")
Now the table returned from running
The compiled DNAm array data in
recountmethylation covers three formats (
gr) and 2 storage formats (HDF5 and
DelayedArray). In general, R
users will want to use the
DelayedArray-backed object formats. These appear as
h5se in their name containing a large assays file and a smaller
metadata file. Users of Python and other programming languages besides R will
likely prefer to use the HDF5 database files, which have
h5 in their names.
We can extract flat tables of the assays or sample metadata from either matrix-backed
SummarizedExperiment objects. The specific functions
will depend on the specific data format, but in general you can think of the
assays data such as the Red channel signal or Beta-values matrix as the main
dataset of CpG probes (rows) and samples (columns).
For this main DNAm signals dataset, the
rowData and annotations obtainable from
getAnnotation contain the probe-level metadata including manifest-based annotations
and genome locations. By contrast, the
pData matrix contains the
sample-level metadata, which may include information such as demographic information,
tissue type, disease condition, and more.
We can extract the individual flat files and save these as R binary files as follows:
# get flat files m.beta <- getBeta(gr.h5se) anno <- as.data.frame(getAnnotation(gr.h5se)) coldata <- as.data.frame(colData(gr.h5se)) # save flat files save(m.beta, file = "mbeta_new.rda") save(anno, file = "anno_new.rda") save(coldata, file = "coldata_new.rda")
To instead write flat files to new tables such as
.csv files, we can use one of
write.table(m.beta, file = "mbeta_new.txt") write.csv(m.beta, file = "mbeta_new.csv") data.table::fwrite(m.beta, file = "mbeta_new.txt")
It will generally take longer to write to a new flat table (e.g. file with
.txt extension) than to a binary file (e.g. file with
.rda extension), and
the time difference will increase with the size of the dataset being saved. Functions
fwrite from the
data.table package work similar to the base
R functions including
write.csv, but they can be many times faster.
They are recommended when working with larger datasets, such as is likely encountered
when working with the
recountmethylation data compilations.
SummarizedExperiment objects, such as
GenomicRatioSets, can all be saved like standard R objects.
save(rg, file = "rg_new.rda") save(gm, file = "gm_new.rda") save(gr, file = "gr_new.rda")
When working with full-sized compilation files, you may find the dataset directory
h5se of interest already exists and you simply want to update it. These
update operations could include subsetting the samples or probes in the compilation,
or adding new metadata columns. In these cases, repeatedly saving the
DelayedArray-backed datasets with
saveHDF5SummarizedExperiment() is very
time-consuming and not necessary. Instead, use
rg.h5se <- rg.h5se[seq(1000),] # subset the h5se object quickResaveHDF5SummarizedExperiment(rg.h5se) # rapidly update stored file
In general, you will only need to use
saveHDF5SummarizedExperiment() when saving
a brand new
We have seen how to load, convert, and save DNAm array datasets using functions
HDF5Array, with runnable examples using an example
For more in-depth discussion of the compilations, data classes, and storage formats, see: