MutationalPatterns 3.14.0
Mutational processes leave characteristic footprints in genomic DNA. This package provides a comprehensive set of flexible functions that allows researchers to easily evaluate and visualize a multitude of mutational patterns in base substitution catalogues of e.g. healthy samples, tumour samples, or DNA-repair deficient cells. This is the second major version of the package. Many new functions have been added and functions from the previous version have been enhanced. The package covers a wide range of patterns including: mutational signatures, transcriptional and replicative strand bias, lesion segregation, genomic distribution and association with genomic features, which are collectively meaningful for studying the activity of mutational processes. The package works with single nucleotide variants (SNVs), insertions and deletions (Indels), double base substitutions (DBSs) and larger multi base substitutions (MBSs). The package provides functionalities for both extracting mutational signatures de novo and determining the contribution of previously identified mutational signatures on a single sample level. MutationalPatterns integrates with common R genomic analysis workflows and allows easy association with (publicly available) annotation data.
Background on the biological relevance of the different mutational patterns, a practical illustration of the package functionalities, comparison with similar tools and software packages and an elaborate discussion, are described in the new MutationalPatterns article, which is in currently being written. The old article can be found here.
This vignette shows some common ways in which the functions in this package can
be used. It is however not exhaustive and won’t show every argument of every
function. You can view the documentation of a function by adding a ?
in front
of it. Like: ?plot_spectrum
. The describes the functions and all their
arguments in more detail. It also contains more examples of how the functions in
this package can be used.
First you need to install the package from Bioconductor
.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("MutationalPatterns")
You also need to load the package.
This needs to be repeated every time you restart R
.
library(MutationalPatterns)
To perform the mutational pattern analyses, you need to load one or multiple VCF files with variant calls and the corresponding reference genome.
You can list available genomes using BSgenome:
library(BSgenome)
head(available.genomes())
## [1] "BSgenome.Alyrata.JGI.v1" "BSgenome.Amellifera.BeeBase.assembly4"
## [3] "BSgenome.Amellifera.NCBI.AmelHAv3.1" "BSgenome.Amellifera.UCSC.apiMel2"
## [5] "BSgenome.Amellifera.UCSC.apiMel2.masked" "BSgenome.Aofficinalis.NCBI.V1"
Make sure to install the reference genome that matches your VCFs.
For the example data this is BSgenome.Hsapiens.UCSC.hg19
.
Now you can load your reference genome:
ref_genome <- "BSgenome.Hsapiens.UCSC.hg19"
library(ref_genome, character.only = TRUE)
We provided two example data sets with this package. One consists of a subset of somatic SNV catalogues of 9 normal human adult stem cells from 3 different tissues (Blokzijl et al. 2016), and the other contains somatic indels and DBSs from 3 healthy human hematopoietic stem cells (Osorio et al. 2018). The MBS data you will find in the latter dataset was manually included by us to demonstrate some features of this package.
This is how you can locate the VCF files of the example data from the first set.
These will be used for the SNV examples:
vcf_files <- list.files(system.file("extdata", package = "MutationalPatterns"),
pattern = "sample.vcf", full.names = TRUE
)
You also need to define corresponding sample names for the VCF files:
sample_names <- c(
"colon1", "colon2", "colon3",
"intestine1", "intestine2", "intestine3",
"liver1", "liver2", "liver3"
)
Now you can load the VCF files into a GRangesList
:
grl <- read_vcfs_as_granges(vcf_files, sample_names, ref_genome)
Here we define relevant metadata on the samples, such as tissue type. This will be useful later.
tissue <- c(rep("colon", 3), rep("intestine", 3), rep("liver", 3))
We will now locate the VCF files of the example data from the second set. These will be used for the indels, DBS and MBS examples.
blood_vcf_fnames <- list.files(
system.file("extdata", package = "MutationalPatterns"),
pattern = "blood.*vcf", full.names = TRUE)
Set their sample names.
blood_sample_names <- c("blood1", "blood2", "blood3")
Read in the data, without filtering for any mutation type using the type="all"
argument.
(By default only SNVs are loaded for backwards compatibility.)
blood_grl <- read_vcfs_as_granges(blood_vcf_fnames, blood_sample_names,
ref_genome, type = "all")
You can now retrieve different types of mutations from the GrangesList
.
snv_grl <- get_mut_type(blood_grl, type = "snv")
## Any neighbouring SNVs will be merged into DBS/MBS variants.
## Set the 'predefined_dbs_mbs' to 'TRUE' if you don't want this.
indel_grl <- get_mut_type(blood_grl, type = "indel")
dbs_grl <- get_mut_type(blood_grl, type = "dbs")
## Any neighbouring SNVs will be merged into DBS/MBS variants.
## Set the 'predefined_dbs_mbs' to 'TRUE' if you don't want this.
mbs_grl <- get_mut_type(blood_grl, type = "mbs")
## Any neighbouring SNVs will be merged into DBS/MBS variants.
## Set the 'predefined_dbs_mbs' to 'TRUE' if you don't want this.
It’s also possible to directly select for a specific mutation type when reading in the data. This can be a convenient shortcut, when you are only interested in a single type of mutation.
indel_grl <- read_vcfs_as_granges(blood_vcf_fnames, blood_sample_names,
ref_genome, type = "indel")
By default the package assumes that DBS and MBS variants are present in your
vcfs as separate neighbouring SNVs. MutationalPatterns merges these to get DBS
and MBS variants. If DBS and MBS variants have already been defined in your vcf
or if you don’t want any variants to be merged, then you can use the
predefined_dbs_mbs
argument, when using read_vcfs_as_granges
or
get_mut_type
.
(In this example the result will be empty, because the DBS variants were not predefined)
predefined_dbs_grl <- read_vcfs_as_granges(blood_vcf_fnames, blood_sample_names,
ref_genome, type = "dbs",
predefined_dbs_mbs = TRUE)
You can retrieve base substitution types from the VCF GRanges object as “REF>ALT”
using mutations_from_vcf
:
muts <- mutations_from_vcf(grl[[1]])
head(muts, 12)
## [1] "A>C" "A>G" "C>T" "A>G" "G>T" "T>A" "T>C" "G>A" "G>A" "C>A" "G>A" "G>T"
You can retrieve the base substitutions from the VCF GRanges object and convert
them to the 6 types of base substitution types that are distinguished by
convention: C>A, C>G, C>T, T>A, T>C, T>G. For example, when the reference
allele is G and the alternative allele is T (G>T), mut_type
returns the G:C>T:A mutation as a C>A mutation:
types <- mut_type(grl[[1]])
head(types, 12)
## [1] "T>G" "T>C" "C>T" "T>C" "C>A" "T>A" "T>C" "C>T" "C>T" "C>A" "C>T" "C>A"
To retrieve the sequence context (one base upstream and one base downstream) of
the base substitutions in the VCF object from the reference genome, you can use
the mut_context
function:
context <- mut_context(grl[[1]], ref_genome)
head(context, 12)
## chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr2 chr2 chr2
## "CAG" "AAC" "ACA" "AAG" "TGA" "GTT" "ATT" "CGC" "AGC" "ACA" "CGT" "GGA"
Withtype_context
, you can retrieve the types and contexts
for all positions in the VCF GRanges object. For the base substitutions that are
converted to the conventional base substitution types, the reverse complement of
the sequence context is returned.
type_context <- type_context(grl[[1]], ref_genome)
lapply(type_context, head, 12)
## $types
## [1] "T>G" "T>C" "C>T" "T>C" "C>A" "T>A" "T>C" "C>T" "C>T" "C>A" "C>T" "C>A"
##
## $context
## chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr2 chr2 chr2
## "CTG" "GTT" "ACA" "CTT" "TCA" "GTT" "ATT" "GCG" "GCT" "ACA" "ACG" "TCC"
With mut_type_occurrences
, you can count mutation type
occurrences for all VCF objects in the GRangesList
. For
C>T mutations, a distinction is made between C>T at CpG sites and other
sites, as deamination of methylated cytosine at CpG sites is a common mutational
process. For this reason, the reference genome is needed for this functionality.
type_occurrences <- mut_type_occurrences(grl, ref_genome)
type_occurrences
## C>A C>G C>T T>A T>C T>G C>T at CpG C>T other
## colon1 28 5 109 12 30 12 59 50
## colon2 77 29 345 36 90 21 209 136
## colon3 79 19 243 25 61 23 165 78
## intestine1 19 8 74 19 26 4 33 41
## intestine2 118 49 423 57 126 27 258 165
## intestine3 54 27 298 32 67 22 192 106
## liver1 43 22 94 30 77 34 18 76
## liver2 144 93 274 103 209 73 48 226
## liver3 39 28 61 15 32 23 7 54
A mutation spectrum shows the relative contribution of each mutation type in
the base substitution catalogs. The plot_spectrum
function plots
the mean relative contribution of each of the 6 base substitution types over
all samples. Error bars indicate the 95% confidence interval over all samples.
The total number of mutations is indicated.
p1 <- plot_spectrum(type_occurrences)
You can also plot the mutation spectrum with distinction between C>T at CpG sites and other sites:
p2 <- plot_spectrum(type_occurrences, CT = TRUE)
Other options include plotting the spectrum with the individual samples as points and removing the legend to save space:
p3 <- plot_spectrum(type_occurrences, CT = TRUE,
indv_points = TRUE, legend = FALSE)
Here we use the gridExtra package to combine the created plots, so you can see them next to each other.
library("gridExtra")
grid.arrange(p1, p2, p3, ncol = 3, widths = c(3, 3, 1.75))
It’s also possible to create a facet per sample group, e.g. plot the spectrum for each tissue separately:
p4 <- plot_spectrum(type_occurrences, by = tissue, CT = TRUE, legend = TRUE)
Or you could use the standard deviation instead of a 95% confidence interval:
p5 <- plot_spectrum(type_occurrences, CT = TRUE,
legend = TRUE, error_bars = "stdev")
grid.arrange(p4, p5, ncol = 2, widths = c(4, 2.3))
First you should make a 96 trinucleotide mutation count matrix. (In contrast to previous versions this also works for single samples.)
mut_mat <- mut_matrix(vcf_list = grl, ref_genome = ref_genome)
head(mut_mat)
## colon1 colon2 colon3 intestine1 intestine2 intestine3 liver1 liver2 liver3
## A[C>A]A 3 10 10 5 19 6 8 10 3
## A[C>A]C 0 3 3 1 8 4 1 8 2
## A[C>A]G 2 3 3 1 4 0 1 6 2
## A[C>A]T 0 2 9 0 9 2 2 12 2
## C[C>A]A 1 8 5 0 8 7 2 15 3
## C[C>A]C 2 5 3 1 3 2 1 15 2
Next, you can use this matrix to plot the 96 profile of samples. In this example we do this for 2 samples:
plot_96_profile(mut_mat[, c(1, 7)])
It’s also possible to look at larger mutational contexts. However, this is only useful if you have a large number of mutations.
mut_mat_ext_context <- mut_matrix(grl, ref_genome, extension = 2)
head(mut_mat_ext_context)
## colon1 colon2 colon3 intestine1 intestine2 intestine3 liver1 liver2 liver3
## AA[C>A]AA 0 4 1 2 9 4 0 4 0
## AA[C>A]AC 0 0 2 0 1 0 1 0 0
## AA[C>A]AG 0 0 1 0 2 0 1 0 0
## AA[C>A]AT 0 0 1 1 2 0 0 0 0
## AA[C>A]CA 0 0 0 0 0 0 0 2 1
## AA[C>A]CC 0 1 0 0 2 2 0 1 0
The extension
argument also works for the mut_context
and type_context
functions.
You can visualize this matrix with a heatmap.
plot_profile_heatmap(mut_mat_ext_context, by = tissue)