To evaluate the aneuploidy and prevalence of clonal or quasiclonal tumors, we developed a novel tool to characterize the mosaic tumor genome on the basis of one major assumption: the genome of a heterogeneous multi-cell tumor biopsy can be sliced into a chain of segments that are characterized by homogeneous somatic copy number alternations (SCNAs) and B allele frequencies (BAFs). The model, termed BubbleTree, utilizes both SCNAs and the interconnected BAFs as markers of tumor clones to extract tumor clonality estimates. BubbleTree is an intuitive and powerful approach to jointly identify ASCN, tumor purity and (sub)clonality, which aims to improve upon current methods to characterize the tumor karyotypes and ultimately better inform cancer diagnosis, prognosis and treatment decisions.
To perform a BubbleTree analysis, data pertaining to the position and B allele frequency of heterozygous snps in the tumor sample and segmented copy number information including the position, number of markers/segment and log2 copy number ratio between tumor and normal samples must first be obtained.
BubbleTree was developed using both whole exome sequencing (WES) and whole genome sequening (WGS) NGS data from paired tumor/normal biopsies, but this model should also be applicable to array comparative genomic hybridization (aCGH) and single nucleotide polymorphism (SNP) array data.
There are many methods for generating and processing sequencing data in preparation for BubbleTree analysis. In the following section we provide example workflows starting from WES NGS which can be adapted as needed to alternate inputs.
The primary BubbleTree requirement for sequence variant information is a GRanges object containing the B alelle frequencies and genomic positions of variants known to be heterozygous in the paired normal sample.
Mapped reads from tumor and normal tissue can be processed with mutation caller software such as VarScan or MUTECT. In this example, we use a hypothetical vcf file from VarScan output which contains mutation calls from both normal and tumor samples.
Assume that you have loaded the data snp.dat like this:
head(snp.dat)
CHROM POS ID REF ALT QUAL FILTER LT.rna.dp LN.rna.dp ON.rna.dp OT.rna.dp BT.wes.dp LT.wes.dp LN.wes.dp ON.wes.dp
1 chr1 54757 rs202000650 T G . PASS NA NA NA NA NA NA NA NA
2 chr1 564636 . C T . PASS NA NA NA NA NA NA NA NA
3 chr1 564862 rs1988726 T C . PASS NA NA NA NA NA NA NA NA
4 chr1 564868 rs1832728 T C . PASS NA NA NA NA NA NA NA NA
5 chr1 565454 rs7349151 T C . PASS NA NA NA NA NA NA NA NA
6 chr1 565464 rs6594030 T C . PASS NA NA NA NA NA NA NA NA
OT.wes.dp LT.wgs.dp LN.wgs.dp ON.wgs.dp OT.wgs.dp LT.rna.freq LN.rna.freq ON.rna.freq OT.rna.freq BT.wes.freq LT.wes.freq
1 NA 25 24 27 19 NA NA NA NA NA NA
2 NA 21 NA NA 14 NA NA NA NA NA NA
3 NA 10 15 55 13 NA NA NA NA NA NA
4 NA 10 12 60 14 NA NA NA NA NA NA
5 NA 21 14 26 24 NA NA NA NA NA NA
6 NA 25 16 33 29 NA NA NA NA NA NA
LN.wes.freq ON.wes.freq OT.wes.freq LT.wgs.freq LN.wgs.freq ON.wgs.freq OT.wgs.freq
1 NA NA NA 0.2400 0.1667 0.2222 0.3684
2 NA NA NA 0.0000 NA NA 0.1429
3 NA NA NA 0.4000 0.5333 0.9091 0.7692
4 NA NA NA 0.5000 0.6667 0.9333 0.7857
5 NA NA NA 0.1429 0.3571 0.6538 0.6250
6 NA NA NA 0.2000 0.3750 0.7273 0.5862
Identify the germline heterozygous loci:
is.hetero <- function(x, a=0.3, b=0.7) {
(x - a) * (b - x) >= 0
}
wgs.snp.ss <- subset(snp.dat, ! CHROM %in% c("chrX", "chrY") &
LN.wgs.dp >= 15 &
ON.wgs.dp >=15 &
is.hetero(LN.wgs.freq, 0.4, 0.6) &
is.hetero(ON.wgs.freq, 0.4, 0.6))
Then convert to the GRanges object:
library(GenomicRanges)
wgs.hetero.grs <- list()
wgs.hetero.grs$lung <- with(wgs.snp.ss, GRanges(CHROM, IRanges(POS, POS), mcols=cbind(LT.wgs.dp, LT.wgs.freq)))
wgs.hetero.grs$ovary <- with(wgs.snp.ss, GRanges(CHROM, IRanges(POS, POS), mcols=cbind(OT.wgs.dp, OT.wgs.freq)))
names(elementMetadata(wgs.hetero.grs$lung)) <- names(elementMetadata(wgs.hetero.grs$ovary)) <- c("dp", "freq")
The B-allele frequency data is extracted using the Bioconductor package VariantAnnotation and converted from string to numeric format.
The object seg is the segment call output from DNAcopy and min.num here specifies the minimum segment size to keep
library(GenomicRanges)
min.num <- 10
cnv.gr <- with(subset(seg$output, num.mark >= min.num & ! chrom %in% c("chrX", "chrY")) , GRanges(chrom, IRanges(loc.start, loc.end), mcols=cbind(num.mark, seg.mean)))
Then merge the SNP and CNV GRanges objects.
Example data in the desired format is provided as part of this package as GRanges objects and can be loaded as shown below. To utilize this vignette, you must first load BubbleTree below. You don’t need to use “suppressMessages”.
suppressMessages(
library(BubbleTree)
)
allCall.lst is pre-calculated CNV data. allRBD.lst is simply the RBD data from below.
load(system.file("data", "allCall.lst.RData", package="BubbleTree"))
head(allCall.lst[[1]]@rbd)
## seqnames start end width strand seg.id num.mark lrr
## 1 chr10 93890 38769716 38675827 * 806 31699 0.1413
## 2 chr10 38877329 135523936 96646608 * 808 74425 0.1415
## 3 chr11 133952 134946370 134812419 * 812 102934 0.1412
## 4 chr12 60000 133841793 133781794 * 813 103392 0.1413
## 5 chr13 19020000 115109861 96089862 * 814 76080 0.1419
## 6 chr14 20191636 107288640 87097005 * 823 68709 0.1425
## kurtosis hds hds.sd het.cnt seg.size
## 1 -0.093044958 0.01851852 0.06051370 10400 1.487216
## 2 -0.048701274 0.01851852 0.06046402 25414 3.491784
## 3 -0.018840021 0.01851852 0.06052843 36183 4.829335
## 4 -0.007149860 0.01851852 0.06096056 36798 4.850823
## 5 -0.022536940 0.01851852 0.06057517 27278 3.569431
## 6 -0.004935614 0.01851852 0.06166893 25001 3.223607
However, if you wish to create your own RBD object from your input, you would use the code below. There is test data available in this package that is used for demonstration purposes.
# load sample files
load(system.file("data", "cnv.gr.rda", package="BubbleTree"))
load(system.file("data", "snp.gr.rda", package="BubbleTree"))
# load annotations
load(system.file("data", "centromere.dat.rda", package="BubbleTree"))
load(system.file("data", "cyto.gr.rda", package="BubbleTree"))
load(system.file("data", "cancer.genes.minus2.rda", package="BubbleTree"))
load(system.file("data", "vol.genes.rda", package="BubbleTree"))
load(system.file("data", "gene.uni.clean.gr.rda", package="BubbleTree"))
# initialize RBD object
r <- new("RBD", unimodal.kurtosis=-0.1)
# create new RBD object with GenomicRanges objects for SNPs and CNVs
rbd <- makeRBD(r, snp.gr, cnv.gr)
head(rbd)
## seqnames start end width strand seg.id num.mark lrr kurtosis
## 1 chr1 65625 2066855 2001231 * 1 548 0.1997 0.1830119
## 2 chr1 2075796 38489397 36413602 * 2 5284 -0.4146 -1.9020390
## 3 chr1 38511244 39761601 1250358 * 3 72 -0.0511 -2.0000000
## 4 chr1 39763396 39982177 218782 * 4 112 0.0401 -0.9912620
## 5 chr1 39988109 43905367 3917259 * 5 601 0.0372 -1.4809979
## 6 chr1 43905709 44128685 222977 * 6 53 0.2822 -2.0000000
## hds hds.sd het.cnt seg.size
## 1 0.01220 0.05007095 71 0.24809178
## 2 0.35600 0.06296830 501 2.39218420
## 3 0.15555 0.03471894 2 0.03259600
## 4 0.12060 0.06761821 4 0.05070489
## 5 0.14560 0.06186714 47 0.27208605
## 6 0.07280 0.05176022 2 0.02399428
# create a new prediction
btreepredictor <- new("BTreePredictor", rbd=rbd, max.ploidy=6, prev.grid=seq(0.2,1, by=0.01))
pred <- btpredict(btreepredictor)
# create rbd plot
btreeplotter <- new("BTreePlotter", max.ploidy=5, max.size=10)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the BubbleTree package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
btree <- drawBTree(btreeplotter, pred@rbd)
## Warning in drawBTree(btreeplotter, pred@rbd): More ploidy might be suggested: 1.6, 1.6, 2.2, 1.7, 1.9, 1.5, 1.7, 2.2, 1.9, 2.1, 1.7
print(btree)