multistateQTL 1.0.0
multistateQTL
is a Bioconductor package for applying basic statistical tests (e.g., feature-wise FDR correction, calculating pairwise sharing), summarizing, and visualizing QTL summary statistics from multiple states (e.g., tissues, celltypes, environmental conditions). It works on the QTLExperiment
(QTLE
) object class, where rows represent features (e.g., genes, transcripts, genomic regions), columns represent states, and assays are the various summary statistics. It also provides wrapper implementations of a number of multi-test correction methods (e.g., mashr, meta-soft, etc), which result in a set of multi-test corrected summary statistics.
QTLExperiment and multistateQTL can be installed from GitHub:
if (!require("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install(c("QTLExperiment", "multistateQTL"), version="devel")
They are also available on GitHub:
devtools::install_git("https://github.com/dunstone-a/QTLExperiment", build_vignettes=TRUE)
devtools::install_git(https://github.com/dunstone-a/multistateQTL", build_vignettes=TRUE)
library(QTLExperiment)
library(multistateQTL)
Provided with real QTL summary statistics as either a QTLE
object or a named list with betas and error (and optionally pval or lfsr), key parameters are estimated that are used to simulate realistic multi-state QTL data. We demonstrate the parameter estimation function on publicly available summary statistics from GTEx (v8). Note that this data only contains tests that were called as significant by GTEx and vroom only loads the first chunk of results as it does not read .gz compressed objects well. This truncated dataset is used in the vignette for convenience, however, to estimate the default parameters in qtleParams()
we downloaded QTL summary statistics for all associations tested for the 10 GTEx tissues with the largest sample sizes from Google Cloud). To speed up calculations we filtered this to include only associations on chromosome 1 and considered significant tests with pval < 0.05 and null tests with a pval > 0.1.
See QTLExperiment for more info on the sumstats2qtle
function and for other approaches for reading in QTL summary statistics.
The parameters estimated here include:
rnorm
to sample an effect size for each QTL for each state. The variance parameter for rnorm
is user defined (default = 0.1).input_path <- system.file("extdata", package="multistateQTL")
state <- c("lung", "thyroid", "spleen", "blood")
input <- data.frame(
state=state,
path=paste0(input_path, "/GTEx_tx_", state, ".tsv"))
gtex <- sumstats2qtle(
input,
feature_id="molecular_trait_id",
variant_id="rsid",
betas="beta",
errors="se",
pvalues="pvalue",
verbose=TRUE)
gtex
## class: QTLExperiment
## dim: 1163 4
## metadata(0):
## assays(3): betas errors pvalues
## rownames(1163): ENST00000428771|rs554008981 ENST00000477976|rs554008981
## ... ENST00000445118|rs368254419 ENST00000483767|rs368254419
## rowData names(2): variant_id feature_id
## colnames(4): lung thyroid spleen blood
## colData names(1): state_id
head(betas(gtex))
## lung thyroid spleen blood
## ENST00000428771|rs554008981 -0.1733690 NA 0.134913 NA
## ENST00000477976|rs554008981 0.1616170 0.3173110 NA NA
## ENST00000483767|rs554008981 -0.4161480 -0.0483018 NA -0.204647
## ENST00000623070|rs554008981 -0.1137930 NA NA NA
## ENST00000669922|rs554008981 -0.1921680 -0.1067540 0.724622 -0.117424
## ENST00000428771|rs201055865 -0.0630909 NA NA NA
Estimating parameters:
params <- qtleEstimate(gtex, threshSig=0.05, threshNull=0.5)
params
## $cv.sig.shape
## [1] 7.070576
##
## $cv.sig.rate
## [1] 7.717956
##
## $cv.null.shape
## [1] 1.903578
##
## $cv.null.rate
## [1] 0.8979204
##
## $betas.sig.shape
## [1] 3.470299
##
## $betas.sig.rate
## [1] 16.27252
##
## $betas.null.shape
## [1] 1.360211
##
## $betas.null.rate
## [1] 11.03417
Looking at the distributions defined by these estimated parameters, the simulated effect sizes for significant QTL will tend to be larger, while the simulated coefficient of variation values will be smaller than for the non-significant QTL.
plotSimulationParams(params=params)
The default parameters available through qtleParams()
were estimated from the GTEx v8 tissue-level eQTL summary statistics from chromosome 1 using the 10 tissues with the largest sample sizes. From these data, significant QTL parameters were estimated from tests in the lowest p-value quantile, while null parameters were estimated from tests in the highest p-value quantile. Data for tests on chromosome 1 were included in all four tissues (n=32613).
The simulation tool allows for the simulation of four types of associations: (1) Global, where the simulated effect size is approximately equal across all states; (2) Unique, where the association is only significant in one state; (3) Multi-state, where the association is significant in a subset of states (i.e., state-groups), and (4) Null, where the association has no significant effects in any state. First each test is randomly assigned as one of the above types according to the proportions specified by the user. For multi-state QTL, each state is assigned to a state-group, either randomly or according to user defined groups, then each multi-state QTL is assigned randomly to one of the state-groups. For unique QTL, the QTL is randomly assigned to a single state.
Simulated mean effect sizes for all non-null QTL are sampled from gamma(beta.sig.shape, beta.sig.rate) and are randomly assigned a positive or negative effect direction. Then for each state where that QTL is significant, an effect size is sampled from N(mean effect size, σ), where σ is user defined (default=0.1). Effect sizes for null QTL are sampled from gamma(beta.null.shape, beta.null.rate) and are randomly assigned a positive or negative effect direction. Standard errors for each QTL for each state are simulated by sampling from gamma(cv.sig.shape, cv.sig.rate) or gamma(cv.null.shape, cv.null.rate) for significant and null QTL, respectively, and multiplying the sampled cv by the absolute value of the simulated beta for that QTL in that state.
Here is an example of a simple simulation with half of the simulated QTL tests having globally significant effects. This example uses the default parameters.
sim <- qtleSimulate(nTests=1000, nStates=6, global=0.5)
sim
## class: QTLExperiment
## dim: 1000 6
## metadata(0):
## assays(3): betas errors lfsrs
## rownames(1000): F0867|v86973 F0107|v67504 ... F0874|v42722 F0190|v78203
## rowData names(11): feature_id variant_id ... S5 S6
## colnames(6): S1 S2 ... S5 S6
## colData names(1): state_id
head(rowData(sim))
## DataFrame with 6 rows and 11 columns
## feature_id variant_id QTL id mean_beta
## <character> <character> <character> <character> <numeric>
## F0867|v86973 F0867 v86973 global F0867|v86973 0.274222
## F0107|v67504 F0107 v67504 global F0107|v67504 0.740930
## F0239|v64205 F0239 v64205 global F0239|v64205 0.211987
## F0531|v90951 F0531 v90951 global F0531|v90951 -0.650529
## F0950|v93596 F0950 v93596 null F0950|v93596 0.000000
## F0851|v25798 F0851 v25798 null F0851|v25798 0.000000
## S1 S2 S3 S4 S5 S6
## <logical> <logical> <logical> <logical> <logical> <logical>
## F0867|v86973 TRUE TRUE TRUE TRUE TRUE TRUE
## F0107|v67504 TRUE TRUE TRUE TRUE TRUE TRUE
## F0239|v64205 TRUE TRUE TRUE TRUE TRUE TRUE
## F0531|v90951 TRUE TRUE TRUE TRUE TRUE TRUE
## F0950|v93596 FALSE FALSE FALSE FALSE FALSE FALSE
## F0851|v25798 FALSE FALSE FALSE FALSE FALSE FALSE
We can also generate more complex simulations, for example this simulation has 20% global, 40% multi-state, 20% unique, and 20% null QTL effects, where multi-state effects are assigned to one of two state-groups.
sim <- qtleSimulate(
nStates=10, nFeatures=100, nTests=1000,
global=0.2, multi=0.4, unique=0.2, k=2)
Here is a snapshot of the simulation key for QTL simulated as unique to a single state:
head(rowData(subset(sim, QTL == "unique")))
## DataFrame with 6 rows and 16 columns
## feature_id variant_id QTL id mean_beta S01
## <character> <character> <character> <character> <numeric> <logical>
## F064|v35885 F064 v35885 unique F064|v35885 0.409911 FALSE
## F004|v26038 F004 v26038 unique F004|v26038 0.460528 FALSE
## F088|v38945 F088 v38945 unique F088|v38945 1.279785 FALSE
## F072|v77053 F072 v77053 unique F072|v77053 0.577497 FALSE
## F020|v13801 F020 v13801 unique F020|v13801 0.428142 FALSE
## F031|v81523 F031 v81523 unique F031|v81523 -0.443593 FALSE
## S02 S03 S04 S05 S06 S07
## <logical> <logical> <logical> <logical> <logical> <logical>
## F064|v35885 FALSE FALSE FALSE FALSE FALSE FALSE
## F004|v26038 FALSE FALSE FALSE FALSE FALSE FALSE
## F088|v38945 FALSE FALSE FALSE FALSE FALSE FALSE
## F072|v77053 FALSE FALSE FALSE FALSE FALSE FALSE
## F020|v13801 FALSE FALSE FALSE FALSE FALSE FALSE
## F031|v81523 FALSE FALSE FALSE FALSE FALSE FALSE
## S08 S09 S10 multistateGroup
## <logical> <logical> <logical> <character>
## F064|v35885 TRUE FALSE FALSE NA
## F004|v26038 FALSE TRUE FALSE NA
## F088|v38945 FALSE TRUE FALSE NA
## F072|v77053 FALSE FALSE TRUE NA
## F020|v13801 FALSE FALSE TRUE NA
## F031|v81523 FALSE TRUE FALSE NA
Here is a snapshot of the simulation key for QTL simulated as multi-state:
head(rowData(subset(sim, QTL == "multistate")))
## DataFrame with 6 rows and 16 columns
## feature_id variant_id QTL id mean_beta S01
## <character> <character> <character> <character> <numeric> <logical>
## F023|v5205 F023 v5205 multistate F023|v5205 -0.567860 TRUE
## F061|v76102 F061 v76102 multistate F061|v76102 0.589580 FALSE
## F072|v45821 F072 v45821 multistate F072|v45821 0.528045 FALSE
## F066|v36503 F066 v36503 multistate F066|v36503 -0.462518 FALSE
## F016|v385 F016 v385 multistate F016|v385 -0.575820 FALSE
## F077|v62943 F077 v62943 multistate F077|v62943 -0.702591 TRUE
## S02 S03 S04 S05 S06 S07
## <logical> <logical> <logical> <logical> <logical> <logical>
## F023|v5205 FALSE TRUE FALSE TRUE TRUE FALSE
## F061|v76102 TRUE FALSE TRUE FALSE FALSE TRUE
## F072|v45821 TRUE FALSE TRUE FALSE FALSE TRUE
## F066|v36503 TRUE FALSE TRUE FALSE FALSE TRUE
## F016|v385 TRUE FALSE TRUE FALSE FALSE TRUE
## F077|v62943 FALSE TRUE FALSE TRUE TRUE FALSE
## S08 S09 S10 multistateGroup
## <logical> <logical> <logical> <character>
## F023|v5205 TRUE TRUE TRUE Group1
## F061|v76102 FALSE FALSE FALSE Group2
## F072|v45821 FALSE FALSE FALSE Group2
## F066|v36503 FALSE FALSE FALSE Group2
## F016|v385 FALSE FALSE FALSE Group2
## F077|v62943 TRUE TRUE TRUE Group1
message("Number of QTL specific to each state-group:")
table(rowData(subset(sim, QTL == "multistate"))$multistateGroup)
##
## Group1 Group2
## 180 220
The multistateQTL toolkit provides two functions to help deal with missing data, getComplete
and replaceNAs
. The getComplete
function is a smart subsetting function that remove QTL associations (rows) with more than an allowed amount of missing data. The replaceNAs
function allows for NAs in each assay to be replaced with a constant or with the row mean or row median. For example, here is a snapshot of our simulated data from above with added NAs:
na_pattern <- sample(seq(1, ncol(sim)*nrow(sim)), 1000)
sim_na <- sim
assay(sim_na, "betas")[na_pattern] <- NA
assay(sim_na, "errors")[na_pattern] <- NA
assay(sim_na, "lfsrs")[na_pattern] <- NA
message("Number of simulated tests: ", nrow(sim_na))
head(betas(sim_na))
## S01 S02 S03 S04 S05 S06
## F023|v5205 -0.6408860 NA -0.61015989 0.4746848 -0.5355503 -0.5996299
## F064|v35885 0.3058540 0.1503173 0.39147843 0.3581393 -0.2495009 0.3656625
## F061|v76102 0.6789483 0.6452821 0.01386314 0.6205773 0.1526890 0.1482970
## F072|v45821 -0.2920567 0.5642467 NA 0.5109627 -0.2517743 0.1750774
## F052|v25298 0.1206159 -0.4384336 0.63220732 0.2553040 0.1784669 NA
## F066|v36503 0.2950614 -0.5365944 0.04943200 -0.4945238 0.2825350 -0.1313272
## S07 S08 S09 S10
## F023|v5205 NA NA -0.5265607 -0.7008404
## F064|v35885 0.1797833 0.3770677 0.2747453 0.2506665
## F061|v76102 0.7101835 0.2389427 0.4925350 -0.6088679
## F072|v45821 0.5460297 0.1897046 -0.5222729 NA
## F052|v25298 0.3867841 0.3070954 NA -0.6314165
## F066|v36503 -0.5335727 0.2287540 NA -0.1775369
First we can use getComplete
to keep only the tests that have data for at least half of the states:
sim_na <- getComplete(sim_na, n=0.5, verbose=TRUE)
Then for the remaining QTL, we can fill in the missing values using the following scheme
sim_na <- replaceNAs(sim_na, verbose=TRUE)
head(betas(sim_na))
## S01 S02 S03 S04 S05 S06
## F023|v5205 -0.6408860 0.0000000 -0.61015989 0.4746848 -0.5355503 -0.5996299
## F064|v35885 0.3058540 0.1503173 0.39147843 0.3581393 -0.2495009 0.3656625
## F061|v76102 0.6789483 0.6452821 0.01386314 0.6205773 0.1526890 0.1482970
## F072|v45821 -0.2920567 0.5642467 0.00000000 0.5109627 -0.2517743 0.1750774
## F052|v25298 0.1206159 -0.4384336 0.63220732 0.2553040 0.1784669 0.0000000
## F066|v36503 0.2950614 -0.5365944 0.04943200 -0.4945238 0.2825350 -0.1313272
## S07 S08 S09 S10
## F023|v5205 0.0000000 0.0000000 -0.5265607 -0.7008404
## F064|v35885 0.1797833 0.3770677 0.2747453 0.2506665
## F061|v76102 0.7101835 0.2389427 0.4925350 -0.6088679
## F072|v45821 0.5460297 0.1897046 -0.5222729 0.0000000
## F052|v25298 0.3867841 0.3070954 0.0000000 -0.6314165
## F066|v36503 -0.5335727 0.2287540 0.0000000 -0.1775369
The multistateQTL toolkit also provides the callSignificance
function, which calls QTL tests significant in each state using either a single or two-step threshold approach. For example, we can set a single lfsr threshold of 0.1 to call significance of our simulate QTL:
sim <- callSignificance(sim, assay="lfsrs", thresh=0.05)
message("Median number of significant tests per state: ",
median(colData(sim)$nSignificant))
Because we have the simulated ground-truth, we can compare these significance calls to what was simulated using the simPerformance
function, which provides the following global (i.e. across all state) performance metrics:
sim <- callSignificance(sim, assay="lfsrs", thresh=0.001)
perf_metrics <- simPerformance(sim)
lapply(perf_metrics, FUN=function(x) {round(x, 2)})
## $accuracy
## [1] 0.41
##
## $precision
## [1] 0.41
##
## $recall
## [1] 1
##
## $f1
## [1] 0.58
##
## $cm
## called
## simulated FALSE TRUE
## FALSE 33 5847
## TRUE 15 4105
As you can see the recall of TRUE significant QTL is quite low. However if we change our significance calling approach to be more flexible.
sim <- callSignificance(
sim, mode="simple", assay="lfsrs",
thresh=0.0001, secondThresh=0.0002)
simPerformance(sim)$recall
## [1] 0.990534
The multistateQTL
package contains five functions for visualising multi-state eQTL data.
These functions are based on the ggplot2 and ComplexHeatmap R packages.
plotPairwiseSharing
: based on ComplexHeatmap::Heatmap
plotQTLClusters
: based on ComplexHeatmap::Heatmap
plotUpSet
: based on ComplexHeatmap::UpSet
plotCompareStates
: based on ggplot2
plotSimulationParams
: based on ggplot2
The functions built on ggplot2
are compatible with ggplot2
syntax such as the +
operator.
The function plotPairwiseSharing
shows the degree of pairwise sharing of significant hits for each combination of two states.
Column annotations can be added by specifying a valid column name from the colData
of the object.
In the plot below, columns are ordered by the broad cell type (multistateGroup
) of the states.
We expect states belonging to the same multi-state group to be more related and have a greater degree of sharing of significant eQTLs.
Column annotations are used to show the number of significant eQTLs for each state.
sim_sig <- getSignificant(sim)
sim_top <- getTopHits(sim_sig, assay="lfsrs", mode="state")
sim_top <- runPairwiseSharing(sim_top)
p1 <- plotPairwiseSharing(sim_top, annotate_cols=c("nSignificant", "multistateGroup"))
These plots show the set of tests that are significant, but not necessarily shared, by groups of states.
plotUpSet(sim_top, annotateColsBy=c("nSignificant", "multistateGroup"))