This vignette introduces the GRaNIE
package and explains the main features and additional methodological background. We refer also to the Workflow Vignette for further information and examples as well as the latest preprint and methodological details therein.
GRaNIE 1.9.7
Genetic variants associated with diseases often affect non-coding
regions, thus likely having a regulatory role. To understand the effects
of genetic variants in these regulatory regions, identifying genes that
are modulated by specific regulatory elements (REs) is crucial. The
effect of gene regulatory elements, such as enhancers, is often
cell-type specific, likely because the combinations of transcription
factors (TFs) that are regulating a given enhancer have cell-type
specific activity. This TF activity can be quantified with existing
tools such as diffTF
and captures differences in binding of a TF in
open chromatin regions. Collectively, this forms a enhancer-mediated
gene regulatory network (eGRN) with cell-type and data-specific TF-RE
and RE-gene links. Here, we reconstruct such a eGRN using bulk RNA-seq
and open chromatin (e.g., using ATAC-seq or ChIP-seq for open chromatin
marks) and optionally TF activity data. Our network contains different
types of links, connecting TFs to regulatory elements, the latter of
which is connected to genes in the vicinity or within the same chromatin
domain (TAD). We use a statistical framework to assign empirical FDRs
and weights to all links using various statistical approaches.
In summary, we present a framework to reconstruct predictive enhancer-mediated regulatory network models that are based on integrating of expression and chromatin accessibility/activity pattern across individuals, and provide a comprehensive resource of cell-type specific gene regulatory networks for particular cell types.
For the latest version of the paper, see the References.
GRaNIE
package and required dependency packagesGRaNIE
is available on Bioconductor. The package and installation
instructions can be found
here and are very simple.
The basic installation installs all required packages but not
necessarily those that are listed under Suggests
unless you installed
the package with BiocManager::install("GRaNIE", dependencies = TRUE)
.
However, we also provide instructions on how to install the newest version of the package outside of Bioconductor for users who do not use the newest version of Bioconductor and/or do not have the devel version of Bioconductor installed. For details, see the instructions here.
Note that from the additional packages, only some of the genome annotation packages are actually required, which of them depends on your genome assembly version (see below).
Overall, we tried to minimize the installation burden and only require packages if they are absolutely necessary for the main workflow. If you want to pre-install also all optional packages, please consider the following options:
BiocManager::install("GRaNIE", dependencies = TRUE)
which
however installs all annotation packages for all supported genomes,
which may take a longer time due to the overall download size of
multiple Gb.Genome annotation packages (only one of the four is optionally needed for some functions, see section below, which of them depends on your genome assembly version):
hg38
:
BiocManager::install(c("org.Hs.eg.db", "BSgenome.Hsapiens.UCSC.hg38", "TxDb.Hsapiens.UCSC.hg38.knownGene"))
hg19
:
BiocManager::install(c("org.Hs.eg.db", "BSgenome.Hsapiens.UCSC.hg19", "TxDb.Hsapiens.UCSC.hg19.knownGene"))
mm10
:
BiocManager::install(c("org.Mm.eg.db", "BSgenome.Mmusculus.UCSC.mm10", "TxDb.Mmusculus.UCSC.mm10.knownGene"))
mm39
:
BiocManager::install(c("org.Mm.eg.db", "BSgenome.Mmusculus.UCSC.mm39", "TxDb.Mmusculus.UCSC.mm39.knownGene"))
mm9
:
BiocManager::install(c("org.Mm.eg.db", "BSgenome.Mmusculus.UCSC.mm9", "TxDb.Mmusculus.UCSC.mm9.knownGene"))
rn6
:
BiocManager::install(c("org.Rn.eg.db", "BSgenome.Rnorvegicus.UCSC.rn6", "TxDb.Rnorvegicus.UCSC.rn6.refGene"))
rn7
:
BiocManager::install(c("org.Rn.eg.db", "BSgenome.Rnorvegicus.UCSC.rn7", "TxDb.Rnorvegicus.UCSC.rn7.refGene"))
dm6
:
BiocManager::install(c("org.Dm.eg.db", "BSgenome.Dmelanogaster.UCSC.dm6", "TxDb.Dmelanogaster.UCSC.dm6.ensGene"))
If you want to use the JASPAR database for TF and TFBS, you
need the following extra packages:
BiocManager::install(c("JASPAR2024", "TFBSTools", "motifmatchr", "rbioapi"))
For all other optional packages, you may execute this line for
installing them in one go:
BiocManager::install(c("IHW", "WGCNA", "csaw", "EDASeq", "ChIPseeker", "variancePartition", "ReactomePA", "DOSE", "clusterProfiler", "BiocFileCache", "BiocParallel", "LDlinkR"))
GRaNIE
has a number of packages that enhance the functionality for
particular cases, may be required depending on certain parameters, only
when using a particular functionality or that generally speed-up the
computation. In detail, the following packages are currently optional
and may be needed context-specifically:
Genome assembly and annotation packages (only one of the four is optionally needed, which of them depends on your genome assembly version)
all of the following packages are ONLY needed for either
additional peak annotation in combination with ChIPseeker
(see
below) or if the chosen peak normalization method is GC-based
(EDASeq_GC_peaks
, gcqn_peaks
)
org.Hs.eg.db
: Needed only for hg19
or hg38
org.Mm.eg.db
: Needed only for mm9
or mm10
or mm39
org.Rn.eg.db
: Needed only for rn6
or rn7
org.Dm.eg.db
: Needed only for dm6
BSgenome.Hsapiens.UCSC.hg19
,
TxDb.Hsapiens.UCSC.hg19.knownGene
: Needed only for hg19
BSgenome.Hsapiens.UCSC.hg38
,
TxDb.Hsapiens.UCSC.hg38.knownGene
: Needed only for hg38
BSgenome.Mmusculus.UCSC.mm39
,
TxDb.Mmusculus.UCSC.mm39.knownGene
: Needed only for mm39
BSgenome.Mmusculus.UCSC.mm10
,
TxDb.Mmusculus.UCSC.mm10.knownGene
: Needed only for mm10
BSgenome.Mmusculus.UCSC.mm9
,
TxDb.Mmusculus.UCSC.mm9.knownGene
: Needed only for mm9
BSgenome.Rnorvegicus.UCSC.rn6
,
TxDb.Rnorvegicus.UCSC.rn6.refGene
: Needed only for rn6
BSgenome.Rnorvegicus.UCSC.rn7
,
TxDb.Rnorvegicus.UCSC.rn7.refGene
: Needed only for rn7
BSgenome.Dmelanogaster.UCSC.dm6
,
TxDb.Dmelanogaster.UCSC.dm6.ensGene
: Needed only for dm6
ChIPseeker
: Not needed strictly speaking, but used only
for the function addData()
to provide additional peak
annotation, fully optional otherwise.
TF and TFBS related packages
JASPAR2024
, TFBSTools
, motifmatchr
,
rbioapi
: Needed only when source = "JASPAR"
for
addTFBS()
for using JASPAR 2024 TFs and TFBSAdditional statistical packages and for enhanced functionality
IHW
: Needed only for the function
filterGRNAndConnectGenes()
for p-value adjustment of the raw
peak-gene p-values when using the IHW
framework (parameter
peak_gene.fdr.method
)csaw
: Needed only for a cyclic LOESS normalization in
addData()
, which is the default normalization currently for
adding TF activity data via addData_TFActivity()
EDASeq
: Provides additional, GC-aware normalization
schemes for the peak data in in `addData().variancePartition
: Needed only for the function
add_featureVariation()
to quantify multiple sources of
biological and technical variation for features (TFs, peaks, and
genes)WGCNA
: Needed only when setting corMethod = "bicor"
in
multiple supported functions for enabling another type of robust
correlations as alternative to Spearman (an experimental,
largely undocumented feature so far).Enrichment-associated packages
topGO
, ReactomePA
, DOSE
,
clusterProfiler
: Needed only for all enrichment functions
for GO
, Reactome
, DO
, and KEGG
respectivelySNP-related extra packages
LDlinkR
: Needed only for addSNPData()
to add SNP LD
information.Computational efficacy
BiocFileCache
: Needed only for loadExampleObject()
, in
cases when this function is executed multiple times, caching is
faster than re-downloading the file anew every time the function
is executed.BiocParallel
: Only used but highly recommended in the
functions addConnections_peak_gene()
,
add_TF_gene_correlation()
, and overlapPeaksAndTFBS()
to
speed-up computation when requesting multiple cores.Check the workflow vignette for an example workflow with explanations, figures and results.
For user convenience, we provide the function loadExampleObject()
that
can load a supported GRN object from the GRaNIE
package from an
arbitrary URL into R. This function can be used, for example, to load an
example object that contains a small number of TFs that we specifically
prepared for package use and exploring the package.
Simply run the following command:
GRN = loadExampleObject()
You can start by simply typing GRN
(i.e., the object name) into the
console, and a summary of the GRN object is printed. The example
object is always up to date with the most recent version of the package
and everything that the package can calculate is contained in the
object. Thus, you can run any function after loading the example
object.
For more details on where the data from the example object comes from, see the workflow vignette with explanations.
In our GRaNIE
approach, we integrate multiple data modalities. Here,
we describe them in detail and their required format.
Open chromatin data may come from ATAC-seq, DNAse-seq or ChIP-seq data for particular histone modifications that associate with open chromatin such as histone acetylation (e.g., H3K27ac). They all capture open chromatin either directly or indirectly, and while we primarily tested and used ATAC-seq while developing the package, the others should also be applicable for our framework. From here on, we will refer to these regions simply as peaks.
For RNA-seq, the data represent expression counts per gene across samples.
Here is a quick graphical representation which format is required to be compatible with our framework:
peakID
while
for RNA-seq, we use EnsemblID
chr
denoting the chromosome, followed by :
, and then start
, -
,
and end
for the peak start and end, respectively. Coordinates
are assumed to be zero-based exclusive, the standard for BED
files, see
here
or here for
more information. In short, the first base in a chromosome is
numbered 0, and the last base is not included in the feature.
For example, a peak with the coordinates chr1:100-150
starts
at position 100 (or the 101th base of the chromosome, as
counting starts with 0), spans 50 bp and ends at 149 (and NOT
150).GRaNIE
pipeline can work with it. Note that only the shared samples between
both data modalities are kept, however, so make sure that the sample
names match between them and share as many samples as possible. See
the methods part for guidelines on how many samples we recommend.Note that peaks should not overlap. If they do, an informative error message is thrown and the user is requested to modify the peak input data so that no overlaps exist among all peaks. This can be done by either merging overlapping peaks or deleting those that overlap with other peaks based on other criteria such as peak signal, by keeping only the strongest peak, for example.
For guidelines on how many peaks are necessary or recommended, see the section below.
For guidelines on whether and how to prepare single-cell 10x multiome data, see here.
TF and TFBS data is mandatory as input. As of March 2023, we offer two
different ways of how to import TF and TFBS data into GRaNIE
.
First, we support the import of any TF and TFBS data irrespective of how
it was generated or what it represents, thereby offering full
flexibility for how our framework is used. The only drawback is that
preparing the TF and TFBS data is a little more involved. This was the
only and default way of how to use GRaNIE
until February 2023. In what
follows, we explain more details.
Specifically, the package requires a bed
file per TF with TF binding
sites (TFBS). TFBS can either be in-silico predicted, or experimentally
verified, as long as genome-wide TFBS can be used.
Our recommended and default TF database is HOCOMOCO v12, see below for details.
As a default, we use HOCOMOCO-derived TFBS predictions that were
predicted with PWMScan
. For further method details, see the diffTF
paper from our
lab.
We provide TFBS predictions for HOCOMOCO-based TF motifs that were used
with PWMScan
for hg19
, hg38
, mm10
and soon mm39
. As of 2024, we recommend
using HOCOMOCO v12 (Website,
Paper). Using
this database requires only one extra step of downloading the database
and preparing it for the use with GRaNIE.
We provide the following downloads that can be directly used with GRaNIE:
hg38:
PWMScan
+ HOCOMOCO v12 INVIVO collection (Download
here)PWMScan
+ HOCOMOCO v11 (Download
here)FIMO
+ HOCOMOCO v11 (Download
here)hg19:
PWMScan
+ HOCOMOCO v10 (Download
here)mm10:
PWMScan
+ HOCOMOCO v12 INVIVO collection (Download
here)PWMScan
+ HOCOMOCO v10 (Download
here)mm39:
PWMScan
+ HOCOMOCO v12 INVIVO collection (Download
here)To use them, simply download the file, extract and then reference the
folder within the motifFolder
argument in the addTFBS()
function -
see also the Workflow Vignette
for examples.
However, you may also use your own TFBS data, and we provide full flexibility in doing so. Only some manual preparation is necessary. Briefly, if you decide to use your own TFBS data, you have to prepare the following:
bed
or bed.gz
format, 6 columns{TF}{suffix}.{fileEnding}
, where {TF}
specifies the name of the TF, {suffix}
an optional and arbitrary
string (we use _TFBS
, for example), and {fileEnding}
the file
type (supported are bed
and bed.gz
).translationTable.csv
. This file must have the following
structure: 3 columns (tab-separated), called SYMBOL
, ENSEMBL
,
and HOCOID
. The first column denotes a symbol or shortcut for the
TF that is used throughout the pipeline (e.g., AHR
), the second
the ENSEMBL ID (without the dot suffix; e.g., ENSG00000106546
) and
the third column the prefix of how the file is called with the TFBS
(e.g., AHR.0.B
if the file for AHR
is called
AHR.0.B_TFBS.bed.gz
).GRaNIE
as
separate downloads. For more information, check the links above.For more methodological details, details on how to construct these
files, their exact format etc we refer to diffTF
paper and
website for details.
As of March 2023, we additionally a more user-friendly way of importing
TF and TFBS data from the various JASPAR databases that are available in
R/Bioconductor: JASPAR
2022 and 2024. While we so far tested only with
the JASPAR2022
database / R package, the other ones should also work -
if not, please let us know. Using JASPAR2024
as TF database is much
easier and also quicker from a user perspective, and none of the
additional complexities that were mentioned above for the custom import
apply here. However, we do note that HOCOMOCO v12 produces larger GRNs
from our (limited) test experience. For details, see the addTFBS()
and
overlapPeaksAndTFBS()
functions.
Note that we also implemented an option to customize which TFBS
collection to use from the JASPAR
database via the function argument
JASPAR_useSpecificTaxGroup
from addTFBS()
. For example, this feature
can be used to specify the whole TF collection
irrespective of the genome assembly the data is in - which is useful for
genomes such as rat or mouse, for which JASPAR only finds a few dozen TF
at most if the specific genome was specified otherwise.
Be aware, though, that you may want to compare your eGRN across
different TF and TFBS data - we are currently investigating the effect
of using JASPAR
motives and cannot currently comment much on it except
what we sumamrized above: HOCOMOCO v12 produces larger GRNs from our
(limited) test experience. We will, however, update the vignette here
once we have more results to share.
Providing sample metadata is optional, but highly recommended - if
available, the sample metadata is integrated into the PCA plots to
understand where the variation in the data comes from and whether any of
the metadata (e.g., age, sex, sequencing batch) is associated with the
PCs from a PC, indicating a batch effect that needs to be addressed
before running the GRaNIE
pipeline.
The integration of sample metadata can be achieved in the addData()
function (click the link for more information).
Integration of Hi-C data is optional and serves as alternative to identifying peak-gene pairs to test for correlation based on a predefined and fixed neighborhood size (see Methods).
If TAD domains are used, the neighborhood of a peak will be defined by the TAD the peak is located in, and all genes in the same TAD will be tested for correlation. If the TAD is very narrow, this consequently results in fewer genes to be tested as compared to a fixed 250 kb neighborhood (the default), for example, while the opposite is true for a large megabase-long TAD domain.
If Hi-C data are available, the pipeline expects a BED file format with
at least 3 columns: chromosome name
, start
, and end
. An ID
column is optional and assumed to be in the 4th column, all additional
columns are ignored.
For more details, see the R help (?addConnections_peak_gene
) and the
Methods.
It is now also possible to input, in addition to or as replacement of the regular peak-gene links (see here), known links that can either be Capture Hi-C links or more generally known promoter-enhancer links. If provided, the coordinates for both bait (the promoter coordinates) as well as other end (OE) coordinates (the enhancers) are overlapped with the gene promoters and peaks/enhancers as defined in the object, and the corresponding pair is added to the list of pairs to test for correlation, irrespective of their genomic distance. In principle, this even works for links that are defined on different chromosomes.
Optionally, SNP data can also be integrated into our framework via the
function addSNPData()
. The main idea is to annotate each peak with the
information whether it overlaps any of the user-provided SNPs (and how
many), and use this extra annotation later on for creating the
TF-peak-gene eGRN within filterGRNAndConnectGenes()
. For more
information, see here.
Optionally, we now also support identifying SNPS that are in LD with any
of the user-provided SNPs to expand the set of SNPs that are
peak-associated. For more details, see addSNPData()
.
SNP IDs have to be provided with their corresponding rsIDs
, and the
genomic position is retrieved automatically within addSNPData()
by
using biomaRt
.
We also plan to make it possible to manually import TF activity or any other TF-specific data that can be used for generating TF-peaks links, in addition to our default approach of using TF expression. Stay tuned!
For more details, see here.
In this section, we give methodological details and guidelines.
As everywhere in bioinformatics, proper data normalization for both RNA
and open chromatin (peaks) data is very important. Thus, the chosen
normalization method may have a large influence on the resulting eGRN,
so make sure the choice of normalization is reasonable. Data
normalization is performed when the data is imported into GRaNIE
in
the addData()
function and cannot be changed thereafter.
In this section, we give details on the supported data normalization
schemes currently available in the GRaNIE
framework.
We currently support six choices of normalization of either peak or
RNA-Seq data: limma_quantile
, DESeq2_sizeFactors
,
limma_cyclicloess
, limma_scale
, csaw_cyclicLoess_orig
, csaw_TMM
plus none
to skip normalization altogether. The default for RNA-Seq is
a quantile normalization, while for the open chromatin peak data, it is
DESeq2_sizeFactors
(i.e., a “regular” DESeq2
size factor
normalization). Importantly, some normalization methods such as
DESeq2_sizeFactors
strictly require raw data, while others, such as
limma_quantile
, do not necessarily.
For peaks, two additional GC-based normalization schemes are offered:
EDASeq_GC_peaks
and gcqn_peaks
. We refer to the R help for more
details (?addData
).
We recommend raw data as input, although it is also possible to provide
pre-normalized data as input. Pre-normalizing data beforehand may be
useful in case there are known batch effects, for example. Removing the
batch effects from the counts before running GRaNIE
therefore may be a
worthwhile strategy, a procedure that generally produces non-integer
counts. In such a case, the user can then either normalize the data with
another normalization method (one that is suitable also with non-integer
values, see above) or skip additional normalization and use the counts
as they are (choosing none
therefore for the normalization parameter).
To identify statistically significant TF-peak connections we implement a cell-type specific data-driven approach. In brief, we first calculate the Pearson correlation coefficients between the expression level of each TF and the open chromatin signal of each peak across all samples. We then use an empirical FDR procedure to identify statistically significant TF-peak connections. For this, for each TF, we split the peaks into two sets: A foreground set containing the peaks with a putative TFBS and a corresponding background set consisting of peaks without putative TFBS based on the TF-peak binding matrix calculated above. We then discretize the TF-peak correlation r into 40 bins in steps of 0.05 ranging from -1, -0.95, …, 0, …, 1 and calculate a bin-specific FDR value using two different directions. For more details on how we establish TF-peak links, please check the Supplement of the corresponding publication. See References for links.
We employ a shuffling-based approach to calculate background TF-peak links, as described in detail here.
TF-peak connections are found by correlating a suitable measure related to TFs with peak accessibility, and then identifying statistically significant links. By default, TF expression is used as TF measure, but we also implemented an additional measure for establishing these links based on a measure called TF activity (see below).
TFs and therefore also TF motifs can have very different GC preferences.
Some TFs are known to bind to very GC-rich regions such as SP1. Thus,
when calculating the TF-peak link FDRs, comparing a GC-rich or GC-poor
foreground set of peaks with the full background (that is, all other
peaks with no predicted site), the latter of which may exhibit a very
different GC distribution, is potentially biasing the FDR values. We
therefore offer an experimental feature to correct for this effect and
to use a corresponding GC-matching background. The GC correction is
enabled with useGCCorrection = TRUE
in the function
addConnections_TF_peak()
. Note that the GC adjustment will take
additional time and is currently much slower as compared to not
adjusting the background.
Importantly, we note again that this feature is experimental as of now and we are still exploring its effects and plausibility.
When the GC correction is activated, the following is performed:
percBackground_resample
is set to a value > 0 (default
is 75), the background size is fixed to this value (i.e., 75
translates to a background size of 75% of the original size),
and no iterative procedure is run to automatically determine the
background size (see b).percBackground_resample = 0
, an automatic iterative
procedure is run that determines the maximum value of
percBackground_resample
(starting from a value of 100 - the
full size of the background - down to 5 maximum in steps of 5
per iteration) so that all GC bins with a relative frequency of
at least 5% in the foreground can be matched in sufficient
numbers with the corresponding GC bin from the background. GC
bins with a relative frequency of less than 5% are ignored due
to their low relevance. If the background size as identified by
the iterative procedure selects a background that is too small
(< 1000 peaks), the selected background percentage is again
increased in steps of 5% until at least 1000 peaks can be
selected from. This is done for statistical reasons and ensures
a minimum no of points in background, even if this sacrifices
the mimicking of the distributions as described above.percBackground_resample
is set to TRUE
, then the
background distribution will mimic the foreground distribution
perfectly even for GC bins that are very lowly abundant (i.e., those
with a relative frequency of < 0.05 in the foreground). In summary,
for GC bins for which not enough peaks are available in the
background, a resampling scheme is performed that selects the
required number of peaks by sampling with replacement from the
background. If set to FALSE
, however, no resampling is performed
and the number of peaks selected from the background is the maximum
that can be selected. In this case, the relative frequencies in
foreground and background may differ. Note that when
percBackground_resample = 0
, this is only relevant for GC bins
with a relative frequency of < 0.05, while otherwise, resampling
may occur for an GC bin depending on the chosen value of
percBackground_size
Rationale for choosing an appropriate number for
percBackground_resample
Ideally, one would want to select as many peaks in the background as possible to maximize the sample size and therefore randomness for better statistical sampling. However, this often fails because particular GC bins may be very dominant in the foreground for GC-rich TFs (e.g., the GC bin 71-80% has a relative frequency of 25%) while the background contains overall only 5% peaks with this GC bin. Thus, one cannot select 25% of all background peaks (as the relative frequency of the GC bins in the foreground is to be kept in the background) of the same GC bin without resampling. The only alternative is to not use the full background size as reference, but only a particular percentage of it so that the required absolute numbers of peaks to sample from decreases and eventually can be selected for all relevant GC bins. When choosing this option, the background size can be much smaller (e.g., 15% only), therefore also choosing only a relative small absolute number of peaks from the background. However, no resampling is done here, therefore not (by chance) giving particular peaks a higher weight.
You may ask which value to choose for percBackground_size
and whether
to use GC correction at all, however? These are good questions, and
while we cannot give any guaranteed reply, we can give recommendations
and our experiences from a limited number of datasets:
GRaNIE
once with and once without
GC correction to compare the eGRNs. What we sometimes see is that
in the eGRN without a GC correction, the most abundant TFs
actually have a GC-rich motif, indicating that they may be
overrepresented in the eGRN.percBackground_size
, we
recommend leaving it at the default value of 75 first and
potentially adjusting it after looking at the diagnostic plots.
Higher values will result in (1) a higher number of GC bins that
need resampling (and therefore potentially overrepresenting some
peaks or indirectly causing a bias due to the higher percentage of
resampled peaks) while (2) increasing the background size overall,
which is better from a statistical sampling point of view. We think
Judging how well the GC correction worked
We also provide detailed QC plots that summarize both the GC background selection as well as the differences for teh TF-peak connections with and without GC correction. For more details, see here.
Reproducibility of GC correction
Lastly, we need to emphasize that the GC correction, due to the random sampling from the background, is not deterministic and may yield a slightly different number of connections each time it is run. The more bins are resampled (see discussion above), the more differences may arise
As explained above, TF-peak connections are found by correlation TF
expression with peak accessibility. In addition to expression, we
also offer to identify statistically significant TF-peak links based on
TF Activity. The concept of TF Activity is described in more detail in
our diffTF
paper. In short, we define TF motif activity, or TF
activity for short, as the effect of a TF on the state of chromatin as
measured by chromatin accessibility or active chromatin marks (i.e.,
ATAC-seq, DNase sequencing [DNase-seq], or histone H3 lysine 27
acetylation [H3K27ac] ChIP-seq). A TF Activity score is therefore
needed for each TF and each sample.
TF Activity information can either be calculated within the GRaNIE
framework using a simplified and empirical
approach) or it can be
calculated outside of our framework using designated methods and then
imported into our framework. We
now describe these two choices in more detail.
In our GRaNIE
approach, we empirically estimate TF Activity for each
TF with the following approach:
By default, we currently offer the four different types of normalizing the raw data for calculating TF Activity. Options 2 to 4 are described in more detail in the section Data normalization above, while option 1 is currently only available for TF Activity and therefore explained below (this may change in the future):
normOffsets
function of the csaw
package in R as opposed to using one size factor for each sample
only as with the regular size factor normalization in DeSeq
. For
each sample, a LOWESS (Locally Weighted Scatterplot Smoothing)
curve is fitted to the log-counts against the log-average count. The
fitted value for each bin pair is used as the generalized linear
model offset for that sample. The use of the average count provides
more stability than the average log-count when low counts are
present. For more details, see the csaw
package in R
and the
normOffsets
methods therein.DeSeq
Soon, it will also be possible to import TF Activity data into our framework as opposed to calculating it using the procedure as described above. This feature is currently in development and will be available soon.
Once TF Activity data is available, finding TF-peak links and assessing their significance is then done in complete analogy as for the TF expression data - just the input data is different (TF Activity as opposed to TF expression). The so-called connection type - expression or TF Activity, is stored in the GRN object and output tables and therefore allows to tailor and filter the resulting network accordingly. All output PDFs also contain the information whether a TF-peak link has been established via the TF expression or TF Activity.
In the GRaNIE
framework, we construct peak-gene pairs completely
independently and separately from the TF-peak links. For this, the
function addConnections_peak_gene()
is used. In general, we offer two
options to decide which peak-gene pairs to test for correlation:
promoterRange
for details) to select which peak-gene pairs to
test. Thus, for a particular peak, all genes within 250 kb either
up- or downstream of it would be selected and tested for
correlation.knownLinks_useExclusively
that was provided as part of the
function call. If multiple promoter-enhancer peaks overlap with the
provided user-defined links, all unique combinations are taken and
added to the list of peak-gene pairs to test. By default, only the
TSS of genes is checked for overlap, while this can be changed with
the parameter overlapTypeGene
in analogy to how it works
everywhere else.For all approaches, the user can select between two options of how to
exactly calculate the distance between a peak and a gene. That is, which
part of the gene is taken as reference point: the 5’ end (TSS, the
default) or the full gene including all exons and introns. For more
information see the R help (?addConnections_peak_gene
and the
parameter overlapTypeGene
in particular)
Importantly, as for TF-peaks, we also employ a shuffling-based approach to calculate background peak-gene links, as described in detail here.
As GRaNIE
models TF-peak and peak-gene links independently from one
another, the package function filterGRNAndConnectGenes()
can be used
to combine them to a tripartite TF-peak-gene combination. In addition,
filtering both TF-peaks and peak-gene links according to different
criteria is important and can be user-adjusted flexibly. Importantly,
for statistical reasons, multiple testing correction of the peak-gene
raw p-values is only done after connecting them to TFs and filtering
them according to user-defined choices. We offer a wide range of
different adjustment methods, including Independent hypothesis
weighting (IHW), a multiple testing procedure that increases power
compared to the method of Benjamini and Hochberg by assigning
data-driven weights to each hypothesis. For more details, see, for
example,
here.
When SNPs have been integrated (see here), the user can choose whether to 1. additionally add a SNP filter to keep or discard TF-peak-gene connections (e.g. by requiring a minimum number of SNPs that have to overlap a peak) 2. keep peaks (either as part of TF-peak or peak-gene connections) if they overlap with a user-defined minimum number of SNPs (e.g, at least 1).
Thus, integrating SNPs can either reduce the size of eGRN by adding another filtering step or increase it by extending the criteria for which we find a peak relevant enough: In addition to being connected to a TF, a peak may also be kept just because it overlaps with a SNP (independent of any TF connection).
After successful execution, the filtered TF-peak-gene network is stored
in GRN@connections$all.filtered
(both real and background eGRN). It
can be easily retrieved, along with additional options for which output
columns to include, using getGRNConnections()
.
After a filtered eGRN has been stored in the GRN object, the user can run all network related analysis such as enrichment or visualization. See the next sections for details.
Importantly, after successful execution of
filterGRNAndConnectGenes()
and re-population of the filtered eGRN in
GRN@connections$all.filtered[["0"]]
, the eGRN graph as produced by
build_eGRN_graph()
(GRN@graph
) is reset to avoid inconsistencies
with mismatches between the filtered eGRN, the graph representation of
it and associated enrichment results.
In the GRaNIE
framework, in addition to the real connections, we also
calculate background connections in each step based on randomized
data. Calculating a background eGRN follows exactly the same
principles and methods as the real eGRN, without exceptions. It is
only the input data that is shuffled in different ways, for both TF-peak
links as well as peak-gene links, as follows:
TF-peak links: First, the rows of the binary 0/1 TF-peak overlap table are shuffled separately for each sample (i.e., per column) to produce a truly random overlap matrix. In addition, the sample labels for the RNA counts are shuffled. Thus, subsequently, the counts that are correlated between RNA and peak data are not from the same pair of samples in fact.
peak-gene links: After creating the real and true table for the
peak-gene pairs that fulfil the user-specified requirements for
being tested for correlation (e.g., within the specified distance
from one another as defined by the parameter promoterRange
), the
peaks are shuffled. This consequently means that the peak-gene pairs
that are subsequently tested for correlation are (usually) not in
the specified proximity anymore, but instead mostly on other
chromosomes or at a very different location on the same chromosome.
Notably, this preserves the degree distribution for both peaks and
genes: that is, if a gene or peak is overrepresented, this is also
true for the shuffled version. Second, in analogy to the background
TF-peak links, the sample labels for the RNA counts are shuffled
before peak-gene correlations are calculated. The latter can be
controlled via the parameter shuffleRNACounts
in
addConnections_peak_gene()
. Empirically, we found that only
shuffling the peak-gene links is not sufficient to create a truly
random background, which is why the default behavior is to do the
shuffling in addition. Nevertheless, this can be user-controlled.
TF-peak-gene links (eGRN): Combining TF-peak and peak-gene links for the background data then again follows the exact same methods as the real eGRN. However, due to the randomized nature of both the TF-peak and peak-gene links, the background eGRN typically has very few or no statistically significant connections at all.
In summary, we want to emphasize that we currently do not simply permute the real eGRN network. Instead, we re-construct an eGRN using the very same methods but with randomized data, as outlined above. This allows us to judge and compare the connectivity for the real network as compared to a background eGRN.
In the GRN
object (GRN@connections
slot, to be specific), we label
the background data with a 1
, while the original data is labeled as
0
. For example, GRN@connections$TF_peaks[["0"]]
stores the original
connections, while GRN@connections$TF_peaks[["1"]]
stores those
arising from the background.
Relevant output files consequently contain the background
label to
designate that the corresponding background has been used to generate
the plot, while the original data is labeled as original
.
After a filtered set of TF-peak-gene links (i.e., a full eGRN) has
been built with filterGRNAndConnectGenes()
, the actual graph must be
constructed using the function build_eGRN_graph()
based on the
filtered eGRN (in GRN@connections$all.filtered[["0"]]
).
Importantly, two types of networks are constructed:
filterGRNAndConnectGenes()
.allowLoops
).For subsequent functions, either of the two networks is used, and the
user can choose which type of network to use if both are possible (such
as for visualizeGRN
).
For more information, see the R help (?build_eGRN_graph
).
After the graph has been built, three types of enrichment analysis are
available within the GRaNIE
framework:
calculateGeneralEnrichment
)calculateCommunitiesEnrichment()
)calculateTFEnrichment()
)All enrichment functions here use the TF-gene graph in the GRN
object and therefore require the function build_eGRN_graph()
to be run
beforehand. They are, as explained above, based on the filtered network
as obtained by filterGRNAndConnectGenes()
and include only the genes
from the corresponding (sub)network.
All three functions have corresponding plot functions to visualize the enrichment. For more information such as supported ontologies, see the corresponding R help for the functions, all details are explained there.
For an explanation of the output files from the plot functions, see here.
All results are stored in GRN@stats$Enrichment
. For example, the
results from the enrichment results of TF E2F7.0.B
from the example
GRN object (see function loadExampleObject()
) for the GO
biological
process (BP) enrichment can be retrieved from
GRN@stats$Enrichment$byTF$E2F7.0.B$GO_BP$results
, while the parameters
that were used to run the enrichment are correspondingly stored in
GRN@stats$Enrichment$byTF$E2F7.0.B$GO_BP$parameters
.
We also provide the functions visualizeGRN()
to visualize a filtered
eGRN network as well as filterConnectionsForPlotting()
to filter a
eGRN just for visualization purposes. They are both very easy to
invoke, but provide many options to customize the output and the way the
graph is drawn. We recommend to explore the options in the R help
(?visualizeGRN
and filterConnectionsForPlotting()
). Note that we use
the provided network visualization methods from the igraph
package for
visualizeGRN()
.
Visualizing the whole network usually works well for small to medium-sized networks that contain up to a few hundred edges, but may fail for larger eGRNs. Drawing thousands of nodes and edges at once without losing readability is almost impossible. However, here are a few tips on what you can do if your eGRN cannot be drawn or it looks not nice enough:
maxEdgesToPlot
to say 1000 or
even more, and check whether the eGRN can be drawn. It may also
help to increase the dimensions of the PDF (if plotAsPDF = TRUE
)
by increasing pdf_width
and pdf_height
.layout
parameter.
Try all of the supported layouts (see ?visualizeGRN
for a list),
maybe one looks good enough.If none of the plotting-related parameters improve the visualization or
if you want to visually explore the network for particular features of
the network (such as specific TFs, peaks, or genes), you can use
filterConnectionsForPlotting()
to filter a eGRN just for
visualization purposes. This reduces the number of nodes and edges to
plot and gives the user ultimate flexibility of what to visualize. For
example, you can filter the network to just visualize the part of the
network that is connected to a specific set of TFs (i.e, their
regulons).
Alternatively, you can also re-filter the network again more stringently
using filterGRNAndConnectGenes()
, but then all subsequent analyses
have to be rerun as well (e.g., build_eGRN_graph()
and all enrichment
functions). This may reduce the multiple testing burden for peak-gene
p-values and recover connections that may not have passed the filter
beforehand.
One of our latest features we added to the package is to use the
variancePartition
package to quantify and interpret the multiple
sources of biological and technical variation from the features (TFs,
peaks, and genes) in a GRN
object. This can be done via the function
add_featureVariation()
. This can be helpful to identify where exactly
the observed variation actually comes from, which we believe is useful
information when checking eGRNs and particular elements of it. Note
that this is still a work of progress, however, and we did not yet test
this feature for many datasets.
In essence, we run the main function fitExtractVarPartModel
of the
package variancePartition
. This fits a linear (mixed) model to
estimate the contribution of multiple sources of variation while
simultaneously correcting for all other variables for the features in a
GRN
object (TFs, peaks, and genes), given particular metadata. The
function reports the fraction of variance attributable to each metadata
variable. As input, the normalized count matrices are used.
The formula used for model fitting can either be provided manually or
set to auto
. For the latter case, the formula will be build
automatically based on all metadata variables as specified with the
metadata
parameter. By default, numerical variables will be modeled as
fixed effects, while variables that are defined as factors or can be
converted to factors (characters and logical variables) are modeled as
random effects as recommended by the variancePartition
package.
The user can also speciy whether to run the procedure only on the
features from the filtered set of connections (the eGRN, the result of
filterGRNAndConnectGenes()
) or on all filters (that is, for all
features irrespective of whether or not they are part of the final
eGRN).
For more information where the information is stored in the GRN
object
and which QC plots are available, see here.
Here, we describe the various output files that are produced by the pipeline. They are described in the order they are produced in the pipeline.
Our pipeline works and output a so-called GRN
object. The goal is
simple: All information is stored in it, and by keeping everything
within one object and sharing it with others, they have all the
necessary data and information to run the GRaNIE
workflow. A
consistent and simple workflow logic makes it easy and intuitive to work
with it, similar to other packages such as DESeq2
.
Technically speaking, it is an S4 object of class GRN
. As you can see
from the workflow vignette, almost all GRaNIE
functions return a GRN
object (with the notable exception of get
functions - i.e., all functions that start with get). Except
initializeGRN()
, which creates an empty GRN
object, they also all
require a GRN
object as first argument, which makes is easy and
intuitive to work with.
GRN
objects contain all data and results necessary for the various
functions the package provides, and various extractor functions allow to
extract information out of an GRN
object such as the various get
functions. In addition, printing a GRN
object results in an object
summary that is printed (try it out and just type GRN
in the console
if your GRN
object is called like this!). In the future, we aim to add
more convenience functions. If you have specific ideas, please let us
know.
The slots of a GRN
object are described in the R help, see
?GRN-class
for details. While we work on general extractor functions
for the various slots for optimal user experience, we currently suggest
to also access and explore the data directly with the @
operator until
we finalized it. For example, for a GRN
object called GRN
,
GRN@config
accesses the configuration slot that contains all
parameters and object metadata, and slotNames(GRN)
prints all
available slots of the object.
During a typical GRaNIE
analysis, almost all results are automatically
stored in GRN
object that is used as input (exceptions are indicated
in the corresponding sections) and therefore, almost all functions
return also the very same GRN
object, containing in addition the
results of the function.
The only exception is the function getGRNConnections()
that can be
used to extract the resulting eGRN network from a GRN
object and
returns a separate data frame and NOT a GRN
object. For more
information and an explanation about the various columns that are
returned from the function, see the corresponding R help (
?getGRNConnections
).
A summary of a GRN
object can be retrieved with the function
getGRNSummary()
. This returns a named list that summarizes all
relevant aspects for a GRN
object. For more details, see the R help.
All user data is stored in GRN@data
. This slot contains the following
elements:
peaks
: Here, peaks data are stored, and the list contains the
following elements:
counts
stores the data after the user-selected normalization
scheme has been performed (even if set to none
)counts_metadata
stores the peak metadata (peak location, ID,
and whether or not it is marked as filtered with the current
object settings)counts_raw
is only present when the user used
keepOriginalReadCounts = TRUE
when running the addData()
function. It stores the original, user-provided data before any
normalization scheme. For memory reasons and because the raw
counts are never directly used in the GRaNIE
workflow, they
are not stored by default (anymore).RNA
: Here, RNA data are stored, and the list contains the same
elements as the peaks data, with one additional element:
GRN@data$RNA$counts_permuted_index
: A memory-saving
representation of how to perform sample shuffling when permuting
the data, which is the basis for constructing the background
links for both TF-peak and peak-gene connections.metadata
: Data frame with the user-provided sample-specific
metadata and a few additional metadata that were added when running
addData()
.TFs
: Here, TF-specific data are stored, and the list contains the
following elements:
TF_peak_overlap
stores the binary 0/1 TF-peak overlap matrix
that denotes whether a particular TF has a putative/predicted TF
binding site in a particular peak. This information comes from
the provided TF and TFBS data from the functions addTFBS()
and
overlapPeaksAndTFBS()
in particularclassification
stores the TF classification results and is
only present when the function AR_classification_wrapper()
has
been run successfully. For more information, see
hereCurrently, the actual PCA result data are not stored in the GRN
object, but we may change this. We will update the Vignette once this is
done and mention it in the News/Changelog.
The pipeline outputs PCA plots for both peaks and RNA as well as original (i..e, the counts the user provided as input) and normalized (i.e., the counts after normalizing them if any normalization method has been provided) data. Thus, in total, 4 different PCA plots are produced, 2 per data modality (peaks and RNA) and 2 per data type (original and normalized counts).
Each PDF consists of three parts: PCA results based on the top 500, top
1000 and top 5000 features (see page headers, depending on the parameter
topn
in plotPCA_all()
). For each part, different plot types are
available and briefly explained in the following:
The TF-peak connections are stored in GRN@connections$TF_peaks
. This
list is structured as follows:
the first level constitutes, in analogy to many other places within the object, whether the real or background links are stored. For more details, see here.
inside both real and background links (i.e., within
GRN@connections$TF_peaks[["0"]]
and
GRN@connections$TF_peaks[["1"]]
), a list with two elements is
stored:
main
, containing a data frame with the actual TF-peak
connections along with additional propertiesconnectionStats
, containing a summary of the empirical FDR
procedure, stratified by TF, correlation bin, FDR
direction and TF connection type. This slots also contains
details on GC-specific information if activated (see parameter
useGCCorrection = TRUE
in addConnections_TF_peak
)The slot GRN@stats$GC
is also populated:
GRN@stats$GC$peaks
.GRN@stats$GC
are stored:
TFs_GC_correction
: A data frame that summarized the GC
correction on the TF and GC bin levelTFs_GC_correction_plots
: For each TF, the GC plots for each TF
are stored, namely the same two plots that were part of the
output plots (see below). For example, to plot the first plot of
the TF AIRE.0.C
, simply type
GRN@stats$GC$TFs_GC_correction_plots$AIRE.0.C[[1]]
.We provide multiple QC plots for the TF-peak connections as part of the output, as explained in the following.
First, we provide an overview of the total number of TF-peak connections for a range of typically used FDR values for both real and background TF-peak links, stratified by the TF-peak correlation bin. Typically, one sees three things:
In addition, we provide TF-specific diagnostic plots for all TFs that
are included in the GRaNIE
analysis. They summarize the FDR and number
of connections, stratified by the connection type (see
here for methodological details),
the FDR directionality and the TF-peak correlation bin for both real and
background links (see here for methodological
details).
The TF name is indicated in the title, and each page shows two plots. In each plot, the TF-peak FDR for each correlation bin (ranging from -1 to 1 in bins of size 0.05) is shown. The only difference between the two plots is the directionality upon which the FDR is empirically derived from: the upper plot is for the positive and the lower plot for the negative direction (for more details, see the References). Each plot is also colored by the number of distinct TF-peak connections that fall into the particular bin. Mostly, correlation bins with smaller absolute correlation values have higher frequencies (i.e., more TF-peak links fall into them) while correlation bins with more extreme correlation values are less frequent. In the end, for the resulting TF-peak and ultimately the eGRN network, the directionality can be ignored and only those TF-peak links with small FDRs are kept, irrespective of the directionality.
For a particular (set of) TF-peak pair(s), the underlying TF-peak data
and the correlation can be visualized with plotCorrelations()
. For
more information, see the R help for the function.
When GC background adjustment was activated (useGCCorrection = TRUE
in
the function addConnections_TF_peak()
), additional QC plots can
optionally be generated with the function
plotDiagnosticPlots_TFPeaks_GC()
. They summarize all relevant data and
adjustments related to the GC adjustment. Note that this function is
not run automatically as part of the regular TF-peak diagnostic plots at
the moment when plotDiagnosticPlots_TFPeaks()
is executed.
For each TF, 2 plots (1 per page in the corresponding PDF file) are produced. One shows the relative frequency of each of the GC bins, stratified by the peak set origin (foreground, background before and after GC adjustment), the second the absolute frequencies (log10 + 1 transformed). For the first plot, GC bins for which resampling was performed are additionally labeled with a percentage value that indicates which percentage of the required number of peaks was actually available in the background. Thus, a value of 95% indicates that only 5% of peaks where missing to have the required number of peaks available in the background (small number of resampled peaks), while a value of 5% indicates the opposite: a large resampling happened to select the required number of peaks, and a potential sampling bias from the small number of peaks to select from in the first place may have occurred.
Typically, though, resampling happens only for extreme GC bins that have a relatively low relative percentage in the foreground. Ideally, one should see a perfect match between foreground and GC-adjusted background in the plots, both in relative frequencies (page 1) and absolute numbers (page 2) for as many GC bins as possible. The number of GC bins for which resampling occurred should be as small as possible, on the other hand.
The pages after summarize further aspects of the GC adjustment in the form of heatmaps. Each page shows the heatmap for a particular variable, each row in the heatmap is a TF, and columns are either the peak GC bins (stratified into ten bins, see above) or the TF-peak correlation bins (40 bins, in steps of 0.05, see above). The color scale is either blue-white-red (for negative and positive scales), white-red (for only positive ones) or black-white (for binary variables).
Details for peak GC bins from either foreground or background that
are the basis for the bin-specific FDR calculation (labeled GC
adjustment details in the plot title) As the legend indicates,
variables are either foreground or background (n.bg.needed
,
n.bg.needed.ratio
, n.bg.needed.ratio.atLeast1
) associated,
belong to both (n
, peak_width mean
, peak_width sd
, n_rel
)
or they capture the difference of foreground and background
(diff_fg_bg
). For each plot, TFs are clustered according to their
similarity, so the TF ordering differs for each plot.
TF-peak connection summary from the actual connections before and
after the GC background adjustment (labeled GRN connections in the
plot title) as stored in GRN@connections$TF_peaks
. Four plots are
shown, three of which show absolute numbers (n_beforeGC
,
n_afterGC
, n_diff_abs
) and one a relative number (n_diff_rel
,
calculated as n_afterGC
/ n_beforeGC
). For ratios that R deems
as Infinite like this (e.g, 1 connection after GC correction but
no connection without it - 1 / 0
), we set the ratio to a value of
either the maximum of all finite values in the heatmap or a
hard-coded value of 10 - depending on what the maximum of these two
values is. Unlike the heatmaps above, TFs are not clustered here but
appear in the same alphabetical order on every plot.
It is also possible to extract the results from the AR classification
out of a GRN
object. Currently, this can only be done manually,
extractor functions are in the works that will further enhance the user
experience. The results are stored in the slot
GRN@data$TFs$classification[[permIndex]] [[connectionType]]$TF.classification
.
Here, permIndex
refers to the original, non-permuted (0
) or permuted
(1
) data, while connectionType
here is either expression
or
TFActivity
, depending on whether the pipeline has also be run for TF
Activity in addition to expression (see function
addConnections_TF_peak()
). Thus, typically, the results for the
original data are stored in
GRN@data$TFs$classification[["0"]] [["expression"]]$TF.classification
.
If intermediate results from the classification have not been deleted
(the default is to delete them as they can occupy a large amount of
memory in the object, see the parameters of AR_classification_wrapper
for details), they can be accessed similarly within
GRN@data$TFs$classification[[permIndex]] [[connectionType]]
:
TF_cor_median_foreground
, TF_cor_median_background
,
TF_peak_cor_foreground
, TF_peak_cor_background
, and
act.rep.thres.l
.
The pipeline produces 3 different plot types related to the
activator-repressor (AR) classification that can optionally be run as
part of the GRaNIE
workflow. For each of the 3 types, plots are
produced for both the original (labeled as original
) as well as the
permuted (labeled as permuted
) data.
The AR classification is run for the RNA expression data (labeled as
expression
) and can additionally also be run for TF activity data
(labeled as TFActivity
, see the function addConnections_TF_peak()
and its parameter options).
In the following, the 3 plot types are briefly explained:
Summary heatmaps (files starting with
TF_classification_summaryHeatmap
): This is described in detail
in the diffTF
documentation.
We explain and summarize this type of plot in the Workflow Vignette. Please check there for details.
Classification stringency summary plots (files starting with
TF_classification_stringencyThresholds
): This is described in
detail in the diffTF
documentation.
Density plots per TF (files starting with
TF_classification_densityPlotsForegroundBackground
): Density
plots for each TF, with one TF per page. The plot shows the
foreground (red, labeled as Motif
) and background (gray, labeled
as Non-motif
) densities of the correlation coefficient (either
Pearson or Spearman, see x-axis label) from peaks with (foreground)
or without (background) a (predicted) TFBS in the peak for the
particular TF. The numbers in the parenthesis summarize the
underlying total number of peaks.
The peak-gene connections are stored in GRN@connections$peak_genes
.
This list is structured as follows:
GRN@connections$peak_genes[["0"]]
and
GRN@connections$peak_genes[["1"]]
), a data frame with the
peak-gene connections along with additional metadata (e.g.,
peak.ID
, gene.ENSEMBL
, peak_gene.distance
), their connection
origin (e.g TADs
, knownLinks
or neighborhood
), and statistical
properties are stored (peak_gene.r
, peak_gene.p_raw
).The function plotDiagnosticPlots_peakGene()
or indirectly the function
addConnections_peak_gene()
(when plotDiagnosticPlots = TRUE
) plots
various diagnostic plots for the peak-gene links that are imperative for
understanding the biological system and resulting eGRN. In what
follows, we describe them briefly, along with some notes on expected
patterns, implications etc. Note that this section is subject to regular
change.
The main QC summary figure on page 1 is divided into two rows: the upper
row focuses on the peak-gene raw p-value from the correlation tests,
while the lower row focuses on the peak-gene correlation coefficient.
The left side visualizes the data of the corresponding metrics via
density plots, while the right side bins the metrics and visualizes them
with barplots for highlighting differences between real and background
data as well as negatively and positively correlated peak-gene links
(denoted as r+
and r-
, respectively).
We will now discuss and explain both of them in more detail.
First and most importantly, we focus on the distribution of the raw p-values from the correlation tests (i.e., peak accessibility correlated with gene expression) of all peak-gene links. We can investigate these from multiple perspectives.
Let’s start with a density plot. The upper left plot shows the raw p-value density for the particular gene type as indicated in the title (here: all gene types), stratified on two levels:
r-
, gray) and positive (r+
,
black) correlation coefficient, respectivelyGenerally, we also consider r-
connections as background. The
reasoning for this is as follows: Since chromatin accessibility in
regulatory elements is generally associated with active gene regulation
and transcription, we only expect positive correlations for functional
peak-gene links. Notably, the same is still true even for
repressor-bound elements, where binding of most repressors leads to loss
of both accessibility and transcription. Negative correlations have no
clear biological meaning and therefore may indicate remaining batch
effects or random noise.
What we would like to see is:
r+
and r-
connectionsr+
links, a strong peak at small
p-values, and a (marginally) flat distribution for higher ones
(similar to a well-calibrated raw p-value distribution for any
hypothesis test such as differential expression). For r-
links,
the peak at small p-values should be much smaller and ideally the
curve is completely flat. However, from the many datasets we
examined, this is rarely the case.If any of these quality controls fails, it indicates at least one assumption of the framework is violated. This could be due to an issue with data normalization, the underlying biology and what the eGRN represents, or an issue with data size or covariates influencing the results. Often, when the number of samples is small, the QC does not look satisfactory. Thus, it is important to use as many samples as possible, preferably at least 12 or so (see here for more details).
To quantify the difference between r+
and r-
links within each
connection type (background vs real), we can also plot the results in
form of ratios rather than densities for either the r+
/ r-
or the
real / background dimension. These plots are shown in the upper right
panel of the summary plot.
For the r+
/ r-
dimension and background data, the ratios should be
close to 1 across all p-value bins, while for the real data, a high
ratio is typically seen for small p-values. In general, the difference
between the background and real bar should be large for small p-values
and close to 1 for larger ones.
For the real / background dimension, what we want to see is again a high
ratio for small p-value bins for the r+
links, indicating that when
comparing background vs real, there are many more small p-value links in
real data as compared to background. This usually does not hold true for
the r-
links, though, as can be seen also from the plot: the gray bars
are smaller and closer to 1 across the whole binned p-value range.
We also provide a summary of the distribution of the peak-gene correlation coefficient. Typically, one sees that the distribution of the real links is wider (i.e., more links with high absolute values - in either direction) and less pronounced around 0 as compared to the background, an indication of the captured signal in the real data.
Lastly, we also provide additional, advanced QC plots in which we
stratify the aforementioned raw p-value distributions across real data
according to various metadata and additional biological and statistical
features that we calculate during the GRaNIE
workflow.
For each of these variables, two plots are shown based on the real data
only. First, density plots for the raw peak-gene p-value distribution,
stratified by r+
and r-, along with a summary graph showing the ratio
of r+
/ r-
in a barplot. Second, a page with two barplots showing
the same information as before, just visually differently.
The features include, for example, the mean, median, or the coefficient
of variation (ratio of the standard deviation to the mean, a unitless
and standardized measure of dispersion/variability; abbreviated CV in
what follows) for both peak and RNA data. The individual values are all
calculated based on the normalized input data across all samples
(excluding the filtered peaks / genes, in analogy how the eGRN is
constructed) as stored in GRN@annotation
. We also stratify by peak
annotation as identified by the ChIPSeeker
package, the GC content of
the peaks as well as the combined CV of peak and gene:
Lastly, a few density plots are shown that stratify by the peak-gene
distance. In total 10 equidistant bins (which results in the bins
containing a non-equal number of data points but genomic distance is
increased uniformly from bin to bin) are constructed, ranging from 0 to
the user-defined value of the maximum peak-gene distance, which was
defined as promoterRange
in the function addConnections_peak_gene()
.
In addition, a background bin is shown, labeled as “randomized”, from
the background data (see above):
r+
/r-
)We generally (hope to) see that for smaller peak-gene distances (in
particular those that overlap, i.e., the peak and the gene are in direct
vicinity or even overlapping), the difference between r+
and r-
links is bigger as for more distant links. We also include the random
links, for which no difference between r+
and r-
links is visible,
as expected for a well-calibrated background.
For a particular (set of) peak-gene pair(s), the underlying peak-gene
data and the correlation can be visualized with plotCorrelations()
.
For more information, see the R help for the function.
The peak-gene connections are stored in GRN@connections$all.filtered
.
This list is structured as follows:
GRN@connections$all.filtered[["0"]]
and
GRN@connections$all.filtered[["1"]]
), a data frame with the
TF-peak-gene connections along with additional metadata and
statistical properties are stored. This table is not to be used
directly, however; instead, the function getGRNConnections()
has
to be used to retrieve them and add additional information as needed
along with it.No plots are available here, but the eGRN can be visualized with
visualizeGRN()
.
The peak-gene connections are stored in
GRN@connections$$TF_genes.filtered
. This list is structured as
follows:
GRN@connections$$TF_genes.filtered[["0"]]
and
GRN@connections$$TF_genes.filtered[["1"]]
), a data frame with the
TF-gene connections along with additional metadata and statistical
properties are stored. This table is not to be used directly,
however; instead, the function getGRNConnections()
with
include_TF_gene_correlations
has to be used to retrieve them.For a particular (set of) TF-gene pair(s), the underlying TF-gene data
and the correlation can be visualized with plotCorrelations()
. For
more information, see the R help for the function.
In addition, the eGRN can be visualized with visualizeGRN()
.
The results of the function generateStatsSummary()
is stored in
GRN@stats$connections
. This table contains a lot of information,
including but not limited to the various parameters the function
iterated over to generate multiple eGRNs for different parameter
thresholds and combinations such as the number of distinct TFs, peaks,
genes, their connectivity, etc.
We currently offer two different connection summary PDFs, both of which
are produced from the function plot_stats_connectionSummary()
. Both
PDFs shows the number of connections for each node type (TF, peak, and
gene), while in the boxplots, peaks are further differentiated into
TF-peak and peak-gene entities. They also iterate over various
parameters and plot one plot per page and parameter combination, as
indicated in the title:
allowMissingTFs
: TRUE
or FALSE
(i.e., allow TFs to be missing
when summarizing the eGRN network. If set to TRUE
, a valid
connection may consist of just peak to gene, with no TF being
connected to the peak. For more details, see the R
help for
plot_stats_connectionSummary()
)allowMissingGenes
: TRUE
or FALSE
(i.e., allow genes to be
missing when summarizing the eGRN network. If set to TRUE
, a
valid connection may consist of just TF to peak, with no gene being
connected to the peak. For more details, see the R
help for
plot_stats_connectionSummary()
)TF_peak.connectionType
. Either expression
or TFActivity
to
denote which connection type the summary is based on.Both plot types compare the connectivity for the real and background
data (denoted as Network type
in the boxplot PDF), which allows a
better judgment of the connectivity from the real data.
An example page for the summary heatmap looks like this:
Here, two heatmaps are shown, one for real (top) and one for background (bottom) network. Each of them shows for different combinations of TF-peak and peak-gene FDRs (0.01 to 0.2) the number of unique node types for the given FDR combination (here: TFs).
Second, a multi-page summary PDF for the connections in form of a boxplot, as exemplified with the following Figure:
The actual graph structure is stored in GRN@graph
after successful
execution of the function build_eGRN_graph()
.It contains two different
kind of graphs, namely the TF-peak-gene and the TF-gene graph. Each
of them is associated with a list element with the same name, which
stores also the table that was used to create the graph along with the
graph itself and the used parameters for creating it.
The function visualizeGRN()
can be used for visualization of the
eGRN graph.
The identified communities for each node are stored within igraph
objects in GRN@graph$TF_gene$clusterGraph
(retrievable via
GRN@graph$TF_gene$clusterGraph$membership
, for example) and as vertex
attributes in GRN@graph$TF_gene$graph
(retrievable via
igraph::vertex.attributes(GRN@graph$TF_gene$graph)$community
, for
example).
The function visualizeGRN()
can be used for visualization of the
identified communities in the eGRN graph. The functions
plotCommunitiesStats()
and plotCommunitiesEnrichment()
can be used
for plotting and summarizing community properties and enrichments,
respectively.
All enrichment results are stored inside GRN@stats$Enrichment
. This
contains a nested list with the named elements general
, byCommunity
,
and byTF
for general, community-specific and TF-specific enrichments,
respectively.
Inside each of these elements, there is further nesting by the chosen
ontology (e.g, `GO_BP
), and whether results or the parameters of the
enrichment function should be retrieved.
The output is structured as follows: - one page per ontology for the whole eGRN with an enrichment summary that includes the ontology terms (y-axis), the gene ratio (the number of genes found divided by the total number of genes in the foreground) on the x-axis, the raw p-value (color) and the absolute number of genes found (dot size). The plot title gives some additional information such as the chosen ontology, the statistic used, the used background, the number of genes in foreground and background, respectively, and the chosen multiple testing adjustment (if applicable; in some cases, this is ignored due to the way the enrichment calculation works)
The output is structured as follows: - one page per ontology and
community with an enrichment summary (see general enrichment above for
explanations) - two pages in the end that compare the
community-enrichment with the general enrichment. Here, all ontology
terms that are deemed significant according to the chosen significance
threshold p
(see function parameters) in either the general or the
community-specific enrichment are included, along with their
-log10-transformed statistical confidence (either raw or adjusted
p-value). The top part shows the size of the general enrichment and the
community-specific subnetworks (in number of genes they include). In
contrast, the last page shows not all but only the top 10 enriched terms
per column.
The output is structured in analogy to the community enrichment, but instead of communities, the regulon that a particular TF spans is chosen as subnetwork (i.e., all genes the TF is connected to).
The results are added to GRN@stats$variancePartition
as well as within
the various feature-related elements of GRN@annotation
. The former
slot and results can be used for the various diagnostic and plotting
functions from the variancePartition
package.
The easiest way to retrieve and include the results into the final
eGRN is to rerun the function getGRNConnections()
and set
include_variancePartitionResults = TRUE
, as the results are NOT added
automatically to GRN@connections$all.filtered
either.
QC plots for the feature variation results are available via the
function plotDiagnosticPlots_featureVariation()
.
We are currently working on updating details for this section. Stay tuned, coming soon!
SNP data is added via the function addSNPData()
. When
add_SNPs_LD = TRUE
, additional information for SNPs in LD with any of
the user-provided input SNPs are stored within GRN@annotation$SNPs
and
GRN@annotation$SNPs_filtered
for the unfiltered and filtered SNP
table, respectively, and the list of peak-associated SNPs is extended to
SNPs in LD with overlapping SNPs from the user input.
Final peak-SNP overlaps are stored in GRN@data$peaks$counts_metadata
.
For more details, see ?addSNPData
.
No plots are specifically available for the SNp data at the moment.
In this section, we provide a few guidelines and recommendations that may be helpful for your analysis.
In this section, we want explicitly mention the designated scope of the
GRaNIE
package, its limitations and additional / companion packages
that may be used subsequently or beforehand.
This section is still being completed, and we regularly extend it.
The number of samples is very important. Due to the nature of the method
(correlating sample data), a small number of samples is likely to
violate the assumptions of our framework, in particular the peak-gene
links (see here for more details). We generally
recommend at least 10, better 15 samples that are shared between the RNA
and peak data - remember that in the GRaNIE
framework, only samples
for which both modalities are available can be included in the analysis.
Thus, the overlap of RNA and peak data should be at least 10-15 samples,
the more the better in general.
The number of peaks that is provided as input matters greatly for the resulting eGRN and its connectivity. From our experience, this number should be in a reasonable range so that there is enough data and TFBS overlap to build a eGRN, but also not too many so that the whole pipeline runs unnecessarily long. We have good experience with the number of peaks ranging between 50,000 and 200,000, although these are not hard thresholds but rather recommendations.
With respect to the recommended width of the peaks, we usually use peaks
that have a width of a couple of hundred base pairs until a few kb,
while the default is to filter peaks if they are wider than 10,000 bp
(parameter maxSize_peaks
in the function filterData()
). Remember:
peaks are used to overlap them with TFBS, so if a particular peak is too
narrow, the likelihood of not overlapping with any (predicted) TFBS from
any TF increases, and such a peak is subsequently essentially ignored.
The following list is subject to change and provides some rough guidelines for the RNA-Seq data:
filterData()
, see the argument
minNormalizedMeanRNA
for more information. You may want to check
beforehand how many gens have a row mean of >1. This number is
usually in the tens of thousands.TFBS are a crucial input for any GRaNIE
analysis. Our GRaNIE
approach is very agnostic as to how these files are generated - as long
as one BED file per TF is provided with TFBS positions, the TF can be
integrated. As explained above, we usually work with TFBS as predicted
by PWMScan
based on HOCOMOCO
TF motifs, while this is by no means a
requirement of the pipeline. Instead, JASPAR
TFBS or TFBS from any
other database can also be used. The total number of TF and TFBS per TF
seems more relevant here, due to the way we integrate TFBS: We create a
binary 0/1 overlap matrix for each peak and TF, with 0 indicating that
no TFBS for a particular TF overlaps with a particular peak, while 1
indicates that at least 1 TFBS from the TFBS input data does indeed
overlap with the particular peak by at least 1 bp. Thus, having more
TFBS in general also increases the number of 1s and therefore the
foreground of the TF (see the diagnostic plots) while it makes the
foreground also more noisy if the TFBS list contains too many false
positives. As always in biology, this is a trade-off.
This section is partially based on and inspired by this nice summary about association measures in the context of gene expression networks.
We currently offer 3 different choices for how to measure the correlation (generally defined as an association measure that is used to estimate the relationships between two random variables) between different pairs of features such as TF-peaks, TF-genes, and peak-genes. All correlation coefficients take on values between −1 and 1 where negative values indicate an inverse relationship.
The choice depends on the user, and is dependent on whether or not the data contain outlier points that may heavily affect the correlation measures. While we generally advise to remove obvious outlier samples, as identified in a PCA, for example, from the analysis altogether in case they come from a single sample, in other cases this may not be the best approach and a more robust correlation measure such as Spearman or bicor seems more appropriate. We are currently also testing and evaluating which of them are best under different criteria and data typoes. For single-cell derived data, this choice seems to be particularly relevant.
As outlined already above, the peak-gene diagnostic plots are very important to check and to verify that the assumptions of our framework and the underlying data are met. We highly recommend doing so, and we provide detailed recommendations and our experiences here.
We now also provide some preliminary guidelines on whether and how to
use GRaNIE
for single-cell eGRN inference. For more details, see the
designated vignette for single-cell
data.
A nice little helper feature for the GRaNIE
package is that the object
history, including all function calls that have been applied to the
object, including the function parameters and their actual values
(unless a variable has been provided, then only the variable name is
stored and not the actual value), the date, and the actual call are
stored in the actual GRN
object in a nested list inside
GRN@GRN@config$functionParameters
. This looks like the following, for
example, for the function add_TF_gene_correlation
that has been
executed at 2023-01-23 21:50:52 CET:
> GRN@config$functionParameters$add_TF_gene_correlation$`2023-01-23_21:50:52`
$call
add_TF_gene_correlation(GRN = GRN, nCores = 5)
$parameters
$parameters$GRN
GRN
$parameters$nCores
[1] 5
$parameters$corMethod
[1] "pearson"
$parameters$forceRerun
[1] FALSE
This can be very helpful for recapitulating at a later point which
functions have been applied to a GRN
object, and to verify specific
parameter assignments.
In this section, we will give an overview over CPU and memory
requirements when running or planning an analysis with GRaNIE
.
Both CPU time and memory footprint primarily depend on similar factors, namely the number of
While the number of TFs is typically similar across analyses when the
default database is used (HOCOMOCO
+ PWMScan
), the number of peaks
and samples may vary greatly across analyses.
For optimized CPU times, make sure to have the package BiocParallel
installed, which is not automatically installed with GRANIE
, even
though it should be already installed for most Bioconductor
installations as it is one of the core packages.
A typical analysis runs within an hour or two on a standard machine with 2 to 4 cores or so. CPU time non-surprisingly depends primarily on the number of used cores for those functions that support multiple cores and that are time-consuming in nature.
The maximum memory footprint even for a large dataset during a typical
GRaNIE
workflow usually does not exceed a few GB in R
. Instead, it
is usually in the range of 1-2 GB only maximum. Thus, GRaNIE
can
usually be run on a normal laptop without problems.
We recommend not using more than a few 100,000 or so peaks as the memory footprint as well as running time may otherwise increase notably. However, there is no package-defined upper limit, it all depends on the available system memory.
Given that our approach is correlation-based, it seems preferable to maximize the number of samples while retaining only peaks that carry a biological signal that is differing among samples.
The size of the GRN
object itself is typically in the range of a few
hundred MB, but can go up to 1-2 GB for large datasets. If you have
troubles with big datasets, let us know! We always look for ways to
further optimize the memory footprint. However, with the recent
optimizations to store the count matrices as sparse matrix if the
fraction of 0s is too large, the overall memory footprint significantly
reduced.
1. [GRaNIE and GRaNPA: inference and evaluation of enhancer-mediated gene regulatory networks](https://www.embopress.org/doi/full/10.15252/msb.202311627)
2. [Comparison of co-expression measures: mutual information, correlation, and model based indices](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3586947/)