In this vignette, we present GRaNIE (Gene Regulatory Network Inference including Enhancers), a framework to reconstruct predictive enhancer-mediated regulatory network models that are based on integrating of expression and chromatin accessibility/activity pattern across individuals, and provide a comprehensive resource of cell-type specific gene regulatory networks for particular cell types. For an extended biological motivation, see the first section below. In the following, we summarize how to use the GRaNIE
package in a real-world example, illustrate most features and how to work with a GRaNIE
object. Importantly, this vignette will be continuously updated whenever new functionality becomes available or when we receive user feedback.
GRaNIE 1.9.7
GRaNIE
packageGRaNIE
objectGRaNIE
object to disk (optional)eGRN
grapheGRN
Genetic variants associated with diseases often affect non-coding regions, thus likely having a regulatory role. To understand the effects of genetic variants in these regulatory regions, identifying genes that are modulated by specific regulatory elements (REs) is crucial. The effect of gene regulatory elements, such as enhancers, is often cell-type specific, likely because the combinations of transcription factors (TFs) that are regulating a given enhancer have cell-type specific activity. This TF activity can be quantified with existing tools such as diffTF
and captures differences in binding of a TF in open chromatin regions. Collectively, this forms a enhancer-mediated gene regulatory network (eGRN
) with cell-type and data-specific TF-RE and RE-gene links. Here, we reconstruct such a eGRN
using bulk RNA-seq and open chromatin (e.g., using ATAC-seq or ChIP-seq for open chromatin marks) and optionally TF activity data. Our network contains different types of links, connecting TFs to regulatory elements, the latter of which is connected to genes in the vicinity or within the same chromatin domain (TAD). We use a statistical framework to assign empirical FDRs and weights to all links using a background-based approach.
For a more detailed description of the package, its mode of action, guidelines, recommendations, limitations, scope, etc,please see the Package Details Vignette on the GRaNIE website.
Before we start with the package, let’s retrieve some example data! For the purpose of this vignette, the data we will use is taken from here 1, has been minimally processed to meet the requirements of the GRaNIE
package and consists of the following files:
In general, the dataset is from human macrophages (both naive and IFNg primed) of healthy individuals and various stimulations / infections (naive vs primed and infected with SL1344 vs not), with 4 groups in total: control/infected(SL1344) and naive/primed(IFNg). However, here for the example data, all ~30 samples are from IFNg primed and infected cells (as summarized as IFNg_SL1344
in the sample metadata column condition
).
Furthermore, the example dataset is accompanied by the following files:
PWMScan
and the HOCOMOCO
database, see 2 for details.In the following example, you will use the example data to construct a eGRN
from ATAC-seq, RNA-seq data as well transcription factor (TF) data.
First, let’s load the required libraries. The readr
package is already loaded and attached when loading the GRaNIE
package, but we nevertheless load it here explicitly to highlight that we’ll use various `readr`` functions for data import.
For reasons of brevity, we omit the output of this code chunk.
library(readr)
library(GRaNIE)
When installing GRaNIE, all required dependency packages are automatically installed. In addition, GRaNIE needs some additional packages for special functionality, packages that are not strictly necessary for the workflow but which enhance the functionality, may be required depending on certain parameters (such as your genome assembly version), or may be required only when using a particular functionality (such as the WGCNA
package for a more robust correlation method called bicor that is based on medians). The package will automatically check if any of these packages are missing during execution, and inform the user when a package is missing, along with a line to copy for pasting into R for installation.
We are actively working on the package and regularly improve upon features, add features, or change features for increased clarity. This sometimes results in minor changes to the workflow, changed argument names or other small incompatibilities that may result in errors when running a version of the package that differs from the version this vignette has been run for.
Thus, make sure to run a version of GRaNIE
that is compatible with this vignette. If in doubt or when you receive errors, check the R help, which always contains the most up-to-date documentation.
Each of the GRaNIE
functions we mention here in this Vignette comes with sensible default parameters that we found to work well for most of the datasets we tested it with so far. For the purpose of this Vignette, however, and the resulting running times, we here try to provide a good compromise between biological necessity and computational efficacy. However, always check the validity and usefulness of the parameters before starting an analysis to avoid unreasonable results.
Also, always check the Package Details Vignette, all methdological details are in there, and we regularly update it.
GRaNIE
packageTo set up a GRaNIE
analysis, we first need to read in some data into R
.
For a more detailed description and list of the required and optional input data, please see the Package Details Vignette on the GRaNIE website.
Briefly, the following data can be used for the GRaNIE
package:
The following data can be used optionally but are not required:
So, let’s import the enhancer and RNA-seq data as a data frame as well as some sample metadata. This can be done in any way you want as long as you end up with the right format.
# We load the example data directly from the web:
file_peaks = "https://www.embl.de/download/zaugg/GRaNIE/countsATAC.filtered.tsv.gz"
file_RNA = "https://www.embl.de/download/zaugg/GRaNIE/countsRNA.filtered.tsv.gz"
file_sampleMetadata = "https://www.embl.de/download/zaugg/GRaNIE/sampleMetadata.tsv.gz"
countsRNA.df = read_tsv(file_RNA, col_types = cols())
countsPeaks.df = read_tsv(file_peaks, col_types = cols())
sampleMetadata.df = read_tsv(file_sampleMetadata, col_types = cols())
# Let's check how the data looks like
countsRNA.df
countsPeaks.df
sampleMetadata.df
# Save the name of the respective ID columns
idColumn_peaks = "peakID"
idColumn_RNA = "ENSEMBL"
## # A tibble: 18,972 × 30
## ENSEMBL babk_D bima_D cicb_D coyi_D diku_D eipl_D eiwy_D eofe_D fafq_D febc_D
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG00… 48933 48737 60581 93101 84980 91536 85728 35483 69674 58890
## 2 ENSG00… 49916 44086 50706 55893 57239 76418 75934 27926 57526 50491
## 3 ENSG00… 281733 211703 269460 239116 284509 389989 351867 164615 257471 304203
## 4 ENSG00… 98943 77503 92402 80927 96690 138149 115875 64368 91627 100039
## 5 ENSG00… 14749 15571 16540 16383 16886 21892 18045 10026 14663 15830
## 6 ENSG00… 64459 63734 71317 69612 72097 100487 78536 38572 65446 76910
## 7 ENSG00… 57449 55736 70798 66334 66424 91801 94729 40413 56916 66382
## 8 ENSG00… 15451 15570 15534 15945 10583 22601 16086 9275 16092 15291
## 9 ENSG00… 18717 18757 20051 18066 19648 28572 25240 11258 17739 20347
## 10 ENSG00… 168054 147822 178164 154220 168837 244731 215862 89368 158845 180734
## # ℹ 18,962 more rows
## # ℹ 19 more variables: fikt_D <dbl>, guss_D <dbl>, hayt_D <dbl>, hehd_D <dbl>,
## # heja_D <dbl>, hiaf_D <dbl>, iill_D <dbl>, kuxp_D <dbl>, nukw_D <dbl>,
## # oapg_D <dbl>, oevr_D <dbl>, pamv_D <dbl>, pelm_D <dbl>, podx_D <dbl>,
## # qolg_D <dbl>, sojd_D <dbl>, vass_D <dbl>, xugn_D <dbl>, zaui_D <dbl>
## # A tibble: 60,698 × 32
## peakID babk_D bima_D cicb_D coyi_D diku_D eipl_D eiwy_D eofe_D fafq_D febc_D
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 chrX:1… 3 7 10 5 4 6 3 18 4 22
## 2 chr15:… 5 28 38 11 20 19 7 53 5 22
## 3 chr12:… 5 14 18 5 3 13 5 15 2 25
## 4 chr1:1… 12 21 36 6 20 29 12 44 2 105
## 5 chr16:… 3 17 16 9 8 16 6 28 3 33
## 6 chr17:… 4 11 6 3 0 3 2 9 1 14
## 7 chr13:… 10 34 44 12 31 29 9 22 5 82
## 8 chr1:2… 21 113 46 28 44 57 47 146 12 91
## 9 chr14:… 5 9 14 6 6 9 8 16 2 26
## 10 chr8:1… 6 4 10 5 8 12 2 5 1 21
## # ℹ 60,688 more rows
## # ℹ 21 more variables: fikt_D <dbl>, guss_D <dbl>, hayt_D <dbl>, hehd_D <dbl>,
## # heja_D <dbl>, hiaf_D <dbl>, iill_D <dbl>, kuxp_D <dbl>, nukw_D <dbl>,
## # oapg_D <dbl>, oevr_D <dbl>, pamv_D <dbl>, pelm_D <dbl>, podx_D <dbl>,
## # qolg_D <dbl>, sojd_D <dbl>, vass_D <dbl>, xugn_D <dbl>, zaui_D <dbl>,
## # uaqe_D <dbl>, qaqx_D <dbl>
## # A tibble: 31 × 16
## sample_id assigned assigned_frac atac_date clone condition diff_start donor
## <chr> <dbl> <dbl> <date> <dbl> <chr> <date> <chr>
## 1 babk_D 5507093 0.211 2015-12-04 2 IFNg_SL13… 2015-10-12 babk
## 2 bima_D 23275756 0.677 2014-12-12 1 IFNg_SL13… 2014-11-07 bima
## 3 cicb_D 19751751 0.580 2015-04-24 3 IFNg_SL13… 2015-03-30 cicb
## 4 coyi_D 6733642 0.312 2015-11-05 3 IFNg_SL13… 2015-09-30 coyi
## 5 diku_D 7010213 0.195 2015-11-13 1 IFNg_SL13… 2015-10-15 diku
## 6 eipl_D 16923025 0.520 2015-08-04 1 IFNg_SL13… 2015-06-30 eipl
## 7 eiwy_D 9807860 0.404 2015-12-02 1 IFNg_SL13… 2015-10-23 eiwy
## 8 eofe_D 25687477 0.646 2014-12-12 1 IFNg_SL13… 2014-11-01 eofe
## 9 fafq_D 4600004 0.415 2015-10-14 1 IFNg_SL13… 2015-09-16 fafq
## 10 febc_D 31712153 0.430 2015-08-04 2 IFNg_SL13… 2015-07-06 febc
## # ℹ 21 more rows
## # ℹ 8 more variables: EB_formation <date>, macrophage_diff_days <dbl>,
## # medium_changes <dbl>, mt_frac <dbl>, percent_duplication <dbl>,
## # received_as <chr>, sex <chr>, short_long_ratio <dbl>
While we recommend raw counts for both enhancers and RNA-Seq as input and offer several normalization choices in the pipeline, it is also possible to provide pre-normalized data. Note that the normalization method may have a large influence on the resulting eGRN
network, so make sure the choice of normalization is reasonable. For more details, see the next sections.
As you can see, both enhancers and RNA-Seq counts must have exactly one ID column, with all other columns being numeric. For enhancers, this column may be called peakID
, for example, but the exact name is not important and can be specified as a parameter later when adding the data to the object. The same applies for the RNA-Seq data, whereas a sensible choice here is ensemblID
, for example.
For the enhancer ID column, the required format is chr:start-end
, with chr
denoting the chromosome, followed by “:”, and then start
, -
, and end
for the enhancer start and end, respectively. As the coordinates for the enhancers are needed in the pipeline, the format must be exactly as stated here.
You may notice that the enhancers and RNA-seq data have different samples being included, and not all are overlapping. This is not a problem and as long as some samples are found in both of them, the GRaNIE
pipeline can work with it. Note that only the shared samples between both data modalities are kept, however, so make sure that the sample names match between them and share as many samples as possible.
GRaNIE
objectWe got all the data in the right format, we can start with our GRaNIE
analysis now! We start by specifying some parameters such as the genome assembly version the data have been produced with, as well as some optional object metadata that helps us to distinguish this GRaNIE
object from others.
genomeAssembly = "hg38" #Either hg19, hg38 or mm10. Both enhancers and RNA data must have the same genome assembly
# Optional and arbitrary list with information and metadata that is stored
# within the GRaNIE object
objectMetadata.l = list(name = paste0("Macrophages_infected_primed"), file_peaks = file_peaks,
file_rna = file_RNA, file_sampleMetadata = file_sampleMetadata, genomeAssembly = genomeAssembly)
dir_output = "."
GRN = initializeGRN(objectMetadata = objectMetadata.l, outputFolder = dir_output,
genomeAssembly = genomeAssembly)
GRN
## INFO [2024-09-26 18:59:29] Empty GRN object created successfully. Type the object name (e.g., GRN) to retrieve summary information about it at any time.
## INFO [2024-09-26 18:59:29] Default output folder: /tmp/RtmpXRZrkb/Rbuild29a9215b996824/GRaNIE/vignettes/
## INFO [2024-09-26 18:59:29] Genome assembly: hg38
## INFO [2024-09-26 18:59:29] Finished successfully. Execution time: 0 secs
## GRN object from package GRaNIE (created with version 1.9.7)
## Data summary:
## Peaks: No peak data found.
## Genes: No RNA-seq data found.
## TFs: No TF data found.
## Parameters:
## Output directory: /tmp/RtmpXRZrkb/Rbuild29a9215b996824/GRaNIE/vignettes/
## Provided metadata:
## name : Macrophages_infected_primed
## file_peaks : https://www.embl.de/download/zaugg/GRaNIE/countsATAC.filtered.tsv.gz
## file_rna : https://www.embl.de/download/zaugg/GRaNIE/countsRNA.filtered.tsv.gz
## file_sampleMetadata : https://www.embl.de/download/zaugg/GRaNIE/sampleMetadata.tsv.gz
## genomeAssembly : hg38
## Connections:
## TF-peak links: none found
## peak-gene links: none found
## TF-peak-gene links (filtered): none found
## Network-related:
## eGRN network: not found
Initializing a GRaNIE
object occurs in the function initializeGRN()
and is trivial: All we need to specify is an output folder (this is where all the pipeline output is automatically being saved unless specified otherwise) and the genome assembly shortcut of the data. We currently support hg19
, hg38
, and mm10
. Please contact us if you need additional genomes. The objectMetadata
argument is recommended but optional and may contain an arbitrarily complex named list that is stored as additional metadata for the GRaNIE
object. Here, we decided to specify a name for the GRaNIE
object as well as the original paths for all 3 input files and the genome assembly.
For more parameter details, see the R help (?initializeGRN
).
At any time point, we can simply “print” a GRaNIE
object by typing its name and a summary of the content is printed to the console (as done above in the last line of the code block).
We are now ready to fill our empty object with data! After preparing the data beforehand, we can now use the data import function addData()
to import both enhancers and RNA-seq data to the GRaNIE
object. In addition to the count tables, we explicitly specify the name of the ID columns. As mentioned before, the sample metadata is optional but recommended if available.
An important consideration is data normalization for RNA and ATAC. We support many different choices of normalization, the selection of which also depends on whether RNA or peaks is considered, and possible choices are: limma_quantile
, DESeq2_sizeFactors
and none
and refer to the R help for more details (?addData
). The default for RNA-Seq is a quantile normalization, while for the open chromatin enhancer data, it is DESeq2_sizeFactors
(i.e., a “regular” DESeq2
size factor normalization). Importantly, DESeq2_sizeFactors
requires raw data, while limma_quantile
does not necessarily. We nevertheless recommend raw data as input, although it is also possible to provide pre-normalized data as input and then topping this up with another normalization method or none
.
GRN = addData(GRN, counts_peaks = countsPeaks.df, normalization_peaks = "DESeq2_sizeFactors",
idColumn_peaks = idColumn_peaks, counts_rna = countsRNA.df, normalization_rna = "limma_quantile",
idColumn_RNA = idColumn_RNA, sampleMetadata = sampleMetadata.df, forceRerun = TRUE)
GRN
Only overlapping samples between the two data modalities are kept in the GRaNIE
object. Here, all 29 samples from the RNA data are kept because they are also found in the peak data, while only 29 out of 31 samples from the peak data are also found in the RNA data, resulting in 29 shared samples overall. The RNA counts are also shuffled, which will be the basis for all analysis and plots in subsequent steps that repeat the analysis for the background eGRN in addition to the real one.
When we print the GRN
object again, we see that the added information from addData
is now also printed in a summarized manner.
The package also provides a history or tracking function: In a GRN
object, all previously used function calls that modified the object are stored for user convenience and reproducibility purposes.
For example, to retrieve the information about how the addData
function was used in the context of the GRN
object we have here, simply type GRN@config$functionParameters$addData
to retrieve a (nested) list with all necessary details.
For more details, see the Package Details.
It is time for our first QC plots using the function plotPCA_all()
! Now that we added peak and RNA data to the object, let’s check with a Principal Component Analysis (PCA) for both peak and RNA-seq data as well as the original input and the normalized data (unless normalization has been set to none, in which case they are identical to the original data) where the variation in the data comes from. If sample metadata has been provided in the addData()
function (something we strongly recommend), they are automatically added to the PCA plots by coloring the PCA results according to the provided metadata, so that potential batch effects can be examined and identified. For more details, see the R help (?plotPCA_all
).
Note that while this step is recommended to do, it is fully optional from a workflow point of view.
GRN = plotPCA_all(GRN, data = c("rna"), topn = 500, type = "normalized", plotAsPDF = FALSE,
pages = c(2, 3, 14), forceRerun = TRUE)
Depending on the parameters, multiple output files (and plots) may be produced, with up to two files for each of the specified data
modalities (that is, RNA-Seq counts, as specified with rna
here, as well as the peak counts, peaks
, not done here for reasons of brevity). For each of them, PCA plots can be produced for both raw
and normalized
data (here: only raw
). With raw
, we here denote the original counts as given as input with the addData()
function, irrespective of whether this was already pre-normalized or not. The topn
argument specifies the number of top variable features to do PCA for - here 500.
There are more plots that are generated, make sure to examine these plots closely! For all details, which plots are produced and further comments on how to understand and interpret them, see the Package Details.
Now it is time to add data for TFs and predicted TF binding sites (TFBS)! Our GRaNIE
package requires pre-computed TFBS that need to be in a specific format. In brief, a 6-column bed file must be present for each TF, with a specific file name that starts with the name of the TF, an arbitrary and optional suffix (here: _TFBS
) and a particular file ending (supported are bed
or bed.gz
; here, we specify the latter). All these files must be located in a particular folder that the addTFBS()
functions then searches in order to identify those files that match the specified patterns. We provide example TFBS for the 3 genome assemblies we support. After setting this up, we are ready to overlap the TFBS and the peaks by calling the function overlapPeaksAndTFBS()
.
For more details how to download the full set of TF and TFBS data, see the Package Details.
For more parameter details, see the R help (?addTFBS
and ?overlapPeaksAndTFBS
).
folder_TFBS_6TFs = "https://www.embl.de/download/zaugg/GRaNIE/TFBS_selected.zip"
# Download the zip of all TFBS files. Takes too long here, not executed
# therefore
download.file(folder_TFBS_6TFs, file.path("TFBS_selected.zip"), quiet = FALSE)
unzip(file.path("TFBS_selected.zip"), overwrite = TRUE)
motifFolder = tools::file_path_as_absolute("TFBS_selected")
GRN = addTFBS(GRN, motifFolder = motifFolder, TFs = "all", filesTFBSPattern = "_TFBS",
fileEnding = ".bed.gz", forceRerun = TRUE)
GRN = overlapPeaksAndTFBS(GRN, nCores = 1, forceRerun = TRUE)
We see from the output (omitted here for brevity) that 6 TFs have been found in the specified input folder, and the number of TFBS that overlap our peaks for each of them. We successfully added our TFs and TFBS to the GRaNIE
object"
Optionally, we can filter both peaks and RNA-Seq data according to various criteria using the function filterData()
.
For the open chromatin peaks, we currently support three filters:
minNormalizedMean_peaks
, default 5)maxSize_peaks
, default: 10000 bp)chrToKeep_peaks
)For RNA-seq, we currently support the analogous filter as for open chromatin for normalized mean counts as explained above (minNormalizedMeanRNA
).
The default values are usually suitable for bulk data and should result in the removal of very few peaks / genes; however, for single-cell data, lowering them may more reasonable. The output will print clearly how many peaks and genes have been filtered, so you can rerun the function with different values if needed.
For more parameter details, see the R help (?filterData
).
# Chromosomes to keep for peaks. This should be a vector of chromosome names
chrToKeep_peaks = c(paste0("chr", 1:22), "chrX")
GRN = filterData(GRN, minNormalizedMean_peaks = 5, minNormalizedMeanRNA = 1, chrToKeep_peaks = chrToKeep_peaks,
maxSize_peaks = 10000, forceRerun = TRUE)
## INFO [2024-09-26 18:59:35] FILTER PEAKS
## INFO [2024-09-26 18:59:35] Number of peaks before filtering : 75000
## INFO [2024-09-26 18:59:35] Filter peaks by CV: Min = 0
## INFO [2024-09-26 18:59:35] Filter peaks by mean: Min = 5
## INFO [2024-09-26 18:59:35] Number of peaks after filtering : 64008
## INFO [2024-09-26 18:59:35] Finished successfully. Execution time: 0 secs
## INFO [2024-09-26 18:59:35] Filter and sort peaks by size and remain only those bigger than 20 and smaller than 10000
## INFO [2024-09-26 18:59:35] Filter and sort peaks and remain only those on the following chromosomes: chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX
## INFO [2024-09-26 18:59:35] Number of peaks before filtering: 75000
## INFO [2024-09-26 18:59:35] Number of peaks after filtering : 75000
## INFO [2024-09-26 18:59:35] Finished successfully. Execution time: 0 secs
## INFO [2024-09-26 18:59:35] Collectively, filter 10992 out of 75000 peaks.
## INFO [2024-09-26 18:59:35] Number of remaining peaks: 64008
## INFO [2024-09-26 18:59:35] FILTER RNA-seq
## INFO [2024-09-26 18:59:35] Number of genes before filtering : 61534
## INFO [2024-09-26 18:59:35] Filter genes by CV: Min = 0
## INFO [2024-09-26 18:59:35] Filter genes by mean: Min = 1
## INFO [2024-09-26 18:59:35] Number of genes after filtering : 18924
## INFO [2024-09-26 18:59:35] Finished successfully. Execution time: 0 secs
## INFO [2024-09-26 18:59:35] Flagged 16129 rows due to filtering criteria
## INFO [2024-09-26 18:59:35] Finished successfully. Execution time: 0.6 secs
We can see from the output that no peaks have been filtered due to their size and almost 11,000 have been filtered due to their small mean read counts, which collectively leaves around 64,000 peaks out of 75,000 originally. For the RNA data, almost half of the data has been filtered (16,211 out of around 35,000 genes).
We now have all necessary data in the object to start constructing our network. As explained elsewhere, we currently support two types of links for our GRaNIE
approach:
Let’s start with TF-enhancer links! For this, we employ the function addConnections_TF_peak()
. By default, we use Pearson to calculate the correlations between TF expression and enhancer accessibility, but Spearman may sometimes be a better alternative, especially if the diagnostic plots show that the background is not looking as expected.
In addition to creating TF-enhancer links based on TF expression, we can also correlate enhancer accessibility with other measures. We call this the connection type, and expression
is the default one in our framework. However, we implemented a flexible way of allowing also additional or other connection types. Briefly, this works as follows: Additional data has to be imported beforehand with a particular name (the name of the connection type). For example, measures that are related to so-called TF activity can be used in addition or as a replacement of TF expression. For each connection type that we want to include, we simply add it to the parameter connectionTypes
along with the binary vector removeNegativeCorrelation
that specifies whether or not negatively correlated pairs should be removed or not. For expression, the default is to not remove them, while removal may be more reasonable for measures related to TF activity (see here for more details).
Lastly, we offer a so called GC-correction that uses a GC-matching background to compare it with the foreground instead of using the full background as comparison. For more details, see here. We are still investigating the plausibility and effects of this and therefore mark this feature as experimental as of now.
Note that the TF-enhancer links are constructed for both the original data (in the corresponding output plots that are produced, this is labeled as original
) and the (shuffled) background (background
). For more methodological details, see the Package Details Vignette. For more parameter options and details, see the R help (?addConnections_TF_peak
).
GRN = addConnections_TF_peak(GRN, plotDiagnosticPlots = FALSE, connectionTypes = c("expression"),
corMethod = "pearson", forceRerun = TRUE)
## INFO [2024-09-26 18:59:35]
## real data
##
## INFO [2024-09-26 18:59:38] Calculate TF-peak links for connection type expression
## INFO [2024-09-26 18:59:38] Correlate expression and peak counts
## INFO [2024-09-26 18:59:38] Retain 59 unique genes from TF/gene data out of 18904 (filter 18845 non-TF genes and 0 TF genes with 0 counts throughout).
## INFO [2024-09-26 18:59:38] Correlate TF/gene data for 59 unique Ensembl IDs (TFs) and peak counts for 64008 peaks.
## INFO [2024-09-26 18:59:38] Note: For subsequent steps, the same gene may be associated with multiple TF, depending on the translation table.
## INFO [2024-09-26 18:59:40] Finished successfully. Execution time: 1.5 secs
## INFO [2024-09-26 18:59:40] Run FDR calculations for 65 TFs for which TFBS predictions and expression data for the corresponding gene are available.
## INFO [2024-09-26 18:59:40] Skip the following 10 TF due to missing data or because they are marked as filtered: ATOH1.0.B,CDX1.0.C,CTCFL.0.A,DLX6.0.D,DMRT1.0.D,EHF.0.B,ESR2.0.A,ESR2.1.A,FOXA3.0.B,FOXB1.0.D
## INFO [2024-09-26 18:59:40] Compute FDR for each TF. This may take a while...
## INFO [2024-09-26 18:59:40] TF AIRE.0.C
## INFO [2024-09-26 18:59:40] TF ANDR.0.A
## INFO [2024-09-26 18:59:40] TF ANDR.1.A
## INFO [2024-09-26 18:59:40] TF ANDR.2.A
## INFO [2024-09-26 18:59:40] TF AP2A.0.A
## INFO [2024-09-26 18:59:40] TF AP2B.0.B
## INFO [2024-09-26 18:59:40] TF ARI3A.0.D
## INFO [2024-09-26 18:59:41] TF ARNT2.0.D
## INFO [2024-09-26 18:59:41] TF ASCL1.0.A
## INFO [2024-09-26 18:59:41] TF ASCL2.0.D
## INFO [2024-09-26 18:59:41] TF ATF2.1.B
## INFO [2024-09-26 18:59:41] TF BACH1.0.A
## INFO [2024-09-26 18:59:41] TF BATF3.0.B
## INFO [2024-09-26 18:59:41] TF BC11A.0.A
## INFO [2024-09-26 18:59:41] TF BCL6.0.A
## INFO [2024-09-26 18:59:41] TF BHA15.0.B
## INFO [2024-09-26 18:59:42] TF BHE41.0.D
## INFO [2024-09-26 18:59:42] TF BPTF.0.D
## INFO [2024-09-26 18:59:42] TF BRAC.0.A
## INFO [2024-09-26 18:59:42] TF BRCA1.0.D
## INFO [2024-09-26 18:59:42] TF CDX2.0.A
## INFO [2024-09-26 18:59:42] TF CEBPA.0.A
## INFO [2024-09-26 18:59:42] TF CENPB.0.D
## INFO [2024-09-26 18:59:42] TF CLOCK.0.C
## INFO [2024-09-26 18:59:42] TF COE1.0.A
## INFO [2024-09-26 18:59:42] TF COT1.0.C
## INFO [2024-09-26 18:59:42] TF COT1.1.C
## INFO [2024-09-26 18:59:43] TF COT2.0.A
## INFO [2024-09-26 18:59:43] TF COT2.1.A
## INFO [2024-09-26 18:59:43] TF CTCF.0.A
## INFO [2024-09-26 18:59:43] TF CUX2.0.D
## INFO [2024-09-26 18:59:43] TF DLX1.0.D
## INFO [2024-09-26 18:59:43] TF DLX2.0.D
## INFO [2024-09-26 18:59:43] TF DLX4.0.D
## INFO [2024-09-26 18:59:43] TF DMBX1.0.D
## INFO [2024-09-26 18:59:43] TF E2F1.0.A
## INFO [2024-09-26 18:59:43] TF E2F3.0.A
## INFO [2024-09-26 18:59:43] TF E2F4.0.A
## INFO [2024-09-26 18:59:43] TF E2F6.0.A
## INFO [2024-09-26 18:59:44] TF E2F7.0.B
## INFO [2024-09-26 18:59:44] TF EGR1.0.A
## INFO [2024-09-26 18:59:44] TF EGR2.0.A
## INFO [2024-09-26 18:59:44] TF EGR2.1.A
## INFO [2024-09-26 18:59:44] TF ELF1.0.A
## INFO [2024-09-26 18:59:44] TF ELF3.0.A
## INFO [2024-09-26 18:59:45] TF ELK3.0.D
## INFO [2024-09-26 18:59:45] TF ERR1.0.A
## INFO [2024-09-26 18:59:45] TF ESR1.0.A
## INFO [2024-09-26 18:59:45] TF ESR1.1.A
## INFO [2024-09-26 18:59:45] TF ETS1.0.A
## INFO [2024-09-26 18:59:45] TF ETS2.0.B
## INFO [2024-09-26 18:59:45] TF ETV2.0.B
## INFO [2024-09-26 18:59:46] TF ETV4.0.B
## INFO [2024-09-26 18:59:46] TF ETV5.0.C
## INFO [2024-09-26 18:59:46] TF EVI1.0.B
## INFO [2024-09-26 18:59:46] TF FEZF1.0.C
## INFO [2024-09-26 18:59:46] TF FLI1.1.A
## INFO [2024-09-26 18:59:47] TF FOXC2.0.D
## INFO [2024-09-26 18:59:47] TF FOXD2.0.D
## INFO [2024-09-26 18:59:47] TF FOXD3.0.D
## INFO [2024-09-26 18:59:47] TF FOXF1.0.D
## INFO [2024-09-26 18:59:47] TF FOXO4.0.C
## INFO [2024-09-26 18:59:47] TF FOXP1.0.A
## INFO [2024-09-26 18:59:47] TF FOXP3.0.D
## INFO [2024-09-26 18:59:47] TF FUBP1.0.D
## INFO [2024-09-26 18:59:47] Finished successfully. Execution time: 9.2 secs
## INFO [2024-09-26 18:59:47] Finished successfully. Execution time: 12.3 secs
## INFO [2024-09-26 18:59:47] Finished. Stored 23096 connections with an FDR <= 0.3
## INFO [2024-09-26 18:59:47]
## permuted data
##
## INFO [2024-09-26 18:59:47] Shuffling rows per column
## INFO [2024-09-26 18:59:48] Finished successfully. Execution time: 0.7 secs
## INFO [2024-09-26 18:59:48] Calculate TF-peak links for connection type expression
## INFO [2024-09-26 18:59:48] Correlate expression and peak counts
## INFO [2024-09-26 18:59:48] Retain 59 unique genes from TF/gene data out of 18904 (filter 18845 non-TF genes and 0 TF genes with 0 counts throughout).
## INFO [2024-09-26 18:59:48] Correlate TF/gene data for 59 unique Ensembl IDs (TFs) and peak counts for 64008 peaks.
## INFO [2024-09-26 18:59:48] Note: For subsequent steps, the same gene may be associated with multiple TF, depending on the translation table.
## INFO [2024-09-26 18:59:49] Finished successfully. Execution time: 0.5 secs
## INFO [2024-09-26 18:59:49] Run FDR calculations for 65 TFs for which TFBS predictions and expression data for the corresponding gene are available.
## INFO [2024-09-26 18:59:49] Skip the following 10 TF due to missing data or because they are marked as filtered: ATOH1.0.B,CDX1.0.C,CTCFL.0.A,DLX6.0.D,DMRT1.0.D,EHF.0.B,ESR2.0.A,ESR2.1.A,FOXA3.0.B,FOXB1.0.D
## INFO [2024-09-26 18:59:49] Compute FDR for each TF. This may take a while...
## INFO [2024-09-26 18:59:49] TF AIRE.0.C
## INFO [2024-09-26 18:59:49] TF ANDR.0.A
## INFO [2024-09-26 18:59:49] TF ANDR.1.A
## INFO [2024-09-26 18:59:49] TF ANDR.2.A
## INFO [2024-09-26 18:59:49] TF AP2A.0.A
## INFO [2024-09-26 18:59:49] TF AP2B.0.B
## INFO [2024-09-26 18:59:49] TF ARI3A.0.D
## INFO [2024-09-26 18:59:50] TF ARNT2.0.D
## INFO [2024-09-26 18:59:50] TF ASCL1.0.A
## INFO [2024-09-26 18:59:50] TF ASCL2.0.D
## INFO [2024-09-26 18:59:50] TF ATF2.1.B
## INFO [2024-09-26 18:59:50] TF BACH1.0.A
## INFO [2024-09-26 18:59:50] TF BATF3.0.B
## INFO [2024-09-26 18:59:50] TF BC11A.0.A
## INFO [2024-09-26 18:59:50] TF BCL6.0.A
## INFO [2024-09-26 18:59:50] TF BHA15.0.B
## INFO [2024-09-26 18:59:50] TF BHE41.0.D
## INFO [2024-09-26 18:59:50] TF BPTF.0.D
## INFO [2024-09-26 18:59:50] TF BRAC.0.A
## INFO [2024-09-26 18:59:50] TF BRCA1.0.D
## INFO [2024-09-26 18:59:51] TF CDX2.0.A
## INFO [2024-09-26 18:59:51] TF CEBPA.0.A
## INFO [2024-09-26 18:59:51] TF CENPB.0.D
## INFO [2024-09-26 18:59:51] TF CLOCK.0.C
## INFO [2024-09-26 18:59:51] TF COE1.0.A
## INFO [2024-09-26 18:59:51] TF COT1.0.C
## INFO [2024-09-26 18:59:51] TF COT1.1.C
## INFO [2024-09-26 18:59:52] TF COT2.0.A
## INFO [2024-09-26 18:59:52] TF COT2.1.A
## INFO [2024-09-26 18:59:52] TF CTCF.0.A
## INFO [2024-09-26 18:59:52] TF CUX2.0.D
## INFO [2024-09-26 18:59:52] TF DLX1.0.D
## INFO [2024-09-26 18:59:52] TF DLX2.0.D
## INFO [2024-09-26 18:59:52] TF DLX4.0.D
## INFO [2024-09-26 18:59:52] TF DMBX1.0.D
## INFO [2024-09-26 18:59:52] TF E2F1.0.A
## INFO [2024-09-26 18:59:53] TF E2F3.0.A
## INFO [2024-09-26 18:59:53] TF E2F4.0.A
## INFO [2024-09-26 18:59:53] TF E2F6.0.A
## INFO [2024-09-26 18:59:53] TF E2F7.0.B
## INFO [2024-09-26 18:59:53] TF EGR1.0.A
## INFO [2024-09-26 18:59:53] TF EGR2.0.A
## INFO [2024-09-26 18:59:53] TF EGR2.1.A
## INFO [2024-09-26 18:59:54] TF ELF1.0.A
## INFO [2024-09-26 18:59:54] TF ELF3.0.A
## INFO [2024-09-26 18:59:54] TF ELK3.0.D
## INFO [2024-09-26 18:59:54] TF ERR1.0.A
## INFO [2024-09-26 18:59:54] TF ESR1.0.A
## INFO [2024-09-26 18:59:55] TF ESR1.1.A
## INFO [2024-09-26 18:59:55] TF ETS1.0.A
## INFO [2024-09-26 18:59:55] TF ETS2.0.B
## INFO [2024-09-26 18:59:55] TF ETV2.0.B
## INFO [2024-09-26 18:59:55] TF ETV4.0.B
## INFO [2024-09-26 18:59:55] TF ETV5.0.C
## INFO [2024-09-26 18:59:55] TF EVI1.0.B
## INFO [2024-09-26 18:59:55] TF FEZF1.0.C
## INFO [2024-09-26 18:59:56] TF FLI1.1.A
## INFO [2024-09-26 18:59:56] TF FOXC2.0.D
## INFO [2024-09-26 18:59:56] TF FOXD2.0.D
## INFO [2024-09-26 18:59:56] TF FOXD3.0.D
## INFO [2024-09-26 18:59:56] TF FOXF1.0.D
## INFO [2024-09-26 18:59:56] TF FOXO4.0.C
## INFO [2024-09-26 18:59:56] TF FOXP1.0.A
## INFO [2024-09-26 18:59:56] TF FOXP3.0.D
## INFO [2024-09-26 18:59:56] TF FUBP1.0.D
## INFO [2024-09-26 18:59:56] Finished successfully. Execution time: 8.1 secs
## INFO [2024-09-26 18:59:56] Finished successfully. Execution time: 8.9 secs
## INFO [2024-09-26 18:59:56] Finished. Stored 59 connections with an FDR <= 0.3
## INFO [2024-09-26 18:59:56] Finished successfully. Execution time: 21.3 secs
From the output, we see that all of the 6 TFs also have RNA-seq data available and consequently will be included and correlated with the peak accessibility.
After adding the TF-enhancer links to our GRaNIE
object, let’s look at some diagnostic plots. Depending on the user parameters, the plots are either directly plotted to the currently active graphics device or to PDF files as specified in the object or via the function parameters. If plotted to a PDF, within the specified or default output folder (when initializing the GRaNIE
object) should contain two new files that are named TF_peak.fdrCurves_original.pdf
and TF_peak.fdrCurves_background.pdf
, for example.
For real data and all TFs, this function may run a while, and each time-consuming step has a built-in progress bar for the plot-related parts so the remaining time can be estimated.
For reasons of brevity and organization, we fully describe their interpretation and meaning in detail in the Package Details Vignette.
In summary, we provide an overview of the total number of TF-peak connections for a range of typically used FDR values for both real and background TF-peak links, stratified by the TF-peak correlation bin
GRN = plotDiagnosticPlots_TFPeaks(GRN, dataType = c("real"), plotAsPDF = FALSE, pages = c(1))