FindIT2 1.11.0
FindIT2
FindIT2
is available on Bioconductor repository for
packages, you can install it by:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("FindIT2")
# Check that you have a valid Bioconductor installation
BiocManager::valid()
citation("FindIT2")
#> To cite FindIT2 in publications use: Guandong Shang(2022)
#>
#> Shang, G.-D., Xu, Z.-G., Wan, M.-C., Wang, F.-X. & Wang, J.-W.
#> FindIT2: an R/Bioconductor package to identify influential
#> transcription factor and targets based on multi-omics data. BMC
#> Genomics 23, 272 (2022)
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Article{,
#> title = {FindIT2: an R/Bioconductor package to identify influential transcription factor and targets based on multi-omics data},
#> author = {Guandong Shang and Zhougeng Xu and Muchun Wan and Fuxiang Wang and Jiawei Wang},
#> journal = {BMC Genomics},
#> year = {2022},
#> volume = {23},
#> number = {S1},
#> pages = {272},
#> url = {https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08506-8},
#> doi = {10.1186/s12864-022-08506-8},
#> }
I benefited a lot from X. Shirley Liu lab’s tools. The integrate_ChIP_RNA
model
borrow the idea from BETA(Wang et al. 2013) and cistromeGO
(Li et al. 2019). The calcRP
model borrow the idea from regulation
potential(Wang et al. 2016). And the FindIT_regionRP
model borrow idea from
lisa(Qin et al. 2020).
I also want to thanks the Allen Lynch in Liu lab, it is please to talk with him
on the github about lisa.
I want to thanks for the memberships in our lab. Many ideas in this packages appeared when I talk with them.
The origin name of FindIT2 is MPMG(Multi Peak Multi Gene) :), which means this package origin purpose is to do mutli-peak-multi-gene annotation. But as the diversity of analysis increase, it gradually extend its function and rename into FindIT2(Find influential TF and Target). But the latter function are still build on the MPMG. Now, it have five module:
And there are also some other useful function like integrate different source information, calculate jaccard similarity for your TF. I will introduce all these function in below manual. And for each part, I will also show the file type you may need prepare, which can help you prepare the correct file format.
The ChIP and ATAC datasets in this vignettes are from (Wang et al. 2020). For the speed, I only use the data in chrosome 5.
# load packages
# If you want to run this manual, please check you have install below packages.
library(FindIT2)
library(TxDb.Athaliana.BioMart.plantsmart28)
library(SummarizedExperiment)
library(dplyr)
library(ggplot2)
# because of the fa I use, I change the seqlevels of Txdb to make the chrosome levels consistent
Txdb <- TxDb.Athaliana.BioMart.plantsmart28
seqlevels(Txdb) <- c(paste0("Chr", 1:5), "M", "C")
all_geneSet <- genes(Txdb)
Because of the storage restriction, the data used here is a small data set, which may not show the deatiled information for result. The user can read the FindIT2 paper and related github repo to see more detailed information and practical example.
The multi-peak multi-gene annotation(mmPeakAnno) is the basic of this package. Most function will use the result of mmPeakAnno. So I explain them first.
The object you may need
FindIT2 provides loadPeakFile
to load peak and store in GRanges
object.
Meanwhile, it also rename one of your GRange column name into feature_id
. The
feature_id
is the most important column in FindIT2, which will be used as a
link to join information from different source. The feature_id
column
represents your peak name, which is often the forth column in bed file and the
first column in GRange metadata column . If you have a GRange without
feature_id
column, you can rename your first metadata column or just add a
column named feature_id
like below
# when you make sure your first metadata column is peak name
colnames(mcols(yourGR))[1] <- "feature_id"
# or you just add a column
yourGR$feature_id <- paste0("peak_", seq_len(length(yourGR)))
you can see the detailed Txdb description in Making and Utilizing TxDb Objects
Here I take the ChIP-Seq data as example.
# load the test ChIP peak bed
ChIP_peak_path <- system.file("extdata", "ChIP.bed.gz", package = "FindIT2")
ChIP_peak_GR <- loadPeakFile(ChIP_peak_path)
# you can see feature_id is in your first column of metadata
ChIP_peak_GR
#> GRanges object with 4288 ranges and 2 metadata columns:
#> seqnames ranges strand | feature_id score
#> <Rle> <IRanges> <Rle> | <character> <numeric>
#> [1] Chr5 6236-6508 * | peak_14125 27
#> [2] Chr5 7627-8237 * | peak_14126 51
#> [3] Chr5 9730-10211 * | peak_14127 32
#> [4] Chr5 12693-12867 * | peak_14128 22
#> [5] Chr5 13168-14770 * | peak_14129 519
#> ... ... ... ... . ... ...
#> [4284] Chr5 26937822-26938526 * | peak_18408 445
#> [4285] Chr5 26939152-26939267 * | peak_18409 21
#> [4286] Chr5 26949581-26950335 * | peak_18410 263
#> [4287] Chr5 26952230-26952558 * | peak_18411 30
#> [4288] Chr5 26968877-26969091 * | peak_18412 26
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
The nearest mode is the most widely used annotation mode. It will link the peak
to its nearest gene, which means every peak only have one related gene. The
disadvantage is sometimes you can not link the correct gene for your peak because
of the complexity in the genomic feature. But this annotation mode is simple
enough and at most time, for most peak, the result is correct.
The skeleton function is distanceToNearest
from GenomicRanges
. I add some
modification so that it will output more human readable result.
mmAnno_nearestgene <- mm_nearestGene(peak_GR = ChIP_peak_GR,
Txdb = Txdb)
#> >> checking seqlevels match... 2024-05-01 17:18:46
#> >> your peak_GR seqlevel:Chr5...
#> >> your Txdb seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5 M C...
#> Good, your Chrs in peak_GR is all in Txdb
#> ------------
#> annotating Peak using nearest gene mode begins
#> >> preparing gene features information... 2024-05-01 17:18:46
#> >> finding nearest gene and calculating distance... 2024-05-01 17:18:47
#> >> dealing with gene strand ... 2024-05-01 17:18:47
#> >> merging all info together ... 2024-05-01 17:18:47
#> >> done 2024-05-01 17:18:47
mmAnno_nearestgene
#> GRanges object with 4288 ranges and 4 metadata columns:
#> seqnames ranges strand | feature_id score gene_id
#> <Rle> <IRanges> <Rle> | <character> <numeric> <character>
#> [1] Chr5 6236-6508 * | peak_14125 27 AT5G01015
#> [2] Chr5 7627-8237 * | peak_14126 51 AT5G01020
#> [3] Chr5 9730-10211 * | peak_14127 32 AT5G01030
#> [4] Chr5 12693-12867 * | peak_14128 22 AT5G01030
#> [5] Chr5 13168-14770 * | peak_14129 519 AT5G01040
#> ... ... ... ... . ... ... ...
#> [4284] Chr5 26937822-26938526 * | peak_18408 445 AT5G67510
#> [4285] Chr5 26939152-26939267 * | peak_18409 21 AT5G67520
#> [4286] Chr5 26949581-26950335 * | peak_18410 263 AT5G67560
#> [4287] Chr5 26952230-26952558 * | peak_18411 30 AT5G67570
#> [4288] Chr5 26968877-26969091 * | peak_18412 26 AT5G67630
#> distanceToTSS
#> <numeric>
#> [1] -344
#> [2] 206
#> [3] 0
#> [4] 2823
#> [5] 1402
#> ... ...
#> [4284] 0
#> [4285] 0
#> [4286] 0
#> [4287] 0
#> [4288] 302
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
You can also use this the annotation result to check your TF type using
plot_annoDistance
. For most TF, the distance density plot maybe like below,
which means your TF is promoter-type. But for some TF, its density plot will be
different, like GATA4, MYOD1(Li et al. 2019).
plot_annoDistance(mmAnno = mmAnno_nearestgene)
Sometimes, you may interested in the number peaks of each gene linked. Or
reciprocally, how many genes of each peak link. you can use the
getAssocPairNumber
to see the deatailed number or summary.
getAssocPairNumber(mmAnno = mmAnno_nearestgene)
#> # A tibble: 2,757 × 2
#> gene_id peakNumber
#> <chr> <int>
#> 1 AT5G01015 1
#> 2 AT5G01020 1
#> 3 AT5G01030 2
#> 4 AT5G01040 2
#> 5 AT5G01050 2
#> 6 AT5G01070 1
#> 7 AT5G01090 1
#> 8 AT5G01100 1
#> 9 AT5G01170 1
#> 10 AT5G01175 3
#> # ℹ 2,747 more rows
getAssocPairNumber(mmAnno = mmAnno_nearestgene,
output_summary = TRUE)
#> # A tibble: 8 × 2
#> N gene_freq
#> <fct> <int>
#> 1 1 1793
#> 2 2 606
#> 3 3 229
#> 4 4 75
#> 5 5 38
#> 6 6 9
#> 7 7 4
#> 8 >=8 3
# you can see all peak's related gene number is 1 because I use the nearest gene mode
getAssocPairNumber(mmAnno_nearestgene, output_type = "feature_id")
#> # A tibble: 4,288 × 2
#> feature_id geneNumber
#> <chr> <int>
#> 1 peak_14125 1
#> 2 peak_14126 1
#> 3 peak_14127 1
#> 4 peak_14128 1
#> 5 peak_14129 1
#> 6 peak_14130 1
#> 7 peak_14131 1
#> 8 peak_14132 1
#> 9 peak_14133 1
#> 10 peak_14134 1
#> # ℹ 4,278 more rows
getAssocPairNumber(mmAnno = mmAnno_nearestgene,
output_type = "feature_id",
output_summary = TRUE)
#> # A tibble: 1 × 2
#> N feature_freq
#> <fct> <int>
#> 1 1 4288
And if you want the summary plot, you can use the plot_peakGeneAlias_summary
function.
plot_peakGeneAlias_summary(mmAnno_nearestgene)