1 Basics

1.1 Install FindIT2

FindIT2 is available on Bioconductor repository for packages, you can install it by:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
      install.packages("BiocManager")
  }

BiocManager::install("FindIT2")

# Check that you have a valid Bioconductor installation
BiocManager::valid()

1.2 Citation

citation("FindIT2")
#> To cite FindIT2 in publications use: Guandong Shang(2022)
#> 
#>   Shang, G.-D., Xu, Z.-G., Wan, M.-C., Wang, F.-X. & Wang, J.-W.
#>   FindIT2: an R/Bioconductor package to identify influential
#>   transcription factor and targets based on multi-omics data.  BMC
#>   Genomics 23, 272 (2022)
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Article{,
#>     title = {FindIT2: an R/Bioconductor package to identify influential transcription factor and targets based on multi-omics data},
#>     author = {Guandong Shang and Zhougeng Xu and Muchun Wan and Fuxiang Wang and Jiawei Wang},
#>     journal = {BMC Genomics},
#>     year = {2022},
#>     volume = {23},
#>     number = {S1},
#>     pages = {272},
#>     url = {https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08506-8},
#>     doi = {10.1186/s12864-022-08506-8},
#>   }

1.3 Acknowledgments

I benefited a lot from X. Shirley Liu lab’s tools. The integrate_ChIP_RNA model borrow the idea from BETA(Wang et al. 2013) and cistromeGO (Li et al. 2019). The calcRP model borrow the idea from regulation potential(Wang et al. 2016). And the FindIT_regionRP model borrow idea from lisa(Qin et al. 2020). I also want to thanks the Allen Lynch in Liu lab, it is please to talk with him on the github about lisa.

I want to thanks for the memberships in our lab. Many ideas in this packages appeared when I talk with them.

1.4 Introduction

The origin name of FindIT2 is MPMG(Multi Peak Multi Gene) :), which means this package origin purpose is to do mutli-peak-multi-gene annotation. But as the diversity of analysis increase, it gradually extend its function and rename into FindIT2(Find influential TF and Target). But the latter function are still build on the MPMG. Now, it have five module:

  • Multi-peak multi-gene annotaion(mmPeakAnno module)
  • Calculate regulation potential(calcRP module)
  • Find influential Target based on ChIP-Seq and RNA-Seq data(Find influential Target module)
  • Find influential TF based on different input(Find influential TF module)
  • Calculate peak-gene or peak-peak correlation(peakGeneCor module)

And there are also some other useful function like integrate different source information, calculate jaccard similarity for your TF. I will introduce all these function in below manual. And for each part, I will also show the file type you may need prepare, which can help you prepare the correct file format.

The ChIP and ATAC datasets in this vignettes are from (Wang et al. 2020). For the speed, I only use the data in chrosome 5.

# load packages
# If you want to run this manual, please check you have install below packages.
library(FindIT2)
library(TxDb.Athaliana.BioMart.plantsmart28)
library(SummarizedExperiment)

library(dplyr)
library(ggplot2)

# because of the fa I use, I change the seqlevels of Txdb to make the chrosome levels consistent
Txdb <- TxDb.Athaliana.BioMart.plantsmart28
seqlevels(Txdb) <- c(paste0("Chr", 1:5), "M", "C")

all_geneSet <- genes(Txdb)

Because of the storage restriction, the data used here is a small data set, which may not show the deatiled information for result. The user can read the FindIT2 paper and related github repo to see more detailed information and practical example.

2 Multi-peak multi-gene annotation

The multi-peak multi-gene annotation(mmPeakAnno) is the basic of this package. Most function will use the result of mmPeakAnno. So I explain them first.

The object you may need

  • peak_GR: a GRange object with a column named feature_id

FindIT2 provides loadPeakFile to load peak and store in GRanges object. Meanwhile, it also rename one of your GRange column name into feature_id. The feature_id is the most important column in FindIT2, which will be used as a link to join information from different source. The feature_id column represents your peak name, which is often the forth column in bed file and the first column in GRange metadata column . If you have a GRange without feature_id column, you can rename your first metadata column or just add a column named feature_id like below

# when you make sure your first metadata column is peak name
colnames(mcols(yourGR))[1] <- "feature_id"

# or you just add a column
yourGR$feature_id <- paste0("peak_", seq_len(length(yourGR)))
  • Txdb: a Txdb object that manages genomic locations and the relationships between genomic feature

you can see the detailed Txdb description in Making and Utilizing TxDb Objects

  • input_genes: gene id

Here I take the ChIP-Seq data as example.

# load the test ChIP peak bed
ChIP_peak_path <- system.file("extdata", "ChIP.bed.gz", package = "FindIT2")
ChIP_peak_GR <- loadPeakFile(ChIP_peak_path)

# you can see feature_id is in your first column of metadata
ChIP_peak_GR
#> GRanges object with 4288 ranges and 2 metadata columns:
#>          seqnames            ranges strand |  feature_id     score
#>             <Rle>         <IRanges>  <Rle> | <character> <numeric>
#>      [1]     Chr5         6236-6508      * |  peak_14125        27
#>      [2]     Chr5         7627-8237      * |  peak_14126        51
#>      [3]     Chr5        9730-10211      * |  peak_14127        32
#>      [4]     Chr5       12693-12867      * |  peak_14128        22
#>      [5]     Chr5       13168-14770      * |  peak_14129       519
#>      ...      ...               ...    ... .         ...       ...
#>   [4284]     Chr5 26937822-26938526      * |  peak_18408       445
#>   [4285]     Chr5 26939152-26939267      * |  peak_18409        21
#>   [4286]     Chr5 26949581-26950335      * |  peak_18410       263
#>   [4287]     Chr5 26952230-26952558      * |  peak_18411        30
#>   [4288]     Chr5 26968877-26969091      * |  peak_18412        26
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

2.1 annotate peak using nearest mode

The nearest mode is the most widely used annotation mode. It will link the peak to its nearest gene, which means every peak only have one related gene. The disadvantage is sometimes you can not link the correct gene for your peak because of the complexity in the genomic feature. But this annotation mode is simple enough and at most time, for most peak, the result is correct. The skeleton function is distanceToNearest from GenomicRanges. I add some modification so that it will output more human readable result.

mmAnno_nearestgene <- mm_nearestGene(peak_GR = ChIP_peak_GR,
                                     Txdb = Txdb)
#> >> checking seqlevels match...       2024-05-01 17:18:46
#> >> your peak_GR seqlevel:Chr5...
#> >> your Txdb seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5 M C...
#> Good, your Chrs in peak_GR is all in Txdb
#> ------------
#> annotating Peak using nearest gene mode begins
#> >> preparing gene features information...        2024-05-01 17:18:46
#> >> finding nearest gene and calculating distance...      2024-05-01 17:18:47
#> >> dealing with gene strand ...      2024-05-01 17:18:47
#> >> merging all info together ...     2024-05-01 17:18:47
#> >> done      2024-05-01 17:18:47

mmAnno_nearestgene
#> GRanges object with 4288 ranges and 4 metadata columns:
#>          seqnames            ranges strand |  feature_id     score     gene_id
#>             <Rle>         <IRanges>  <Rle> | <character> <numeric> <character>
#>      [1]     Chr5         6236-6508      * |  peak_14125        27   AT5G01015
#>      [2]     Chr5         7627-8237      * |  peak_14126        51   AT5G01020
#>      [3]     Chr5        9730-10211      * |  peak_14127        32   AT5G01030
#>      [4]     Chr5       12693-12867      * |  peak_14128        22   AT5G01030
#>      [5]     Chr5       13168-14770      * |  peak_14129       519   AT5G01040
#>      ...      ...               ...    ... .         ...       ...         ...
#>   [4284]     Chr5 26937822-26938526      * |  peak_18408       445   AT5G67510
#>   [4285]     Chr5 26939152-26939267      * |  peak_18409        21   AT5G67520
#>   [4286]     Chr5 26949581-26950335      * |  peak_18410       263   AT5G67560
#>   [4287]     Chr5 26952230-26952558      * |  peak_18411        30   AT5G67570
#>   [4288]     Chr5 26968877-26969091      * |  peak_18412        26   AT5G67630
#>          distanceToTSS
#>              <numeric>
#>      [1]          -344
#>      [2]           206
#>      [3]             0
#>      [4]          2823
#>      [5]          1402
#>      ...           ...
#>   [4284]             0
#>   [4285]             0
#>   [4286]             0
#>   [4287]             0
#>   [4288]           302
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

You can also use this the annotation result to check your TF type using plot_annoDistance. For most TF, the distance density plot maybe like below, which means your TF is promoter-type. But for some TF, its density plot will be different, like GATA4, MYOD1(Li et al. 2019).

plot_annoDistance(mmAnno = mmAnno_nearestgene)

Sometimes, you may interested in the number peaks of each gene linked. Or reciprocally, how many genes of each peak link. you can use the getAssocPairNumber to see the deatailed number or summary.

getAssocPairNumber(mmAnno = mmAnno_nearestgene)
#> # A tibble: 2,757 × 2
#>    gene_id   peakNumber
#>    <chr>          <int>
#>  1 AT5G01015          1
#>  2 AT5G01020          1
#>  3 AT5G01030          2
#>  4 AT5G01040          2
#>  5 AT5G01050          2
#>  6 AT5G01070          1
#>  7 AT5G01090          1
#>  8 AT5G01100          1
#>  9 AT5G01170          1
#> 10 AT5G01175          3
#> # ℹ 2,747 more rows

getAssocPairNumber(mmAnno = mmAnno_nearestgene,
                   output_summary = TRUE)
#> # A tibble: 8 × 2
#>   N     gene_freq
#>   <fct>     <int>
#> 1 1          1793
#> 2 2           606
#> 3 3           229
#> 4 4            75
#> 5 5            38
#> 6 6             9
#> 7 7             4
#> 8 >=8           3

# you can see all peak's related gene number is 1 because I use the nearest gene mode
getAssocPairNumber(mmAnno_nearestgene, output_type = "feature_id")
#> # A tibble: 4,288 × 2
#>    feature_id geneNumber
#>    <chr>           <int>
#>  1 peak_14125          1
#>  2 peak_14126          1
#>  3 peak_14127          1
#>  4 peak_14128          1
#>  5 peak_14129          1
#>  6 peak_14130          1
#>  7 peak_14131          1
#>  8 peak_14132          1
#>  9 peak_14133          1
#> 10 peak_14134          1
#> # ℹ 4,278 more rows

getAssocPairNumber(mmAnno = mmAnno_nearestgene,
                   output_type = "feature_id",
                   output_summary = TRUE)
#> # A tibble: 1 × 2
#>   N     feature_freq
#>   <fct>        <int>
#> 1 1             4288

And if you want the summary plot, you can use the plot_peakGeneAlias_summary function.

plot_peakGeneAlias_summary(mmAnno_nearestgene)