plyinteractions 1.3.1
The plyinteractions package
introduces tidy methods for the GInteractions
class defined in the
InteractionSet package (Lun, Perry, and Ing-Simmons, 2016).
GInteractions
objectsGInteractions
are objects describing interactions between two parallel
sets of genomic ranges.
library(GenomicRanges)
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted, lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank, rbind, rownames, sapply, setdiff, table, tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
library(InteractionSet)
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#>
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#>
#> colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse, colCounts, colCummaxs, colCummins, colCumprods, colCumsums, colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs, colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats, colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds, colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads, colWeightedMeans, colWeightedMedians, colWeightedSds, colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet, rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods, rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps, rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins, rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks, rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars, rowWeightedMads, rowWeightedMeans, rowWeightedMedians, rowWeightedSds, rowWeightedVars
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'.
#>
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#>
#> rowMedians
#> The following objects are masked from 'package:matrixStats':
#>
#> anyMissing, rowMedians
anchor1 <- GRanges("chr1:10-20:+")
anchor2 <- GRanges("chr1:50-60:-")
gi <- GInteractions(anchor1, anchor2)
gi
#> GInteractions object with 1 interaction and 0 metadata columns:
#> seqnames1 ranges1 seqnames2 ranges2
#> <Rle> <IRanges> <Rle> <IRanges>
#> [1] chr1 10-20 --- chr1 50-60
#> -------
#> regions: 2 ranges and 0 metadata columns
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
The InteractionSet package provides basic methods to interact with this class, but does not support tidy grammar principles.
The grammar of tidy genomic data transformation defined in
plyranges
and available for GInteractions
currently supports:
dplyr verbs (for GInteractions
and GroupedGInteractions
):
group_by()
;summarize()
;tally
and count()
;mutate()
;filter()
using
<data-masking>
and logical expressions;select()
using <tidy-select>
arguments;slice()
;arrange()
using categorical/numerical
variables.plyranges verbs (for PinnedGInteractions
and AnchoredPinnedGInteractions
):
stretch()
;anchor_*()
functions to control how stretching is performed;shift()
;GRanges
from specific anchors of genomic interactions with flank()
.plyinteractions provides a consistent interface for importing
genomic interactions from pairs
and bedpe
files into GInteractions in R,
following grammar of tidy data manipulation defined in the
tidyverse ecosystem.
Tidy genomic data maniuplation implies that we first parse genomic files stored on disk as tabular data frames.
## This uses an example `bedpe` file provided in the `rtracklayer` package
bedpe_file <- system.file("tests", "test.bedpe", package = "rtracklayer")
bedpe_df <- read.delim(bedpe_file, header = FALSE, sep = '\t')
bedpe_df
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#> 1 chr7 118965072 118965122 chr7 118970079 118970129 TUPAC_0001:3:1:0:1452#0 37 + -
#> 2 chr11 46765606 46765656 chr10 46769934 46769984 TUPAC_0001:3:1:0:1472#0 37 + -
#> 3 chr20 54704674 54704724 chr20 54708987 54709037 TUPAC_0001:3:1:1:1833#0 37 + -
Genomic interactions in tabular format are not easy to manipulate.
We can easily parse a data.frame
into a GInteractions
object using
the as_ginteractions()
function.
library(plyinteractions)
#>
#> Attaching package: 'plyinteractions'
#> The following object is masked from 'package:matrixStats':
#>
#> count
#> The following object is masked from 'package:IRanges':
#>
#> slice
#> The following object is masked from 'package:S4Vectors':
#>
#> rename
#> The following object is masked from 'package:stats':
#>
#> filter
gi <- bedpe_df |>
as_ginteractions(
seqnames1 = V1, start1 = V2, end1 = V3, strand1 = V9,
seqnames2 = V4, start2 = V5, end2 = V6, strand2 = V10,
starts.in.df.are.0based = TRUE
)
#> Warning in .merge_two_Seqinfo_objects(x, y): Each of the 2 combined objects has sequence levels not in the other:
#> - in 'x': chr11
#> - in 'y': chr10
#> Make sure to always combine/compare objects based on the same reference
#> genome (use suppressWarnings() to suppress this warning).
gi
#> GInteractions object with 3 interactions and 2 metadata columns:
#> seqnames1 ranges1 strand1 seqnames2 ranges2 strand2 | V7 V8
#> <Rle> <IRanges> <Rle> <Rle> <IRanges> <Rle> | <character> <integer>
#> [1] chr7 118965073-118965122 + --- chr7 118970080-118970129 - | TUPAC_0001:3:1:0:145.. 37
#> [2] chr11 46765607-46765656 + --- chr10 46769935-46769984 - | TUPAC_0001:3:1:0:147.. 37
#> [3] chr20 54704675-54704724 + --- chr20 54708988-54709037 - | TUPAC_0001:3:1:1:183.. 37
#> -------
#> regions: 6 ranges and 0 metadata columns
#> seqinfo: 4 sequences from an unspecified genome; no seqlengths
The columns containing information for core fields of the future GInteractions
object (e.g. seqnames1
, strand2
, …) can be specified using the
key = value
(supported by quasiquotation).
pairs
filesThe pairs
file format has been formally defined by the 4DN consortium.
Its specifications are available here.
It can be imported in R as a data.frame
using read.delim()
or any other
tabular data import functions (including fread()
or vroom()
for larger files),
and readily coerced into GInteractions
with as_ginteractions()
.
## This uses an example `pairs` file provided in this package
pairs_file <- system.file('extdata', 'pairs.gz', package = 'plyinteractions')
pairs_df <- read.delim(pairs_file, sep = "\t", header = FALSE, comment.char = "#")
head(pairs_df)
#> V1 V2 V3 V4 V5 V6 V7 V8 V9
#> 1 NS500150:527:HHGYNBGXF:3:21611:19085:3986 II 105 II 48548 + - 1358 1681
#> 2 NS500150:527:HHGYNBGXF:4:13604:19734:2406 II 113 II 45003 - + 1358 1658
#> 3 NS500150:527:HHGYNBGXF:2:11108:25178:11036 II 119 II 687251 - + 1358 5550
#> 4 NS500150:527:HHGYNBGXF:1:22301:8468:1586 II 160 II 26124 + - 1358 1510
#> 5 NS500150:527:HHGYNBGXF:4:23606:24037:2076 II 169 II 39052 + + 1358 1613
#> 6 NS500150:527:HHGYNBGXF:1:12110:9220:19806 II 177 II 10285 + - 1358 1416
pairs <- as_ginteractions(pairs_df,
seqnames1 = V2, start1 = V3, strand1 = V6,
seqnames2 = V4, start2 = V5, strand2 = V7,
width1 = 1, width2 = 1,
keep.extra.columns = FALSE
)
pairs
#> GInteractions object with 50000 interactions and 0 metadata columns:
#> seqnames1 ranges1 strand1 seqnames2 ranges2 strand2
#> <Rle> <IRanges> <Rle> <Rle> <IRanges> <Rle>
#> [1] II 105 + --- II 48548 -
#> [2] II 113 - --- II 45003 +
#> [3] II 119 - --- II 687251 +
#> [4] II 160 + --- II 26124 -
#> [5] II 169 + --- II 39052 +
#> ... ... ... ... ... ... ... ...
#> [49996] II 86996 + --- II 487591 +
#> [49997] II 86997 + --- II 96353 -
#> [49998] II 86997 + --- II 114748 -
#> [49999] II 86998 + --- II 88955 +
#> [50000] II 86999 + --- II 87513 +
#> -------
#> regions: 62911 ranges and 0 metadata columns
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
GInteractions
to tabular data framesThe reverse operation to coerce GInteractions
back to a tabular form is
also possible using the as_tibble()
function from the tibble package:
tibble::as_tibble(gi)
#> # A tibble: 3 × 12
#> seqnames1 start1 end1 width1 strand1 seqnames2 start2 end2 width2 strand2 V7 V8
#> <fct> <int> <int> <int> <fct> <fct> <int> <int> <int> <fct> <chr> <int>
#> 1 chr7 118965073 118965122 50 + chr7 118970080 118970129 50 - TUPAC_0001:3:1:0:1452#0 37
#> 2 chr11 46765607 46765656 50 + chr10 46769935 46769984 50 - TUPAC_0001:3:1:0:1472#0 37
#> 3 chr20 54704675 54704724 50 + chr20 54708988 54709037 50 - TUPAC_0001:3:1:1:1833#0 37
anchors{12}
A GInteractions
object consists of two sets of anchors:
anchors1
and anchors2
. Each set can be accessed with the corresponding
function (anchors1()
or anchors2()
):