Contents

Package: metagene2
Modified: April 2nd, 2019
Compiled: Sat Jul 13 01:35:34 2019
License: Artistic-2.0

1 Introduction

This package produces metagene plots, and is the successor to the metagene package. Users of metagene can find a list of differences between metagene2 and metagene in the Differences with metagene section of this vignette.

Metagene plots aggregate coverages from multiple sources (bam files) over multiple regions (genes, cofactor binding sites, etc.) to provide profiles of average coverage. They are useful for many different purposes, such as comparing the binding profiles of DNA-interacting proteins at selected groups of features. In a typical analysis, these features will be the transcription start sites (TSS) of genes, transcription factor binding sites, or enhancer regions. Multiple combinations of groups of features and/or groups of bam files can be compared in a single analysis. The metagene2 package uses bootstrap analysis to provide an estimation of the mean enrichment and a confidence interval for each group of samples.

This vignette will introduce the main features of the metagene2 package. You can load the metagene2 package by calling library(metagene2):

library(metagene2)

2 Creating a metagene object

metagene2 objects are used to perform all of the analysis steps necessary to produce metagene plots. Calling metagene2$new creates a metagene2 object and requires only two mandatory parameters: bam_files, which is the list of bam files from which coverages should be extracted, and regions, which is the list of regions over which said coverages are computed. We also recommend using the optional assay parameter, which can be one of 'chipseq' or 'rnaseq', and will automatically set other optional parameters to convenient defaults. We discuss each of these arguments below.

# metagene objects are created by calling metagene2$new and providing
# regions and bam files:
mg <- metagene2$new(regions = get_demo_regions(), 
                   bam_files = get_demo_bam_files(), 
                   assay='chipseq')

# We can then plot coverage over those regions across all bam files.
mg$produce_metagene(title = "Demo metagene plot")

2.1 Specifying alignment files (BAM files)

There is no hard limit on the number of BAM files that can be included in an analysis. However, loading a large number of bam files might also require large amounts of memory. The provided bam files must be indexed: a file named file.bam, must have an accompanying file.bam.bai or file.bai in its directory.

The paths (relative or absolute) to the BAM files must be provided in a vector. If the vector is named, then those names will be used to refer to the bam files in subsequent steps. Otherwise, metagene2 will attempt to generate appropriate names.

# We create a vector with paths to the bam files of interest.
bam_files <- get_demo_bam_files()
basename(bam_files)
## [1] "align1_rep1.bam" "align1_rep2.bam" "align2_rep1.bam" "align2_rep2.bam"
## [5] "ctrl.bam"

Each bam file must have a corresponding index file:

# List .bai matching our specified bam files.
basename(Sys.glob(gsub(".bam", ".ba*", bam_files)))
##  [1] "align1_rep1.bam"     "align1_rep1.bam.bai" "align1_rep2.bam"    
##  [4] "align1_rep2.bam.bai" "align2_rep1.bam"     "align2_rep1.bam.bai"
##  [7] "align2_rep2.bam"     "align2_rep2.bam.bai" "ctrl.bam"           
## [10] "ctrl.bam.bai"

If no names were provided for the bam files, metagene automatically generates some:

mg <- metagene2$new(regions = get_demo_regions(), bam_files = bam_files)
names(mg$get_params()[["bam_files"]])
## [1] "align1_rep1" "align1_rep2" "align2_rep1" "align2_rep2" "ctrl"

We also could have explicitly named our bam files.

names(bam_files) = c("a1_1", "a1_2", "a2_1", "a2_2", "ctrl")
mg <- metagene2$new(regions = get_demo_regions(), bam_files = bam_files)
names(mg$get_params()[["bam_files"]])
## [1] "a1_1" "a1_2" "a2_1" "a2_2" "ctrl"

2.2 Specifying genomic regions

The regions for the metagene analysis can be provided in one of three different formats:

  • A character vector, containing the paths to bed, narrowPeak, broadPeak or gtf files describing the regions to be used.
  • A GRanges or GRangesList object defining a set of contiguous regions.
  • A GRangesList where each element defines a set of regions to be stitched together to be considered as a single logical region.

2.2.1 Defining regions using BED, narrowPeak, broadPeak and GTF files

metagene2 can automatically import your regions of interest if they are already defined in a file with one of the following formats:

A file’s extension will usually reflect the format it is stored in.

regions <- get_demo_region_filenames()
regions
## [1] "/tmp/Rtmpp75B4J/Rinst6add59aa8a/metagene2/extdata/list1.bed"
## [2] "/tmp/Rtmpp75B4J/Rinst6add59aa8a/metagene2/extdata/list2.bed"

By providing those two file names to metagene2$new, they will be loaded and converted into appropriate objects:

mg <- metagene2$new(regions = get_demo_region_filenames(),
                   bam_files = get_demo_bam_files())
mg$get_regions()
## GRanges object with 100 ranges and 3 metadata columns:
##         seqnames            ranges strand |        name     score
##            <Rle>         <IRanges>  <Rle> | <character> <numeric>
##     [1]     chr1 16103663-16105662      + |     list1_1         0
##     [2]     chr1 23921318-23923317      + |     list1_2         0
##     [3]     chr1 34848977-34850976      + |     list1_3         0
##     [4]     chr1 36368182-36370181      + |     list1_4         0
##     [5]     chr1 36690488-36692487      + |     list1_5         0
##     ...      ...               ...    ... .         ...       ...
##    [96]     chr1 81075951-81077950      - |    list2_46         0
##    [97]     chr1 85108854-85110853      - |    list2_47         0
##    [98]     chr1 85960056-85962055      - |    list2_48         0
##    [99]     chr1 86110971-86112970      - |    list2_49         0
##   [100]     chr1 87155522-87157521      - |    list2_50         0
##         region_name
##         <character>
##     [1]       list1
##     [2]       list1
##     [3]       list1
##     [4]       list1
##     [5]       list1
##     ...         ...
##    [96]       list2
##    [97]       list2
##    [98]       list2
##    [99]       list2
##   [100]       list2
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

2.2.2 Defining contiguous regions using GRanges or GRangesList objects

As an alternative to a list of BED files, GRanges objects can be used to define contiguous regions of interest. Each range defined within the GRanges object is treated separately from the others. GRangesList objects are also accepted, but they are automatically coerced into GRanges objects, and a column named region_name bearing the name of the list elements is added to the coerced GRanges. Here is an example of valid regions provided as a GRangesList:

regions <- get_demo_regions()
regions
## GRangesList object of length 2:
## $list1
## GRanges object with 50 ranges and 2 metadata columns:
##        seqnames              ranges strand |        name     score
##           <Rle>           <IRanges>  <Rle> | <character> <numeric>
##    [1]     chr1   16103663-16105662      + |     list1_1         0
##    [2]     chr1   23921318-23923317      + |     list1_2         0
##    [3]     chr1   34848977-34850976      + |     list1_3         0
##    [4]     chr1   36368182-36370181      + |     list1_4         0
##    [5]     chr1   36690488-36692487      + |     list1_5         0
##    ...      ...                 ...    ... .         ...       ...
##   [46]     chr1 172081530-172083529      + |    list1_46         0
##   [47]     chr1 172081796-172083795      + |    list1_47         0
##   [48]     chr1 172147016-172149015      + |    list1_48         0
##   [49]     chr1 172205805-172207804      + |    list1_49         0
##   [50]     chr1 172260642-172262641      + |    list1_50         0
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
## 
## $list2
## GRanges object with 50 ranges and 2 metadata columns:
##        seqnames            ranges strand |        name     score
##           <Rle>         <IRanges>  <Rle> | <character> <numeric>
##    [1]     chr1   3670499-3672498      - |     list2_1         0
##    [2]     chr1   5916399-5918398      - |     list2_2         0
##    [3]     chr1   9699210-9701209      - |     list2_3         0
##    [4]     chr1   9907639-9909638      - |     list2_4         0
##    [5]     chr1 10718946-10720945      - |     list2_5         0
##    ...      ...               ...    ... .         ...       ...
##   [46]     chr1 81075951-81077950      - |    list2_46         0
##   [47]     chr1 85108854-85110853      - |    list2_47         0
##   [48]     chr1 85960056-85962055      - |    list2_48         0
##   [49]     chr1 86110971-86112970      - |    list2_49         0
##   [50]     chr1 87155522-87157521      - |    list2_50         0
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

When loaded by metagene2, they are converted to a GRanges:

mg <- metagene2$new(regions = regions,
                   bam_files = get_demo_bam_files())
mg$get_regions()
## GRanges object with 100 ranges and 3 metadata columns:
##         seqnames            ranges strand |        name     score
##            <Rle>         <IRanges>  <Rle> | <character> <numeric>
##     [1]     chr1 16103663-16105662      + |     list1_1         0
##     [2]     chr1 23921318-23923317      + |     list1_2         0
##     [3]     chr1 34848977-34850976      + |     list1_3         0
##     [4]     chr1 36368182-36370181      + |     list1_4         0
##     [5]     chr1 36690488-36692487      + |     list1_5         0
##     ...      ...               ...    ... .         ...       ...
##    [96]     chr1 81075951-81077950      - |    list2_46         0
##    [97]     chr1 85108854-85110853      - |    list2_47         0
##    [98]     chr1 85960056-85962055      - |    list2_48         0
##    [99]     chr1 86110971-86112970      - |    list2_49         0
##   [100]     chr1 87155522-87157521      - |    list2_50         0
##         region_name
##         <character>
##     [1]       list1
##     [2]       list1
##     [3]       list1
##     [4]       list1
##     [5]       list1
##     ...         ...
##    [96]       list2
##    [97]       list2
##    [98]       list2
##    [99]       list2
##   [100]       list2
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

For more details about each datasets, please refer to their documentation (i.e.:?promoters_hg19).

2.2.3 GRangesList objects for stitching ranges together

For certain types of analyses, it is useful to stitch together several regions into one logical unit. This is the case in RNA-seq data, where exons are individual regions which make more sense when grouped together into a single transcript.

For these cases, regions can be a GRangesList object where each element is one such logical region. One must also specify the region_mode="stitch" parameter when creating the new metagene object. When assay='rnaseq', region_mode is automatically set to "stitch".

regions <- get_demo_rna_regions()
regions
## GRangesList object of length 2:
## $DPM1
## GRanges object with 10 ranges and 2 metadata columns:
##        seqnames            ranges strand |        name     score
##           <Rle>         <IRanges>  <Rle> | <character> <numeric>
##    [1]    chr20 50934868-50935236      - |        <NA>      <NA>
##    [2]    chr20 50936149-50936262      - |        <NA>      <NA>
##    [3]    chr20 50940866-50940955      - |        <NA>      <NA>
##    [4]    chr20 50941106-50941209      - |        <NA>      <NA>
##    [5]    chr20 50942032-50942126      - |        <NA>      <NA>
##    [6]    chr20 50945738-50945762      - |        <NA>      <NA>
##    [7]    chr20 50945848-50945923      - |        <NA>      <NA>
##    [8]    chr20 50948630-50948662      - |        <NA>      <NA>
##    [9]    chr20 50955187-50955285      - |        <NA>      <NA>
##   [10]    chr20 50958364-50958555      - |        <NA>      <NA>
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
## 
## $NDUFAB1
## GRanges object with 7 ranges and 2 metadata columns:
##       seqnames            ranges strand |        name     score
##          <Rle>         <IRanges>  <Rle> | <character> <numeric>
##   [1]    chr16 23581003-23581173      - |        <NA>      <NA>
##   [2]    chr16 23581872-23582375      - |        <NA>      <NA>
##   [3]    chr16 23585337-23585423      - |        <NA>      <NA>
##   [4]    chr16 23587198-23587319      - |        <NA>      <NA>
##   [5]    chr16 23590887-23591153      - |        <NA>      <NA>
##   [6]    chr16 23595430-23595638      - |        <NA>      <NA>
##   [7]    chr16 23596124-23596356      - |        <NA>      <NA>
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

In stitch mode, the loaded regions remain in a GRangesList, rather than being coerced into a GRanges.

mg <- metagene2$new(regions = regions,
                   bam_files = get_demo_rna_bam_files(),
                   region_mode="stitch")
mg$get_regions()
## GRangesList object of length 2:
## $DPM1
## GRanges object with 10 ranges and 3 metadata columns:
##        seqnames            ranges strand |        name     score region_name
##           <Rle>         <IRanges>  <Rle> | <character> <numeric> <character>
##    [1]    chr20 50934868-50935236      - |        <NA>      <NA>        DPM1
##    [2]    chr20 50936149-50936262      - |        <NA>      <NA>        DPM1
##    [3]    chr20 50940866-50940955      - |        <NA>      <NA>        DPM1
##    [4]    chr20 50941106-50941209      - |        <NA>      <NA>        DPM1
##    [5]    chr20 50942032-50942126      - |        <NA>      <NA>        DPM1
##    [6]    chr20 50945738-50945762      - |        <NA>      <NA>        DPM1
##    [7]    chr20 50945848-50945923      - |        <NA>      <NA>        DPM1
##    [8]    chr20 50948630-50948662      - |        <NA>      <NA>        DPM1
##    [9]    chr20 50955187-50955285      - |        <NA>      <NA>        DPM1
##   [10]    chr20 50958364-50958555      - |        <NA>      <NA>        DPM1
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
## 
## $NDUFAB1
## GRanges object with 7 ranges and 3 metadata columns:
##       seqnames            ranges strand |        name     score region_name
##          <Rle>         <IRanges>  <Rle> | <character> <numeric> <character>
##   [1]    chr16 23581003-23581173      - |        <NA>      <NA>     NDUFAB1
##   [2]    chr16 23581872-23582375      - |        <NA>      <NA>     NDUFAB1
##   [3]    chr16 23585337-23585423      - |        <NA>      <NA>     NDUFAB1
##   [4]    chr16 23587198-23587319      - |        <NA>      <NA>     NDUFAB1
##   [5]    chr16 23590887-23591153      - |        <NA>      <NA>     NDUFAB1
##   [6]    chr16 23595430-23595638      - |        <NA>      <NA>     NDUFAB1
##   [7]    chr16 23596124-23596356      - |        <NA>      <NA>     NDUFAB1
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

2.2.4 Generating common ranges (Promoters, gene bodies)

Some common ranges that can be useful for plotting include the set of all TSSes or gene bodies. While metagene2 does not provide those, they can easily be generated using packages from BioConductor:

    # First locate the TxDb package containing the geneset of interest.
    # Some of the most common TxDb packages include:
    #  - TxDb.Hsapiens.UCSC.hg38.knownGene
    #  - TxDb.Hsapiens.UCSC.hg19.knownGene
    #  - TxDb.Mmusculus.UCSC.mm10.knownGene
    #  - TxDb.Mmusculus.UCSC.mm10.ensGene
    library(TxDb.Hsapiens.UCSC.hg38.knownGene)
    
    # We'll use the GenomicFeatures package to obtain gene/TSS coordinates
    # from the TxDb package.
    library(GenomicFeatures)
    
    # The GenomicFeatures::genes function provides us with gene bodies.
    all_gene_bodies = GenomicFeatures::genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
    
    # The GenomicFeatures::promoters function gets a region flanking the TSS.
    # By using it directly on TxDb.Hsapiens.UCSC.hg38.knownGene, we would get
    # the TSSes of all transcripts. Here, we use it on the gene_bodies GRanges
    # we've just created, and limit ourselves to one TSS per gene.
    all_TSS = GenomicFeatures::promoters(all_gene_bodies,
                                         upstream=2000, downstream=2000)

3 Grouping regions and bam files

By default, metagene2 aggregates all passed-in regions together, and treats all bam files separately. However, most non-trivial analyses will benefit from more granularity. Bam files can be split among different ChIP-seq experiments and/or multiple replicates. Regions can likewise be split according to multiple criteria: is the underlying gene up- or down-regulated? Is the enhancer bound by a cofactor of interest? Below, we discuss how metagene2 allows the user to specify those groupings to produce relevant analyses.

3.1 Grouping bam files

3.1.1 Using an experimental design

In metagene2, an experimental design is a set of design groups, each of which is defined as a set of “input” bam files and a set of “control” bam files. There is no limit to the number of design groups, though a large number of design groups will require a proportionately large amount of memory. A BAM file can be assigned to more than one design group.

The experimental design is expressed using a data-frame, where each row represents a bam file. The very first column of the data-frame must identify the bam files, using either their paths or their names as specified in the bam_files argument. Each subsequent column then represents an individual design group. The column name defines the design group’s name, and the column values determine how each bam file relates to the design group:

* 0: ignore file
* 1: input
* 2: control

A design group does not need to have a control, but it must have at least one input. Control samples are ignored when no normalization or “RPM” normalization is chosen. However, they are used to remove background noise using “NCIS” normalization is selected, or to compute coverage ratios with a control sample when “log2_ratio” normalization is applied.

example_design <- data.frame(Samples = bam_files,
                             align1 = c(1,1,0,0,2),
                             align2 = c(0,0,1,1,2))
kable(example_design)
Samples align1 align2
a1_1 /tmp/Rtmpp75B4J/Rinst6add59aa8a/metagene2/extdata/align1_rep1.bam 1 0
a1_2 /tmp/Rtmpp75B4J/Rinst6add59aa8a/metagene2/extdata/align1_rep2.bam 1 0
a2_1 /tmp/Rtmpp75B4J/Rinst6add59aa8a/metagene2/extdata/align2_rep1.bam 0 1
a2_2 /tmp/Rtmpp75B4J/Rinst6add59aa8a/metagene2/extdata/align2_rep2.bam 0 1
ctrl /tmp/Rtmpp75B4J/Rinst6add59aa8a/metagene2/extdata/ctrl.bam 2 2
# Initializing the metagene object.
mg <- metagene2$new(regions = get_demo_regions(),
                   bam_files = get_demo_bam_files(),
                   assay='chipseq')

# Plotting while grouping the bam files by design group
mg$produce_metagene(design=example_design)