Contents

1 Introduction

GeneStructureTools is a package for the manipulation and analysis of transcribed gene structures.

We have provided functions for importing Whippet and leafcutter alternative splicing data, and the analysis of these splicing events. Splicing events can also be defined manually if you are using a different splicing analysis tool to Whippet. For specific events - currently including exon skipping, intron retention, alternative splice site usage and alternative first/last exons - transcripts can be made in silico which use the two splicing modes - i.e. transcripts containing and transcripts skipping an exon. These transcripts do not have to be pre-annotated, and thus all potential isoforms can be compared for an event.

Current comparisons of transcripts include annotating and analysing ORF and UTR features (length, locations, difference/similarity between transcripts), and predicting nonsense mediated decay (NMD) potential.

We also have functions for re-annotation of .GTF features, such as annotating UTRs as 3’ or 5’, and assigning a broader biotype for genes and transcripts so more informative analysis can be performed between these classes.

2 Importing Differential Splicing Data

Currently, very few available tools output splicing event type information (i.e. exon skipping, intron retention) within tested genes. GeneStructureTools currently has functions for processing data from:

whippet

DEXSeq

leafcutter

Data Preparation

We have pre-prepared data from mouse embryonic stem cell (ESC) development (Gloss et. al, 2017, Accession Number GSE75028), at days 0 and 5, and run whippet on each replicate using the reccomended parameters and the Gencode vM14 annotation. You can download the whippet, leafcutter, and DEXSeq files here.

You will also need to download the Gencode GTF file from here.

For the purposes of this vignette, small subsets of these data are available in the package data (inst/extdata).

Data provided is typical output for leafcutter and Whippet. For details on what each file contains, please refer to their respective manuals ( leafcutter | Whippet ).

DEXSeq data is processed as reccomended by the (DEXSeq) manual. The script used to process raw output is here

2.1 Whippet

To run a full analysis on Whippet output, you will need the raw .psi.gz (percent spliced in) and .jnc.gz (junction read counts) files for each sample. In addition, you will need to compare conditions using whippet-delta.jl and have a resulting .diff.gz file.

Read in Whippet files from downloaded data

# Load packages
library(GeneStructureTools)
library(GenomicRanges)
library(stringr)
library(BSgenome.Mmusculus.UCSC.mm10)
library(Gviz)
library(rtracklayer)


# list of files in the whippet directory
whippet_file_directory <- "~/Downloads/GeneStructureTools_VignetteFiles/"

# read in files as a whippetDataSet
wds <- readWhippetDataSet(whippet_file_directory)

# create a sample table with sample id, condition and replicate
whippet_sampleTable <- data.frame(sample=c("A01","B01","A21","B21"),
                                  condition=c("01","01","21","21"),
                                  replicate=(c("A","B","A","B")))
                                  
# read in gtf annotation
gtf <- rtracklayer::import("~/Downloads/gencode.vM14.annotation.gtf.gz")
exons <- gtf[gtf$type=="exon"]
transcripts <- gtf[gtf$type=="transcript"]

# add first/last annotation (speeds up later steps)
if(!("first_last" %in% colnames(mcols(exons)))){
    t <- as.data.frame(table(exons$transcript_id))
    exons$first_last <- NA
    exons$first_last[exons$exon_number == 1] <- "first"
    exons$first_last[exons$exon_number == 
    t$Freq[match(exons$transcript_id, t$Var1)]] <- "last"
}

# specify the BSGenome annotation
g <- BSgenome.Mmusculus.UCSC.mm10::BSgenome.Mmusculus.UCSC.mm10

Read in Whippet files from package data

# Load packages
suppressPackageStartupMessages({
library(GeneStructureTools)
library(GenomicRanges)
library(stringr)
library(BSgenome.Mmusculus.UCSC.mm10)
library(Gviz)
library(rtracklayer)
})
# list of files in the whippet directory
whippet_file_directory <- system.file("extdata","whippet/", 
package = "GeneStructureTools")

# read in files as a whippetDataSet
wds <- readWhippetDataSet(whippet_file_directory)

# create a sample table with sample id, condition and replicate
whippet_sampleTable <- data.frame(sample=c("A01","B01","A21","B21"),
                                  condition=c("01","01","21","21"),
                                  replicate=(c("A","B","A","B")))

# read in gtf annotation
gtf <- rtracklayer::import(system.file("extdata","example_gtf.gtf", 
package = "GeneStructureTools"))
exons <- gtf[gtf$type=="exon"]
transcripts <- gtf[gtf$type=="transcript"]

# add first/last annotation (speeds up later steps)
if(!("first_last" %in% colnames(mcols(exons)))){
    t <- as.data.frame(table(exons$transcript_id))
    exons$first_last <- NA
    exons$first_last[exons$exon_number == 1] <- "first"
    exons$first_last[exons$exon_number == 
    t$Freq[match(exons$transcript_id, t$Var1)]] <- "last"
}

# specify the BSGenome annotation
g <- BSgenome.Mmusculus.UCSC.mm10::BSgenome.Mmusculus.UCSC.mm10

2.2 leafcutter

Only the file containg the leafcutter results for each intron, and the .gtf file used with leafcutter needs to be read in for results processing. The leafcutter results file is generated after running prepare_results.R on your data, then extracting out the intron data table.

First, find the location of the leafviz2table.R script:

#find location of the script
system.file("extdata","leafviz2table.R", package = "GeneStructureTools")

Then run it on your leafviz output .RData file. The first argument is the leafviz output RData file, and the second is the name of the table you wish to write the intron results to.

Rscript leafviz2table.R leafviz.RData per_intron_results.tab

We have an processed example file available in extdata/leafcutter with a small sample of significant events.

# read in gtf annotation
gtf <- rtracklayer::import(system.file("extdata","example_gtf.gtf", package = "GeneStructureTools"))
exons <- gtf[gtf$type=="exon"]

# specify the BSGenome annotation
g <- BSgenome.Mmusculus.UCSC.mm10::BSgenome.Mmusculus.UCSC.mm10

# list of files in the leafcutter directory
leafcutter_files <- list.files(system.file("extdata","leafcutter/", package = "GeneStructureTools"),full.names = TRUE)

intron_results <- read.delim(leafcutter_files[grep("per_intron_results.tab", leafcutter_files)], stringsAsFactors = FALSE)

3 Summarise changes in gene structures due to splicing

3.1 Whippet

# filter events for significance
wds <- filterWhippetEvents(
whippetDataSet = wds,
probability = 0.95, # min probability
psiDelta = 0.1, # min change in PSI
eventTypes = "all", # all event types
minCounts = 100, # mean of at least 100 counts in one condition
sampleTable = whippet_sampleTable)

# check for changes in gene/transcript structure
whippet_summary <- whippetTranscriptChangeSummary(wds,
exons = exons,
transcripts = transcripts,
BSgenome = g,
NMD = FALSE # ignore nonsense mediated decay
)
head(whippet_summary)
##                    gene node                    coord strand type   psi_a
## 1 ENSMUSG00000032883.15    3   chr1:78663141-78663280      +   CE 0.35304
## 2 ENSMUSG00000024038.16    6  chr17:31527310-31528401      +   CE 0.13662
## 3 ENSMUSG00000018379.17    5  chr11:88049216-88049411      +   RI 0.37136
## 4 ENSMUSG00000034064.14    3  chr16:38549434-38549636      -   AA 0.38094
## 5 ENSMUSG00000035478.14    2  chr10:80399122-80399217      -   AD 0.45017
## 6 ENSMUSG00000026421.14    2 chr1:135729147-135729274      +   AF 0.84840
##     psi_b psi_delta probability complexity entropy
## 1 0.13494   0.21810       1.000         K1  0.9619
## 2 0.27537  -0.13875       0.998         K1  0.9006
## 3 0.21276   0.15861       1.000         K1  0.9590
## 4 0.23379   0.14716       0.990         K1  0.9638
## 5 0.33980   0.11036       0.953         K1  0.9937
## 6 0.99487  -0.14647       1.000         K1  0.6421
##                                           unique_name comparison condition_1
## 1   ENSMUSG00000032883.15_chr1:78663141-78663280_CE_3    01_v_21          01
## 2  ENSMUSG00000024038.16_chr17:31527310-31528401_CE_6    01_v_21          01
## 3  ENSMUSG00000018379.17_chr11:88049216-88049411_RI_5    01_v_21          01
## 4  ENSMUSG00000034064.14_chr16:38549434-38549636_AA_3    01_v_21          01
## 5  ENSMUSG00000035478.14_chr10:80399122-80399217_AD_2    01_v_21          01
## 6 ENSMUSG00000026421.14_chr1:135729147-135729274_AF_2    01_v_21          01
##   condition_2 coord_match condition_1_counts condition_2_counts node
## 1          21           1               89.5              213.5    3
## 2          21           2              706.0              456.5    6
## 3          21           3              569.5              662.0    5
## 4          21           4              103.5              109.0    3
## 5          21           5               98.5              107.5    2
## 6          21           6               51.5              196.5    2
##                      coord strand type   psi_a   psi_b psi_delta probability
## 1   chr1:78663141-78663280      +   CE 0.35304 0.13494   0.21810       1.000
## 2  chr17:31527310-31528401      +   CE 0.13662 0.27537  -0.13875       0.998
## 3  chr11:88049216-88049411      +   RI 0.37136 0.21276   0.15861       1.000
## 4  chr16:38549434-38549636      -   AA 0.38094 0.23379   0.14716       0.990
## 5  chr10:80399122-80399217      -   AD 0.45017 0.33980   0.11036       0.953
## 6 chr1:135729147-135729274      +   AF 0.84840 0.99487  -0.14647       1.000
##   complexity entropy                                         unique_name
## 1         K1  0.9619   ENSMUSG00000032883.15_chr1:78663141-78663280_CE_3
## 2         K1  0.9006  ENSMUSG00000024038.16_chr17:31527310-31528401_CE_6
## 3         K1  0.9590  ENSMUSG00000018379.17_chr11:88049216-88049411_RI_5
## 4         K1  0.9638  ENSMUSG00000034064.14_chr16:38549434-38549636_AA_3
## 5         K1  0.9937  ENSMUSG00000035478.14_chr10:80399122-80399217_AD_2
## 6         K1  0.6421 ENSMUSG00000026421.14_chr1:135729147-135729274_AF_2
##   comparison condition_1 condition_2 coord_match condition_1_counts
## 1    01_v_21          01          21           1               89.5
## 2    01_v_21          01          21           2              706.0
## 3    01_v_21          01          21           3              569.5
## 4    01_v_21          01          21           4              103.5
## 5    01_v_21          01          21           5               98.5
## 6    01_v_21          01          21           6               51.5
##   condition_2_counts                       id orf_length_bygroup_x
## 1              213.5   chr1:78663141-78663280                  720
## 2              456.5  chr17:31527310-31528401                  104
## 3              662.0  chr11:88049216-88049411                  201
## 4              109.0  chr16:38549434-38549636                  410
## 5              107.5  chr10:80399122-80399217                  313
## 6              196.5 chr1:135729147-135729274                  166
##   orf_length_bygroup_y utr3_length_bygroup_x utr3_length_bygroup_y
## 1                  720                  1328                  1328
## 2                  468                    74                    74
## 3                  253                   763                   411
## 4                  317                  1485                  1488
## 5                  281                   185                   185
## 6                  166                  1898                   709
##   utr5_length_bygroup_x utr5_length_bygroup_y filtered percent_orf_shared
## 1                   460                   320    FALSE          1.0000000
## 2                    57                    57    FALSE          0.2222222
## 3                    37                    37    FALSE          0.0000000
## 4                     2                   520    FALSE          0.7731707
## 5                   192                   192    FALSE          0.8977636
## 6                  1135                  2596    FALSE          1.0000000
##   max_percent_orf_shared orf_percent_kept_x orf_percent_kept_y
## 1              1.0000000          1.0000000          1.0000000
## 2              0.2222222          1.0000000          0.2222222
## 3              0.7944664          0.0000000          0.0000000
## 4              0.7731707          0.7731707          1.0000000
## 5              0.8977636          0.8977636          1.0000000
## 6              1.0000000          1.0000000          1.0000000

###leafcutter

leafcutter_summary <- leafcutterTranscriptChangeSummary(intron_results, 
                                                        exons = exons,
                                                        BSgenome = g,
                                                        NMD = FALSE,
                                                        showProgressBar = FALSE)

head(leafcutter_summary[!duplicated(leafcutter_summary$cluster),])
##          cluster  status    loglr df            p     p.adjust  genes
## 5 chr16:clu_1396 Success 20.16004  2 1.756329e-09 2.720554e-06 Eif4a2
## 1 chr10:clu_1204 Success 18.48627  2 9.365149e-09 7.253308e-06 Bclaf1
## 6 chr15:clu_1488 Success 23.03735  6 2.860878e-08 1.107875e-05 Tarbp2
## 2 chr17:clu_1281 Success 20.57316  4 2.506720e-08 1.107875e-05  Rnps1
## 8   chr5:clu_225 Success 16.41082  2 7.462287e-08 2.311817e-05 Hnrnpd
## 7 chr14:clu_1512 Success 15.00163  2 3.054042e-07 7.884518e-05   Ktn1
##            FDR                             intron       logef       T01
## 5 2.720554e-06   chr16:23111896:23112351:clu_1396  0.30823371 0.2657849
## 1 7.253308e-06   chr10:20333572:20334499:clu_1204 -0.56647927 0.3521539
## 6 1.107875e-05 chr15:102518602:102519122:clu_1488 -0.47200141 0.2294254
## 2 1.107875e-05   chr17:24414734:24415041:clu_1281 -0.07302291 0.1145809
## 8 2.311817e-05     chr5:99967387:99976534:clu_225  0.55565460 0.3165242
## 7 7.884518e-05   chr14:47724970:47725929:clu_1512 -0.46578048 0.3338764
##         T02      deltapsi   chr     start       end clusterID   verdict   gene
## 5 0.4410376  0.1752527121 chr16  23111896  23112351  clu_1396 annotated Eif4a2
## 1 0.2269207 -0.1252331984 chr10  20333572  20334499  clu_1204 annotated Bclaf1
## 6 0.2597079  0.0302825713 chr15 102518602 102519122  clu_1488 annotated Tarbp2
## 2 0.1140728 -0.0005080613 chr17  24414734  24415041  clu_1281 annotated  Rnps1
## 8 0.3905444  0.0740202209  chr5  99967387  99976534   clu_225 annotated Hnrnpd
## 7 0.1656536 -0.1682228169 chr14  47724970  47725929  clu_1512 annotated   Ktn1
##               ensemblID transcripts constitutive.score orf_length_bygroup_x
## 5 ENSMUSG00000022884.14           +                  1                  408
## 1 ENSMUSG00000037608.16           +                  1                  919
## 6 ENSMUSG00000023051.10           +                  1                  132
## 2 ENSMUSG00000034681.16           +                  1                  305
## 8 ENSMUSG00000000568.15           -                  1                  336
## 7 ENSMUSG00000021843.16           +                  1                 1327
##   orf_length_bygroup_y utr3_length_bygroup_x utr3_length_bygroup_y
## 5                  363                   615                   857
## 1                  870                  2442                  2442
## 6                  274                  1265                   334
## 2                  282                   710                   710
## 8                  361                   366                   366
## 7                 1303                   469                   469
##   utr5_length_bygroup_x utr5_length_bygroup_y filtered percent_orf_shared
## 5                    34                    34    FALSE          0.7671569
## 1                   236                   236    FALSE          0.9466812
## 6                   224                    92    FALSE          0.0000000
## 2                   293                    59    FALSE          0.9245902
## 8                   294                   294    FALSE          0.9307479
## 7                    32                    32    FALSE          0.9819141
##   max_percent_orf_shared orf_percent_kept_x orf_percent_kept_y
## 5              0.8897059          0.7671569          0.8622590
## 1              0.9466812          0.9466812          1.0000000
## 6              0.4817518          0.0000000          0.0000000
## 2              0.9245902          0.9245902          1.0000000
## 8              0.9307479          1.0000000          0.9307479
## 7              0.9819141          0.9819141          1.0000000

4 Altering Gene and Transcript Structures

whippetTranscriptChangeSummary() combines several functions for analysing changes in gene structures. While this has been made to simplify analysis from whippet data, individual functions can be used on other data sources or manually annotated gene structures. It may also be helpful to run each individual step if you would like to manually investigate changes to genes.

4.1 Exon skipping

Exon skipping, or cassette exon usage, occurs when a single exon is spliced out of the mature transcript.

Removal of a skipped exon
Removal of a skipped exon

1.a. Find skipped exon events

# filter out skipped exon events (coded as "CE")
# we will be looking at Ndufv3 (ENSMUSG00000024038.16)
wds.ce <- filterWhippetEvents(wds, psiDelta=0,probability=0, 
event="CE", idList="ENSMUSG00000024038.16")

diffSplicingResults(wds.ce)
##                    gene node                   coord strand type   psi_a
## 2 ENSMUSG00000024038.16    6 chr17:31527310-31528401      +   CE 0.13662
##     psi_b psi_delta probability complexity entropy
## 2 0.27537  -0.13875       0.998         K1  0.9006
##                                          unique_name comparison condition_1
## 2 ENSMUSG00000024038.16_chr17:31527310-31528401_CE_6    01_v_21          01
##   condition_2 coord_match condition_1_counts condition_2_counts
## 2          21           2                706              456.5
# psi_a = 0.137, psi_b = 0.275
# percentage of transcripts skipping exon 3 decreases from timepoint 1 to 21

# whippet outputs the skipped exon coordinates
coordinates(wds.ce)
## GRanges object with 1 range and 1 metadata column:
##       seqnames            ranges strand |                     id
##          <Rle>         <IRanges>  <Rle> |            <character>
##   [1]    chr17 31527310-31528401      + | chr17:31527310-31528..
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

2. Find transcripts which overlap the skipped exon, and create normal & skipped exon isoforms

# find exons in the gtf that overlap the skipped exon event
exons.ce <- findExonContainingTranscripts(wds.ce, 
                                          exons = exons,
                                          transcripts = transcripts)
# make skipped and included exon transcripts
# removes the skipped exon from all transcripts which contain it
skippedExonTranscripts <- skipExonInTranscript(skippedExons = exons.ce,
                                               exons=exons, 
                                               match="skip",
                                               whippetDataSet = wds.ce)
## make Gvis models
# set up for visualisation
gtr <- GenomeAxisTrack()

# all transcripts for the gene
geneModel.all <- GeneRegionTrack(makeGeneModel(exons[exons$gene_id ==
skippedExonTranscripts$gene_id[1]]),
                                 name="Reference Gene",
                                 showId=TRUE,
                                 transcriptAnnotation = "transcript")
# reference transcript
geneModelNormal <- GeneRegionTrack(makeGeneModel(
skippedExonTranscripts[skippedExonTranscripts$set=="included_exon"]),
                                   name="Reference Isoform",
                                   showId=TRUE, fill="#4D7ABE",
                                   transcriptAnnotation = "transcript")
# for the skipped exon transcript
geneModelSkipped <- GeneRegionTrack(makeGeneModel(
skippedExonTranscripts[skippedExonTranscripts$set=="skipped_exon"]),
                                   name="Alternative Isoform",
                                   showId=TRUE, fill="#94AFD8",
                                   transcriptAnnotation = "transcript")

plotTracks(list(gtr,geneModel.all,geneModelNormal, geneModelSkipped), 
extend.left = 1000, extend.right = 1000)

# Only the transcript isoform containing the skipped exon (exon 3) 
# is used for analysis, and a 'novel' isoform is created by exon skipping

4.2 Intron Retention

Intron Retention occurs when an intron is not spliced out of the mature transcript.

Addition of a retained intron
Addition of a retained intron

1. Create normal and retained isoform structures from whippet coordinates

# filter out retained events (coded as "RI")
# we will be looking at Srsf1 (ENSMUSG00000018379.17)
wds.ri <- filterWhippetEvents(wds, psiDelta=0,probability=0, 
event="RI", idList="ENSMUSG00000018379.17")

diffSplicingResults(wds.ri)
##                    gene node                   coord strand type   psi_a
## 3 ENSMUSG00000018379.17    5 chr11:88049216-88049411      +   RI 0.37136
##     psi_b psi_delta probability complexity entropy
## 3 0.21276   0.15861           1         K1   0.959
##                                          unique_name comparison condition_1
## 3 ENSMUSG00000018379.17_chr11:88049216-88049411_RI_5    01_v_21          01
##   condition_2 coord_match condition_1_counts condition_2_counts
## 3          21           3              569.5                662

2. Find transcripts which overlap the intron, and create normal & retained intron isoforms

# find exons pairs in the gtf that bound the retained intron event
exons.ri <- findIntronContainingTranscripts(wds.ri, 
                                            exons)

# make retained and non-retained transcripts
# adds the intron into all transcripts which overlap it
retainedIntronTranscripts <- addIntronInTranscript(exons.ri, 
                                                   exons = exons, 
                                                   whippetDataSet = wds.ri,
                                                   glueExons = TRUE)
## make Gviz models
# all transcripts for the gene
geneModel.all <- GeneRegionTrack(makeGeneModel(exons[exons$gene_id == retainedIntronTranscripts$gene_id[1]]),
                                 name="Reference Gene",
                                 showId=TRUE,
                                 transcriptAnnotation = "transcript")
# reference transcript
geneModelNormal <- GeneRegionTrack(makeGeneModel(
retainedIntronTranscripts[retainedIntronTranscripts$set=="spliced_intron"]),
                                   name="Reference Isoform",
                                   showId=TRUE, fill="#4D7ABE",
                                   transcriptAnnotation = "transcript")
# for the retained intron transcript
geneModelRetained <- GeneRegionTrack(makeGeneModel(
retainedIntronTranscripts[retainedIntronTranscripts$set=="retained_intron"]),
                                   name="Alternative Isoform",
                                   showId=TRUE, fill="#94AFD8",
                                   transcriptAnnotation = "transcript")

# Only the transcript isoforms with exons at the boundries of the retained intron are used for analysis, and 'novel' isoforms are created by intron retention
plotTracks(list(gtr,geneModel.all,geneModelNormal, geneModelRetained), 
extend.left = 1000, extend.right = 1000)