Abstract

SplicingFactory is an R package for the analysis of alternative splicing isoform diversity, with expression levels generally coming from RNA-Seq data. Expression levels are generated by the usual tools, for example a STAR alignment followed by featureCounts, RSEM, salmon or kallisto. However, you need transcript isoform level data, where the expression estimates of the different isoforms for a gene are as specific, as possible. The package provides the ability to generate gene-level Shannon-entropy values and other diversity measures, such as the Gini- or Simpson-index. These measures can quantify transcript isoform diversity within samples or between conditions. Additionally, the package analyzes the isoform diversity data, looking for significant changes between conditions. A basic task during analysis is to use read counts or TPM to calculate a diversity measure and then perform a differential analysis to identify genes with a significant change between conditions. The vignette explains the use of the package with a typical workflow.SplicingFactory package version: 1.5.0

As input, the SplicingFactory package can use 5 different data types. Besides matrices and data frames, you can use the output of tximport, DGELists and SummarizedExperiments. The tximport R package is able to import files containing transcript abundance estimates generated by tools such as Salmon, Kallisto or RSEM. The tximport datasets are stored in a list by default, and they can serve as input for SplicingFactory package.

DGELists are a list-based S4 class for storing read counts and associated information, while SummarizedExperiment objects contain and describe high-throughput expression level assays. SplicingFactory automatically uses the necessary functions for all of these data types, on condition that they were not previously modified (e.g. list elements renamed).

While SplicingFactory can process any kind of numeric value, used to measure expression levels, we recommend TPM or similar length normalized values. If the read count values are not normalized for the transcript isoform lengths, the read count proportions, therefore the diversity values, will be misleading. For example, a gene with three transcript isoforms, lengths of 100, 100, and 1000, and read counts of 20, 20 and 200 for each of them is detected in an experiment. Simply using the read counts to calculate proportions, will lead to the values of 0.083, 0.083 and 0.83, and to the conclusion that we have a single dominant isoform based on the diversity value. However,the third isoform is 10 times longer than the other two, leading to a larger number of reads originating from this isoform. Normalizing for isoform length will lead to the same 0.33 proportion for all of them, therefore no dominant isoform and a very different diversity value.

Besides the format, the input table (matrix, data.frame or the tabular expression data extracted from other data structures) needs to be arranged properly. Every row identifies a distinct transcript in your dataset, while every column belongs to a distinct sample. The table can contain only numeric values. Supplementing your table, the package will need a gene and a sample vector, identifying the genes and the sample conditions in your analysis. The gene vector assigns genes to every row in your data, that will be used to aggregate transcript level expression information at the gene level. It is important that the columns and the rows of the table (genes and samples) are in the same order, as the gene and the sample vectors. In case of `SummarizedExperiment`

, the genes vector can be empty (but not exclusively) as the package can automatically extract the necessary vector from the object.

The package contains an example dataset called `tcga_brca_luma_dataset`

. The data was downloaded from The Cancer Genome Atlas (TCGA) on 12th of April, 2020. It contains transcript level read counts for 300 pre-selected genes of 40 patients with Luminal A type breast cancer (primary tumor and solid normal samples). Transcript level expression was estimated with RSEM.

You can check the list of TCGA sample ids selected with the following code.

```
sample_id_file <- system.file("extdata/tcga_sample_ids.tsv",
package = "SplicingFactory")
sample_ids <- read.table(sample_id_file)
```

You can check the list of pre-selected genes with the following code.

```
library("SplicingFactory")
library("SummarizedExperiment")
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#>
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#>
#> colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#> colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#> colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#> colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#> colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#> colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#> colWeightedMeans, colWeightedMedians, colWeightedSds,
#> colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#> rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#> rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#> rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#> rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#> rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#> rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#> rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#> tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#>
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#>
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#>
#> rowMedians
#> The following objects are masked from 'package:matrixStats':
#>
#> anyMissing, rowMedians
# Load dataset
data(tcga_brca_luma_dataset)
# Extract gene names
genes <- tcga_brca_luma_dataset[, 1]
# Extract read count data without gene names
readcounts <- tcga_brca_luma_dataset[, -1]
# Check read count dataset
dim(readcounts)
#> [1] 1100 40
head(readcounts[, 1:5])
#> TCGA-A7-A0CH_N TCGA-A7-A0CH_T TCGA-A7-A0D9_N TCGA-A7-A0D9_T TCGA-A7-A0DB_T
#> 1 2858.04 743.56 812.59 0.00 1508.90
#> 2 127.82 21.28 50.87 38.21 30.10
#> 3 370.22 94.38 368.76 98.12 167.89
#> 4 7472.00 3564.87 7647.76 3236.28 6360.80
#> 5 25.44 16.91 4.07 7.55 34.74
#> 6 0.00 0.00 0.00 0.00 0.00
```

As a first step, before doing the diversity calculation, you might want to filter out genes with a low overall expression or limit the analysis to transcripts with a sufficient minimum expression level. Expression estimates of transcript isoforms with zero or low expression might be highly variable. For more details on the effect of transcript isoform prefiltering on differential transcript usage, see this paper.

Here, we are filtering out transcript isoforms with less than 5 reads in more than 5 samples. Additionally, we update the `genes`

vector to match the new filtered matrix.

We are going to use the `calculate_diversity`

function to calculate two different types of transcript diversity.There are several mandatory and optional pamaters for the function. Even though we are using only a limited number of genes for a set of 40 samples, the analysis can be done using a full transcriptome annotation and much larger sample sets.

- x - Input data, in various formats, discussed in more detail in the Input data structure and Data arrangement section.
- genes - A vector with gene names used for aggregating the transcript level data.
- method - Method to use for splicing diversity calculation.
- norm - If set to
`TRUE`

, the entropy values are normalized to the number of transcripts for each gene. - tpm - In the case of a tximport list, you might want set the
`tpm`

argument. As the default option is`FALSE`

, the raw read counts will be extracted from your input data. Set it to`TRUE`

if you want to use TPM values. - assayno - An optional argument is
`assayno`

, which is useful if you are planning to analyze a`SummarizedExperiment`

input, containing multiple assays.`assayno`

is a numeric value, specifying the assay to be analyzed. - verbose - Set it to
`TRUE`

if you want more detailed diagnostic messages.

To calculate Laplace entropy, where values are normalized between 0 and 1, use:

```
laplace_entropy <- calculate_diversity(readcounts, genes, method = "laplace",
norm = TRUE, verbose = TRUE)
#> Note: There are 70 genes with single isoforms,
#> which will be exluded from the analysis.
head(assay(laplace_entropy)[, 1:5])
#> TCGA-A7-A0CH_N TCGA-A7-A0CH_T TCGA-A7-A0D9_N TCGA-A7-A0D9_T
#> AADAT 0.97882503 0.12394987 0.93798705 0.0503176129
#> AAGAB 0.84165885 0.34697215 0.99995622 0.0007424084
#> AASS 0.08596202 0.03280399 0.01264964 0.0169753750
#> ABCA5 0.58404988 0.42518099 0.49292670 0.5902030861
#> ACACB 0.72444702 0.49913583 0.65958666 0.6333204035
#> ACE2 0.17556503 0.81127812 0.08146203 0.5032583348
#> TCGA-A7-A0DB_T
#> AADAT 0.1485495
#> AAGAB 0.1490526
#> AASS 0.9984193
#> ABCA5 0.4478578
#> ACACB 0.5544559
#> ACE2 0.5916728
```

To calculate Gini index, you don’t need to specify the `norm`

argument, as the Gini index is by definition ranges between 0 (complete equality) and 1 (complete inequality).

```
gini_index <- calculate_diversity(readcounts, genes, method = "gini",
verbose = TRUE)
#> Note: There are 70 genes with single isoforms,
#> which will be exluded from the analysis.
head(assay(gini_index)[, 1:5])
#> TCGA-A7-A0CH_N TCGA-A7-A0CH_T TCGA-A7-A0D9_N TCGA-A7-A0D9_T
#> AADAT 0.1714243 1.0000000 0.294838710 1.0000000
#> AAGAB 0.4600646 0.8703584 0.007806598 1.0000000
#> AASS 1.0000000 1.0000000 1.000000000 1.0000000
#> ABCA5 0.7559249 0.8488397 0.767124239 0.7476021
#> ACACB 0.5876599 0.7557049 0.674832742 0.6524809
#> ACE2 1.0000000 1.0000000 1.000000000 1.0000000
#> TCGA-A7-A0DB_T
#> AADAT 1.00000000
#> AAGAB 0.95771019
#> AASS 0.04696541
#> ABCA5 0.86402952
#> ACACB 0.69320036
#> ACE2 1.00000000
```

Both for the Laplace-entropy and Gini index calculation, the package returns a `SummarizedExperiment`

object, that you can investigate further with the `assay`

function.

The package automatically filters out genes with a single isoform, as splicing diversity values can only be calculated for genes with at least 2 splicing isoforms.

Some genes might show `NA`

diversity values. This means that the expression was zero for all transcript isoforms of the gene in these samples and the package could not calculate any diversity value as there is no meaningful diversity for genes which did not show any expression in your experiment. Lack of expression might also be the result of technical issues.

To further analyze and visualize your data, you might do the following. To see the distribution and density of the splicing diversity data, you can visualize it by using ggplot2 from the tidyverse package collection.

```
library("tidyr")
#>
#> Attaching package: 'tidyr'
#> The following object is masked from 'package:S4Vectors':
#>
#> expand
library("ggplot2")
# Construct data.frame from SummarizedExperiment result
laplace_data <- cbind(assay(laplace_entropy),
Gene = rowData(laplace_entropy)$genes)
# Reshape data.frame
laplace_data <- pivot_longer(laplace_data, -Gene, names_to = "sample",
values_to = "entropy")
# Add sample type information
laplace_data$sample_type <- apply(laplace_data[, 2], 1,
function(x) ifelse(grepl("_N", x),
"Normal", "Tumor"))
# Filter genes with NA entropy values
laplace_data <- drop_na(laplace_data)
# Update gene names and add diversity type column
laplace_data$Gene <- paste0(laplace_data$Gene, "_", laplace_data$sample_type)
laplace_data$diversity <- "Normalized Laplace entropy"
# Construct data.frame from SummarizedExperiment result
gini_data <- cbind(assay(gini_index), Gene = rowData(gini_index)$genes)
# Reshape data.frame
gini_data <- pivot_longer(gini_data, -Gene, names_to = "sample",
values_to = "gini")
# Add sample type information
gini_data$sample_type <- apply(gini_data[, 2], 1,
function(x) ifelse(grepl("_N", x),
"Normal", "Tumor"))
# Filter genes with NA gini values
gini_data <- drop_na(gini_data)
# Update gene names and add diversity type column
gini_data$Gene <- paste0(gini_data$Gene, "_", gini_data$sample_type)
gini_data$diversity <- "Gini index"
# Plot diversity data
ggplot() +
geom_density(data = laplace_data, alpha = 0.3,
aes(x = entropy, group = sample, color = diversity)) +
geom_density(data = gini_data, alpha = 0.3,
aes(x = gini, group = sample, color = diversity)) +
facet_grid(. ~ sample_type) +
scale_color_manual(values = c("black", "darkorchid4")) +
guides(color = FALSE) +
theme_minimal() +
labs(x = "Diversity values",
y = "Density")
#> Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
#> "none")` instead.
```

```
# Mean entropy calculation across samples for each gene/sample type combination
laplace_entropy_mean <- aggregate(laplace_data$entropy,
by = list(laplace_data$Gene), mean)
colnames(laplace_entropy_mean)[2] <- "mean_entropy"
laplace_entropy_mean <- as_tibble(laplace_entropy_mean)
# Add sample type information
laplace_entropy_mean$sample_type <- apply(laplace_entropy_mean[, 1], 1,
function(x) ifelse(grepl("_Normal", x),
"Normal", "Tumor"))
# Add diversity type column
laplace_entropy_mean$diversity <- "Normalized Laplace entropy"
# Mean gini calculation across samples for each gene/sample type combination
gini_mean <- aggregate(gini_data$gini, by = list(gini_data$Gene), mean)
colnames(gini_mean)[2] <- "mean_gini"
gini_mean <- as_tibble(gini_mean)
# Add sample type information
gini_mean$sample_type <- apply(gini_mean[, 1], 1,
function(x) ifelse(grepl("_Normal", x),
"Normal", "Tumor"))
# Add diversity type column
gini_mean$diversity <- "Gini index"
ggplot() +
geom_violin(data = laplace_entropy_mean, aes(x = sample_type, y = mean_entropy,
fill = diversity),
alpha = 0.6) +
geom_violin(data = gini_mean, aes(x = sample_type, y = mean_gini,
fill = diversity),
alpha = 0.6) +
scale_fill_viridis_d(name = "Diversity") +
coord_flip() +
theme_minimal() +
labs(x = "Samples",
y = "Diversity")
```

The two methods are calculating different results. Genes with a single dominant isoform have a near-zero entropy, while they have a Gini index close to 1. The overall distribution of the data is similar between the Normal and Tumor conditions.

To further analyze the data, the steps of a differential diversity analysis are implemented in the `calculate_difference`

function, aiming to identify genes with significant changes in splicing diversity. The result table will contain the mean or median values of the diversity across sample categories, the difference of these values, the log2 fold change of the two different conditions, p-values and adjusted p-values for each genes.

There are several mandatory and optional arguments for this function:

- x - Output of the previous function, which is a data.frame with gene names as the first column and splicing diversity values for each sample in additional columns.
- samples - The previously defined vector with sample conditions specifying the category of each sample in the correct order.
- control - Name of the control sample category, defined in the samples vector, e.g. control = “Normal” or control = “WT”.
- method - Method to use for calculating the average splicing diversity value in a condition. Can be “mean” or “median”.
- test - Method to use for p-value calculation. There are two statistical tests implemented at the moment - Wilcoxon rank sum test and label shuffling test.
- randomizations - Optional parameter for the label shuffling test. You can specify the number of random shuffles (default = 100).
- pcorr - P-value correction method to use for the test results. Benjamini & Hochberg by default.
- assayno - An optional argument is
`assayno`

, which is useful if you are planning to analyze a`SummarizedExperiment`

input, containing multiple assays.`assayno`

is a numeric value, specifying the assay to be analyzed. - verbose - Set it to
`TRUE`

if you want more detailed diagnostic messages.

To analyze the previously calculated normalized Laplace entropy values stored in a `SummarizedExperiment`

object with a Wilcoxon sum rank test, use the `calculate_difference`

function as follows:

```
# Update the SummarizedExperiment object with a new sample metadata column for
# sample types, as the the object returned by calculate_diversity does not
# contain this information.
colData(laplace_entropy) <- cbind(colData(laplace_entropy),
sample_type = ifelse(grepl("_N", laplace_entropy$samples),
"Normal", "Tumor"))
# Calculate significant entropy changes
entropy_significance <- calculate_difference(x = laplace_entropy, samples = "sample_type",
control = "Normal",
method = "mean", test = "wilcoxon",
verbose = TRUE)
head(entropy_significance)
#> genes Tumor_mean Normal_mean mean_difference log2_fold_change raw_p_values
#> 1 AADAT 0.25574615 0.39521523 -0.13946909 -0.62792606 8.817198e-01
#> 2 AAGAB 0.08392113 0.35296374 -0.26904261 -2.07241399 1.348582e-03
#> 3 AASS 0.24982525 0.04153589 0.20828935 2.58848881 5.775309e-04
#> 4 ABCA5 0.51455877 0.54811526 -0.03355649 -0.09114345 2.502968e-01
#> 5 ACACB 0.57465202 0.61468593 -0.04003391 -0.09716086 1.135514e-01
#> 6 ACE2 0.60674130 0.14586248 0.46087882 2.05647271 1.054912e-06
#> adjusted_p_values
#> 1 0.9159727642
#> 2 0.0099516083
#> 3 0.0058853148
#> 4 0.5005935950
#> 5 0.2982452455
#> 6 0.0000893763
```

The package sends a note about 11 genes, with low sample size, excluded from the analysis. as these genes had several NA diversity values, the result of 0 expression values. You need at least 3 samples in each sample category and a total of 8 samples for a Wilcoxon test and at least 5 samples in each sample category for the label shuffling.

Genes with a significant change in splicing diversity can be further analyzed and visualized by using e.g. MA-plots, where the log2 fold change or mean difference is shown on the y axis, and the mean diversity values on the x axis. Dots are colored red if the adjusted p-value is smaller than 0.05. It is recommended to filter for an absolute mean difference larger than 0.1 besides the adjusted p-value.

The normalized naive and Laplace entropy, the Gini index, and the Simpson index are bounded in [0, 1], and we recommend using the mean or median difference when filtering for biologically meaningful changes. The non-normalized naive and Laplace entropy and the inverse Simpson index are not bounded in [0, 1], and the log2 fold change might give more useful results.

```
entropy_significance$label <- apply(entropy_significance[, c(4, 7)], 1,
function(x) ifelse(abs(x[1]) >= 0.1 & x[2] < 0.05,
"significant", "non-significant"))
entropy_significance$mean <- apply(entropy_significance[, c(2, 3)], 1,
function(x) (x[1] + x[2]) / 2)
ggplot(entropy_significance, aes(x = mean, y = mean_difference)) +
geom_point(color = "lightgrey", size = 1) +
geom_point(data = entropy_significance[entropy_significance$label == "significant", ],
color = "red", size = 1) +
theme_minimal() +
labs(title = "Normalized Laplace entropy",
subtitle = "Wilcoxon signed rank test",
x = "Mean entropy",
y = "Mean difference")
```

The analyzed genes also can be visualized on a Volcano-plot, which shows the adjusted p-values of the genes on a logarithmic scale on the y axis and the mean difference values between the two conditions on the x axis. We used a cutoff of 0.1 and -0.1 for the mean difference.

```
ggplot(entropy_significance, aes(x = mean_difference,
y = -log10(adjusted_p_values),
color = label)) +
geom_point() +
scale_color_manual(values = c("grey", "red"), guide = "none") +
geom_hline(yintercept = -log10(0.05), color = "red") +
geom_vline(xintercept = c(0.1, -0.1)) +
theme_minimal() +
labs(title = "Normalized Laplace entropy",
subtitle = "Wilcoxon signed rank test",
x = "Mean difference of entropy values",
y = "-Log10(adjusted p-value")
```