ELViS Vignette

Hyo Young Choi1, Jin-Young Lee2, Jeremiah R. Holt3, Katherine A. Hoadley4 and D. Neil Hayes1

1Department of Preventive Medicine, Division of Biostatistics, University of Tennessee Health Science Center, Memphis, TN
2Center for Cancer Research, University of Tennessee Health Science Center, Memphis, TN
3Department of Medicine, College of Medicine, University of Tennessee Health Science Center, Memphis, TN
4Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC

21 February 2025

1 1. Introduction

Tumor viruses cause a significant proportion of human cancers by hijacking cell signaling and inducing proliferation. For example, HPV E6 and E7 proteins can promote tumorigenesis by inducing degradation of tumor suppressors p53 and pRb. Despite the importance of the dosages of the viral genes like these, existing tools cover viral integration, emphasizing the need for viral copy number analysis framework.

1.1 1.1 Motivation for submitting to Bioconductor

ELViS (Estimating Copy Number Levels of Viral Genomic Segments) is an R package that addresses this need through the analysis of copy numbers within viral genomes. ELViS utilizes base-resolution read depth data over viral genome to find copy number segments with two-dimensional segmentation approach. It also provides publish-ready figures, including histograms of read depths for sample filtering, coverage line plots over viral genome annotated with copy number change events and viral genes, and heatmaps showing overall pictures with multiple types of data with integrative clustering of samples. Taken together, our framework helps computational biologists and bioinformaticians to analyze viral copy number changes with ease.

# 2. Installation

To install this package, start R (version “4.5”) and enter:

``` r if (!require(“BiocManager”, quietly = TRUE)) install.packages(“BiocManager”)

BiocManager::install(“ELViS”) ```

# 3. Run Example

## 3.1 Data preparation

#### 3.1.1 Virus Reference Genome File

The following reference files are included in extdata directory of ELViS.

- HPV16.fa - HPV16.fa.fai - HPV16REF_PaVE.gff

HPV16.fa is a reference sequence file with the associated fasta index file HPV16.fa.fai HPV16REF_PaVE.gff are viral gene annotation file in GFF3 format. They were downloaded from PaVE database (https://pave.niaid.nih.gov/locus_viewer?seq_id=HPV16REF)

You can get your own files(.fa,.fai,.gff) for the virus of your interest.

But, you should check if this viral reference genome FASTA(.fa) file and gene annotation GFF3(.gff) file have the same set of sequence names.

For the FASTA file, we recommend you add viral genome files to the host genome file (i.e. human reference genome hg38) to simultaneously align your reads to both host and viral genome. Multiple viral genome can be added. The reason we recommend this is as such. If there is a significant portion of viral reads in the sequencing data, and you align reads to either host or viral genome, there is a higher chance of false alignment and time it takes to finish the read alignment step can be extended.

# linux bash cat hg38.fa virus1.fa virus2.fa > hg38_with_virus.fa

Even if you use FASTA file containing both viral genomes and host genome, ELViS require a separate GFF file for each virus. Don’t merge GFF3 files together for using ELViS.

FASTA index file(.fai) can be created by running samtools faidx. For more detail, please refer to https://www.htslib.org/doc/samtools-faidx.html

# linux bash samtools faidx hg38_with_virus.fa

#### 3.1.2 BAM Files The following BAM files are included in extdata directory of ELViS.

- Control_100X_1102.bam - Control_100X_1102.bam.bai - Control_100X_1119.bam - Control_100X_1119.bam.bai

These are simulation data made using w-WESSIM-2. (https://github.com/GeorgetteTanner/w-Wessim2) Briefly, sequencing reads were simulated using this tool and aligned to HPV16.fa to generate the BAM file.

For your case, you should align your sequencing reads in FASTQ format to the reference genome prepared according to instructions in 2.2.1. As mentioned, we recommend aligning reads to a FASTQ file contains host genome and viral genomes.

However, you may have discovered our package after aligning all the reads to the host genome only already. No worries since there is a work around.

In that case, you can extract the unaligned reads and align that to viral genome FASTA file.

For example, say you have a BAM file seq_align_to_host.bam that is aligned to host genome only.

Extract unaligned reads using samtools as follows.

samtools view seq_align_to_host.bam --incl-flags 12 -b | # extract reads that are unaligned or those with unaligned mates samtools sort -n | # sort reads by read name samtools fastq -1 seq_r1.fastq -2 seq_r2.fastq # save as read1 and read2 files.

Align seq_r1.fastq and seq_r2.fastq to merged reference of viral genomes using sequence aligners of your choice, such as BWA MEM or Bowtie2.

## 3.2 Generate Raw Read Depth Matrix with Toy Examples

Load required libraries.

r library(ELViS) library(ggplot2) library(glue) library(dplyr) #> #> Attaching package: 'dplyr' #> The following objects are masked from 'package:stats': #> #>     filter, lag #> The following objects are masked from 'package:base': #> #>     intersect, setdiff, setequal, union library(ComplexHeatmap) #> Loading required package: grid #> ======================================== #> ComplexHeatmap version 2.23.0 #> Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/ #> Github page: https://github.com/jokergoo/ComplexHeatmap #> Documentation: http://jokergoo.github.io/ComplexHeatmap-reference #> #> If you use it in published research, please cite either one: #> - Gu, Z. Complex Heatmap Visualization. iMeta 2022. #> - Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional #>     genomic data. Bioinformatics 2016. #> #> #> The new InteractiveComplexHeatmap package can directly export static #> complex heatmaps into an interactive Shiny app with zero effort. Have a try! #> #> This message can be suppressed by: #>   suppressPackageStartupMessages(library(ComplexHeatmap)) #> ======================================== theme_set(theme_bw())

Prepare BAM file name vector.

``` r analysis_dir = tempdir() dir.create(analysis_dir,showWarnings = FALSE)

package_name = “ELViS”

# load toy example meta data data(toy_example,package = package_name)

# get lust of bam file paths ext_path = system.file(“extdata”,package = package_name) bam_files = list.files(ext_path,full.names = TRUE,pattern = “bam$”)

```

Generate base-resolution read depth matrix from a list of BAM files. Parallel package is used to read BAM files fast.

``` r os_name = Sys.info()[“sysname”] if( os_name == “Windows” ){ N_cores <- 1L }else{ N_cores <- 2L }

# the name of the reference viral sequence the reads were aligned to target_virus_name = “gi|333031|lcl|HPV16REF.1|”

# temporary file directory tmpdir=tempdir()

# generate read depth matrix system.time({ mtrx_samtools_reticulate__example = get_depth_matrix( bam_files = bam_files,coord_or_target_virus_name = target_virus_name ,mode = “samtools_basilisk” ,N_cores = N_cores ,min_mapq = 30 ,tmpdir=tmpdir ,condaenv = “env_samtools” ,condaenv_samtools_version=“1.21” ) }) #> + /home/biocbuild/.cache/R/basilisk/1.19.1/0/bin/conda create –yes –prefix /home/biocbuild/.cache/R/basilisk/1.19.1/ELViS/0.99.11/env_samtools ‘python=3.12.8’ –quiet -c conda-forge -c bioconda –override-channels #> + /home/biocbuild/.cache/R/basilisk/1.19.1/0/bin/conda install –yes –prefix /home/biocbuild/.cache/R/basilisk/1.19.1/ELViS/0.99.11/env_samtools ‘python=3.12.8’ -c conda-forge -c bioconda –override-channels #> + /home/biocbuild/.cache/R/basilisk/1.19.1/0/bin/conda install –yes –prefix /home/biocbuild/.cache/R/basilisk/1.19.1/ELViS/0.99.11/env_samtools -c conda-forge -c bioconda ‘python=3.12.8’ ‘samtools=1.21’ –override-channels #> user system elapsed #> 37.731 5.350 41.942 ```

## 3.3 Filtering Out Low Depth Samples

Determine sample filtering threshold using histogram and filter out low depth samples

``` r # loading precalculated depth matrix data(mtrx_samtools_reticulate)

# threshold th = 50 # histogram with adjustable thresholds for custom function depth_hist(mtrx_samtools_reticulate,th=th,smry_fun=max) #> Warning in scale_x_continuous(trans = “log10”): log-10 transformation #> introduced infinite values. #> stat_bin() using bins = 30. Pick better value with binwidth. ```

r depth_hist(mtrx_samtools_reticulate,th=th,smry_fun=quantile,prob=0.75) #> Warning in scale_x_continuous(trans = "log10"): log-10 transformation #> introduced infinite values. #> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

``` r

# filtered matrix base_resol_depth = filt_samples(mtrx_samtools_reticulate,th=th,smry_fun=max) print(base_resol_depth[1:4,1:4]) #> Control_100X_58.bam Control_100X_61.bam Control_100X_64.bam #> [1,] 55 57 60 #> [2,] 56 59 62 #> [3,] 57 61 62 #> [4,] 57 61 63 #> Control_100X_83.bam #> [1,] 49 #> [2,] 49 #> [3,] 52 #> [4,] 53 ```

## 3.4 Run ELViS using the Filtered Depth Matrix

Running ELViS using the filtered read depth matrix(base_resol_depth).

``` r system.time({ result = run_ELViS( X = base_resol_depth ,N_cores=N_cores ,reduced_output=TRUE )

})

ELViS_toy_run_result = result use_data(ELViS_toy_run_result)

# 4min for 120 samples using 10 threads ```

## 3.5 Plotting Figures

Prepare plotting data

``` r # ELViS run result data(ELViS_toy_run_result) result = ELViS_toy_run_result

# Directory where figures will be saved figure_dir = glue(“{analysis_dir}/figures”) dir.create(figure_dir)

# give the gff3 file of the virus of your interest. Sequence name or chromosome name should match with that in the reference genome FASTA file. gff3_fn = system.file(“extdata”,“HPV16REF_PaVE.gff”,package = package_name) ```

Raw read depth profile line plots.

``` r # Plotting raw depth profile gg_lst_x = plot_pileUp_multisample( result = result, X_raw = base_resol_depth, plot_target = “x”, gff3 = gff3_fn, baseline=1, exclude_genes = c("E6*“,”E1^E4“,”E8E2“), ) #> Import genomic features from the file as a GRanges object … #> Warning in .local(con, format, text, …): gff-version directive indicates #> version is 3.1.26, not 3 #> OK #> Prepare the ‘metadata’ data frame … OK #> Make the TxDb object … #> Warning in .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, : the transcript names (”tx_name" column in the TxDb object) imported from the #> “Name” attribute are not unique #> Warning in .makeTxDb_normarg_chrominfo(chrominfo): genome version information #> is not available for this TxDb object #> OK

# Save to pdf file, set SKIP = FALSE if you want to save as pdf SKIP = TRUE if(!SKIP){ pdf(glue(“{figure_dir}/Raw_Depth_CNV_call.pdf”),height=4,width=6) gg_lst_x dev.off() }

# an example of raw read depth line plot print(gg_lst_x[[1]]) ```

You can adjust baseline after examining depth profile plots.

``` r # set the longest segment as a new baseline new_baseline = get_new_baseline(result,mode=“longest”)

# Plotting raw depth profile with new baseline gg_lst_x = plot_pileUp_multisample( result = result, X_raw = base_resol_depth, plot_target = “x”, gff3 = gff3_fn, baseline=new_baseline, exclude_genes = c("E6*“,”E1^E4“,”E8E2“), ) # Save to pdf file, set SKIP = FALSE if you want to save as pdf SKIP = TRUE if(!SKIP){ # Save to pdf file pdf(”figures/Raw_Depth_new_baseline_CNV_call.pdf",height=4,width=6) gg_lst_x dev.off() }

# an example of raw read depth line plot with new baseline gg_lst_x[[1]] ```

Normalized read depth profile line plots.

``` r # Plotting normalized depth profile gg_lst_y = plot_pileUp_multisample( result = result, X_raw = base_resol_depth, plot_target = “y”, gff3 = gff3_fn, baseline=new_baseline, exclude_genes = c("E6*“,”E1^E4“,”E8E2"), )

# Save to pdf file SKIP = TRUE if(!SKIP){ pdf(“figures/Normalized_Depth_CNV_call.pdf”,height=4,width=6) gg_lst_y dev.off() }

# an example of normalized read depth line plot with new baseline gg_lst_y[[1]] ```

Robust Z-score profile line plots.

``` r # Plotting robust Z-score profile gg_lst_z = plot_pileUp_multisample( result = result, X_raw = base_resol_depth, plot_target = “z”, gff3 = gff3_fn, baseline=new_baseline, exclude_genes = c("E6*“,”E1^E4“,”E8E2") )

SKIP = TRUE if(!SKIP){ # Save to pdf file pdf(“figures/Robust-Z-score_CNV_call.pdf”,height=4,width=6) gg_lst_z dev.off() }

# an example of Z-score line plot with new baseline gg_lst_z[[1]] ```

Generating heatmaps with integrative clustering.

Calculation of viral loads. - Get total aligned base using tools such as picard. Here we use randomly generated numbers instead.

``` r data(total_aligned_base__host_and_virus)

viral_load = (10^6)*(apply(base_resol_depth,2,(x) sum(x)) )/total_aligned_base__host_and_virus

# distribtuion of overall viral load viral_load %>%log10 %>% hist ```

Generate heatmaps with integrative clustering using data transformed in various ways.

r exclude_genes = c("E6*","E1^E4","E8^E2") integ_ht_result = integrative_heatmap( X_raw = base_resol_depth, result = result, gff3_fn = gff3_fn, exclude_genes = exclude_genes, baseline=1, total_aligned_base__host_and_virus = total_aligned_base__host_and_virus ) #> Import genomic features from the file as a GRanges object ... #> Warning in .local(con, format, text, ...): gff-version directive indicates #> version is 3.1.26, not 3 #> OK #> Prepare the 'metadata' data frame ... OK #> Make the TxDb object ... #> Warning in .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, : the transcript names ("tx_name" column in the TxDb object) imported from the #>   "Name" attribute are not unique #> Warning in .makeTxDb_normarg_chrominfo(chrominfo): genome version information #> is not available for this TxDb object #> OK #> Import genomic features from the file as a GRanges object ... #> Warning in .local(con, format, text, ...): gff-version directive indicates #> version is 3.1.26, not 3 #> OK #> Prepare the 'metadata' data frame ... OK #> Make the TxDb object ... #> Warning in .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, : the transcript names ("tx_name" column in the TxDb object) imported from the #>   "Name" attribute are not unique #> Warning in .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, : genome version information is not available for this TxDb object #> OK #>

use_rasteris automatically set to TRUE for a matrix with more than #> 2000 rows. You can controluse_rasterargument by explicitly setting #> TRUE/FALSE to it. #> #> Setht_opt$message = FALSE` to turn off this message. #> Warning: The input is a data frame-like object, convert it to a matrix. #> `use_raster` is automatically set to TRUE for a matrix with more than #> 2000 rows. You can control `use_raster` argument by explicitly setting #> TRUE/FALSE to it. #> #> Set `ht_opt$message = FALSEto turn off this message. #>use_rasteris automatically set to TRUE for a matrix with more than #> 2000 rows. You can controluse_rasterargument by explicitly setting #> TRUE/FALSE to it. #> #> Setht_opt$message = FALSE` to turn off this message. #> `use_raster` is automatically set to TRUE for a matrix with more than #> 2000 rows. You can control `use_raster` argument by explicitly setting #> TRUE/FALSE to it. #> #> Set `ht_opt$message = FALSEto turn off this message. #>use_rasteris automatically set to TRUE for a matrix with more than #> 2000 rows. You can controluse_rasterargument by explicitly setting #> TRUE/FALSE to it. #> #> Setht_opt$message = FALSE` to turn off this message. #> Warning: The input is a data frame-like object, convert it to a matrix. #> `use_raster` is automatically set to TRUE for a matrix with more than #> 2000 rows. You can control `use_raster` argument by explicitly setting #> TRUE/FALSE to it. #> #> Set `ht_opt$message = FALSE

to turn off this message. #> Warning: The heatmap has not been initialized. You might have different results #> if you repeatedly execute this function, e.g. when row_km/column_km was #> set. It is more suggested to do as

ht = draw(ht); column_order(ht)`.

# top annotation top_ant = HeatmapAnnotation( Log2 Overall\nViral Load = anno_points(log2(viral_load)), annotation_name_side = “left”,annotation_name_rot=0) ```

Generate heatmap showing maximum number of intact copies - min copy of the overlapping copy segments - ratio relative to certain gene(gene_ref)

``` r gene_ref=“E7”

gene_cn = gene_cn_heatmaps( X_raw = base_resol_depth, result = result, gff3_fn = gff3_fn, baseline = new_baseline, gene_ref = gene_ref, exclude_genes = exclude_genes ) #> Import genomic features from the file as a GRanges object … #> Warning in .local(con, format, text, …): gff-version directive indicates #> version is 3.1.26, not 3 #> OK #> Prepare the ‘metadata’ data frame … OK #> Make the TxDb object … #> Warning in .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, : the transcript names (“tx_name” column in the TxDb object) imported from the #> “Name” attribute are not unique #> Warning in .makeTxDb_normarg_chrominfo(chrominfo): genome version information #> is not available for this TxDb object #> OK ```

Generate final heatmap in a single panel.

r draw(top_ant %v% integ_ht_result$Heatmap %v% gene_cn$Heatmaps$intact_gene_cn %v% gene_cn$Heatmaps$rel_dosage)

2 4. sessionInfo

sessionInfo()
#> R Under development (unstable) (2025-02-19 r87757)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] grid      stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] ComplexHeatmap_2.23.0 dplyr_1.1.4           glue_1.8.0           
#> [4] ggplot2_3.5.1         ELViS_0.99.11         knitr_1.49           
#> [7] BiocStyle_2.35.0     
#> 
#> loaded via a namespace (and not attached):
#>   [1] DBI_1.2.3                   bitops_1.0-9               
#>   [3] httr2_1.1.0                 biomaRt_2.63.1             
#>   [5] rlang_1.1.5                 magrittr_2.0.3             
#>   [7] clue_0.3-66                 GetoptLong_1.0.5           
#>   [9] matrixStats_1.5.0           compiler_4.5.0             
#>  [11] RSQLite_2.3.9               GenomicFeatures_1.59.1     
#>  [13] dir.expiry_1.15.0           txdbmaker_1.3.1            
#>  [15] png_0.1-8                   vctrs_0.6.5                
#>  [17] stringr_1.5.1               pkgconfig_2.0.3            
#>  [19] shape_1.4.6.1               crayon_1.5.3               
#>  [21] fastmap_1.2.0               magick_2.8.5               
#>  [23] dbplyr_2.5.0                XVector_0.47.2             
#>  [25] labeling_0.4.3              Rsamtools_2.23.1           
#>  [27] rmarkdown_2.29              UCSC.utils_1.3.1           
#>  [29] tinytex_0.55                bit_4.5.0.1                
#>  [31] xfun_0.51                   cachem_1.1.0               
#>  [33] GenomeInfoDb_1.43.4         jsonlite_1.9.0             
#>  [35] progress_1.2.3              blob_1.2.4                 
#>  [37] DelayedArray_0.33.6         uuid_1.2-1                 
#>  [39] BiocParallel_1.41.2         prettyunits_1.2.0          
#>  [41] parallel_4.5.0              cluster_2.1.8              
#>  [43] R6_2.6.1                    stringi_1.8.4              
#>  [45] bslib_0.9.0                 RColorBrewer_1.1-3         
#>  [47] reticulate_1.40.0           rtracklayer_1.67.1         
#>  [49] GenomicRanges_1.59.1        jquerylib_0.1.4            
#>  [51] Rcpp_1.0.14                 bookdown_0.42              
#>  [53] SummarizedExperiment_1.37.0 iterators_1.0.14           
#>  [55] zoo_1.8-12                  IRanges_2.41.3             
#>  [57] Matrix_1.7-2                igraph_2.1.4               
#>  [59] tidyselect_1.2.1            abind_1.4-8                
#>  [61] yaml_2.3.10                 doParallel_1.0.17          
#>  [63] codetools_0.2-20            curl_6.2.1                 
#>  [65] lattice_0.22-6              tibble_3.2.1               
#>  [67] withr_3.0.2                 Biobase_2.67.0             
#>  [69] basilisk.utils_1.19.1       KEGGREST_1.47.0            
#>  [71] evaluate_1.0.3              segclust2d_0.3.3           
#>  [73] BiocFileCache_2.15.1        xml2_1.3.6                 
#>  [75] circlize_0.4.16             Biostrings_2.75.4          
#>  [77] pillar_1.10.1               BiocManager_1.30.25        
#>  [79] filelock_1.0.3              MatrixGenerics_1.19.1      
#>  [81] foreach_1.5.2               stats4_4.5.0               
#>  [83] generics_0.1.3              RCurl_1.98-1.16            
#>  [85] hms_1.1.3                   S4Vectors_0.45.4           
#>  [87] munsell_0.5.1               scales_1.3.0               
#>  [89] tools_4.5.0                 BiocIO_1.17.1              
#>  [91] data.table_1.16.4           GenomicAlignments_1.43.0   
#>  [93] XML_3.99-0.18               Cairo_1.6-2                
#>  [95] AnnotationDbi_1.69.0        colorspace_2.1-1           
#>  [97] GenomeInfoDbData_1.2.13     patchwork_1.3.0            
#>  [99] basilisk_1.19.1             restfulr_0.0.15            
#> [101] cli_3.6.4                   rappdirs_0.3.3             
#> [103] viridisLite_0.4.2           S4Arrays_1.7.3             
#> [105] gtable_0.3.6                sass_0.4.9                 
#> [107] digest_0.6.37               BiocGenerics_0.53.6        
#> [109] SparseArray_1.7.6           rjson_0.2.23               
#> [111] farver_2.1.2                memoise_2.0.1              
#> [113] htmltools_0.5.8.1           lifecycle_1.0.4            
#> [115] httr_1.4.7                  GlobalOptions_0.1.2        
#> [117] bit64_4.6.0-1