Contents

Progenetix is an open data resource that provides curated individual cancer copy number aberrations (CNA) profiles along with associated metadata sourced from published oncogenomic studies and various data repositories. This vignette offers a comprehensive guide on accessing and visualizing CNV frequency data within the Progenetix database. CNV frequency is pre-calculated based on CNV segment data in Progenetix and reflects the CNV pattern in a cohort. It is defined as the percentage of samples showing a CNV for a genomic region (1MB-sized genomic bins in this case) over the total number of samples in a cohort specified by filters. If your focus lies in cancer cell lines, you can access data from cancercelllines.org by specifying the dataset parameter as “cancercelllines”. This data repository originates from CNV profiling data of cell lines initially collected as part of Progenetix and currently includes additional types of genomic mutations.

1 Load library

library(pgxRpi)
library(SummarizedExperiment) # for pgxmatrix data
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> 
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#>     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#>     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#>     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#>     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#>     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#>     colWeightedMeans, colWeightedMedians, colWeightedSds,
#>     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#>     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#>     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#>     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#>     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#>     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#>     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#>     rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, table, tapply,
#>     union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#> 
#>     rowMedians
#> The following objects are masked from 'package:matrixStats':
#> 
#>     anyMissing, rowMedians
library(GenomicRanges) # for pgxfreq data

1.1 pgxLoader function

This function loads various data from Progenetix database.

The parameters of this function used in this tutorial:

  • type A string specifying output data type. Available options are “biosample”, “individual”, “variant” or “frequency”.
  • output A string specifying output file format. When the parameter type is “frequency”, available options are “pgxfreq” or “pgxmatrix” .
  • filters Identifiers for cancer type, literature, cohorts, and age such as c(“NCIT:C7376”, “pgx:icdom-98353”, “PMID:22824167”, “pgx:cohort-TCGAcancers”, “age:>=P50Y”). For more information about filters, see the documentation.
  • codematches A logical value determining whether to exclude samples from child concepts of specified filters that belong to cancer type/tissue encoding system (NCIt, icdom/t, Uberon). If TRUE, retrieved samples only keep samples exactly encoded by specified filters. Do not use this parameter when filters include cancer-irrelevant filters such as PMID and cohort identifiers. Default is FALSE.
  • dataset A string specifying the dataset to query. Default is “progenetix”. Other available options are “cancercelllines”.

2 Retrieve CNV frequency data

2.1 Relevant parameters

type, output, filters, codematches, dataset

2.2 The first output format (output = “pgxfreq”)

freq_pgxfreq <- pgxLoader(type="frequency", output ="pgxfreq",
                         filters=c("NCIT:C4038","pgx:icdom-85003"))

freq_pgxfreq
#> GRangesList object of length 2:
#> $`NCIT:C4038`
#> GRanges object with 3106 ranges and 3 metadata columns:
#>          seqnames            ranges strand | gain_frequency loss_frequency
#>             <Rle>         <IRanges>  <Rle> |      <numeric>      <numeric>
#>      [1]        1          0-400000      * |          0.625          1.875
#>      [2]        1    400000-1400000      * |          1.250          8.750
#>      [3]        1   1400000-2400000      * |          2.500          9.375
#>      [4]        1   2400000-3400000      * |          2.500         13.750
#>      [5]        1   3400000-4400000      * |          3.125         14.375
#>      ...      ...               ...    ... .            ...            ...
#>   [3102]        Y 52400000-53400000      * |              0          0.625
#>   [3103]        Y 53400000-54400000      * |              0          0.625
#>   [3104]        Y 54400000-55400000      * |              0          0.625
#>   [3105]        Y 55400000-56400000      * |              0          0.625
#>   [3106]        Y 56400000-57227415      * |              0          0.625
#>                 no
#>          <integer>
#>      [1]         1
#>      [2]         2
#>      [3]         3
#>      [4]         4
#>      [5]         5
#>      ...       ...
#>   [3102]      3102
#>   [3103]      3103
#>   [3104]      3104
#>   [3105]      3105
#>   [3106]      3106
#>   -------
#>   seqinfo: 24 sequences from an unspecified genome; no seqlengths
#> 
#> $`pgx:icdom-85003`
#> GRanges object with 3106 ranges and 3 metadata columns:
#>          seqnames            ranges strand | gain_frequency loss_frequency
#>             <Rle>         <IRanges>  <Rle> |      <numeric>      <numeric>
#>      [1]        1          0-400000      * |          7.333          6.041
#>      [2]        1    400000-1400000      * |          9.644         11.995
#>      [3]        1   1400000-2400000      * |          7.437         14.554
#>      [4]        1   2400000-3400000      * |          9.564         25.883
#>      [5]        1   3400000-4400000      * |          7.373         24.655
#>      ...      ...               ...    ... .            ...            ...
#>   [3102]        Y 52400000-53400000      * |          0.056          1.051
#>   [3103]        Y 53400000-54400000      * |          0.056          1.051
#>   [3104]        Y 54400000-55400000      * |          0.056          1.051
#>   [3105]        Y 55400000-56400000      * |          0.056          1.051
#>   [3106]        Y 56400000-57227415      * |          0.064          1.059
#>                 no
#>          <integer>
#>      [1]         1
#>      [2]         2
#>      [3]         3
#>      [4]         4
#>      [5]         5
#>      ...       ...
#>   [3102]      3102
#>   [3103]      3103
#>   [3104]      3104
#>   [3105]      3105
#>   [3106]      3106
#>   -------
#>   seqinfo: 24 sequences from an unspecified genome; no seqlengths

The returned data is stored in GRangesList container which consists of multiple GRanges objects. Each GRanges object stores CNV frequency from samples pecified by a particular filter. Within each GRanges object, you can find annotation columns “gain_frequency” and “loss_frequency” in each row, which express the percentage values across samples (%) for gains and losses that overlap the corresponding genomic interval.

These genomic intervals are derived from the partitioning of the entire genome (GRCh38). Most of these bins have a size of 1MB, except for a few bins located near the telomeres. In total, there are 3106 intervals encompassing the genome.

To access the CNV frequency data from specific filters, you could access like this

freq_pgxfreq[["NCIT:C4038"]]
#> GRanges object with 3106 ranges and 3 metadata columns:
#>          seqnames            ranges strand | gain_frequency loss_frequency
#>             <Rle>         <IRanges>  <Rle> |      <numeric>      <numeric>
#>      [1]        1          0-400000      * |          0.625          1.875
#>      [2]        1    400000-1400000      * |          1.250          8.750
#>      [3]        1   1400000-2400000      * |          2.500          9.375
#>      [4]        1   2400000-3400000      * |          2.500         13.750
#>      [5]        1   3400000-4400000      * |          3.125         14.375
#>      ...      ...               ...    ... .            ...            ...
#>   [3102]        Y 52400000-53400000      * |              0          0.625
#>   [3103]        Y 53400000-54400000      * |              0          0.625
#>   [3104]        Y 54400000-55400000      * |              0          0.625
#>   [3105]        Y 55400000-56400000      * |              0          0.625
#>   [3106]        Y 56400000-57227415      * |              0          0.625
#>                 no
#>          <integer>
#>      [1]         1
#>      [2]         2
#>      [3]         3
#>      [4]         4
#>      [5]         5
#>      ...       ...
#>   [3102]      3102
#>   [3103]      3103
#>   [3104]      3104
#>   [3105]      3105
#>   [3106]      3106
#>   -------
#>   seqinfo: 24 sequences from an unspecified genome; no seqlengths

To get metadata such as count of samples used to calculate frequency, use mcols function from GenomicRanges package:

mcols(freq_pgxfreq)
#> DataFrame with 2 rows and 3 columns
#>                          filter                  label sample_count
#>                     <character>            <character>    <numeric>
#> NCIT:C4038           NCIT:C4038   Lung Carcinoid Tumor          160
#> pgx:icdom-85003 pgx:icdom-85003 Infiltrating duct ca..        12464

The parameter codematches determines whether the calculation of CNV frequency excludes samples from child terms.

2.3 The second output format (output = “pgxmatrix”)

Choose 8 NCIT codes of interests that correspond to different tumor types

code <-c("C3059","C3716","C4917","C3512","C3493","C3771","C4017","C4001")
# add prefix for query
code <- sub(".",'NCIT:C',code)

load data with the specified code

freq_pgxmatrix <- pgxLoader(type="frequency",output ="pgxmatrix",filters=code)
freq_pgxmatrix
#> class: RangedSummarizedExperiment 
#> dim: 6212 8 
#> metadata(0):
#> assays(1): frequency
#> rownames(6212): 1 2 ... 6211 6212
#> rowData names(1): type
#> colnames(8): NCIT:C3059 NCIT:C3493 ... NCIT:C4017 NCIT:C4917
#> colData names(3): filter label sample_count

The returned data is stored in RangedSummarizedExperiment object, which is a matrix-like container where rows represent ranges of interest (as a GRanges object) and columns represent filters.

To get metadata such as count of samples used to calculate frequency, use colData function from SummarizedExperiment package:

colData(freq_pgxmatrix)
#> DataFrame with 8 rows and 3 columns
#>                 filter                  label sample_count
#>            <character>            <character>    <numeric>
#> NCIT:C3059  NCIT:C3059                 Glioma         8183
#> NCIT:C3493  NCIT:C3716 Lung Squamous Cell C..         1938
#> NCIT:C3512  NCIT:C4917    Lung Adenocarcinoma         4664
#> NCIT:C3716  NCIT:C3512 Primitive Neuroectod..         2214
#> NCIT:C3771  NCIT:C3493 Breast Lobular Carci..          904
#> NCIT:C4001  NCIT:C3771 Breast Inflammatory ..           27
#> NCIT:C4017  NCIT:C4017 Breast Ductal Carcin..        10183
#> NCIT:C4917  NCIT:C4001 Lung Small Cell Carc..          558

To access the CNV frequency matrix, use assay accesssor from SummarizedExperiment package

head(assay(freq_pgxmatrix))
#>   NCIT:C3059 NCIT:C3493 NCIT:C3512 NCIT:C3716 NCIT:C3771 NCIT:C4001 NCIT:C4017
#> 1      3.410      5.315      1.951      3.568      0.996      3.704      8.720
#> 2      8.457      8.978      6.089      7.678      2.434      3.704     11.166
#> 3     10.644      6.553      5.832      8.988      2.655      3.704      8.642
#> 4     11.964     10.475     12.693      8.762      4.204      3.704     10.734
#> 5     12.440      8.824     11.342      9.033      3.761      3.704      8.279
#> 6      9.068      8.359     10.828      7.949      2.655      7.407      5.755
#>   NCIT:C4917
#> 1     11.470
#> 2     25.448
#> 3     24.373
#> 4     29.032
#> 5     27.419
#> 6     27.240

The matrix has 6212 rows (genomic regions) and 8 columns (filters). The rows comprised 3106 intervals with “gain status” plus 3106 intervals with “loss status”.

The value is the percentage of samples from the corresponding filter having one or more CNV events in the specific genomic intervals. You could get the interval information by rowRanges function from SummarizedExperiment package

rowRanges(freq_pgxmatrix)
#> GRanges object with 6212 ranges and 1 metadata column:
#>        seqnames            ranges strand |        type
#>           <Rle>         <IRanges>  <Rle> | <character>
#>      1        1          0-400000      * |         DUP
#>      2        1    400000-1400000      * |         DUP
#>      3        1   1400000-2400000      * |         DUP
#>      4        1   2400000-3400000      * |         DUP
#>      5        1   3400000-4400000      * |         DUP
#>    ...      ...               ...    ... .         ...
#>   6208        Y 52400000-53400000      * |         DEL
#>   6209        Y 53400000-54400000      * |         DEL
#>   6210        Y 54400000-55400000      * |         DEL
#>   6211        Y 55400000-56400000      * |         DEL
#>   6212        Y 56400000-57227415      * |         DEL
#>   -------
#>   seqinfo: 24 sequences from an unspecified genome; no seqlengths

For example, if the value in the second row and first column is 8.457, it means that 8.457% samples from the corresponding filter NCIT:C3059 having one or more duplication events in the genomic interval in chromosome 1: 400000-1400000.

Note: it is different from CNV status matrix introduced in Introduction_2_loadvariants. Value in this matrix is percentage (%) of samples having one or more CNVs overlapped with the binned interval while the value in CNV status matrix is fraction in individual samples to indicate how much the binned interval overlaps with one or more CNVs in the individual sample.

3 Calculate CNV frequency data

3.1 segtoFreq function

This function computes the binned CNV frequency from segment data.

The parameters of this function:

  • data: Segment data with CNV states. The first four columns should specify sample ID, chromosome, start position, and end position, respectively. The column representing CNV states should contain either “DUP” for duplications or “DEL” for deletions.
  • cnv_column_idx: Index of the column specifying CNV state. Default is 6, following the “pgxseg” format used in Progenetix. If the input segment data uses the general .seg file format, it might need to be set differently.
  • cohort_name: A string specifying the cohort name. Default is “unspecified cohort”.
  • assembly: A string specifying the genome assembly version for CNV frequency calculation. Allowed options are “hg19” or “hg38”. Default is “hg38”.
  • bin_size: Size of genomic bins used to split the genome, in base pairs (bp). Default is 1,000,000.
  • overlap: Numeric value defining the amount of overlap between bins and segments considered as bin-specific CNV, in base pairs (bp). Default is 1,000.
  • soft_expansion: Fraction of bin_size to determine merge criteria. During the generation of genomic bins, division starts at the centromere and expands towards the telomeres on both sides. If the size of the last bin is smaller than soft_expansion * bin_size, it will be merged with the previous bin. Default is 0.1.

Suppose you have segment data from several biosamples:

# access variant data
vardata <- pgxLoader(type="variant",biosample_id = c("pgxbs-kftvhmz9", "pgxbs-kftvhnqz","pgxbs-kftvhupd"),output="pgxseg")
# only keep segment cnv data
segdata <- vardata[vardata$variant_type %in% c("DUP","DEL"),]

You can then calculate the CNV frequency from this cohort comprised of these samples. The output is stored in “pgxfreq” format:

segfreq <- segtoFreq(segdata,cohort_name="c1")
segfreq
#> GRangesList object of length 1:
#> $c1
#> GRanges object with 3106 ranges and 2 metadata columns:
#>          seqnames            ranges strand | gain_frequency loss_frequency
#>             <Rle>         <IRanges>  <Rle> |      <numeric>      <numeric>
#>      [1]        1          0-400000      * |              0         0.0000
#>      [2]        1    400000-1400000      * |              0         0.0000
#>      [3]        1   1400000-2400000      * |              0         0.0000
#>      [4]        1   2400000-3400000      * |              0         0.0000
#>      [5]        1   3400000-4400000      * |              0        33.3333
#>      ...      ...               ...    ... .            ...            ...
#>   [3102]        Y 52400000-53400000      * |              0              0
#>   [3103]        Y 53400000-54400000      * |              0              0
#>   [3104]        Y 54400000-55400000      * |              0              0
#>   [3105]        Y 55400000-56400000      * |              0              0
#>   [3106]        Y 56400000-57227415      * |              0              0
#>   -------
#>   seqinfo: 24 sequences from an unspecified genome; no seqlengths

4 Visualization of CNV frequency data using ‘pgxfreq’ format

4.1 pgxFreqplot function

This function provides CNV frequency plots by genome or chromosomes as you request.

The parameters of this function:

  • data: frequency object returned by pgxLoader function.
  • chrom: a vector with chromosomes to be plotted. If NULL, return the plot by genome. If specified the frequencies are plotted with one panel for each chromosome. Default is NULL.
  • layout: number of columns and rows in plot. Only used in plot by chromosome. Default is c(1,1).
  • filters: Index or string value to indicate which filter to be plotted. The length of filters is limited to one if the parameter circos is False. Default is the first filter.
  • circos: a logical value to indicate if return a circos plot. If TRUE, it can return a circos plot with multiple group ids for display and comparison. Default is FALSE.
  • highlight: Indices of genomic bins to be highlighted with red color.
  • assembly: A string specifying which genome assembly version should be applied to CNV frequency plotting. Allowed options are “hg19”, “hg38”. Default is “hg38” (genome version used in Progenetix).

4.2 CNV frequency plot by genome

4.2.1 Input is pgxfreq object.

pgxFreqplot(freq_pgxfreq, filters="pgx:icdom-85003")

4.2.2 Input is pgxmatrix object.

pgxFreqplot(freq_pgxmatrix, filters = "NCIT:C3512")

4.3 CNV frequency plot by chromosomes

pgxFreqplot(freq_pgxfreq, filters='NCIT:C4038',chrom=c(1,2,3), layout = c(3,1))  

4.4 CNV frequency circos plot

pgxFreqplot(freq_pgxfreq, filters='pgx:icdom-85003', circos = TRUE)

The circos plot also supports multiple group comparison

pgxFreqplot(freq_pgxfreq,filters= c("NCIT:C4038","pgx:icdom-85003"),circos = TRUE) 

4.5 Highlight interesting genomic intervals

If you want to look at the CNV frequency at specific genomic bins, you can use highlight parameter. For example, when you are interested in CNV pattern of CCND1 gene in samples with infiltrating duct carcinoma (icdom-85003). You could first find the genomic bin where CCND1 (chr11:69641156-69654474) is located.

# Extract the CNV frequency data frame of samples from 'icdom-85003' from 
# the previously returned object
freq_IDC <- freq_pgxfreq[['pgx:icdom-85003']]
# search the genomic bin where CCND1 is located
bin <- which(seqnames(freq_IDC) == 11 & start(freq_IDC) <= 69641156 &  
             end(freq_IDC) >= 69654474)
freq_IDC[bin,]
#> GRanges object with 1 range and 3 metadata columns:
#>       seqnames            ranges strand | gain_frequency loss_frequency
#>          <Rle>         <IRanges>  <Rle> |      <numeric>      <numeric>
#>   [1]       11 69400000-70400000      * |         30.375          9.074
#>              no
#>       <integer>
#>   [1]      1887
#>   -------
#>   seqinfo: 24 sequences from an unspecified genome; no seqlengths

Then you could highlight this genomic bin like this

pgxFreqplot(freq_pgxfreq,filters = 'pgx:icdom-85003', chrom = 11,highlight = bin)

Note: For CNV analysis of specific genes, the highlighted plot is rough as a reference, because the bin size in frequency plots is 1MB, which is possible to cover multiple genes.

The highlighting is also available for genome plots and circos plots. And you could highlight multiple bins by a vector of indices.

pgxFreqplot(freq_pgxfreq,filters = 'pgx:icdom-85003',highlight = c(1:100))

5 Session Info

#> R version 4.4.0 RC (2024-04-16 r86468)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SummarizedExperiment_1.35.0 Biobase_2.65.0             
#>  [3] GenomicRanges_1.57.0        GenomeInfoDb_1.41.0        
#>  [5] IRanges_2.39.0              S4Vectors_0.43.0           
#>  [7] BiocGenerics_0.51.0         MatrixGenerics_1.17.0      
#>  [9] matrixStats_1.3.0           pgxRpi_1.1.2               
#> [11] BiocStyle_2.33.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.9              SparseArray_1.5.0       shape_1.4.6.1          
#>  [4] lattice_0.22-6          digest_0.6.35           magrittr_2.0.3         
#>  [7] evaluate_0.23           attempt_0.3.1           grid_4.4.0             
#> [10] bookdown_0.39           circlize_0.4.16         fastmap_1.1.1          
#> [13] plyr_1.8.9              jsonlite_1.8.8          Matrix_1.7-0           
#> [16] GlobalOptions_0.1.2     tinytex_0.50            BiocManager_1.30.22    
#> [19] httr_1.4.7              UCSC.utils_1.1.0        jquerylib_0.1.4        
#> [22] abind_1.4-5             cli_3.6.2               rlang_1.1.3            
#> [25] crayon_1.5.2            XVector_0.45.0          cachem_1.0.8           
#> [28] DelayedArray_0.31.0     yaml_2.3.8              S4Arrays_1.5.0         
#> [31] tools_4.4.0             colorspace_2.1-0        GenomeInfoDbData_1.2.12
#> [34] curl_5.2.1              R6_2.5.1                lifecycle_1.0.4        
#> [37] magick_2.8.3            zlibbioc_1.51.0         bslib_0.7.0            
#> [40] Rcpp_1.0.12             xfun_0.43               highr_0.10             
#> [43] knitr_1.46              htmltools_0.5.8.1       rmarkdown_2.26         
#> [46] compiler_4.4.0