Contents

Progenetix is an open data resource that provides curated individual cancer copy number aberrations (CNA) profiles along with associated metadata sourced from published oncogenomic studies and various data repositories. This vignette provides a comprehensive guide on accessing genomic variant data within the Progenetix database. If your focus lies in cancer cell lines, you can access data from cancercelllines.org by specifying the dataset parameter as “cancercelllines”. This data repository originates from CNV profiling data of cell lines initially collected as part of Progenetix and currently includes additional types of genomic mutations.

1 Load library

library(pgxRpi)

1.1 pgxLoader function

This function loads various data from Progenetix database.

The parameters of this function used in this tutorial:

  • type A string specifying output data type. Available options are “biosample”, “individual”, “variant” or “frequency”.
  • output A string specifying output file format. When the parameter type is “variant”, available options are NULL, “pgxseg” ,“pgxmatrix”, “coverage” or “seg”.
  • filters Identifiers for cancer type, literature, cohorts, and age such as c(“NCIT:C7376”, “pgx:icdom-98353”, “PMID:22824167”, “pgx:cohort-TCGAcancers”, “age:>=P50Y”). For more information about filters, see the documentation.
  • individual_id Identifiers used in Progenetix database for identifying individuals.
  • biosample_id Identifiers used in Progenetix database for identifying biosamples.
  • codematches A logical value determining whether to exclude samples from child concepts of specified filters that belong to cancer type/tissue encoding system (NCIt, icdom/t, Uberon). If TRUE, retrieved samples only keep samples exactly encoded by specified filters. Do not use this parameter when filters include cancer-irrelevant filters such as PMID and cohort identifiers. Default is FALSE.
  • limit Integer to specify the number of returned CNV coverage profiles for each filter. Default is 0 (return all).
  • skip Integer to specify the number of skipped CNV coverage profiles for each filter. E.g. if skip = 2, limit=500, the first 2*500 =1000 profiles are skipped and the next 500 profiles are returned. Default is NULL (no skip).
  • save_file A logical value determining whether to save the segment variant data as file instead of direct return. Only used when the parameter type is “variant” and output is “pgxseg” or “seg”. Default is FALSE.
  • filename A string specifying the path and name of the file to be saved. Only used if the parameter save_file is TRUE. Default is “variants.seg/pgxseg” in current work directory.
  • dataset A string specifying the dataset to query. Default is “progenetix”. Other available options are “cancercelllines”.

2 Retrive CNV coverage of biosamples

2.1 Relevant parameters

type, output, filters, individual_id, biosample_id, codematches, skip, limit, dataset

2.2 Across genomic bins

cnv_matrix <- pgxLoader(type="variant", output="pgxmatrix", filters = "NCIT:C2948")

The data looks like this

print(dim(cnv_matrix))
#> [1]   47 6215
cnv_matrix[c(1:3), c(1:5,6213:6215)]
#>      analysis_id   biosample_id   group_id chr1.0.400000.DUP
#> 1 pgxcs-kftvs0ri pgxbs-kftvh262 NCIT:C2948                 0
#> 2 pgxcs-kftvu9w2 pgxbs-kftvh9fp NCIT:C2948                 0
#> 3 pgxcs-kftvw1kw pgxbs-kftvhf4h NCIT:C8893                 0
#>   chr1.400000.1400000.DUP chrY.54400000.55400000.DEL chrY.55400000.56400000.DEL
#> 1                       0                          0                          0
#> 2                       0                          1                          1
#> 3                       0                          0                          0
#>   chrY.56400000.57227415.DEL
#> 1                          0
#> 2                          1
#> 3                          0

In this dataframe, analysis_id is the identifier for individual analysis, biosample_id is the identifier for individual biosample. It is noted that the number of analysis profiles does not necessarily equal the number of samples. One biosample_id may correspond to multiple analysis_id. group_id equals the meaning of filters. It’s followed by all “gain status” columns (3106 intervals) plus all “loss status” columns (3106 intervals). The status is indicated by a coverage value, i.e. the fraction of how much the binned interval overlaps with one or more CNVs of the given type (DUP/DEL). For example, if the column chr1.400000.1400000.DUP is 0.200 in one row, it means that one or more duplication events overlapped with 20% of the genomic bin located in chromosome 1: 400000-1400000 in the corresponding analysis.

2.3 Across chromosomes or the whole genome

cnv_covergae <- pgxLoader(type="variant", output="coverage", filters = "NCIT:C2948")

It includes CNV coverage across chromosome arms, whole chromosomes, or whole genome.

names(cnv_covergae)
#> [1] "chrom_arm_coverage"    "whole_chrom_coverage"  "whole_genome_coverage"

The data of CNV coverage across chromosomal arms looks like this

head(cnv_covergae$chrom_arm_coverage)[,c(1:4, 49:52)]
#>                chr1p.dup chr1q.dup chr2p.dup chr2q.dup chr1p.del chr1q.del
#> pgxbs-kftvh262         0     0.000     0.000     0.000     0.000         0
#> pgxbs-kftvh9fp         0     0.000     0.000     0.000     0.000         0
#> pgxbs-kftvhf4h         0     0.000     0.000     0.000     0.000         0
#> pgxbs-kftvhf4i         0     0.000     0.000     0.000     0.000         0
#> pgxbs-kftvhf4k         0     0.000     0.000     0.000     0.225         0
#> pgxbs-kftvhf4m         0     0.979     0.989     0.003     0.000         0
#>                chr2p.del chr2q.del
#> pgxbs-kftvh262         0         0
#> pgxbs-kftvh9fp         0         0
#> pgxbs-kftvhf4h         0         0
#> pgxbs-kftvhf4i         0         0
#> pgxbs-kftvhf4k         0         0
#> pgxbs-kftvhf4m         0         0

The row names are id of biosamples from the group NCIT:C2948. There are 96 columns. The first 48 columns are duplication coverage across chromosomal arms, followed by deletion coverage. The data of CNV coverage across whole chromosomes is similar, with the only difference in columns.

The data of CNV coverage across genome (hg38) looks like this

head(cnv_covergae$whole_genome_coverage)
#>                cnvfraction dupfraction delfraction
#> pgxbs-kftvh262       0.080       0.036       0.044
#> pgxbs-kftvh9fp       0.058       0.010       0.048
#> pgxbs-kftvhf4h       0.000       0.000       0.000
#> pgxbs-kftvhf4i       0.000       0.000       0.000
#> pgxbs-kftvhf4k       0.027       0.000       0.027
#> pgxbs-kftvhf4m       0.176       0.159       0.017

The first column is the total called coverage, followed by duplication coverage and deletion coverage.

2.4 Parameter codematches use

Setting codematches = True can exclude profiles with group_id belonging to child terms of the input filters.

21 samples are excluded from the original 47 samples in this case.

cnv_covergae_2 <- pgxLoader(type="variant", output="coverage", filters = "NCIT:C2948",
                            codematches = TRUE)

print(dim(cnv_covergae$chrom_arm_coverage))
#> [1] 47 96
print(dim(cnv_covergae_2$chrom_arm_coverage))
#> [1] 26 96

2.5 Access a subset of samples

By default, it returns all available profiles (limit=0), so the query may take a while when the number of retrieved samples is large. You can use the parameters limit and skip to access a subset of samples.

cnv_matrix_2 <- pgxLoader(type="variant", output="pgxmatrix", 
                          filters = "NCIT:C2948",
                          skip = 0, limit=10)
# the dimention of subset 
print(dim(cnv_matrix_2))
#> [1]   10 6215
# the dimention of original set
print(dim(cnv_matrix))
#> [1]   47 6215

2.6 Access by biosample id and individual id

cnv_ind_matrix <- pgxLoader(type="variant", output="pgxmatrix", 
                          biosample_id = "pgxbs-kftva604",
                          individual_id = "pgxind-kftx5g4t")

cnv_ind_cov <- pgxLoader(type="variant", output="coverage", 
                          biosample_id = "pgxbs-kftva604",
                          individual_id = "pgxind-kftx5g4t")

3 Retrieve segment variants

Because of the time-out problem, the segment variant data can only be accessed by biosample id instead of filters.

3.1 Relevant parameters

type, output, biosample_id, save_file, filename, dataset

3.2 Get biosample id

The biosample information is also obtained by pgxLoader and the vignette about metadata query see Introduction_1_loadmetadata.

biosamples <- pgxLoader(type="biosample", filters = "PMID:20229506", limit=2)

biosample_id <- biosamples$biosample_id

There are three output formats.

3.3 The first output format (by default)

This format contains variant id and associated biosample id as well as analysis id. The variant is represented as ‘DUP’ (duplication) or ‘DEL’ (deletion) in specific chromosome locations.

variant_1 <- pgxLoader(type="variant", biosample_id = biosample_id)
head(variant_1)
#>                        variant_id   biosample_id    analysis_id
#> 1 pgxvar-5c865fd809d374f2dc35e41e pgxbs-kftviq25 pgxcs-kftwah0f
#> 2 pgxvar-5c865fd809d374f2dc35e41f pgxbs-kftviq25 pgxcs-kftwah0f
#> 3 pgxvar-5c865fd809d374f2dc35e420 pgxbs-kftviq25 pgxcs-kftwah0f
#> 4 pgxvar-5c865fd809d374f2dc35e421 pgxbs-kftviq25 pgxcs-kftwah0f
#> 5 pgxvar-5c865fd809d374f2dc35e422 pgxbs-kftviq25 pgxcs-kftwah0f
#> 6 pgxvar-5c865fd809d374f2dc35e423 pgxbs-kftviq25 pgxcs-kftwah0f
#>      reference_genome                 variant variant_log2 variant_copychange
#> 1 refseq:NC_000001.11      1:61736-785910:DUP       0.4964        efo:0030070
#> 2 refseq:NC_000001.11    1:787938-3830146:DEL      -0.3306        efo:0030067
#> 3 refseq:NC_000001.11  1:3830314-34272616:DEL      -0.4466        efo:0030067
#> 4 refseq:NC_000001.11 1:34273097-72284670:DEL      -0.4986        efo:0030067
#> 5 refseq:NC_000001.11 1:72290431-72297688:DEL      -2.1683        efo:0030067
#> 6 refseq:NC_000001.11 1:72302736-72345466:DUP       1.3714        efo:0030070

3.4 The second output format (output = “pgxseg”)

This format is ‘.pgxseg’ file format. It contains segment mean values (in log2 column), which are equal to log2(copy number of measured sample/copy number of control sample (usually 2)). A few variants are point mutations represented by columns reference_bases and alternate_bases.

variant_2 <- pgxLoader(type="variant", biosample_id = biosample_id,output = "pgxseg")
head(variant_2)
#>     biosample_id reference_name    start      end    log2 variant_type
#> 1 pgxbs-kftviq25              1    61736   785910  0.4964          DUP
#> 2 pgxbs-kftviq25              1   787938  3830146 -0.3306          DEL
#> 3 pgxbs-kftviq25              1  3830314 34272616 -0.4466          DEL
#> 4 pgxbs-kftviq25              1 34273097 72284670 -0.4986          DEL
#> 5 pgxbs-kftviq25              1 72290431 72297688 -2.1683          DEL
#> 6 pgxbs-kftviq25              1 72302736 72345466  1.3714          DUP
#>   reference_bases alternate_bases variant_state_id variant_state_label
#> 1               .               .      EFO:0030070    copy number gain
#> 2               .               .      EFO:0030067    copy number loss
#> 3               .               .      EFO:0030067    copy number loss
#> 4               .               .      EFO:0030067    copy number loss
#> 5               .               .      EFO:0030067    copy number loss
#> 6               .               .      EFO:0030070    copy number gain

3.5 The third output format (output = “seg”)

This format is similar to the general ‘.seg’ file format and compatible with IGV tool for visualization. The only difference between this file format and the general ‘.seg’ file format is the fifth column. It represents variant type in this format while in the general ‘.seg’ file format, it represents number of probes or bins covered by the segment. In addition, the point mutation variants are excluded in this file format.

variant_3 <- pgxLoader(type="variant", biosample_id = biosample_id,output = "seg")
head(variant_3)
#>     biosample_id reference_name    start      end variant_type    log2
#> 1 pgxbs-kftviq25              1    61736   785910          DUP  0.4964
#> 2 pgxbs-kftviq25              1   787938  3830146          DEL -0.3306
#> 3 pgxbs-kftviq25              1  3830314 34272616          DEL -0.4466
#> 4 pgxbs-kftviq25              1 34273097 72284670          DEL -0.4986
#> 5 pgxbs-kftviq25              1 72290431 72297688          DEL -2.1683
#> 6 pgxbs-kftviq25              1 72302736 72345466          DUP  1.3714

4 Export variants data for visualization

Setting save_file as TRUE in pgxLoader function would make this function doesn’t return variants data directly but let the retrieved data saved in the current work directory by default or other paths (specified by filename). The export is only available for variants data (type=‘variant’).

4.1 Upload ‘pgxseg’ file to Progenetix website

The following command creates a ‘.pgxseg’ file with the name “variants.pgxseg” in “~/Downloads/” folder.

pgxLoader(type="variant", output="pgxseg", biosample_id=biosample_id, save_file=TRUE, 
          filename="~/Downloads/variants.pgxseg")

To visualize the ‘.pgxseg’ file, you can either upload it to this link or use the byconaut package for local visualization when dealing with a large number of samples.

4.2 Upload ‘.seg’ file to IGV

The following command creates a special ‘.seg’ file with the name “variants.seg” in “~/Downloads/” folder.

pgxLoader(type="variant", output="seg", biosample_id=biosample_id, save_file=TRUE, 
          filename="~/Downloads/variants.seg")

You can upload this ‘.seg’ file to IGV tool for visualization.

5 Session Info

#> R version 4.4.0 RC (2024-04-16 r86468)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] pgxRpi_1.1.2     BiocStyle_2.33.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.35       R6_2.5.1            bookdown_0.39      
#>  [4] fastmap_1.1.1       xfun_0.43           cachem_1.0.8       
#>  [7] knitr_1.46          htmltools_0.5.8.1   attempt_0.3.1      
#> [10] rmarkdown_2.26      lifecycle_1.0.4     cli_3.6.2          
#> [13] sass_0.4.9          jquerylib_0.1.4     compiler_4.4.0     
#> [16] plyr_1.8.9          httr_1.4.7          tools_4.4.0        
#> [19] curl_5.2.1          evaluate_0.23       bslib_0.7.0        
#> [22] Rcpp_1.0.12         yaml_2.3.8          BiocManager_1.30.22
#> [25] jsonlite_1.8.8      rlang_1.1.3