RTCGA
package to
download mutations data that are included in
RTCGA.mutations
packageThe Cancer Genome Atlas (TCGA) Data Portal provides a platform for researchers to search, download, and analyze data sets generated by TCGA. It contains clinical information, genomic characterization data, and high level sequence analysis of the tumor genomes. The key is to understand genomics to improve cancer care.
RTCGA
package offers download and integration of the
variety and volume of TCGA data using patient barcode key, what enables
easier data possession. This may have a benefcial infuence on
development of science and improvement of patients’ treatment.
RTCGA
is an open-source R package, available to download
from Bioconductor
or from github
Furthermore, RTCGA
package transforms TCGA data into
form which is convenient to use in R statistical package. Those data
transformations can be a part of statistical analysis pipeline which can
be more reproducible with RTCGA
.
Use cases and examples are shown in RTCGA
packages
vignettes:
There are many available date times of TCGA data releases. To see them all just type:
Version 20151101.0.0 of RTCGA.mutations
package contains
mutations datasets which were released 2015-11-01
. They
were downloaded in the following way (which is mainly copied from http://rtcga.github.io/RTCGA/:
All cohort names can be checked using:
For all cohorts the following code downloads the mutations data.
# dir.create( "data2" ) # name of a directory in which data will be stored
releaseDate <- "2015-11-01"
sapply( cohorts, function(element){
tryCatch({
downloadTCGA( cancerTypes = element,
dataSet = "Mutation_Packager_Calls.Level",
destDir = "data2",
date = releaseDate )},
error = function(cond){
cat("Error: Maybe there weren't mutations data for ", element, " cancer.\n")
}
)
})
NA
files from data2 folderIf there were not mutations data for some cohorts we should remove
corresponding NA
files.
Below is the code that automatically assigns paths to files for all
mutations files for all available cohorts types downloaded to
data2
folder.
cohorts %>%
sapply(function(element){
grep(paste0("_", element, "\\."),
x = list.files("data2") %>%
file.path("data2", .),
value = TRUE)
}) -> potential_datasets
for(i in seq_along(potential_datasets)){
if(length(potential_datasets[[i]]) > 0){
assign(value = potential_datasets[[i]],
x = paste0(names(potential_datasets)[i], ".mutations.path"),
envir = .GlobalEnv)
}
}
readTCGA
Because of the fact that mutations data are are in separate files,
there has been prepared special function readTCGA
to read
and merge data automatically. Code is below
ls() %>%
grep("mutations\\.path", x = ., value = TRUE) %>%
sapply(function(element){
tryCatch({
readTCGA(get(element, envir = .GlobalEnv),
dataType = "mutations") -> mutations_file
for( i in 1:ncol(mutations_file)){
mutations_file[, i] <- iconv(mutations_file[, i],
"UTF-8", "ASCII", sub="")
}
assign(value = mutations_file,
x = sub("\\.path", "", x = element),
envir = .GlobalEnv )
}, error = function(cond){
cat(element)
})
invisible(NULL)
}
)
RTCGA.mutations
packagegrep( "mutations", ls(), value = TRUE) %>%
grep("path", x=., value = TRUE, invert = TRUE) %>%
cat( sep="," ) #can one to it better? as from use_data documentation:
# ... Unquoted names of existing objects to save
devtools::use_data(ACC.mutations,BLCA.mutations,BRCA.mutations,
CESC.mutations,CHOL.mutations,COAD.mutations,
COADREAD.mutations,DLBC.mutations,ESCA.mutations,
GBMLGG.mutations,GBM.mutations,HNSC.mutations,
KICH.mutations,KIPAN.mutations,KIRC.mutations,
KIRP.mutations,LAML.mutations,LGG.mutations,
LIHC.mutations,LUAD.mutations,LUSC.mutations,
OV.mutations,PAAD.mutations,PCPG.mutations,
PRAD.mutations,READ.mutations,SARC.mutations,
SKCM.mutations,STAD.mutations,STES.mutations,
TGCT.mutations,THCA.mutations,UCEC.mutations,
UCS.mutations,UVM.mutations,
# overwrite = TRUE,
compress="xz")