Contents

library(MungeSumstats)

MungeSumstats now offers high throughput query and import functionality to data from the MRC IEU Open GWAS Project.

1 Find GWAS datasets

#### Search for datasets ####
metagwas <- MungeSumstats::find_sumstats(traits = c("parkinson","alzheimer"), 
                                         min_sample_size = 1000)
head(metagwas,3)
ids <- (dplyr::arrange(metagwas, nsnp))$id  
##          id               trait group_name year    author
## 1 ieu-a-298 Alzheimer's disease     public 2013   Lambert
## 2   ieu-b-2 Alzheimer's disease     public 2019 Kunkle BW
## 3 ieu-a-297 Alzheimer's disease     public 2013   Lambert
##                                                                                                                                                                                                                                                                                                                    consortium
## 1                                                                                                                                                                                                                                                                                                                        IGAP
## 2 Alzheimer Disease Genetics Consortium (ADGC), European Alzheimer's Disease Initiative (EADI), Cohorts for Heart and Aging Research in Genomic Epidemiology Consortium (CHARGE), Genetic and Environmental Risk in AD/Defining Genetic, Polygenic and Environmental Risk for Alzheimer's Disease Consortium (GERAD/PERADES),
## 3                                                                                                                                                                                                                                                                                                                        IGAP
##                 sex population     unit     nsnp sample_size       build
## 1 Males and Females   European log odds    11633       74046 HG19/GRCh37
## 2 Males and Females   European       NA 10528610       63926 HG19/GRCh37
## 3 Males and Females   European log odds  7055882       54162 HG19/GRCh37
##   category                subcategory ontology mr priority     pmid sd
## 1  Disease Psychiatric / neurological       NA  1        1 24162737 NA
## 2   Binary Psychiatric / neurological       NA  1        0 30820047 NA
## 3  Disease Psychiatric / neurological       NA  1        2 24162737 NA
##                                                                      note ncase
## 1 Exposure only; Effect allele frequencies are missing; forward(+) strand 25580
## 2                                                                      NA 21982
## 3                Effect allele frequencies are missing; forward(+) strand 17008
##   ncontrol     N
## 1    48466 74046
## 2    41944 63926
## 3    37154 54162

2 Import full results

You can supply import_sumstats() with a list of as many OpenGWAS IDs as you want, but we’ll just give one to save time.

datasets <- MungeSumstats::import_sumstats(ids = "ieu-a-298",
                                           ref_genome = "GRCH37")

2.1 Summarise results

By default, import_sumstats results a named list where the names are the Open GWAS dataset IDs and the items are the respective paths to the formatted summary statistics.

print(datasets)
## $`ieu-a-298`
## [1] "/tmp/RtmpZU6bK9/ieu-a-298.tsv.gz"

You can easily turn this into a data.frame as well.

results_df <- data.frame(id=names(datasets), 
                         path=unlist(datasets))
print(results_df)
##                  id                             path
## ieu-a-298 ieu-a-298 /tmp/RtmpZU6bK9/ieu-a-298.tsv.gz

3 Import full results (parallel)

Optional: Speed up with multi-threaded download via axel.

datasets <- MungeSumstats::import_sumstats(ids = ids, 
                                           vcf_download = TRUE, 
                                           download_method = "axel", 
                                           nThread = max(2,future::availableCores()-2))

4 Further functionality

See the Getting started vignette for more information on how to use MungeSumstats and its functionality.

5 Session Info

utils::sessionInfo()
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] MungeSumstats_1.11.10 BiocStyle_2.31.0     
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1                           
##  [2] dplyr_1.1.4                                
##  [3] blob_1.2.4                                 
##  [4] R.utils_2.12.3                             
##  [5] Biostrings_2.71.6                          
##  [6] bitops_1.0-7                               
##  [7] fastmap_1.1.1                              
##  [8] RCurl_1.98-1.14                            
##  [9] VariantAnnotation_1.49.7                   
## [10] GenomicAlignments_1.39.5                   
## [11] XML_3.99-0.16.1                            
## [12] digest_0.6.35                              
## [13] lifecycle_1.0.4                            
## [14] KEGGREST_1.43.0                            
## [15] RSQLite_2.3.6                              
## [16] magrittr_2.0.3                             
## [17] googleAuthR_2.0.1                          
## [18] compiler_4.4.0                             
## [19] rlang_1.1.3                                
## [20] sass_0.4.9                                 
## [21] tools_4.4.0                                
## [22] utf8_1.2.4                                 
## [23] yaml_2.3.8                                 
## [24] data.table_1.15.4                          
## [25] rtracklayer_1.63.2                         
## [26] knitr_1.46                                 
## [27] S4Arrays_1.3.7                             
## [28] bit_4.0.5                                  
## [29] curl_5.2.1                                 
## [30] DelayedArray_0.29.9                        
## [31] abind_1.4-5                                
## [32] BiocParallel_1.37.1                        
## [33] BiocGenerics_0.49.1                        
## [34] R.oo_1.26.0                                
## [35] grid_4.4.0                                 
## [36] stats4_4.4.0                               
## [37] fansi_1.0.6                                
## [38] SummarizedExperiment_1.33.3                
## [39] cli_3.6.2                                  
## [40] rmarkdown_2.26                             
## [41] crayon_1.5.2                               
## [42] generics_0.1.3                             
## [43] BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1
## [44] httr_1.4.7                                 
## [45] rjson_0.2.21                               
## [46] DBI_1.2.2                                  
## [47] cachem_1.0.8                               
## [48] stringr_1.5.1                              
## [49] zlibbioc_1.49.3                            
## [50] assertthat_0.2.1                           
## [51] parallel_4.4.0                             
## [52] AnnotationDbi_1.65.2                       
## [53] BiocManager_1.30.22                        
## [54] XVector_0.43.1                             
## [55] restfulr_0.0.15                            
## [56] matrixStats_1.3.0                          
## [57] vctrs_0.6.5                                
## [58] Matrix_1.7-0                               
## [59] jsonlite_1.8.8                             
## [60] bookdown_0.39                              
## [61] IRanges_2.37.1                             
## [62] S4Vectors_0.41.7                           
## [63] bit64_4.0.5                                
## [64] GenomicFiles_1.39.0                        
## [65] GenomicFeatures_1.55.4                     
## [66] jquerylib_0.1.4                            
## [67] glue_1.7.0                                 
## [68] codetools_0.2-20                           
## [69] stringi_1.8.3                              
## [70] GenomeInfoDb_1.39.14                       
## [71] BiocIO_1.13.0                              
## [72] GenomicRanges_1.55.4                       
## [73] UCSC.utils_0.99.7                          
## [74] tibble_3.2.1                               
## [75] pillar_1.9.0                               
## [76] SNPlocs.Hsapiens.dbSNP155.GRCh37_0.99.24   
## [77] htmltools_0.5.8.1                          
## [78] GenomeInfoDbData_1.2.12                    
## [79] BSgenome_1.71.4                            
## [80] R6_2.5.1                                   
## [81] evaluate_0.23                              
## [82] lattice_0.22-6                             
## [83] Biobase_2.63.1                             
## [84] R.methodsS3_1.8.2                          
## [85] png_0.1-8                                  
## [86] Rsamtools_2.19.4                           
## [87] gargle_1.5.2                               
## [88] memoise_2.0.1                              
## [89] bslib_0.7.0                                
## [90] SparseArray_1.3.5                          
## [91] xfun_0.43                                  
## [92] fs_1.6.3                                   
## [93] MatrixGenerics_1.15.1                      
## [94] pkgconfig_2.0.3