NHGRI maintains and routinely updates a database of selected genome-wide association studies. This document describes R/Bioconductor facilities for working with contents of this database.
Once the package has been installed, use to obtain interactive access to all the facilities. After executing this command, use to obtain an overview. The current version of this vignette can always be accessed at www.bioconductor.org, or by suitably navigating the web pages generated with .
We can produce a GRanges in two forms. By default
we get an mcols that has a small set of columns. Note
that records that lack a CHR_POS
value are omitted.
Records that have complicated CHR_POS
values, including
semicolons or " x " notation are kept, but only the first
position is retained. The CHR_ID
field may have
complicated character values, these are not normalized,
and are simply used as seqnames
“as is”.
## dropping 89449 records that have NA for CHR_POS
## 1952 records have semicolon in CHR_POS; splitting and using first entry.
## 3270 records have ' x ' in CHR_POS indicating multiple SNP effects, using first.
We can set the seqinfo as follows, retaining only records that employ standard chromosomes.
gg = keepStandardChromosomes(gg, pruning="coarse")
seqlevels(gg) = seqlevels(si.hs.38)
seqinfo(gg) = si.hs.38
We use BiocFileCache to manage downloaded TSV from EBI’s download site. The file is provided without compression, so prepare for 200+MB download if you are not working from a cache. There is no etag set, so you have to check for updates on your own.
This is converted to a manageable extension of GRanges using
Available functions are:
An extended GRanges instance with a sample of 50000 SNP-disease associations reported
on 30 April 2020 is
obtained as follows, with addresses based on the GRCh38 genome build.
We use gwtrunc
to refer to it in the sequel.
For a given trait, obtain a GRanges with all recorded associations; here only three associations are shown:
A basic Manhattan plot is easily constructed with the ggbio package facilities. Here we confine attention to chromosomes 4:6. First, we create a version of the catalog with \(-log_{10} p\) truncated at a maximum value of 25.
mcols = S4Vectors::mcols
mlpv = mcols(gwtrunc)$PVALUE_MLOG
mlpv = ifelse(mlpv > 25, 25, mlpv)
S4Vectors::mcols(gwtrunc)$PVALUE_MLOG = mlpv
# seqlevelsStyle(gwtrunc) = "UCSC" # no more!
seqlevels(gwtrunc) = paste0("chr", seqlevels(gwtrunc))
gwlit = gwtrunc[ which(as.character(seqnames(gwtrunc)) %in% c("chr4", "chr5", "chr6")) ]
mlpv = mcols(gwlit)$PVALUE_MLOG
mlpv = ifelse(mlpv > 25, 25, mlpv)
S4Vectors::mcols(gwlit)$PVALUE_MLOG = mlpv
