The GBScleanR package has been developed to conduct error correction on genotype data obtained via NGS-based genotyping methods such as RAD-seq and GBS (Miller et al. 2007; Elshire et al. 2011). It is designed to estimate true genotypes along chromosomes from given allele read counts in the VCF file generated by SNP callers like GATK and TASSEL-GBS (McKenna et al. 2010; Glaubitz et al. 2014). The current implementation supports genotype data of a mapping population derived from two or more diploid founders followed by selfings, sibling crosses, or random crosses. e.g. F\(_2\) and 8-way RILs. Our method supposes markers to be biallelic and ordered along chromosomes by mapping reads on a reference genome sequence. To support smooth access to large size genotype data, every input VCF file is first converted to a genomic data structure (GDS) file (Zheng et al. 2012). The current implementation cannot convert a VCF file containing non-biallelic markers to a GDS file correctly. Thus, any input VCF file should be subjected to filtering for retaining only biallelic SNPs using, for example, bcftools (Li and Barrett 2011). GBScleanR provides functions for data visualization, filtering, and loading/writing a VCF file. Furthermore, the data structure of the GDS file created via this package is compatible with those used in the SNPRelate, GWASTools and GENESIS packages those are designed to handle large variant data and conduct regression analysis (Zheng et al. 2012; Gogarten et al. 2012, 2019). In this document, we first walk through the utility functions implemented in GBScleanR to introduce a basic usage. Then, the core function of GBScleanR which estimates true genotypes for error correction will be introduced.
You can install GBScleanR from the Bioconductor repository with the following code.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("GBScleanR")
The code below let you install the package from the github repository. The package released in the github usually get frequent update more than that in Bioconductor due to the release schedule.
if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools") devtools::install_github("tomoyukif/GBScleanR", build_vignettes = TRUE)
The main class of the GBScleanR package is
GbsrGenotypData which inherits the
GenotypeData class in the SeqArray package. The
gbsrGenotypeData class object has three slots:
data slot holds genotype data as a
gds.class object which is defined in the
gdsfmt package while
scanAnnot contain objects storing annotation information of SNPs and samples, which are the
ScanAnnotationDataFrame objects defined in the GWASTools package. See the vignette of GWASTools for more detail. GBScleanR follows the way of GWASTools in which a unique genotyping instance (genotyped sample) is called “sample”.
Load the package.
GBScleanR only supports a VCF file as input. As an example data, we use sample genotype data for a biparental F2 population derived from inbred founders.
vcf_fn <- system.file("extdata", "sample.vcf", package = "GBScleanR") gds_fn <- tempfile("sample", fileext = ".gds")
As mentioned above, the
GbsrGenotypeData class requires genotype data in the
gds.class object which enable us quick access to the genotype data without loading the whole data on RAM. At the beginning of the processing, we need to convert data format of our genotype data from VCF to GDS. This conversion can be achieved using
gbsrVCF2GDS() as shown below. A compressed VCF file (.vcf.gz) also can be the input.
# `force = TRUE` allow the function to over write the GDS file, # even if a GDS file exists at `out_fn`. gbsrVCF2GDS(vcf_fn = vcf_fn, out_fn = gds_fn, force = TRUE, verbose = FALSE)
##  "/tmp/Rtmp2LGE7i/sample27687b16763629.gds"
Once we converted the VCF to the GDS, we can create a
GbsrGenotypeData instance for our data.
gds <- loadGDS(gds_fn, verbose = FALSE)
The first time to load a newly produced GDS file will take long time due to data reformatting for quick access.
Getter functions allow you to retrieve basic information of genotype data, e.g. the number of SNPs and samples, chromosome names, physical position of SNPs and alleles.
# Number of samples nsam(gds)
##  102
# Number of SNPs nmar(gds)
##  242
# Indices of chromosome ID of all markers head(getChromosome(gds))
##  "1" "1" "1" "1" "1" "1"
# Chromosome names of all markers head(getChromosome(gds))
##  "1" "1" "1" "1" "1" "1"
# Position (bp) of all markers head(getPosition(gds))
##  522289 571177 577905 720086 735019 841286
# Reference allele of all markers head(getAllele(gds))
##  "A,G" "A,G" "C,T" "G,C" "C,T" "G,T"
# SNP IDs head(getMarID(gds))
##  1 2 3 4 5 6
# sample IDs head(getSamID(gds))
##  "F2_1054" "F2_1055" "F2_1059" "F2_1170" "F2_1075" "F2_1070"
getGenotype() returns genotype call data in which integer numbers 0, 1, and 2 indicate the number of reference allele.
geno <- getGenotype(gds)
getRead() returns read count data as a list with two elements
alt containing reference read counts and alternative read counts, respectively.
geno <- getRead(gds)
countRead() are class methods of
GbsrGenotypeData and they summarize genotype counts and read counts per marker and per sample.
gds <- countGenotype(gds) gds <- countRead(gds)
These summary statistics can be visualized via plotting functions.
With the values obtained via
countGenotype(), we can plot histograms of missing rate , heterozygosity, reference allele frequency as shown below.
# Histgrams of missing rate histGBSR(gds, stats = "missing")
# Histgrams of heterozygosity histGBSR(gds, stats = "het")
# Histgrams of reference allele frequency histGBSR(gds, stats = "raf")