mviterson@gmail.com

davycats.dc@gmail.com

pjhop@gmail.com

b.t.heijmans@lumc.nl

Abstract

The Aanalysis of multiple omics datatypes from the same individuals is becoming increasingly common. For example, several data repositories contain genetic, transcriptomic and epigenetic (DNA methylation) measurements on the same individuals, e.g., TCGA, Geuvadis, BBMRI/BIOS, etc. However, errors in the data linkage are almost inevitable in such complex projects (e.g.Â due to sample mix-ups) and unknown family relationships may be present. If not corrected, such errors may introduce false positive or false negative findings. To streamline necessary quality control, wWe have developed a tool to reliably verify the sample relationships between and across omics data types using genotype data that is directly measured or inferred from for example RNA-seq or DNA methylation array data.

Here we illustrate how to detect data linkage errors with the *omicsPrint*
package using both artificially generated data and experimental data.
Several additional vignettes are available that show the use of the
package on further experimental data in different settings, i.e., 450k
DNA methylation and imputed genotypes.

Here we generate a single vector with 100 randomly drawn integers from the set; 1, 2, 3, representing 100 SNP calls from a single individual. Three additional individuals are generated by randomly swapping a certain fraction of the SNPs. Swapping 5 SNPs will introduce a few mismatches mimicking a situation where the same individual was measured twice (replicate) but with measurement error. Swapping 50% of the SNPs will be similar to the difference in genotypes between parents and offspring. Swapping all SNPs will result in a situation similar to comparing two unrelated individuals.

```
swap <- function(x, frac=0.05) {
n <- length(x)
k <- floor(n*frac)
x1 <- sample(1:n,k)
x2 <- sample(1:n,k) ##could be overlapping
x[x2] <- x[x1]
x
}
x1 <- 1 + rbinom(100, size=2, prob=1/3)
x2 <- swap(x1, 0.05) ##strongly related e.g. replicate
x3 <- swap(x1, 0.5) ##related e.g. parent off spring
x4 <- swap(x1, 1) ##unrelated
x <- cbind(x1, x2, x3, x4)
```

Now `x`

contains the 100 SNPs for the four individuals
using `head`

we can inspect the first six SNPs.

`head(x)`

```
## x1 x2 x3 x4
## [1,] 1 1 2 2
## [2,] 2 2 2 1
## [3,] 1 1 3 1
## [4,] 1 1 2 1
## [5,] 1 1 2 2
## [6,] 2 2 2 2
```

`allelesharing`

algorithmWe use Identity by State (IBS) for the set of SNPs to infer sample relations. See Abecasis et al. (2001), for the description of this approach applied to genetic data. Briefly, between each sample pair, the IBS-vector is calculated, which is a measure of genetic distance between individuals. Next, the vector is summarized by its mean and variance. A mean of 2 and variance of 0 indicates that the samples are identical.

`data <- alleleSharing(x, verbose=TRUE)`

`## Hash relations`

`## Pruning 100 SNPs ...`

`## 0 SNPs removed because of low call rate!`

`## 0 samples removed because too few SNPs called!`

`## Using 100 polymorphic SNPs to determine allele sharing.`

`## Running `square` IBS algorithm!`

The set of SNPs may contain uninformative SNPs, SNPs of bad quality
or even SNPs could be missing. The following pruning steps are
implemented to yield the most informative set of SNPs (thresholds can be
adjusted see `?alleleSharing`

):

- a SNP is remove if missing in >5% of the samples
(
`callRate = 0.95`

) - a sample should have at least 2/3 of the SNPs called
(
`coverageRate = 2/3`

) - a SNP is can be removed if it violates the Hardy-Weinberg
equilibrium (
`alpha = 0`

). - a SNP is can be removed if the minor allele frequency is below the
given threshold (
`maf = 0`

)

Hardy-Weinberg test-statistics is calculated using a \(\chi^2\)-test and Bonferonni multiple testing correction is performed.

`data`

```
## mean var colnames.x colnames.y relation
## 1 2.00 0.0000000 x1 x1 identical
## 2 1.96 0.0589899 x2 x1 unrelated
## 3 1.62 0.3187879 x3 x1 unrelated
## 4 1.40 0.3636364 x4 x1 unrelated
## 5 2.00 0.0000000 x2 x2 identical
## 6 1.60 0.3434343 x3 x2 unrelated
## 7 1.38 0.3793939 x4 x2 unrelated
## 8 2.00 0.0000000 x3 x3 identical
## 9 1.40 0.4242424 x4 x3 unrelated
## 10 2.00 0.0000000 x4 x4 identical
```

By default no relations are assumed except for the self-self relations.

The output is a `data.frame`

containing all pairwise
comparisons with the mean and variance of the IBS over the set of SNPs
and the reported sample relationship, including the identifiers.

Since we provided a list of known relations and assume that the majority is correct, we can build a classifier to discover misclassified relations. Linear discriminant analysis is used to generate a confusion-matrix, which is subsequently used to graphically represent the classification boundaries and generate an output file with misclassified sample pairs.

`mismatches <- inferRelations(data)`

`mismatches`

```
## mean var colnames.x colnames.y relation predicted
## 2 1.96 0.0589899 x2 x1 unrelated identical
```

There is one misclassified sample, namely the replicate that we
introduced but was not a priori specified as an existing relationship.
The true relationship with between sample `x1`

and sample
`x2`

is an identical relation. Furthermore, we see two sample
pairs with mean IBS of 1.96 and variance 0.06 which is an indication
that also these pairs share a considerable number of alleles. If known,
such relationships can be specified prior to analysis.

```
relations <- expand.grid(idx = colnames(x), idy= colnames(x))
relations$relation_type <- "unrelated"
relations$relation_type[relations$idx == relations$idy] <- "identical"
relations$relation_type[c(2,5)] <- "identical" ##replicate
relations$relation_type[c(3,7,9,10)] <- "parent offspring"
relations
```

```
## idx idy relation_type
## 1 x1 x1 identical
## 2 x2 x1 identical
## 3 x3 x1 parent offspring
## 4 x4 x1 unrelated
## 5 x1 x2 identical
## 6 x2 x2 identical
## 7 x3 x2 parent offspring
## 8 x4 x2 unrelated
## 9 x1 x3 parent offspring
## 10 x2 x3 parent offspring
## 11 x3 x3 identical
## 12 x4 x3 unrelated
## 13 x1 x4 unrelated
## 14 x2 x4 unrelated
## 15 x3 x4 unrelated
## 16 x4 x4 identical
```

Rerun the allelesharing algorithm now provided with the known relations.

`data <- alleleSharing(x, relations=relations)`

`## Hash relations`

`## Pruning 100 SNPs ...`

`## 0 SNPs removed because of low call rate!`

`## 0 samples removed because too few SNPs called!`

`## Using 100 polymorphic SNPs to determine allele sharing.`

`## Running `square` IBS algorithm!`

`data`

```
## mean var colnames.x colnames.y relation
## 1 2.00 0.0000000 x1 x1 identical
## 2 1.96 0.0589899 x2 x1 identical
## 3 1.62 0.3187879 x3 x1 parent offspring
## 4 1.40 0.3636364 x4 x1 unrelated
## 5 2.00 0.0000000 x2 x2 identical
## 6 1.60 0.3434343 x3 x2 parent offspring
## 7 1.38 0.3793939 x4 x2 unrelated
## 8 2.00 0.0000000 x3 x3 identical
## 9 1.40 0.4242424 x4 x3 unrelated
## 10 2.00 0.0000000 x4 x4 identical
```

`mismatches <- inferRelations(data)`