# 1 Inspiration: correlated coin tosses

Wolfgang mentioned on Monday that observing 50 heads and then 50 tails would not be evidence of a ‘fair’ coin, even though the observed number of head (50) was exactly that expected of a fair coin.

I wonder, how do you generate correlated coin tosses?

Criteria (most to least important)

1. Correct
2. Understandable
3. Robust
4. Fast

# 2 First approach

## 2.1 Implementation

Suppose we represent heads as ‘0’ and tails as ‘1’. Let p be the probability that a coin that is currently a head is tossed and remains a head, and similarly for tails. If p == 0.5, then the tosses would seem to be uncorrelated. We implement this idea as a simple function

f1 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
outcome <- NULL
for (i in 1:n) {
if (runif(1) > p)
current <- (current + 1) %% 2
outcome <- c(outcome, current)
}
outcome
}

## 2.2 Assessment

Is it correct?

f1(10, .5)
##   0 0 0 1 1 1 1 1 1 1
table(f1(1000, .5))
##
##   0   1
## 510 490
res <- f1(1000, .9)
mean(rle(res)\$length) # expectation: 1 / (1 - p)
##  10.10101

Is it understandable? Pretty much

Is it robust? We’ll come back to this.

Is it fast? Seems so initially, but it scales poorly!

system.time(f1(5000, .5))
##    user  system elapsed
##   0.188   0.028   0.215
system.time(f1(10000, .5))
##    user  system elapsed
##   0.587   0.001   0.587
system.time(f1(20000, .5))
##    user  system elapsed
##   2.033   0.015   2.050

Doubling the problem size (n) causes a 4-fold increase in execution time.

What’s the problem? c(outcome, current) causes outcome to be copied each time through the loop!

# 3 Second approach

## 3.1 Implementation

Pre-allocate and fill the outcome vector; outome updated in place, without copying.

f2 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
outcome <- numeric(n)
for (i in 1:n) {
if (runif(1) > p)
current <- (current + 1) %% 2
outcome[i] <- current
}
outcome
}

## 3.2 Assessment

Is it correct?

set.seed(123)
res1 <- f1(100, .9)
set.seed(123)
res2 <- f2(100, .9)
identical(res1, res2)
##  TRUE

Is it understandable? As understandable as f1().

Is it robust? In a minute…

Is it fast?

system.time(f2(5000, .5))
##    user  system elapsed
##    0.05    0.00    0.05
system.time(f2(10000, .5))
##    user  system elapsed
##   0.076   0.000   0.076
system.time(f2(20000, .5))
##    user  system elapsed
##   0.149   0.000   0.150

f2() seems to scale linearly, so performs increasingly well compared to f1().

# 4 Third approach

## 4.1 Implementation

Note that runif() is called n times, but the program could be modified, and still be understandable, if it were called just once – hoist runif(1) > p out of the loop.

f3 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
outcome <- numeric(n)
test <- runif(n) > p
for (i in 1:n) {
if (test[i])
current <- (current + 1) %% 2
outcome[i] <- current
}
outcome
}

## 4.2 Assessment

1. Correct?
set.seed(123)
res1 <- f1(100, .9)
set.seed(123)
res3 <- f3(100, .9)
identical(res1, res3)
##  TRUE
1. Understandable? Yes.

2. Robust? None of them have been, and f3() really isn’t!

set.seed(123)
f1(0, .5)
##  1 1
set.seed(123)
try(f3(0, .5))
## Error in if (test[i]) current <- (current + 1)%%2 :
##   missing value where TRUE/FALSE needed
1. Fast? Yes, about 10 times faster than f3()
n <- 100000
system.time(f2(n, .5))
##    user  system elapsed
##   0.788   0.000   0.788
system.time(f3(n, .5))
##    user  system elapsed
##   0.060   0.000   0.059

… with linear scaling.

system.time(f3(n * 10, .5))
##    user  system elapsed
##   0.570   0.012   0.583

# 5 Fourth approach

## 5.1 Implementation

The problem is that 1:n is not robust, especially 1:0 generates the sequence c(1, 0), whereas we were expecting a zero-length sequence!

Solution: use seq_len(n)

lapply(3:1, seq_len)
## []
##  1 2 3
##
## []
##  1 2
##
## []
##  1
f4 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
outcome <- numeric(n)
test <- runif(n) > p
for (i in seq_len(n)) {
if (test[i])
current <- (current + 1) %% 2
outcome[i] <- current
}
outcome
}

## 5.2 Assessment

1. Correct? Yes

2. Understandable? Yes

3. Robust? Yes

set.seed(123)
f4(3, .5)
##  1 1 0
f4(2, .5)
##  1 0
f4(1, .5)
##  0
f4(0, .5)
## numeric(0)
1. Fast? Yes

# 6 Fifth approach

## 6.1 Implementation

Use cumsum() (cummulative sum) to produce sequential groups that have the same head or tail status. Use %% on the cummulative sum to translate those groups into heads (cummsum() %% 2 == 0 or tails (cummsum() %% 2 == 1).

f5 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
test <- runif(n) > p
cumsum(current + test) %% 2
}

## 6.2 Assessment

1. Correct?
set.seed(123); res1 <- f1(10, .8)
set.seed(123); res5 <- f5(10, .8)
identical(res1, res5)
##  TRUE
1. Understandable? Harder to understand…

2. Robust? Yes

f5(0, .8)
## numeric(0)
1. Fast?
n <- 1000000
system.time(f4(n, .5))
##    user  system elapsed
##   0.562   0.000   0.562
system.time(f5(n, .5))
##    user  system elapsed
##   0.102   0.000   0.102
system.time(f5(n * 10, .5))
##    user  system elapsed
##   1.259   0.076   1.334

About 4x faster than f4(), scales linearly, fast even for very large n.

Could be used to generate a large data set for developing methods about correlated samples, along the lines of

correlated_tosses_expts <- function(m, n, p) {
## m tosses (rows) in each of n experiments
start0 <- rep(rbinom(m, 1, .5), each = m)
group0 <- cumsum(runif(m * n) > p)
toss <- (start0 + group0) %% 2
matrix(toss, m)
}
system.time({
expt <- correlated_tosses_expts(1000, 10000, .8)
})
##    user  system elapsed
##   1.158   0.092   1.250
hist(colSums(expt)) # 7 XXX Approach

Probably we have reached the point of diminishing gains, we’ve already spent far more time developing f5() than we’ll ever save by further investigation… However,

• Avoid adding current to cumsum() vector.
• Use rgeom() to generate change points.
• ‘Is there a package for that?’

Other tools

• microbenchmark for comparing fine-scale performance differences (but do we really care about speed when, e.g., timing differences are less than a couple of seconds for large-scale data?)
• testthat for writing ‘unit tests’ that allow easy implementation of tests for correct and robust code.

# 8 Session info

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 19
##
## Matrix products: default
## BLAS:   /home/msmith/Applications/R/R-3.6.0/lib/libRblas.so
## LAPACK: /home/msmith/Applications/R/R-3.6.0/lib/libRlapack.so
##
## locale:
##   LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##   LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
##   LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8
##   LC_PAPER=de_DE.UTF-8       LC_NAME=C
##  LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
##  stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
##  BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
##   BiocManager_1.30.4 compiler_3.6.0     magrittr_1.5
##   bookdown_0.10      tools_3.6.0        htmltools_0.3.6
##   yaml_2.2.0         Rcpp_1.0.1         codetools_0.2-16
##  stringi_1.4.3      rmarkdown_1.12     knitr_1.23
##  stringr_1.4.0      xfun_0.7           digest_0.6.20
##  evaluate_0.13