# 1 Inspiration: correlated coin tosses

Wolfgang mentioned on Monday that observing 50 heads and then 50 tails would not be evidence of a ‘fair’ coin, even though the observed number of head (50) was exactly that expected of a fair coin.

I wonder, how do you generate correlated coin tosses?

Criteria (most to least important)

1. Correct
2. Understandable
3. Robust
4. Fast

# 2 First approach

## 2.1 Implementation

Suppose we represent heads as ‘0’ and tails as ‘1’. Let p be the probability that a coin that is currently a head is tossed and remains a head, and similarly for tails. If p == 0.5, then the tosses would seem to be uncorrelated. We implement this idea as a simple function

f1 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
outcome <- NULL
for (i in 1:n) {
if (runif(1) > p)
current <- (current + 1) %% 2
outcome <- c(outcome, current)
}
outcome
}

## 2.2 Assessment

Is it correct?

f1(10, .5)
##  [1] 1 1 0 0 0 0 0 0 1 0
table(f1(1000, .5))
##
##   0   1
## 496 504
res <- f1(1000, .9)
mean(rle(res)\$length) # expectation: 1 / (1 - p)
## [1] 10.75269

Is it understandable? Pretty much

Is it robust? We’ll come back to this.

Is it fast? Seems so initially, but it scales poorly!

system.time(f1(5000, .5))
##    user  system elapsed
##   0.141   0.019   0.161
system.time(f1(10000, .5))
##    user  system elapsed
##   0.455   0.047   0.503
system.time(f1(20000, .5))
##    user  system elapsed
##   1.796   0.261   2.062

Doubling the problem size (n) causes a 4-fold increase in execution time.

What’s the problem? c(outcome, current) causes outcome to be copied each time through the loop!

# 3 Second approach

## 3.1 Implementation

Pre-allocate and fill the outcome vector; outome updated in place, without copying.

f2 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
outcome <- numeric(n)
for (i in 1:n) {
if (runif(1) > p)
current <- (current + 1) %% 2
outcome[i] <- current
}
outcome
}

## 3.2 Assessment

Is it correct?

set.seed(123)
res1 <- f1(100, .9)
set.seed(123)
res2 <- f2(100, .9)
identical(res1, res2)
## [1] TRUE

Is it understandable? As understandable as f1().

Is it robust? In a minute…

Is it fast?

system.time(f2(5000, .5))
##    user  system elapsed
##   0.029   0.000   0.029
system.time(f2(10000, .5))
##    user  system elapsed
##   0.055   0.001   0.057
system.time(f2(20000, .5))
##    user  system elapsed
##   0.106   0.003   0.111

f2() seems to scale linearly, so performs increasingly well compared to f1().

# 4 Third approach

## 4.1 Implementation

Note that runif() is called n times, but the program could be modified, and still be understandable, if it were called just once – hoist runif(1) > p out of the loop.

f3 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
outcome <- numeric(n)
test <- runif(n) > p
for (i in 1:n) {
if (test[i])
current <- (current + 1) %% 2
outcome[i] <- current
}
outcome
}

## 4.2 Assessment

1. Correct?
set.seed(123)
res1 <- f1(100, .9)
set.seed(123)
res3 <- f3(100, .9)
identical(res1, res3)
## [1] TRUE
1. Understandable? Yes.

2. Robust? None of them have been, and f3() really isn’t!

set.seed(123)
f1(0, .5)
## [1] 1 1
set.seed(123)
try(f3(0, .5))
## Error in if (test[i]) current <- (current + 1)%%2 :
##   missing value where TRUE/FALSE needed
1. Fast? Yes, about 10 times faster than f3()
n <- 100000
system.time(f2(n, .5))
##    user  system elapsed
##   0.551   0.033   0.587
system.time(f3(n, .5))
##    user  system elapsed
##   0.043   0.001   0.044

… with linear scaling.

system.time(f3(n * 10, .5))
##    user  system elapsed
##   0.424   0.007   0.434

# 5 Fourth approach

## 5.1 Implementation

The problem is that 1:n is not robust, especially 1:0 generates the sequence c(1, 0), whereas we were expecting a zero-length sequence!

Solution: use seq_len(n)

lapply(3:1, seq_len)
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] 1 2
##
## [[3]]
## [1] 1
f4 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
outcome <- numeric(n)
test <- runif(n) > p
for (i in seq_len(n)) {
if (test[i])
current <- (current + 1) %% 2
outcome[i] <- current
}
outcome
}

## 5.2 Assessment

1. Correct? Yes

2. Understandable? Yes

3. Robust? Yes

set.seed(123)
f4(3, .5)
## [1] 1 1 0
f4(2, .5)
## [1] 1 0
f4(1, .5)
## [1] 0
f4(0, .5)
## numeric(0)
1. Fast? Yes

# 6 Fifth approach

## 6.1 Implementation

Use cumsum() (cummulative sum) to produce sequential groups that have the same head or tail status. Use %% on the cummulative sum to translate those groups into heads (cummsum() %% 2 == 0 or tails (cummsum() %% 2 == 1).

f5 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
test <- runif(n) > p
cumsum(current + test) %% 2
}

## 6.2 Assessment

1. Correct?
set.seed(123); res1 <- f1(10, .8)
set.seed(123); res5 <- f5(10, .8)
identical(res1, res5)
## [1] TRUE
1. Understandable? Harder to understand…

2. Robust? Yes

f5(0, .8)
## numeric(0)
1. Fast?
n <- 1000000
system.time(f4(n, .5))
##    user  system elapsed
##   0.416   0.002   0.420
system.time(f5(n, .5))
##    user  system elapsed
##   0.087   0.002   0.088
system.time(f5(n * 10, .5))
##    user  system elapsed
##   1.086   0.063   1.151

About 4x faster than f4(), scales linearly, fast even for very large n.

Could be used to generate a large data set for developing methods about correlated samples, along the lines of

correlated_tosses_expts <- function(m, n, p) {
## m tosses (rows) in each of n experiments
start0 <- rep(rbinom(m, 1, .5), each = m)
group0 <- cumsum(runif(m * n) > p)
toss <- (start0 + group0) %% 2
matrix(toss, m)
}
system.time({
expt <- correlated_tosses_expts(1000, 10000, .8)
})
##    user  system elapsed
##   0.977   0.058   1.035
hist(colSums(expt))

# 7 XXX Approach

Probably we have reached the point of diminishing gains, we’ve already spent far more time developing f5() than we’ll ever save by further investigation… However,

• Avoid adding current to cumsum() vector.
• Use rgeom() to generate change points.
• ‘Is there a package for that?’

Other tools

• microbenchmark for comparing fine-scale performance differences (but do we really care about speed when, e.g., timing differences are less than a couple of seconds for large-scale data?)
• testthat for writing ‘unit tests’ that allow easy implementation of tests for correct and robust code.

# 8 Session info

sessionInfo()
## R version 3.6.1 Patched (2019-07-16 r76845)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS:   /Users/ma38727/bin/R-3-6-branch/lib/libRblas.dylib
## LAPACK: /Users/ma38727/bin/R-3-6-branch/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] BiocStyle_2.13.2
##
## loaded via a namespace (and not attached):
##  [1] BiocManager_1.30.4 compiler_3.6.1     magrittr_1.5
##  [4] bookdown_0.12      htmltools_0.3.6    tools_3.6.1
##  [7] yaml_2.2.0         Rcpp_1.0.1         codetools_0.2-16
## [10] stringi_1.4.3      rmarkdown_1.14     knitr_1.23
## [13] stringr_1.4.0      digest_0.6.20      xfun_0.8
## [16] evaluate_0.14