Suppose we represent heads as ‘0’ and tails as ‘1’. Let `p`

be the
probability that a coin that is currently a head is tossed and remains
a head, and similarly for tails. If `p == 0.5`

, then the tosses would
seem to be uncorrelated. We implement this idea as a simple function

```
f1 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
outcome <- NULL
for (i in 1:n) {
if (runif(1) > p)
current <- (current + 1) %% 2
outcome <- c(outcome, current)
}
outcome
}
```

Is it correct?

`f1(10, .5)`

`## [1] 0 0 0 1 1 1 1 1 1 1`

`table(f1(1000, .5))`

```
##
## 0 1
## 510 490
```

```
res <- f1(1000, .9)
mean(rle(res)$length) # expectation: 1 / (1 - p)
```

`## [1] 10.10101`

Is it understandable? Pretty much

Is it robust? We’ll come back to this.

Is it fast? Seems so initially, but it scales poorly!

`system.time(f1(5000, .5))`

```
## user system elapsed
## 0.188 0.028 0.215
```

`system.time(f1(10000, .5))`

```
## user system elapsed
## 0.587 0.001 0.587
```

`system.time(f1(20000, .5))`

```
## user system elapsed
## 2.033 0.015 2.050
```

Doubling the problem size (`n`

) causes a 4-fold increase in execution time.

What’s the problem? `c(outcome, current)`

causes `outcome`

to be
copied each time through the loop!

Pre-allocate and fill the outcome vector; `outome`

updated in place,
without copying.

```
f2 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
outcome <- numeric(n)
for (i in 1:n) {
if (runif(1) > p)
current <- (current + 1) %% 2
outcome[i] <- current
}
outcome
}
```

Is it correct?

```
set.seed(123)
res1 <- f1(100, .9)
set.seed(123)
res2 <- f2(100, .9)
identical(res1, res2)
```

`## [1] TRUE`

Is it understandable? As understandable as `f1()`

.

Is it robust? In a minute…

Is it fast?

`system.time(f2(5000, .5))`

```
## user system elapsed
## 0.05 0.00 0.05
```

`system.time(f2(10000, .5))`

```
## user system elapsed
## 0.076 0.000 0.076
```

`system.time(f2(20000, .5))`

```
## user system elapsed
## 0.149 0.000 0.150
```

`f2()`

seems to scale linearly, so performs increasingly well compared
to `f1()`

.

Note that `runif()`

is called `n`

times, but the program could be
modified, and still be understandable, if it were called just once –
*hoist* `runif(1) > p`

out of the loop.

```
f3 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
outcome <- numeric(n)
test <- runif(n) > p
for (i in 1:n) {
if (test[i])
current <- (current + 1) %% 2
outcome[i] <- current
}
outcome
}
```

- Correct?

```
set.seed(123)
res1 <- f1(100, .9)
set.seed(123)
res3 <- f3(100, .9)
identical(res1, res3)
```

`## [1] TRUE`

Understandable? Yes.

Robust? None of them have been, and

`f3()`

really isn’t!

```
set.seed(123)
f1(0, .5)
```

`## [1] 1 1`

```
set.seed(123)
try(f3(0, .5))
```

```
## Error in if (test[i]) current <- (current + 1)%%2 :
## missing value where TRUE/FALSE needed
```

- Fast? Yes, about 10 times faster than
`f3()`

```
n <- 100000
system.time(f2(n, .5))
```

```
## user system elapsed
## 0.788 0.000 0.788
```

`system.time(f3(n, .5))`

```
## user system elapsed
## 0.060 0.000 0.059
```

… with linear scaling.

`system.time(f3(n * 10, .5))`

```
## user system elapsed
## 0.570 0.012 0.583
```

The problem is that `1:n`

is not robust, especially `1:0`

generates
the sequence `c(1, 0)`

, whereas we were expecting a zero-length
sequence!

Solution: use `seq_len(n)`

`lapply(3:1, seq_len)`

```
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] 1 2
##
## [[3]]
## [1] 1
```

```
f4 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
outcome <- numeric(n)
test <- runif(n) > p
for (i in seq_len(n)) {
if (test[i])
current <- (current + 1) %% 2
outcome[i] <- current
}
outcome
}
```

Correct? Yes

Understandable? Yes

Robust? Yes

```
set.seed(123)
f4(3, .5)
```

`## [1] 1 1 0`

`f4(2, .5)`

`## [1] 1 0`

`f4(1, .5)`

`## [1] 0`

`f4(0, .5)`

`## numeric(0)`

- Fast? Yes

Use `cumsum()`

(cummulative sum) to produce sequential groups that
have the same head or tail status. Use `%%`

on the cummulative sum to
translate those groups into heads (`cummsum() %% 2 == 0`

or tails
`(cummsum() %% 2 == 1`

).

```
f5 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
test <- runif(n) > p
cumsum(current + test) %% 2
}
```

- Correct?

```
set.seed(123); res1 <- f1(10, .8)
set.seed(123); res5 <- f5(10, .8)
identical(res1, res5)
```

`## [1] TRUE`

Understandable? Harder to understand…

Robust? Yes

`f5(0, .8)`

`## numeric(0)`

- Fast?

```
n <- 1000000
system.time(f4(n, .5))
```

```
## user system elapsed
## 0.562 0.000 0.562
```

`system.time(f5(n, .5))`

```
## user system elapsed
## 0.102 0.000 0.102
```

`system.time(f5(n * 10, .5))`

```
## user system elapsed
## 1.259 0.076 1.334
```

About 4x faster than `f4()`

, scales linearly, fast even for very large `n`

.

Could be used to generate a large data set for developing methods about correlated samples, along the lines of

```
correlated_tosses_expts <- function(m, n, p) {
## m tosses (rows) in each of n experiments
start0 <- rep(rbinom(m, 1, .5), each = m)
group0 <- cumsum(runif(m * n) > p)
toss <- (start0 + group0) %% 2
matrix(toss, m)
}
system.time({
expt <- correlated_tosses_expts(1000, 10000, .8)
})
```

```
## user system elapsed
## 1.158 0.092 1.250
```

`hist(colSums(expt))`

Probably we have reached the point of diminishing gains, we’ve already
spent far more time developing `f5()`

than we’ll ever save by further
investigation… However,

- Avoid adding
`current`

to`cumsum()`

vector. - Use
`rgeom()`

to generate change points. - ‘Is there a package for that?’

Other tools

- microbenchmark for comparing fine-scale performance differences (but do we really care about speed when, e.g., timing differences are less than a couple of seconds for large-scale data?)
- testthat for writing ‘unit tests’ that allow easy implementation of tests for correct and robust code.

`sessionInfo()`

```
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 19
##
## Matrix products: default
## BLAS: /home/msmith/Applications/R/R-3.6.0/lib/libRblas.so
## LAPACK: /home/msmith/Applications/R/R-3.6.0/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] BiocManager_1.30.4 compiler_3.6.0 magrittr_1.5
## [4] bookdown_0.10 tools_3.6.0 htmltools_0.3.6
## [7] yaml_2.2.0 Rcpp_1.0.1 codetools_0.2-16
## [10] stringi_1.4.3 rmarkdown_1.12 knitr_1.23
## [13] stringr_1.4.0 xfun_0.7 digest_0.6.20
## [16] evaluate_0.13
```