performance.Rmd

---
output: html_document
editor_options: 
  chunk_output_type: console
---

# Performance

```{r, include=FALSE}
source("knitr-options.R")
WORDS_TO_IGNORE <- c("truelength", "growable", "precomputing", "vectorize", "JIT")
source("spelling-check.R")
```

Some resources used here or for further reading:

- [Advanced R](https://adv-r.hadley.nz/perf-improve.html)
- [Efficient R programming](https://bookdown.org/csgillespie/efficientR/)

The people who say that "R is just always slow" are usually not great R programmers. It is true that writing inefficient R code is easy, yet writing efficient R code is also possible when you know what you're doing. 
In this chapter, you will learn how to write R(cpp) code that is fast.

<!-- Some years ago, when introducing the Julia language, some really unfair benchmark comparisons were released (e.g. with loops where obvious and simpler vectorized code would have been much faster). I'm still mad at Julia for this and [I am not the only one](https://matloff.wordpress.com/2014/05/21/r-beats-python-r-beats-julia-anyone-else-wanna-challenge-r/). -->

## R's memory management

Read more with [this chapter of Advanced R](https://adv-r.hadley.nz/names-values.html).


### Understanding binding basics

```{r, echo = TRUE}
x <- c(1, 2, 3)
```

* It's creating an object, a vector of values, `c(1, 2, 3)`.
* And it's binding that object to a name, `x`.

```{r, out.width="32%", echo=FALSE}
knitr::include_graphics("https://d33wubrfki0l68.cloudfront.net/bd90c87ac98708b1731c92900f2f53ec6a71edaf/ce375/diagrams/name-value/binding-1.png")
```

<br>

```{r, echo = TRUE}
y <- x
```
```{r, out.width="30%", echo=FALSE}
knitr::include_graphics("https://d33wubrfki0l68.cloudfront.net/bdc72c04d3135f19fb3ab13731129eb84c9170af/f0ab9/diagrams/name-value/binding-2.png")
```

There are now two names for the same object in memory.


### Copy-on-modify

```{r, echo=TRUE}
x <- c(1, 2, 3)
y <- x
y[3] <- 4
```

```{r, echo=TRUE}
x
```

```{r, out.width="28%", echo=FALSE}
knitr::include_graphics("https://d33wubrfki0l68.cloudfront.net/ef9f480effa2f1d0e401d1f94218d0cf118433c0/b56e9/diagrams/name-value/binding-3.png")
```

The object in memory is copied before being modified, so that `x` is not modified.


### Copy-on-modify: what about inside functions?

```{r, echo=TRUE}
f <- function(a) {
  a
}
x <- c(1, 2, 3)
z <- f(x)
```

```{r, out.width="30%", echo=FALSE}
knitr::include_graphics("https://d33wubrfki0l68.cloudfront.net/e8718027aabaed377da311f45b45a179588e4dcf/6bf90/diagrams/name-value/binding-f2.png")
```

```{r, echo=TRUE}
f2 <- function(a) {
  a[1] <- 10
  a
}
z2 <- f2(x)
```

```{r}
cbind(x, z2)
```

The input parameter is not modified; you operate on a local copy of `a = x` in `f2()`.


### Lists

It's not just names (i.e. variables) that point to values; elements of lists do too. 

```{r, echo=TRUE}
l1 <- list(1, 2, 3)
```

```{r, out.width="32%", echo=FALSE}
knitr::include_graphics("https://d33wubrfki0l68.cloudfront.net/dae84980f1586fc4ef47091c91f51a5737b38135/a1403/diagrams/name-value/list.png")
```

```{r, echo=TRUE}
l2 <- l1
```

```{r, out.width="30%", echo=FALSE}
knitr::include_graphics("https://d33wubrfki0l68.cloudfront.net/52bc0e3da3382cba957a9d83397b6c9200906ce2/c72aa/diagrams/name-value/l-modify-1.png")
```


### Copy-on-modify for lists?

```{r, echo=TRUE}
l2[[3]] <- 4
```

```{r, out.width="38%", echo=FALSE}
knitr::include_graphics("https://d33wubrfki0l68.cloudfront.net/b844bb5a3443e1344299627f5760e2ae3a9885b5/e1c76/diagrams/name-value/l-modify-2.png")
```

Only the third element needs to be copied.


### Data frames

**Data frames are lists of vectors.**

```{r, echo=TRUE}
d1 <- data.frame(x = c(1, 5, 6), y = c(2, 4, 3))
```

```{r, out.width="28%", echo=FALSE}
knitr::include_graphics("https://d33wubrfki0l68.cloudfront.net/80d8995999aa240ff4bc91bb6aba2c7bf72afc24/95ee6/diagrams/name-value/dataframe.png")
```

```{r, echo=TRUE}
d2 <- d1
d2[, 2] <- d2[, 2] * 2  # modify one column
```

```{r, out.width="30%", echo=FALSE}
knitr::include_graphics("https://d33wubrfki0l68.cloudfront.net/c19fd7e31bf34ceff73d0fac6e3ea22b09429e4a/23d8d/diagrams/name-value/d-modify-c.png")
```

```{r, echo=TRUE}
d3 <- d1
d3[1, ] <- d3[1, ] * 3  # modify one row
```

```{r, out.width="45%", echo=FALSE}
knitr::include_graphics("https://d33wubrfki0l68.cloudfront.net/36df61f54d1ac62e066fb814cb7ba38ea6047a74/facf8/diagrams/name-value/d-modify-r.png")
```

By modifying the first row, you're modifying the first element of all vectors, therefore the full data frame is copied..


## Early advice

### NEVER GROW A VECTOR

Example computing the cumulative sums of a vector:

```{r}
x <- rnorm(2e4)  # Try also with n = 1e5
system.time({
  current_sum <- 0
  res <- c()
  for (x_i in x) {
    current_sum <- current_sum + x_i
    res <- c(res, current_sum)
  }
})
```

Here, at each iteration, you are reallocating a vector (of increasing size). Not only computations take time, memory allocations do too. This makes your code quadratic with the size of `x` (if you multiply the size of `x` by 2, you can expect the execution time to be multiplied by 4, for large sample sizes), whereas it should be only linear. 

What happens is similar to if you would like to climb these stairs, you climb one stair, go to the bottom, then climb two stairs, go to bottom, climb three, and so on. That takes way more time than just climbing all stairs at once.

```{r, out.width = "50%", echo=FALSE}
knitr::include_graphics("https://privefl.github.io/blog/images/stairs.jpg")
```

A good solution is to always pre-allocate your results (if you know the size):

```{r}
system.time({
  current_sum <- 0
  res2 <- double(length(x))
  for (i in seq_along(x)) {
    current_sum <- current_sum + x[i]
    res2[i] <- current_sum
  }
})
all.equal(res2, res)
```

If you don't know the size of the results, you can store them in a list and merge them afterwards:

```{r}
system.time({
  current_sum <- 0
  res3 <- list()
  for (i in seq_along(x)) {
    current_sum <- current_sum + x[i]
    res3[[i]] <- current_sum
  }
})
all.equal(unlist(res3), res)
```

With recent versions of R (>= 3.4), you can efficiently grow a vector using

```{r}
system.time({
  current_sum <- 0
  res4 <- c()
  for (i in seq_along(x)) {
    current_sum <- current_sum + x[i]
    res4[i] <- current_sum
  }
})
all.equal(res4, res)
```

> Assigning to an element of a vector beyond the current length now over-allocates by a small fraction. The new vector is marked internally as growable, and the true length of the new vector is stored in the truelength field. This makes building up a vector result by assigning to the next element beyond the current length more efficient, though pre-allocating is still preferred. The implementation is subject to change and not intended to be used in packages at this time. ([NEWS](https://cran.r-project.org/doc/manuals/r-release/NEWS.html))

An even better solution would be to avoid the loop by using a vectorized function:

```{r}
system.time(res5 <- cumsum(x))
all.equal(res5, res)
x <- rnorm(1e7)
system.time(cumsum(x))
```

***

As a second example, let us generate a matrix of uniform values (max changing for every column):

```{r}
n <- 1e3
max <- 1:1000
system.time({
  mat <- NULL
  for (m in max) {
    mat <- cbind(mat, runif(n, max = m))
  }
})
apply(mat, 2, max)[1:10]
```

Instead, we should pre-allocate a matrix of the right size:

```{r}
system.time({
  mat3 <- matrix(0, n, length(max))
  for (i in seq_along(max)) {
    mat3[, i] <- runif(n, max = max[i])
  }
})
apply(mat3, 2, max)[1:10]
```

Or we could use a list instead. What is nice with using a list is that you don't need to pre-allocate. Indeed, as opposed to atomic vectors, each element of a list is in different places in memory so that you don't have to reallocate all the data when you add an element to a list.

```{r}
system.time({
  l <- list()
  for (i in seq_along(max)) {
    l[[i]] <- runif(n, max = max[i])
  }
  mat4 <- do.call("cbind", l)
})
apply(mat4, 2, max)[1:10]
```

Instead of pre-allocating yourself, you can use `sapply` (or `lapply` and calling `do.call()` after, as previously done):

```{r}
system.time(
  mat4 <- sapply(max, function(m) runif(n, max = m))
)
apply(mat4, 2, max)[1:10]
```

Don't listen to people telling you that `sapply()` is a vectorized operation that is so much faster than loops. That's false, and for-loops can actually be much faster than `sapply()` when using just-in-time (JIT) compilation. You can learn more with [this blog post](https://privefl.github.io/blog/why-loops-are-slow-in-r/).

<!-- ### Access columns of a matrix -->

<!-- When you do computations on a matrix, recall that a matrix is just a vector with some dimensions. -->

<!-- ```{r} -->
<!-- vec <- 1:20 -->
<!-- dim(vec) <- c(4, 5) -->
<!-- vec -->
<!-- ``` -->

<!-- So, as you can see in this example, R matrices are column-oriented, which means that elements of the same column are stored contiguously in memory. Therefore, accessing elements of the same column is fast.  -->


### Use the right function

Often, in order to optimize your code, you can simply find the right function to do what you need to do. 

For example, `rowMeans(x)` is much faster than `apply(x, 1, mean)`. 
Similarly, if you want more efficient functions that apply to rows and columns of matrices, you can check [package {matrixStats}](https://github.com/HenrikBengtsson/matrixStats). 

Another example is when reading large text files; in such cases, prefer using `data.table::fread()` rather than `read.table()`.

Generally, packages that uses C/Rcpp are efficient.

### Do not try to optimize everything

> "Programmers waste enormous amounts of time thinking about, or worrying
> about, the speed of noncritical parts of their programs, and these attempts 
> at efficiency actually have a strong negative impact when debugging and
> maintenance are considered."
>
> --- Donald Knuth.

If you try to optimize each and every part of your code, you will end up losing a lot of time writing it, and it will probably make your code less readable.

R is great at prototyping quickly because you can write code in a concise and easy way. Start by doing just that. If performance matters, then profile your code to see which part of your code is taking too much time and optimize only this part!

Learn more on how to profile your code in RStudio in [this article](https://support.posit.co/hc/en-us/articles/218221837-Profiling-R-code-with-the-RStudio-IDE).


## Vectorization

I call *vectorized* a function that takes vectors as arguments and operate on each element of these vectors in another (compiled) language (such as C, C++ and Fortran). Again, `sapply()` is not a vectorized function (cf. above).

Take this code:
```{r, eval=FALSE}
N <- 10e3; x <- runif(N); y <- rnorm(N)

res <- double(length(x))
for (i in seq_along(x)) {
  res[i] <- x[i] + y[i]
}
```

As an interpreted language, for each iteration `res[i] <- x[i] + y[i]`, R has to ask:

- what is the type of `x[i]` and `y[i]`?

- can I add these two types? 

- what is the type of `x[i] + y[i]` then?

- can I store this result in `res` or do I need to convert it?

These questions must be answered for each iteration, which takes time.    
Some of this is alleviated by JIT compilation.

On the contrary, for vectorized functions, these questions must be answered only once, which saves a lot of time. 
Read more with [Noam Ross’s blog post on vectorization](http://www.noamross.net/blog/2014/4/16/vectorization-in-r--why.html).

### Exercise

Monte-Carlo integration (example from [book Efficient R programming](https://bookdown.org/csgillespie/efficientR/programming.html#vectorised-code))

Suppose we wish to estimate the integral $\int_0^1 x^2 dx$ using a Monte-Carlo method. Essentially, we throw darts (at random) and count the proportion of darts that fall below the curve (as in the following figure).

```{r monte-carlo, echo=FALSE}
knitr::include_graphics("https://bookdown.org/csgillespie/efficientR/_main_files/figure-html/3-1-1.png")
```

Naively implementing this Monte-Carlo algorithm in R would typically lead to something like:

```{r}
monte_carlo <- function(N) {
  
  hits <- 0
  
  for (i in seq_len(N)) {
    x <- runif(1)
    y <- runif(1)
    if (y < x^2) {
      hits <- hits + 1
    }
  }
  
  hits / N
}
```

This takes a few seconds for `N = 1e6`:

```{r cache=TRUE}
N <- 2e5
system.time(res <- monte_carlo(N))
res
```

Your task: find a vectorized solution for this problem:

```{r echo=FALSE}
monte_carlo_vec <- function(N) mean(runif(N)^2 > runif(N))
```

```{r}
system.time(res2 <- monte_carlo_vec(N))
res2
```


## Rcpp {#Rcpp}

See [this presentation](https://privefl.github.io/R-presentation/Rcpp.html).

You have this data and this working code (a loop) that is slow 

```{r, eval=FALSE}
mydf <- readRDS(system.file("extdata/one-million.rds", package = "advr38pkg"))

QRA_3Dmatrix <- array(0, dim = c(max(mydf$ID), max(mydf$Volume), 2))
transform_energy <- function(x) {
  1 - 1.358 / (1 + exp( (1000 * x - 129000) / 120300 ))
}

for (i in seq_len(nrow(mydf))) {
  # Row corresponds to the ID class
  row    <- mydf$ID[i]
  # Column corresponds to the volume class
  column <- mydf$Volume[i]
  # Number of events, initially zero, then increment
  QRA_3Dmatrix[row, column, 1] <- QRA_3Dmatrix[row, column, 1] + 1  
  # Sum energy 
  QRA_3Dmatrix[row, column, 2] <- QRA_3Dmatrix[row, column, 2] + 
    transform_energy(mydf$Energy[i])
}
```

Rewrite this for-loop with Rcpp. 

You can also try to use {dplyr} for this problem.


## Linear algebra

It is faster to use `crossprod(X)` and `tcrossprod(X)` instead of `t(X) %*% X` and `X %*% t(X)`. Moreover, using `A %*% (B %*% y)` is faster than `A %*% B %*% y`, and `solve(A, y)` is faster than `solve(A) %*% y`.

Don't re-implement linear algebra operations (such as matrix products) yourself. There exist some highly optimized libraries for this. If you want to use linear algebra in Rcpp, try [RcppArmadillo](http://dirk.eddelbuettel.com/code/rcpp.armadillo.html) or [RcppEigen](http://dirk.eddelbuettel.com/code/rcpp.eigen.html).

If you want to use some optimized multi-threaded linear library, you can try [Microsoft R Open](https://www.microsoft.com/en-US/download/details.aspx?id=51205). 

### Exercises

Compute the Euclidean distances between each of row of `X` and each row of `Y`:

```{r}
set.seed(1)
X <- matrix(rnorm(5000), ncol = 5)
Y <- matrix(rnorm(2000), ncol = 5)
```

A naive implementation would be:

```{r}
system.time({
  dist <- matrix(NA_real_, nrow(X), nrow(Y))
  for (i in seq_len(nrow(X))) {
    for (j in seq_len(nrow(Y))) {
      dist[i, j] <- sqrt(sum((Y[j, ] - X[i, ])^2))
    }
  }
})
```

Try first to remove one of the two loops using `sweep()` instead. Which loop should you choose to remove ideally?
A solution with `sweep()` can take

```{r, echo=FALSE}
system.time({
  dist2 <- matrix(NA_real_, nrow(X), nrow(Y))
  for (i in seq_len(nrow(X))) {
    dist2[i, ] <- sqrt(rowSums(sweep(Y, 2, X[i, ], '-')^2))
  }
})
stopifnot(all.equal(dist2, dist))
```

Then, try to implement a fully vectorized solution based on this hint: $\text{dist}(X_i, Y_j)^2 = (X_i - Y_j)^T (X_i - Y_j) = X_i^T X_i + Y_j^T Y_j - 2 X_i^T Y_j$. A faster solution using `outer()` and `tcrossprod()` takes

```{r, echo=FALSE}
fast_dist <- function(x, y) {
  sqrt(outer(rowSums(x^2), rowSums(y^2), '+') - tcrossprod(x, 2 * y))
}
stopifnot(all.equal(fast_dist(X, Y), dist))
system.time(fast_dist(X, Y))
```


## Algorithms & data structures

Sometimes, getting the right data structure (e.g. using a matrix instead of a data frame or integers instead of characters) can save you some computation time.

Is your algorithm doing some redundant computations making it e.g. quadratic instead of linear with respect to the dimension of your data?

See exercises (section \@ref(exos)) for some insights.

You can also find a detailed example in [this blog post](https://privefl.github.io/blog/performance-when-algorithmics-meets-mathematics/).

## Exercises {#exos}

Generate $10^7$ (start with $10^5$) steps of the process described by the formula:$$X(0)=0$$$$X(t+1)=X(t)+Y(t)$$ where $Y(t)$ are independent random variables with the distribution $N(0,1)$. 
Then, calculate the percentage of $X(t)$ that are negative. 
You do not need to store all values of $X$.

A naive implementation with a for-loop could be:

```{r}
set.seed(1)

system.time({
  N <- 1e5
  x <- 0
  count <- 0
  for (i in seq_len(N)) {
    y <- rnorm(1)
    x <- x + y
    if (x < 0) count <- count + 1
  }
  p <- count / N
})
p
```

Try to vectorize this after having written the value of X(0), X(1), X(2), and X(3).
What would be the benefit of writing an Rcpp function over a simple vectorized R function?

```{r}
set.seed(1)
system.time(p2 <- advr38pkg::random_walk_neg_prop(1e5))
p2
```

```{r}
set.seed(1)
system.time(p3 <- advr38pkg::random_walk_neg_prop(1e7))
p3
```

***

```{r}
mat <- as.matrix(mtcars)
ind <- seq_len(nrow(mat))
mat_big <- mat[rep(ind, 1000), ]  ## 1000 times bigger dataset
last_row <- mat_big[nrow(mat_big), ]
```

Speed up these loops (vectorize):

```{r}
system.time({
  for (j in 1:ncol(mat_big)) {
    for (i in 1:nrow(mat_big)) {
      mat_big[i, j] <- 10 * mat_big[i, j] * last_row[j]
    }
  }
})
```

***

Why `colSums()` on a whole matrix is faster than on only half of it?

```{r}
m0 <- matrix(rnorm(1e6), 1e3, 1e3)
microbenchmark::microbenchmark(
  colSums(m0[, 1:500]), 
  colSums(m0)
)
```

***

Try to speed up this code by vectorizing it first, and/or by precomputing. Then, recode it in Rcpp and benchmark all the solutions you came up with.

```{r}
M <- 50
step1 <- runif(M)
A <- rnorm(M)
N <- 1e4

tau <- matrix(0, N + 1, M)
tau[1, ] <- A
for (j in 1:M) {
  for (i in 2:nrow(tau)) {
    tau[i, j] <- tau[i - 1, j] + step1[j] * 1.0025^(i - 2)
  }
} 
```

***

Make a fast function that counts the number of elements between a sequence of breaks. Can you do it in base R? Try also implementing it in Rcpp. How can you implement a solution whose computation time doesn't depend on the number of breaks? [Which are the special cases that you should consider?]

```{r}
x <- sample(10, size = 1e4, replace = TRUE)
breaks <- c(1, 3, 8.5, 9.5, 10)
table(cut(x, breaks), exclude = NULL) # does not include first break (1)
hist(x, breaks, plot = FALSE)$counts  # includes first break
advr38pkg::count_by_breaks(x, breaks)
advr38pkg::count_by_breaks_fast(x, breaks)

microbenchmark::microbenchmark(
  table(cut(x, breaks)), 
  hist(x, breaks, plot = FALSE)$counts, 
  advr38pkg::count_by_breaks(x, breaks, use_outer = TRUE),
  advr38pkg::count_by_breaks(x, breaks, use_outer = FALSE),
  advr38pkg::count_by_breaks_fast(x, breaks)
)

x2 <- sample(100, size = 1e5, replace = TRUE)
breaks2 <- breaks * 10
breaks3 <- seq(0, 100, length.out = 100)
microbenchmark::microbenchmark(
  advr38pkg::count_by_breaks(x2, breaks2),
  advr38pkg::count_by_breaks_fast(x2, breaks2),
  advr38pkg::count_by_breaks(x2, breaks3),
  advr38pkg::count_by_breaks_fast(x2, breaks3)
)
```

***

An R user wants to implement some sampling on a sparse matrix and provides this working code:

```{r}
N <- 2000
system.time({
  m <- Matrix::Matrix(0, nrow = N, ncol = N)
  for (j in 1:N) {
    cols <- sample((1:N)[-j], 2)  # pick 2 columns that are not j 
    m[j, cols] <- 1
  }
})
```

This code is slow; can you find two major reasons why? 

How can you more efficiently assign 1s? A faster solution would take:

```{r, echo=FALSE}
system.time({
  m <- Matrix::Matrix(0, nrow = N, ncol = N)
  ind <- sapply(1:N, function(j) sample((1:N)[-j], 2))
  m[t(ind)] <- 1
})
```

Can you use sampling with replacement (to avoid unnecessarily allocating memory) in this example? A faster solution would take:

```{r, echo=FALSE}
sample2 <- function(N) {
  
  res <- rep(NA_integer_, 2 * N)
  
  for (j in seq_len(N)) {
    
    # sample first one
    repeat {
      ind1 <- sample(N, 1)
      if (ind1 != j) break
    }
    
    # sample second one
    repeat {
      ind2 <- sample(N, 1)
      if (ind2 != j && ind2 != ind1) break
    }
    
    res[2 * j - 1:0] = c(ind1, ind2)
  }
  
  cbind(rep(1:N, each = 2), res)
}

system.time({
  m <- Matrix::Matrix(0, nrow = N, ncol = N)
  m[sample(N)] <- 1
})
```

It would be even faster using Rcpp (cf. [this SO answer](https://stackoverflow.com/a/45796424/6103040)).


***

Make a fast function that returns all prime numbers up to a number `N`.

```{r, dev='png'}
N <- 1e6
system.time(
  primes <- advr38pkg::AllPrimesUpTo(N)
)
plot(primes, pch = 20, cex = 0.5)
```

## Parallel computing

I basically always use `foreach` and recommend to do so. See [my guide to parallelism in R with `foreach`](https://privefl.github.io/blog/a-guide-to-parallelism-in-r/). 

**Just remember to optimize your code before trying to parallelize it.**

Try to parallelize some of your best solutions for the previous exercises.