Performance optimization

Now that we know how to measure our code speed, we can focus on making it faster. We’ve actually covered a number of important performance optimizations already.

To recap:

  • Avoid memory bloat and costly reassignment by doing things in-place: do not create unnecessary variables.

  • Factors are an efficent way of saving memory when working with repetitive strings.

  • Avoid indexing out-of-bounds. We’ll cover another variant of this here as well.

  • tidyverse packages are typically optimized out of the box - as long as we’re using tidyverse functions, we can be reasonably confident that some optimization has already been done (these functions also sidestep several performance traps).

  • The apply() function offers superior performance when working with matrices.

  • We don’t need to do anything to gain the benefits of R’s JIT compiler - it’s on by default.

We’ll now walk through a few other common optimizations we can use. This will be done by example - we’ll show exactly why each tip is here.

Preallocate your results

We’ll cover two examples of the same piece of code: in this case, just creating a sequence of numbers of a certain length. We will use a for loop for clarity. For the first case (dynamic), we will grow the variable sequence with each iteration of our loop. In the second case, we will create sequence beforehand at its expected final size.

dynamic <- function(size) {
  sequence <- NA
  for (i in 1:size) {
    sequence[i] <- i
  }
  return(sequence)
}

preallocated <- function(size) {
  sequence <- rep(NA, size)
  for (i in 1:size) {
    sequence[i] <- i
  }
  return(sequence)
}

library(microbenchmark)
microbenchmark(dynamic(10),
               dynamic(100),
               dynamic(1000),
               dynamic(10000),
               preallocated(10),
               preallocated(100),
               preallocated(1000),
               preallocated(10000))
## Unit: microseconds
##                 expr      min        lq       mean    median        uq
##          dynamic(10)    4.302    4.8495    5.75981    5.3815    6.2745
##         dynamic(100)   34.733   36.9985   39.01158   38.0150   40.3305
##        dynamic(1000)  312.952  325.1245  384.67507  330.9940  338.5085
##       dynamic(10000) 3045.775 3116.5280 4043.69206 3186.0660 3416.4905
##     preallocated(10)    2.517    2.8455    3.92232    3.1150    4.1320
##    preallocated(100)   10.605   11.4570   13.44438   11.9510   14.2180
##   preallocated(1000)   89.400   91.2730  113.40652   94.0870  102.4435
##  preallocated(10000)  882.232  895.7860 1042.75275  906.0015 1010.1895
##        max neval
##     10.895   100
##     58.897   100
##   3643.213   100
##  54640.227   100
##     11.720   100
##     35.359   100
##   1649.024   100
##   4624.827   100

Preallocating the sequence variable of our loop performed better in every case. Intriguingly, the speed advantage of preallocating sequence vs. dynamically growing it actually increases with larger sizes. The reason for the difference in speed is caused by how R handles stored objects in memory. Every time it has to make sequence bigger, R actually has to create a copy of sequence at the new size, and then add i to the new object. By preallocating sequence at its final size, we avoid all of this unnecessary copying.

microbenchmark(
  dynamic(10000),
  preallocated(10000)
)
## Unit: microseconds
##                 expr      min        lq     mean    median       uq
##       dynamic(10000) 2937.264 3035.9860 3271.389 3084.1785 3205.791
##  preallocated(10000)  877.958  895.9275 1040.761  907.2745  940.223
##       max neval
##  4992.752   100
##  8034.196   100

This applies to complex structures like dataframes as well. Do not ever use a function like rbind to dynamically grow your dataframe. An excellent way of sidestepping this issue is to use the corresponding tidyverse functionality.

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(nycflights13)
microbenchmark(
  rbind = {
    results <- NA
    for (row in 1:nrow(airports)) {
      results <- rbind(airports[row, ], results)
    }
  },
  # map_df returns a dataframe
  map_df = 1:nrow(airports) %>% map_df(~airports[.x, ]),
  times = 5
)
## Unit: milliseconds
##    expr       min        lq      mean    median        uq       max neval
##   rbind 1083.8573 1085.1102 1101.7861 1090.9935 1107.7997 1141.1700     5
##  map_df  417.6141  421.7491  422.7264  421.8737  424.2527  428.1426     5

Use vectorized code

Many of R’s core functions use actually use ultra-fast code written in other programming languages like C and Fortran. One major example of this is using R’s vectorized operators and functions (that take a vector as input and return a vector as output). One example of this is adding a number to a vector: in the first case, we will simply add 10 to a set of 1000 random numbers with the + operator. In the second case, we will loop over all 1000 numbers and add 10 to each. We’ll also use map to compare things to the tidyverse as well.

randoms <- runif(1000)

microbenchmark(vectorized = randoms + 10, 
               map = map(randoms, ~.x + 10),
               for_loop = for (i in 1:1000) {
                 randoms[i] + 10
               })
## Unit: microseconds
##        expr      min        lq       mean    median        uq      max
##  vectorized    2.071    3.0035    3.94643    3.7795    4.5705    7.655
##         map  817.490  847.7375  934.70673  865.8595  892.3615 2929.554
##    for_loop 2491.274 2543.4375 2694.67620 2576.5930 2636.8385 4699.892
##  neval
##    100
##    100
##    100

Generally speaking, vectorized code is the fastest choice. Do it whenever you can!

Avoid creating unnecessary variables - Part II

The more things you make your computer do, the longer it will take. Saving variables with <- is not always an instantaneous operation, especially if the variables you are saving are comparatively large. This is true even for inline assignments (x <- x + 1). In this case we will calculate the mean gdpPercap for a particular country using the gapminder dataset. In our extra_variables() function, we will save all of the data for the country of interest as countryData, then calculate the mean gdpPercap. In the second case, we calculate gdpPercap in a single step.

library(gapminder)

extra_variables <- function(country) {
  countryData <- gapminder[gapminder$country == country, ]
  return(mean(countryData$gdpPercap))
}

less_variables <- function(country) {
  return(mean(gapminder$gdpPercap[gapminder$country == country]))
}

microbenchmark(
  extra_variables = extra_variables("Germany"), 
  less_variables = less_variables("Germany")
)
## Unit: microseconds
##             expr     min       lq     mean   median       uq      max
##  extra_variables 518.789 525.6185 586.3183 528.9365 532.5105 4039.341
##   less_variables 199.159 202.6400 266.4636 204.5095 206.2915 3476.021
##  neval
##    100
##    100

Avoid using the hard disk if you can

Using the hard disk on your computer is an extremely slow operation. Objects stored in memory (the R workspace) are orders of magnitude faster to access. If one of your scripts is reading or writing to disk repeatedly, your code will be limited by your disk I/O speed, which is well below your memory I/O speed.

As an example of this, let’s save the flights dataset in memory and then read it back. We will then do the same thing, but instead of using memory, we will save it to disk and then read it back.

microbenchmark(
  memory = {
    new_flights <- flights
    head(new_flights)
  },
  disk = {
    write.csv(flights, "flights.csv")
    new_flights <- read.csv("flights.csv")
    head(new_flights)
  },
  times = 1
)
## Unit: microseconds
##    expr          min           lq         mean       median           uq
##  memory      558.677      558.677      558.677      558.677      558.677
##    disk 11274470.967 11274470.967 11274470.967 11274470.967 11274470.967
##           max neval
##       558.677     1
##  11274470.967     1

Using memory was much faster than using the disk. However, if we do need to use the disk somewhere in your analysis, there are a few things you can do to speed things along.

When you write to disk, use readr

readr is an I/O library from the tidyverse. It has a number of improvements over base R’s I/O functions, both in terms of usability and performance.

microbenchmark(
  base_r = {
    write.csv(flights, "flights.csv")
    new_flights <- read.csv("flights.csv")
    head(new_flights)
  },
  readr = {
    write_csv(flights, "flights.csv")
    new_flights <- read_csv("flights.csv")
    head(new_flights)
  },
  times = 1
)
## Parsed with column specification:
## cols(
##   year = col_integer(),
##   month = col_integer(),
##   day = col_integer(),
##   dep_time = col_integer(),
##   sched_dep_time = col_integer(),
##   dep_delay = col_integer(),
##   arr_time = col_integer(),
##   sched_arr_time = col_integer(),
##   arr_delay = col_integer(),
##   carrier = col_character(),
##   flight = col_integer(),
##   tailnum = col_character(),
##   origin = col_character(),
##   dest = col_character(),
##   air_time = col_integer(),
##   distance = col_integer(),
##   hour = col_integer(),
##   minute = col_integer(),
##   time_hour = col_datetime(format = "")
## )
## Unit: seconds
##    expr      min       lq     mean   median       uq      max neval
##  base_r 9.869690 9.869690 9.869690 9.869690 9.869690 9.869690     1
##   readr 2.799264 2.799264 2.799264 2.799264 2.799264 2.799264     1

In this case, using readr’s I/O functions was roughly four times as fast.

One big write beats a thousand little writes

One thing scientists really like to do is iterate through a dataset, and append the results to a file after ever iteration. This is extremely slow, especially on compute clusters where network filesystem latency is high. Don’t do it. Wherever possible, aggregrate the results in memory, then write results to a file all at once.

microbenchmark(
  individual_writes = {
    for (row in 1:10000) {
      write_csv(flights[row, ], "flights.csv", append = TRUE)
    }
  },
  sequential_write = {
    write_csv(flights[1:10000, ], "flights.csv")
  },
  times = 1
)
## Unit: milliseconds
##               expr        min         lq       mean     median         uq
##  individual_writes 7759.64379 7759.64379 7759.64379 7759.64379 7759.64379
##   sequential_write   48.24504   48.24504   48.24504   48.24504   48.24504
##         max neval
##  7759.64379     1
##    48.24504     1

Avoid type coercion

R is meant to be a user-friendly language. It does things behind the scenes sometimes without telling us, usually to make things easier to work with. One of these is automatic type coercion: if a new type of data is added to a collection of existing elements of a different type of data, R will convert the datastructure automatically, which is a solid performance hit. This often happens without us knowing, and is especially common when working with integers.

microbenchmark(
  type_coercion = {
    integers <- as.integer(1:100000)
    integers[7] <- as.numeric(7.3)
  },
  no_coercion = {
    integers <- as.integer(1:100000)
    integers[7] <- as.integer(7.3)
  }
)
## Unit: microseconds
##           expr     min       lq     mean  median       uq      max neval
##  type_coercion 237.261 327.2840 325.8342 331.289 338.8950  380.453   100
##    no_coercion  94.456 117.0215 176.7763 119.051 123.5325 5311.687   100

This is a fairly tame example, but when working with a matrix of millions (or billions) of elements, the performance hit can be massive. The map functions from purrr are a useful tool here, because they will throw an error if an unwanted type tries to sneak in and convert our dataset.

Calling code from other languages

Throughout this tutorial, we’ve mentioned that other languages like C++ and Fortran are much faster than R. It is possible to write functions in these other languages that we can then use in R, resulting in a massive speedup over their R equivalents. Doing this won’t be covered here (this tutorial is intended as a “pure R” lesson that does not require prerequisite knowledge), but you can see an excellent overview of doing so in Hadley Wickham’s Advanced R book.