Performance optimization
Now that we know how to measure our code speed, we can focus on making it faster. We’ve actually covered a number of important performance optimizations already.
To recap:
Avoid memory bloat and costly reassignment by doing things in-place: do not create unnecessary variables.
Factors are an efficent way of saving memory when working with repetitive strings.
Avoid indexing out-of-bounds. We’ll cover another variant of this here as well.
tidyverse
packages are typically optimized out of the box - as long as we’re usingtidyverse
functions, we can be reasonably confident that some optimization has already been done (these functions also sidestep several performance traps).The
apply()
function offers superior performance when working with matrices.We don’t need to do anything to gain the benefits of R’s JIT compiler - it’s on by default.
We’ll now walk through a few other common optimizations we can use. This will be done by example - we’ll show exactly why each tip is here.
Preallocate your results
We’ll cover two examples of the same piece of code: in this case, just creating a sequence of numbers of a certain length. We will use a for loop for clarity. For the first case (dynamic
), we will grow the variable sequence
with each iteration of our loop. In the second case, we will create sequence
beforehand at its expected final size.
dynamic <- function(size) {
sequence <- NA
for (i in 1:size) {
sequence[i] <- i
}
return(sequence)
}
preallocated <- function(size) {
sequence <- rep(NA, size)
for (i in 1:size) {
sequence[i] <- i
}
return(sequence)
}
library(microbenchmark)
microbenchmark(dynamic(10),
dynamic(100),
dynamic(1000),
dynamic(10000),
preallocated(10),
preallocated(100),
preallocated(1000),
preallocated(10000))
## Unit: microseconds
## expr min lq mean median uq
## dynamic(10) 4.302 4.8495 5.75981 5.3815 6.2745
## dynamic(100) 34.733 36.9985 39.01158 38.0150 40.3305
## dynamic(1000) 312.952 325.1245 384.67507 330.9940 338.5085
## dynamic(10000) 3045.775 3116.5280 4043.69206 3186.0660 3416.4905
## preallocated(10) 2.517 2.8455 3.92232 3.1150 4.1320
## preallocated(100) 10.605 11.4570 13.44438 11.9510 14.2180
## preallocated(1000) 89.400 91.2730 113.40652 94.0870 102.4435
## preallocated(10000) 882.232 895.7860 1042.75275 906.0015 1010.1895
## max neval
## 10.895 100
## 58.897 100
## 3643.213 100
## 54640.227 100
## 11.720 100
## 35.359 100
## 1649.024 100
## 4624.827 100
Preallocating the sequence
variable of our loop performed better in every case. Intriguingly, the speed advantage of preallocating sequence vs. dynamically growing it actually increases with larger sizes. The reason for the difference in speed is caused by how R handles stored objects in memory. Every time it has to make sequence
bigger, R actually has to create a copy of sequence at the new size, and then add i
to the new object. By preallocating sequence at its final size, we avoid all of this unnecessary copying.
microbenchmark(
dynamic(10000),
preallocated(10000)
)
## Unit: microseconds
## expr min lq mean median uq
## dynamic(10000) 2937.264 3035.9860 3271.389 3084.1785 3205.791
## preallocated(10000) 877.958 895.9275 1040.761 907.2745 940.223
## max neval
## 4992.752 100
## 8034.196 100
This applies to complex structures like dataframes as well. Do not ever use a function like rbind
to dynamically grow your dataframe. An excellent way of sidestepping this issue is to use the corresponding tidyverse
functionality.
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(nycflights13)
microbenchmark(
rbind = {
results <- NA
for (row in 1:nrow(airports)) {
results <- rbind(airports[row, ], results)
}
},
# map_df returns a dataframe
map_df = 1:nrow(airports) %>% map_df(~airports[.x, ]),
times = 5
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## rbind 1083.8573 1085.1102 1101.7861 1090.9935 1107.7997 1141.1700 5
## map_df 417.6141 421.7491 422.7264 421.8737 424.2527 428.1426 5
Use vectorized code
Many of R’s core functions use actually use ultra-fast code written in other programming languages like C and Fortran. One major example of this is using R’s vectorized operators and functions (that take a vector as input and return a vector as output). One example of this is adding a number to a vector: in the first case, we will simply add 10 to a set of 1000 random numbers with the + operator. In the second case, we will loop over all 1000 numbers and add 10 to each. We’ll also use map
to compare things to the tidyverse as well.
randoms <- runif(1000)
microbenchmark(vectorized = randoms + 10,
map = map(randoms, ~.x + 10),
for_loop = for (i in 1:1000) {
randoms[i] + 10
})
## Unit: microseconds
## expr min lq mean median uq max
## vectorized 2.071 3.0035 3.94643 3.7795 4.5705 7.655
## map 817.490 847.7375 934.70673 865.8595 892.3615 2929.554
## for_loop 2491.274 2543.4375 2694.67620 2576.5930 2636.8385 4699.892
## neval
## 100
## 100
## 100
Generally speaking, vectorized code is the fastest choice. Do it whenever you can!
Avoid creating unnecessary variables - Part II
The more things you make your computer do, the longer it will take. Saving variables with <-
is not always an instantaneous operation, especially if the variables you are saving are comparatively large. This is true even for inline assignments (x <- x + 1
). In this case we will calculate the mean gdpPercap
for a particular country using the gapminder
dataset. In our extra_variables()
function, we will save all of the data for the country of interest as countryData
, then calculate the mean gdpPercap
. In the second case, we calculate gdpPercap
in a single step.
library(gapminder)
extra_variables <- function(country) {
countryData <- gapminder[gapminder$country == country, ]
return(mean(countryData$gdpPercap))
}
less_variables <- function(country) {
return(mean(gapminder$gdpPercap[gapminder$country == country]))
}
microbenchmark(
extra_variables = extra_variables("Germany"),
less_variables = less_variables("Germany")
)
## Unit: microseconds
## expr min lq mean median uq max
## extra_variables 518.789 525.6185 586.3183 528.9365 532.5105 4039.341
## less_variables 199.159 202.6400 266.4636 204.5095 206.2915 3476.021
## neval
## 100
## 100
Avoid using the hard disk if you can
Using the hard disk on your computer is an extremely slow operation. Objects stored in memory (the R workspace) are orders of magnitude faster to access. If one of your scripts is reading or writing to disk repeatedly, your code will be limited by your disk I/O speed, which is well below your memory I/O speed.
As an example of this, let’s save the flights dataset in memory and then read it back. We will then do the same thing, but instead of using memory, we will save it to disk and then read it back.
microbenchmark(
memory = {
new_flights <- flights
head(new_flights)
},
disk = {
write.csv(flights, "flights.csv")
new_flights <- read.csv("flights.csv")
head(new_flights)
},
times = 1
)
## Unit: microseconds
## expr min lq mean median uq
## memory 558.677 558.677 558.677 558.677 558.677
## disk 11274470.967 11274470.967 11274470.967 11274470.967 11274470.967
## max neval
## 558.677 1
## 11274470.967 1
Using memory was much faster than using the disk. However, if we do need to use the disk somewhere in your analysis, there are a few things you can do to speed things along.
When you write to disk, use readr
readr
is an I/O library from the tidyverse. It has a number of improvements over base R’s I/O functions, both in terms of usability and performance.
microbenchmark(
base_r = {
write.csv(flights, "flights.csv")
new_flights <- read.csv("flights.csv")
head(new_flights)
},
readr = {
write_csv(flights, "flights.csv")
new_flights <- read_csv("flights.csv")
head(new_flights)
},
times = 1
)
## Parsed with column specification:
## cols(
## year = col_integer(),
## month = col_integer(),
## day = col_integer(),
## dep_time = col_integer(),
## sched_dep_time = col_integer(),
## dep_delay = col_integer(),
## arr_time = col_integer(),
## sched_arr_time = col_integer(),
## arr_delay = col_integer(),
## carrier = col_character(),
## flight = col_integer(),
## tailnum = col_character(),
## origin = col_character(),
## dest = col_character(),
## air_time = col_integer(),
## distance = col_integer(),
## hour = col_integer(),
## minute = col_integer(),
## time_hour = col_datetime(format = "")
## )
## Unit: seconds
## expr min lq mean median uq max neval
## base_r 9.869690 9.869690 9.869690 9.869690 9.869690 9.869690 1
## readr 2.799264 2.799264 2.799264 2.799264 2.799264 2.799264 1
In this case, using readr
’s I/O functions was roughly four times as fast.
One big write beats a thousand little writes
One thing scientists really like to do is iterate through a dataset, and append the results to a file after ever iteration. This is extremely slow, especially on compute clusters where network filesystem latency is high. Don’t do it. Wherever possible, aggregrate the results in memory, then write results to a file all at once.
microbenchmark(
individual_writes = {
for (row in 1:10000) {
write_csv(flights[row, ], "flights.csv", append = TRUE)
}
},
sequential_write = {
write_csv(flights[1:10000, ], "flights.csv")
},
times = 1
)
## Unit: milliseconds
## expr min lq mean median uq
## individual_writes 7759.64379 7759.64379 7759.64379 7759.64379 7759.64379
## sequential_write 48.24504 48.24504 48.24504 48.24504 48.24504
## max neval
## 7759.64379 1
## 48.24504 1
Avoid type coercion
R is meant to be a user-friendly language. It does things behind the scenes sometimes without telling us, usually to make things easier to work with. One of these is automatic type coercion: if a new type of data is added to a collection of existing elements of a different type of data, R will convert the datastructure automatically, which is a solid performance hit. This often happens without us knowing, and is especially common when working with integers.
microbenchmark(
type_coercion = {
integers <- as.integer(1:100000)
integers[7] <- as.numeric(7.3)
},
no_coercion = {
integers <- as.integer(1:100000)
integers[7] <- as.integer(7.3)
}
)
## Unit: microseconds
## expr min lq mean median uq max neval
## type_coercion 237.261 327.2840 325.8342 331.289 338.8950 380.453 100
## no_coercion 94.456 117.0215 176.7763 119.051 123.5325 5311.687 100
This is a fairly tame example, but when working with a matrix of millions (or billions) of elements, the performance hit can be massive. The map
functions from purrr
are a useful tool here, because they will throw an error if an unwanted type tries to sneak in and convert our dataset.
Calling code from other languages
Throughout this tutorial, we’ve mentioned that other languages like C++ and Fortran are much faster than R. It is possible to write functions in these other languages that we can then use in R, resulting in a massive speedup over their R equivalents. Doing this won’t be covered here (this tutorial is intended as a “pure R” lesson that does not require prerequisite knowledge), but you can see an excellent overview of doing so in Hadley Wickham’s Advanced R book.