This month we share the article “R Tip: Break up Function Nesting for Legibility” by John Mount on R-bloggers.
There are a number of easy ways to avoid illegible code nesting problems in R.
In this R tip we will expand upon the above statement with a simple example.
At some point it becomes illegible and undesirable to compose operations by nesting them, such as in the following code.
head(mtcars[with(mtcars, cyl == 8), c("mpg", "cyl", "wt")])
# mpg cyl wt
# Hornet Sportabout 18.7 8 3.44
# Duster 360 14.3 8 3.57
# Merc 450SE 16.4 8 4.07
# Merc 450SL 17.3 8 3.73
# Merc 450SLC 15.2 8 3.78
# Cadillac Fleetwood 10.4 8 5.25
One popular way to break up nesting is to use magrittr‘s “%>%” in combination with dplyr transform verbs as we show below.
library("dplyr")
mtcars %>%
filter(cyl == 8) %>%
select(mpg, cyl, wt) %>%
head
# mpg cyl wt
# 1 18.7 8 3.44
# 2 14.3 8 3.57
# 3 16.4 8 4.07
# 4 17.3 8 3.73
# 5 15.2 8 3.78
# 6 10.4 8 5.25
Note: the above code lost (without warning) the row names that are part of mtcars. We also pass over the details of how pipe notation works. It is sufficient to say the notational convention is: each stage is approximately treated as an altered function call with a new inserted first argument set to the value of the pipeline up to the current point.
Many R users already routinely avoid nested notation problems through a convention I call “name re-use.” Such code looks like the following.
result <- mtcars
result <- filter(result, cyl == 8)
result <- select(result, mpg, cyl, wt)
head(result)
The above convention is enough to get around all problems of nesting. It also has the great advantage that it is step-debuggable.
I like a variation I call “dot intermediates”, which looks like the code below (notice we are switching back from dplyr verbs, to base R operators).
. <- mtcars
. <- subset(., cyl == 8)
. <- .[, c("mpg", "cyl", "wt")]
result <- .
head(result)
# mpg cyl wt
# Hornet Sportabout 18.7 8 3.44
# Duster 360 14.3 8 3.57
# Merc 450SE 16.4 8 4.07
# Merc 450SL 17.3 8 3.73
# Merc 450SLC 15.2 8 3.78
# Cadillac Fleetwood 10.4 8 5.25
The dot intermediate convention is very succinct, and we can use it with base R transforms to get a correct (and performant) result. Like all conventions: it is just a matter of teaching, learning, and repetition to make this seem natural, familiar and legible.
library("dplyr")
library("microbenchmark")
library("ggplot2")
timings <- microbenchmark(
base = {
. <- mtcars
. <- subset(., cyl == 8)
. <- .[, c("mpg", "cyl", "wt")]
nrow(.)
},
dplyr = {
mtcars %>%
filter(cyl == 8) %>%
select(mpg, cyl, wt) %>%
nrow
})
print(timings)
## Unit: microseconds
## expr min lq mean median uq max neval
## base 122.948 136.948 167.2253 159.688 179.924 349.328 100
## dplyr 1570.188 1654.700 2537.2912 1699.744 1785.611 50759.770 100
autoplot(timings)
Durations for related tasks, smaller is better.
Contrary to what many repeat, base R is often faster than the dplyr alternative. In this case the base R is 15 times faster (possibly due to magrittr overhead and the small size of this example). We also see, with some care, base R can be quite legible. dplyr is a useful tool and convention, however it is not the only allowed tool or only allowed convention.