Stat Tricks of the Month

By Zheyu Wang posted 04-04-2018 00:50

  

This month we share the article “R Tip: Break up Function Nesting for Legibility” by John Mount on R-bloggers.

There are a number of easy ways to avoid illegible code nesting problems in R.

In this R tip we will expand upon the above statement with a simple example.

At some point it becomes illegible and undesirable to compose operations by nesting them, such as in the following code.

   head(mtcars[with(mtcars, cyl == 8), c("mpg", "cyl", "wt")])

 

#                     mpg cyl   wt

# Hornet Sportabout  18.7   8 3.44

# Duster 360         14.3   8 3.57

# Merc 450SE         16.4   8 4.07

# Merc 450SL         17.3   8 3.73

# Merc 450SLC        15.2   8 3.78

# Cadillac Fleetwood 10.4   8 5.25

One popular way to break up nesting is to use magrittr‘s “%>%” in combination with dplyr transform verbs as we show below.

library("dplyr")

 

mtcars                 %>%

  filter(cyl == 8)     %>%

  select(mpg, cyl, wt) %>%

  head

 

#    mpg cyl   wt

# 1 18.7   8 3.44

# 2 14.3   8 3.57

# 3 16.4   8 4.07

# 4 17.3   8 3.73

# 5 15.2   8 3.78

# 6 10.4   8 5.25

Note: the above code lost (without warning) the row names that are part of mtcars. We also pass over the details of how pipe notation works. It is sufficient to say the notational convention is: each stage is approximately treated as an altered function call with a new inserted first argument set to the value of the pipeline up to the current point.

Many R users already routinely avoid nested notation problems through a convention I call “name re-use.” Such code looks like the following.

result <- mtcars

result <- filter(result, cyl == 8)

result <- select(result, mpg, cyl, wt)

head(result)

The above convention is enough to get around all problems of nesting. It also has the great advantage that it is step-debuggable.

I like a variation I call “dot intermediates”, which looks like the code below (notice we are switching back from dplyr verbs, to base R operators).

. <- mtcars

. <- subset(., cyl == 8)

. <- .[, c("mpg", "cyl", "wt")]

result <- .

head(result)

 

#                     mpg cyl   wt

# Hornet Sportabout  18.7   8 3.44

# Duster 360         14.3   8 3.57

# Merc 450SE         16.4   8 4.07

# Merc 450SL         17.3   8 3.73

# Merc 450SLC        15.2   8 3.78

# Cadillac Fleetwood 10.4   8 5.25

The dot intermediate convention is very succinct, and we can use it with base R transforms to get a correct (and performant) result. Like all conventions: it is just a matter of teaching, learning, and repetition to make this seem natural, familiar and legible.

library("dplyr")

library("microbenchmark")

library("ggplot2")

 

timings <- microbenchmark(

  base = {

    . <- mtcars

    . <- subset(., cyl == 8)

    . <- .[, c("mpg", "cyl", "wt")]

    nrow(.)

  },

  dplyr = {

    mtcars                 %>%

      filter(cyl == 8)     %>%

      select(mpg, cyl, wt) %>%

      nrow

  })

 

print(timings)

 

## Unit: microseconds

##   expr      min       lq      mean   median       uq       max neval

##   base  122.948  136.948  167.2253  159.688  179.924   349.328   100

##  dplyr 1570.188 1654.700 2537.2912 1699.744 1785.611 50759.770   100

 

autoplot(timings)

Durations for related tasks, smaller is better.

Contrary to what many repeat, base R is often faster than the dplyr alternative. In this case the base R is 15 times faster (possibly due to magrittr overhead and the small size of this example). We also see, with some care, base R can be quite legible. dplyr is a useful tool and convention, however it is not the only allowed tool or only allowed convention.

 

0 comments
8 views

Permalink