I recently watched the “Tidy eval: Programming with dplyr, tidyr, and ggplot2” video. It’s an excellent introduction to the concept of the tidy evaluation, which is the core concept for programming with dplyr and friends.

In this video, Hadley showed on the slide the grouped_mean function (12:48). An attempt to implement this functions might be a good exercise in tidy evaluation, and an excellent opportunity to compare this approach with standard evaluation rules provided by the seplyr package.

Let’s start with the simple example:

library(dplyr)

mtcars %>% 
  group_by(cyl) %>% 
  summarise(mean = mean(hp))
## # A tibble: 3 x 2
##     cyl  mean
##   <dbl> <dbl>
## 1     4  82.6
## 2     6 122. 
## 3     8 209.

The code below shows the first version of this function (based on the knowledge from the video).

grouped_mean <- function(dt, group, value) {
  group <- enquo(group)
  value <- enquo(value)
  dt %>% 
    group_by(!!group) %>% 
    summarise(mean = mean(!!value))
}

Let’s try it:

grouped_mean(mtcars, cyl, hp)
## # A tibble: 3 x 2
##     cyl  mean
##   <dbl> <dbl>
## 1     4  82.6
## 2     6 122. 
## 3     8 209.
grouped_mean(mtcars, gear, mpg)
## # A tibble: 3 x 2
##    gear  mean
##   <dbl> <dbl>
## 1     3  16.1
## 2     4  24.5
## 3     5  21.4

But maybe we want to use more than one variable for grouping? This use case is described here in the section “Capturing multiple variables”. So the second version might look like this (I had to change the order of variables):

grouped_mean2 <- function(dt, value, ...) {
  value <- enquo(value)
  groups <- quos(...)
  dt %>% 
    group_by(!!!groups) %>%
    summarise(mean = mean(!!value))
}

grouped_mean2(mtcars, mpg) # without grouping
## # A tibble: 1 x 1
##    mean
##   <dbl>
## 1  20.1
grouped_mean2(mtcars, mpg, cyl) # one variable used for grouping
## # A tibble: 3 x 2
##     cyl  mean
##   <dbl> <dbl>
## 1     4  26.7
## 2     6  19.7
## 3     8  15.1
grouped_mean2(mtcars, mpg, cyl, gear) # two variables
## # A tibble: 8 x 3
## # Groups:   cyl [?]
##     cyl  gear  mean
##   <dbl> <dbl> <dbl>
## 1     4     3  21.5
## 2     4     4  26.9
## 3     4     5  28.2
## 4     6     3  19.8
## 5     6     4  19.8
## 6     6     5  19.7
## 7     8     3  15.0
## 8     8     5  15.4

seplyr

However, we might want to pass the column names as strings, so using the nonstandard evaluation might be a problem here. But there’s a seplyr package, which provides another interface to dplyr in which you can pass a vector of strings. It works perfectly for grouping, but for other functions like summarise or mutate it’s not as elegant as the tidy solution.

library(seplyr)

grouped_mean_se <- function(dt, group, value) {
  # I pass the R code to summarise_se as a string
  # it's not very elegant:(
  dt %>% 
    group_by_se(group) %>%
    summarise_se(setNames(sprintf("mean(`%s`)", value), "mean"))
}

grouped_mean_se(mtcars, "cyl", "hp")
## # A tibble: 3 x 2
##     cyl  mean
##   <dbl> <dbl>
## 1     4  82.6
## 2     6 122. 
## 3     8 209.
grouped_mean_se(mtcars, "gear", "mpg")
## # A tibble: 3 x 2
##    gear  mean
##   <dbl> <dbl>
## 1     3  16.1
## 2     4  24.5
## 3     5  21.4

The good thing about this solution is that grouping by multiple columns works without any modifications. See the example below:

grouped_mean_se(mtcars, c("gear", "cyl"), "mpg")
## # A tibble: 8 x 3
## # Groups:   gear [?]
##    gear   cyl  mean
##   <dbl> <dbl> <dbl>
## 1     3     4  21.5
## 2     3     6  19.8
## 3     3     8  15.0
## 4     4     4  26.9
## 5     4     6  19.8
## 6     5     4  28.2
## 7     5     6  19.7
## 8     5     8  15.4

You can use the seplyr approach with tidyeval to make it nicer. Note that rlang::parse_quosure works as enquo, but extracts the value from the variable.

grouped_mean_se2 <- function(dt, group, value) {
  value <- rlang::parse_quosure(value)
  dt %>% 
    group_by_se(group) %>%
    summarise(mean = mean(!!value))
}

grouped_mean_se2(mtcars, c("gear", "cyl"), "hp")
## # A tibble: 8 x 3
## # Groups:   gear [?]
##    gear   cyl  mean
##   <dbl> <dbl> <dbl>
## 1     3     4   97 
## 2     3     6  108.
## 3     3     8  194.
## 4     4     4   76 
## 5     4     6  116.
## 6     5     4  102 
## 7     5     6  175 
## 8     5     8  300.

There are also other possibilities for using the tidyeval approach with seplyr. One that seems to be useful is to pass grouping variables as a string vector, but use standard dplyr’s rules in summarise.

grouped_summarise <- function(dt, group, ...) {
  dt %>% 
    group_by_se(group) %>%
    summarise(...)
}

grouped_summarise(
  mtcars, "gear",
  mean_hp = mean(hp),
  mean_mpg = mean(mpg)
)
## # A tibble: 3 x 3
##    gear mean_hp mean_mpg
##   <dbl>   <dbl>    <dbl>
## 1     3   176.      16.1
## 2     4    89.5     24.5
## 3     5   196.      21.4
grouped_summarise(
  mtcars, c("gear", "cyl"),
  mean_hp = mean(hp),
  mean_mpg = mean(mpg),
  n = n()
)
## # A tibble: 8 x 5
## # Groups:   gear [?]
##    gear   cyl mean_hp mean_mpg     n
##   <dbl> <dbl>   <dbl>    <dbl> <int>
## 1     3     4     97      21.5     1
## 2     3     6    108.     19.8     2
## 3     3     8    194.     15.0    12
## 4     4     4     76      26.9     8
## 5     4     6    116.     19.8     4
## 6     5     4    102      28.2     2
## 7     5     6    175      19.7     1
## 8     5     8    300.     15.4     2

The same function, but using only standard evaluation techniques is a bit less elegant because a user needs to pass summarise expressions in the form of strings. It might be a problem because the syntax highlight and a tool for code analysis do not work inside the string. But this approach might be sometimes useful.

grouped_summarise_se <- function(dt, group, vals) {
  dt %>% 
    group_by_se(group) %>%
    summarise_se(summarizeTerms = vals)
}

grouped_summarise_se(
  mtcars, "gear",
  vals = list(
    mean_hp = "mean(hp)",
    mean_mpg = "mean(mpg)")
)
## # A tibble: 3 x 3
##    gear mean_hp mean_mpg
##   <dbl>   <dbl>    <dbl>
## 1     3   176.      16.1
## 2     4    89.5     24.5
## 3     5   196.      21.4
grouped_summarise_se(
  mtcars, c("gear", "cyl"),
  vals = list(
    mean_hp = "mean(hp)",
    mean_mpg = "mean(mpg)",
    n = "n()"
  )
)
## # A tibble: 8 x 5
## # Groups:   gear [?]
##    gear   cyl mean_hp mean_mpg     n
##   <dbl> <dbl>   <dbl>    <dbl> <int>
## 1     3     4     97      21.5     1
## 2     3     6    108.     19.8     2
## 3     3     8    194.     15.0    12
## 4     4     4     76      26.9     8
## 5     4     6    116.     19.8     4
## 6     5     4    102      28.2     2
## 7     5     6    175      19.7     1
## 8     5     8    300.     15.4     2

wrapr

The last topic related to nonstandard evaluation rules is a package wrapr. It allows substituting the variable name in a code block with something else. Consider this simple example - the variable VALUE, will be replaced by xxx. I set the eval parameter to FALSE, to capture the expression, without evaluating. For more information please check the articles here or here.

value <- "xxx"
wrapr::let(
    c(VALUE = value), eval = FALSE,
    dt %>% 
      group_by_se(group) %>%
      summarise(mean = mean(VALUE))
)
## dt %>% group_by_se(group) %>% summarise(mean = mean(xxx))

So the final version of grouped_mean using wrapr::let might looks like this (and for me, it’s the most elegant solution if we want to use standard evaluation rules and pass string arguments):

grouped_mean_wrapr <- function(dt, group, value) {
  wrapr::let(
    c(VALUE = value),
    dt %>% 
      group_by_se(group) %>%
      summarise(mean = mean(VALUE))
  )
}

grouped_mean_wrapr(mtcars, c("cyl", "gear"), "hp")
## # A tibble: 8 x 3
## # Groups:   cyl [?]
##     cyl  gear  mean
##   <dbl> <dbl> <dbl>
## 1     4     3   97 
## 2     4     4   76 
## 3     4     5  102 
## 4     6     3  108.
## 5     6     4  116.
## 6     6     5  175 
## 7     8     3  194.
## 8     8     5  300.
codetools::checkUsage(grouped_mean_wrapr, all = TRUE)
## <anonymous>: no visible binding for global variable 'VALUE' (<text>:2-7)

But there’s one caveat. The automatic tools for checking the code (like codetools::checkUsage) might treat VALUE as an undefined variable. It might cause a warning in R CMD check, so such code would have a problem with getting into CRAN. The easy fix for this is to use the name value instead of VALUE inside let. However, I think that using uppercase variables names is a better solution because they’re more visible, and it’s easier to know which variables are going to be substituted inside the code block. So the other solution is to create an empty variable VALUE, to turn off this warning.

grouped_mean_wrapr_clean <- function(dt, group, value) {
  
  VALUE <- NULL
  wrapr::let(
    c(VALUE = value),
    dt %>% 
      group_by_se(group) %>%
      summarise(mean = mean(VALUE))
  )
}

codetools::checkUsage(grouped_mean_wrapr_clean, all = TRUE)

Summary

In this post I tried to show how you can program with dplyr’s which is based on tidyeval principle, and some other approaches, the seplyr which is mostly a dplyr with standard evaluation rules, and wrapr::let which uses substitution to get the expected code. From all those three approaches my gut feeling tells me that the wrapr::let is the most elegant, and precise, but I can’t tell if it is sufficient. Probably all of those three approaches have their use cases.

Session info

R.version
##                _                           
## platform       x86_64-pc-linux-gnu         
## arch           x86_64                      
## os             linux-gnu                   
## system         x86_64, linux-gnu           
## status                                     
## major          3                           
## minor          5.1                         
## year           2018                        
## month          07                          
## day            02                          
## svn rev        74947                       
## language       R                           
## version.string R version 3.5.1 (2018-07-02)
## nickname       Feather Spray
packageVersion("dplyr")
## [1] '0.7.6'
packageVersion("seplyr")
## [1] '0.5.9'
packageVersion("wrapr")
## [1] '1.6.1'