R Metaprogramming - Misc Patterns

This is a miscellaneous collection of metaprogramming examples I've used in production code.

Set up a docker container to run R code in

FROM rocker/rstudio:latest
# rocker/tidyverse didn't seem to have a prebuilt arm64 image, so
# we can build our own
RUN Rscript -e "install.packages(c('rlang', 'tidyverse', 'lobstr'))"

docker build . -t r_meta:latest

docker run --rm --name r_meta -dti r_meta

bba8f0404662ee369f8b4cce781a8c797016922e1a16e6eed826641840962ca4

Preliminaries

rlang provides a nice wrapper over base R's metaprogramming capabilities and extends it with a few useful facilities.

library(rlang)

Expressions and expression lists

R code (aka language objects) can be created, inspected, and modified much the same as any other object in R. To prevent code from being evaluated, it must be "quoted" which base R accomplishes with the quote function. rlang accomplishes this with the expr function, short for "expression".

(two_expr <- expr(1 + 1))
class(two_expr)
is.language(two_expr)

1 + 1
[1] "call"
[1] TRUE

Expression lists are unsurprisingly lists of expressions, easily made by exprs.

exprs(
  x + 5,
  unevaluated_call(y),
  z
)

[[1]]
x + 5

[[2]]
unevaluated_call(y)

[[3]]
z

Partial evaluation

Code evaluation need not be all or nothing, however. Sometimes it is useful to only partially evaluate an expression, which in R is usually referred to as "quasiquotation", sometimes "backquote". In base R, this is accomplished with the bquote function, which will accept part of an expression wrapped in .() to force evaluation.

numerator <- 5
bquote(.(numerator) / denominator)

5/denominator

rlang uses the "bang-bang" operator !!

expr(!!numerator / denominator)

5/denominator

Patterns / Examples

`filter` Expressions

Sometimes it may be useful to store filters in a variable such that they may be reused or be set as the result of some computation or argument to a function. For functions that support "tidy dots", expression lists may be "spliced" in to the function call, essentially unpacking the list (powered by rlang::list2). This is useful in applying multiple filter conditions.

my_filters <- exprs(
  cyl == 6,
  mpg >= 20
)

# filter supports tidy dots
dplyr::filter(mtcars, !!!my_filters) |> knitr::kable()



|               |  mpg| cyl| disp|  hp| drat|    wt|  qsec| vs| am| gear| carb|
|:--------------|----:|---:|----:|---:|----:|-----:|-----:|--:|--:|----:|----:|
|Mazda RX4      | 21.0|   6|  160| 110| 3.90| 2.620| 16.46|  0|  1|    4|    4|
|Mazda RX4 Wag  | 21.0|   6|  160| 110| 3.90| 2.875| 17.02|  0|  1|    4|    4|
|Hornet 4 Drive | 21.4|   6|  258| 110| 3.08| 3.215| 19.44|  1|  0|    3|    1|

Creating/Modifying Variables in a `mutate`

Very similarly to filtering expressions, mutations can be encoded in expressions with one added complication. Often when using mutate, we want to name the result of a computation (i.e. we give the new column a name). If we know the resulting name, this is fairly straightforward:

times_two <- function(x) {
  x * 2
}

mutations <- exprs(
  mpg = times_two(mpg),
  hp = times_two(mpg)
)

dplyr::transmute(head(mtcars), !!!mutations) |> knitr::kable()



|                  |  mpg|   hp|
|:-----------------|----:|----:|
|Mazda RX4         | 42.0| 84.0|
|Mazda RX4 Wag     | 42.0| 84.0|
|Datsun 710        | 45.6| 91.2|
|Hornet 4 Drive    | 42.8| 85.6|
|Hornet Sportabout | 37.4| 74.8|
|Valiant           | 36.2| 72.4|

We run in to issues, however, if we don't know the resulting name (i.e. we want the name of the column to also be a variable):

times_two_expression <- function(var) {
  exprs(!!var = times_two(!!var))
}

times_two_expression(cyl)

Error: unexpected '=' in:
"times_two_expression <- function(var) {
  exprs(!!var ="
Error: unexpected '}' in "}"
Error: object 'cyl' not found

As the errors say, we can't have a variable on the left hand side of an assignment. Additionally, since the name cyl doesn't exist outside of mtcars, an error is thrown when trying to use it. To address these, we can use the fact that the walrus operator := allows variables on the LHS of an assignment. However, it only allows strings on the LHS. So this will work:

times_two_expression <- function(var) {
  exprs(!!var := times_two(!!sym(var)))
}

times_two_expression("cyl")

dplyr::transmute(head(mtcars), !!!times_two_expression("cyl"))  |> knitr::kable()

$cyl
times_two(cyl)


|                  | cyl|
|:-----------------|---:|
|Mazda RX4         |  12|
|Mazda RX4 Wag     |  12|
|Datsun 710        |   8|
|Hornet 4 Drive    |  12|
|Hornet Sportabout |  16|
|Valiant           |  12|

This only works if we expect a string argument for var. Alternatively, we could expect symbols as arguments and adjust appropriately:

times_two_expression <- function(var) {
  exprs(!!as_name(var) := times_two(!!var))
}

times_two_expression(expr(cyl))

dplyr::transmute(head(mtcars), !!!times_two_expression(expr(cyl))) |> knitr::kable()

$cyl
times_two(cyl)


|                  | cyl|
|:-----------------|---:|
|Mazda RX4         |  12|
|Mazda RX4 Wag     |  12|
|Datsun 710        |   8|
|Hornet 4 Drive    |  12|
|Hornet Sportabout |  16|
|Valiant           |  12|

I prefer to use a exprs list that uses symbols inside expressions and set the list names (as strings) to handle the naming. This avoids having to worry about strings inside our expressions:

columns_to_double <- exprs(cyl, mpg, hp)

times_two_expression <- function(var) {
  ## using expr now (i.e. this doesn't return a list, but an
  ## expression now)
  expr(times_two(!!var))
}

(my_exprs <- set_names(
  ## list of expressions
  purrr::map(columns_to_double, times_two_expression),
  ## character vector of names
  purrr::map_chr(columns_to_double, as_name)
))

dplyr::transmute(head(mtcars), !!!my_exprs) |> knitr::kable()

$cyl
times_two(cyl)

$mpg
times_two(mpg)

$hp
times_two(hp)


|                  | cyl|  mpg|  hp|
|:-----------------|---:|----:|---:|
|Mazda RX4         |  12| 42.0| 220|
|Mazda RX4 Wag     |  12| 42.0| 220|
|Datsun 710        |   8| 45.6| 186|
|Hornet 4 Drive    |  12| 42.8| 220|
|Hornet Sportabout |  16| 37.4| 350|
|Valiant           |  12| 36.2| 210|

Chaining Expressions with a Pipe

Usually when working with lists of expressions, we need to combine them somehow. A pattern I've often used is to build up a list of expressions that form the steps in a pipeline. To turn the list into one expression I put all elements together with a pipe, which is easy enough with purrr::reduce.

pipe_together <- function(x, y) {
  expr(!!x %>% !!y)
}

purrr::reduce(
  exprs(
    mtcars,
    dplyr::filter(!!!my_filters),
    dplyr::transmute(!!!my_exprs)
  ),
  pipe_together
)

mtcars %>% dplyr::filter(cyl == 6, mpg >= 20) %>% dplyr::transmute(cyl = times_two(cyl), 
    mpg = times_two(mpg), hp = times_two(hp))

Dynamic modification of functions

I once had a project that was essentially a data pipeline with many steps that needed to run in a specific order due to dependencies between steps (i.e. it had to solve a DAG). The python Hamilton library encodes DAGs by having functions whose arguments were named the same as a function whose result should be used in place of that argument. With function and argument names defining the DAG, some introspection of functions allowed one to build the DAG out. That introspection is straightforward with R.

Here's some functions that use function and argument names to define the DAG as described above.

mtcars_data <- function(cyls) mtcars[mtcars$cyl %in% cyls, ]

horsepower <- function(mtcars_data) mtcars_data$hp

weight <- function(mtcars_data) mtcars_data$wt

power_to_weight <- function(horsepower, weight) horsepower / weight

To actually solve the connections between functions, we can recurse through functions and their arguments, putting calls to those functions in a new child environment and returning that child environment.

solve_dataflow <- function(fun_name, envir = rlang::caller_env()) {
  ## setup a child environment in which we can define how to call
  ## functions (with the parent env defining those functions)
  env_plan <- new_environment(parent = envir)

  ## we need to be able to tell if a given name corresponds to a
  ## function we know about
  is_function <- function(name) {
    name %in% lsf.str(envir)
  }

  ## get previously defined function names from our environment to be used as
  ## argument to dependent functions
  formals_from_plan <- function(formals_names) {
    if (!is.null(formals_names)) {
      env_get_list(env_plan, formals_names)
    } else {
      list()
    }
  }

  ## we need to recurse (starting from function `fun_name`) down all the
  ## function arguments, replacing argument names with calls to corresponding
  ## functions (if so defined)
  recur <- function(sym_name) {
    if (is_function(sym_name)) {
      formals_names <- fn_fmls_names(as_function(sym_name, envir))

      # arguments to this function that aren't in the planning environment
      unresolved <- setdiff(formals_names, env_names(env_plan))

      # make sure all arguments are resolved to a call or symbol
      purrr::walk(unresolved, recur)

      ## "install" in the env_plan environment a name whose value is how to call
      ## the function corresponding to that name
      env_bind(
        env_plan,
        !!sym_name := call2(sym_name, !!!formals_from_plan(formals_names))
      )
    } else {
      # its just a symbol
      env_bind(env_plan, !!sym_name := sym(sym_name))
    }
  }
  recur(fun_name)

  ## return the new environment
  env_plan
}

"Solving" the DAG makes the connections between functions and arguments. Note how function names have been replace by calls to those functions, and its arguments replaced with calls to calculate that function argument.

execution_plan <- solve_dataflow("power_to_weight")
lobstr::tree(execution_plan)

<environment: 0xaaab03b3ef40>
├─weight: <language> weight(mtcars_data = mtcars_data(cyls = cyls))
├─power_to_weight: <language> power_to_weight(horsepower = horsepower(mtcars_data = mtcars_data(cyls = cyls)), power_to_weight: <language>     weight = weight(mtcars_data = mtcars_data(cyls = cyls)))
├─cyls: <symbol> cyls
├─horsepower: <language> horsepower(mtcars_data = mtcars_data(cyls = cyls))
└─mtcars_data: <language> mtcars_data(cyls = cyls)

Viewing the AST of function calls demonstrates the order in which function calls will happen (with █ indicating a function call).

lobstr::ast(!!execution_plan$power_to_weight)

█─power_to_weight 
├─horsepower = █─horsepower 
│              └─mtcars_data = █─mtcars_data 
│                              └─cyls = cyls 
└─weight = █─weight 
           └─mtcars_data = █─mtcars_data 
                           └─cyls = cyls

Note that since none of these functions are defined in the execution_plan environment, their actual implementations will be looked up in the parent of execution_plan.

eval ing the tree of calls enables a final calculation, defining any symbols that aren't function calls (e.g. cyls here).

eval(execution_plan$power_to_weight, list(cyls = 4))

[1] 40.08621 19.43574 30.15873 30.00000 32.19814 35.42234 39.35091 34.10853 42.52336 74.68605 39.20863

Shut down docker container

docker stop r_meta

r_meta