Break data.table chain into two lines of code for readability

For many years, the way that automatic indentation in RStudio mis-aligns data.table pipes has been a source of frustration to me. I only recently realized that there is a neat way to get around this, simply by enclosing the piped operations in parentheses.

Here's a simple example:

x <- data.table(a = letters, b = LETTERS[1:5], c = rnorm(26))
y <- (
  x
  [, c := round(c, 2)]
  [sample(26)]
  [, d := paste(a,b)]
  [, .(d, foo = mean(c)), by = b]
  )

Why does this work? Because the un-closed parenthesis signals to the R interpreter that the current line is still not complete, and therefore the whole pipe is treated in the same way as a continuous line of code.


Chaining data.tables with magrittr

I have a method I'm using, with magrittr, using the . object with [:

library(magrittr)
library(data.table)

bar <- foo %>%
        .[etcetera] %>%
        .[etcetera] %>%
        .[etcetera]

working example:

out <- data.table(expand.grid(x = 1:10,y = 1:10))
out %>% 
  .[,z := x*y] %>% 
  .[,w := x*z] %>% 
  .[,v := w*z]
print(out)

Additional examples

Edit: it's also not just syntactic sugar, since it allows you to refer to the table from the previous step as ., which means that you can do a self join,

or you can use %T>% for some logging in-between steps (using futile.logger or the like):

out %>%
 .[etcetera] %>%
 .[etcetera] %T>% 
 .[loggingstep] %>%
 .[etcetera] %>%
 .[., on = SOMEVARS, allow.cartesian = TRUE]

EDIT:

This is much later, and I still use this regularly. But I have the following caveat:

magrittr adds overhead

I really like doing this at the top level of a script. It has a very clear and readable flow, and there are a number of neat tricks you can do with it.

But I've had to remove this before when optimizing if it's part of a function that's being called lots of times.

You're better off chaining data.tables the old fashioned way in that case.


You have to give a return between the [ and ] of each line. An example for how to divide your data.table code over several lines:

bar <- foo[, .(long_name_here = sum(foo2)), by = var
           ][order(-long_name_here)]

You can also give a return before / after each comma. An example with a return before the comma (my preference):

bar <- foo[, .(long_name_here = sum(foo2))
           , by = var
           ][order(-long_name_here)
             , long_name_2 := long_name_here * 10]

See this answer for an extended example