Is grouping parallelised in data.table 1.12.0?

Yes, grouping is parallelized in v 1.12.0

Your benchmark is a bit of a red herring. You want a fast f(x, y) if you want to isolate the speed of grouping. Using the cardinalities of your examples but with a trivial function we get:

library(data.table)
  packageVersion("data.table")
#> [1] '1.12.0'

n = 5e6
N <- n
k = 1e4

print(getDTthreads())
#> [1] 12

DT = data.table(x = rep_len(runif(n), N),
                y = rep_len(runif(n), N),
                grp = rep_len(sample(1:k, n, TRUE), N))
bench::system_time(DT[, .(a = 1L), by = "grp"])
#>   process      real 
#> 250.000ms  72.029ms

setDTthreads(1)

bench::system_time(DT[, .(a = 1L), by = "grp"])
#>   process      real 
#> 125.000ms 126.385ms

Created on 2019-02-01 by the reprex package (v0.2.1)

That is, we were slightly faster in the parallel case, but only by about 50 ms -- negligible compared to the 3 s of your function.

If we bump out the size of the DT, we can see a more dramatic difference:

library(data.table)
  packageVersion("data.table")
#> [1] '1.12.0'

n = 5e6
N <- 1e9
k = 1e4

print(getDTthreads())
#> [1] 12

DT = data.table(x = rep_len(runif(n), N),
                y = rep_len(runif(n), N),
                grp = rep_len(sample(1:k, n, TRUE), N))
bench::system_time(DT[, .(a = 1L), by = "grp"])
#> process    real 
#> 45.719s 14.485s

setDTthreads(1)

bench::system_time(DT[, .(a = 1L), by = "grp"])
#> process    real 
#> 24.859s 24.890s
sessioninfo::session_info()
#> - Session info ----------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.5.2 (2018-12-20)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_Australia.1252      
#>  ctype    English_Australia.1252      
#>  tz       Australia/Sydney            
#>  date     2019-02-01                  
#> 

Created on 2019-02-01 by the reprex package (v0.2.1)


Here an answer not validated by any authority (e.g. a member of data.table team) based on my exploration of the data.table issues from the github repository.

From issue #3042 I understand that sum and mean are optimized. We can benchmark it to verify that it is correct:

library(data.table)
n = 1e7 ; k = 1e5
DT = data.table(x = runif(n), y = runif(n), grp = sample(1:k, n, TRUE))

setDTthreads(1)
system.time(DT[ , mean(x), by = grp]) #> 0.8 s
setDTthreads(0)
system.time(DT[ , mean(x), by = grp]) #> 0.4 s

However Matt Dowle in the same issue #3042 wrote:

There is much left to do on extending to other gforce functions and grouping arbitrary functions

And in #3130 sritchie73 wrote

Worth noting here that R functions are inherently not thread safe, e.g. so they can't be passed to multithreaded C++ code via Rcpp.

So, it seems that parallelization of user-defined functions is not a simple task and there is no current implementation in data.table.