Why is using dplyr pipe (%>%) slower than an equivalent non-pipe expression, for high-cardinality group-by?
What might be a negligible effect in a real-world full application becomes non-negligible when writing one-liners that are time-dependent on the formerly "negligible". I suspect if you profile your tests then most of the time will be in the summarize
clause, so lets microbenchmark something similar to that:
> set.seed(99);z=sample(10000,4,TRUE)
> microbenchmark(z %>% unique %>% list, list(unique(z)))
Unit: microseconds
expr min lq mean median uq max neval
z %>% unique %>% list 142.617 144.433 148.06515 145.0265 145.969 297.735 100
list(unique(z)) 9.289 9.988 10.85705 10.5820 11.804 12.642 100
This is doing something a bit different to your code but illustrates the point. Pipes are slower.
Because pipes need to restructure R's calling into the same one that function evaluations are using, and then evaluate them. So it has to be slower. By how much depends on how speedy the functions are. Calls to unique
and list
are pretty fast in R, so the whole difference here is the pipe overhead.
Profiling expressions like this showed me most of the time is spent in the pipe functions:
total.time total.pct self.time self.pct
"microbenchmark" 16.84 98.71 1.22 7.15
"%>%" 15.50 90.86 1.22 7.15
"eval" 5.72 33.53 1.18 6.92
"split_chain" 5.60 32.83 1.92 11.25
"lapply" 5.00 29.31 0.62 3.63
"FUN" 4.30 25.21 0.24 1.41
..... stuff .....
then somewhere down in about 15th place the real work gets done:
"as.list" 1.40 8.13 0.66 3.83
"unique" 1.38 8.01 0.88 5.11
"rev" 1.26 7.32 0.90 5.23
Whereas if you just call the functions as Chambers intended, R gets straight down to it:
total.time total.pct self.time self.pct
"microbenchmark" 2.30 96.64 1.04 43.70
"unique" 1.12 47.06 0.38 15.97
"unique.default" 0.74 31.09 0.64 26.89
"is.factor" 0.10 4.20 0.10 4.20
Hence the oft-quoted recommendation that pipes are okay on the command line where your brain thinks in chains, but not in functions that might be time-critical. In practice this overhead will probably get wiped out in one call to glm
with a few hundred data points, but that's another story....
So, I finally got around to running the expressions in OP's question:
set.seed(0)
dummy_data <- dplyr::data_frame(
id=floor(runif(100000, 1, 100000))
, label=floor(runif(100000, 1, 4))
)
microbenchmark(dummy_data %>% group_by(id) %>% summarise(list(unique(label))))
microbenchmark(dummy_data %>% group_by(id) %>% summarise(label %>% unique %>% list))
This took so long that I thought I'd run into a bug, and force-interrupted R.
Trying again, with the number of repetitions cut down, I got the following times:
microbenchmark(
b=dummy_data %>% group_by(id) %>% summarise(list(unique(label))),
d=dummy_data %>% group_by(id) %>% summarise(label %>% unique %>% list),
times=2)
#Unit: seconds
# expr min lq mean median uq max neval
# b 2.091957 2.091957 2.162222 2.162222 2.232486 2.232486 2
# d 7.380610 7.380610 7.459041 7.459041 7.537471 7.537471 2
The times are in seconds! So much for milliseconds or microseconds. No wonder it seemed like R had hung at first, with the default value of times=100
.
But why is it taking so long? First, the way the dataset is constructed, the id
column contains about 63000 values:
length(unique(dummy_data$id))
#[1] 63052
Second, the expression that is being summarised over in turn contains several pipes, and each set of grouped data is going to be relatively small.
This is essentially the worst-case scenario for a piped expression: it's being called very many times, and each time, it's operating over a very small set of inputs. This results in plenty of overhead, and not much computation for that overhead to be amortised over.
By contrast, if we just switch the variables that are being grouped and summarized over:
microbenchmark(
b=dummy_data %>% group_by(label) %>% summarise(list(unique(id))),
d=dummy_data %>% group_by(label) %>% summarise(id %>% unique %>% list),
times=2)
#Unit: milliseconds
# expr min lq mean median uq max neval
# b 12.00079 12.00079 12.04227 12.04227 12.08375 12.08375 2
# d 10.16612 10.16612 12.68642 12.68642 15.20672 15.20672 2
Now everything looks much more equal.
But here is something I have learnt today. I am using R 3.5.0.
Code with x = 100 (1e2)
library(microbenchmark)
library(dplyr)
set.seed(99)
x <- 1e2
z <- sample(x, x / 2, TRUE)
timings <- microbenchmark(
dp = z %>% unique %>% list,
bs = list(unique(z)))
print(timings)
Unit: microseconds
expr min lq mean median uq max neval
dp 99.055 101.025 112.84144 102.7890 109.2165 312.359 100
bs 6.590 7.653 9.94989 8.1625 8.9850 63.790 100
Although, if x = 1e6
Unit: milliseconds
expr min lq mean median uq max neval
dp 27.77045 31.78353 35.09774 33.89216 38.26898 52.8760 100
bs 27.85490 31.70471 36.55641 34.75976 39.12192 138.7977 100