Elegantly assigning multiple columns in data.table with lapply()
Is this what you are looking for?
dt[ , names(dt)[20:100] :=lapply(.SD, function(x) sqrt(x) ) , .SDcols=20:100]
I have heard tell that using .SD
is not so efficient because it makes a copy of the table beforehand, but if your table isn't huge (obviously that's relative depending on your system specs) I doubt it will make much of a difference.
Yes, you're right in question here :
I understand it is more efficient to loop over a vector of column names using
:=
to assign:
for (col in paste0("V", 20:100)) dt[, col := sqrt(dt[[col]]), with = FALSE]
Aside: note that the new way of doing that is :
for (col in paste0("V", 20:100))
dt[ , (col) := sqrt(dt[[col]])]
because the with = FALSE
wasn't easy to read whether it referred to the LHS or the RHS of :=
. End aside.
As you know, that's efficient because that does each column one by one, so working memory is only needed for one column at a time. That can make a difference between it working and it failing with the dreaded out-of-memory error.
The problem with lapply
on the RHS of :=
is that the RHS (the lapply
) is evaluated first; i.e., the result for the 80 columns is created. That's 80 column's worth of new memory which has to be allocated and populated. So you need 80 column's worth of free RAM for that operation to succeed. That RAM usage dominates vs the subsequently instant operation of assigning (plonking) those 80 new columns into the data.table's column pointer slots.
As @Frank pointed to, if you have a lot of columns (say 10,000 or more) then the small overhead of dispatching to the [.data.table
method starts to add up). To eliminate that overhead that there is data.table::set
which under ?set
is described as a "loopable" :=
. I use a for
loop for this type of operation. It's the fastest way and is fairly easy to write and read.
for (col in paste0("V", 20:100))
set(dt, j = col, value = sqrt(dt[[col]]))
Although with just 80 columns, it's unlikely to matter. (Note it may be more common to loop set
over a large number of rows than a large number of columns.) However, looped set
doesn't solve the problem of the repeated reference to the dt
symbol name that you mentioned in the question :
I don't like this because I don't like reference the data.table in a j expression.
Agreed. So the best I can do is revert to your looping of :=
but use get
instead.
for (col in paste0("V", 20:100))
dt[, (col) := sqrt(get(col))]
However, I fear that using get
in j
carry an overhead. Benchmarking made in #1380. Also, perhaps it is confusing to use get()
on the RHS but not on the LHS. To address that we could sugar the LHS and allow get()
as well, #1381 :
for (col in paste0("V", 20:100))
dt[, get(col) := sqrt(get(col))]
Also, maybe value
of set
could be run within scope of DT
, #1382.
for (col in paste0("V", 20:100))
set(dt, j = col, value = sqrt(get(col))
These should work if you want to refer to the columns by string name:
n = paste0("V", 20:100)
dt[, (n) := lapply(n, function(x) {sqrt(get(x))})]
or
dt[, (n) := lapply(n, function(x) {sqrt(dt[[x]])})]