R data.table weird value/reference semantics

A supplement to GKi's answer:

setalloccol's location is indeed the direct culprit: it performs a shallow copy (i.e., generates a new vector of pointers to the existing data columns) and in addition allocates extra 1024 (by default) slots for additional columns. If setting the class to data.frame is performed after this shallow copy (either by class(z)<- or by setattr) it is applied to this new vector and not the original argument.

However.

Even after using a fixed version of setDT (with setattr called after setalloccol), it seems there is no way to get consistent behaviour. Some operations apply to the caller copy, and some don't.

Click to copy

df <- data.frame(a=1:2, b=3:4)

foo1 <- function(z) { 
  setDT.fixed(z)
  z[, b:=5]   # will apply to the caller copy
  data.table::setDF(z)
}

foo1(df)
#    a b
# 1: 1 5
# 2: 2 5
class(df)
# [1] "data.frame"
df
#   a b
# 1 1 5
# 2 2 5

foo2 <- function(z) { 
  setDT.fixed(z)
  z[, c:=5]   # will NOT apply to the caller copy
  data.table::setDF(z)
}
foo2(df)
#    a b c
# 1: 1 3 5
# 2: 2 4 5
# Warning message:
# In `[.data.table`(z, , `:=`(c, 5)) :
#  Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
class(df)
# [1] "data.table" "data.frame"
df
#    a b
# 1: 1 3
# 2: 2 4

(Using the j argument, e.g., z[!is.na(a), b:=6] gives an extra dimension of weirdness which I won't go into here).

Bottom line, the data.table package took on the brave task of punching a hole in R's all-value semantics. It was pretty successful until setDT came along (BTW, in response to a SO question here). Using setDT within a function on an argument will probably never have consistent semantics and is almost guaranteed to get you nasty surprises.

In your function z is a reference to x until setDT.

Click to copy

library(data.table)
foo <- function(z) {print(address(z)); setDT(z); print(address(z))} 
x <- data.frame(a = 1:2)
address(x)
#[1] "0x555ec9a471e8"
foo(x)
#[1] "0x555ec9a471e8"
#[1] "0x555ec9ede300"

In setDT it comes to the following line where z is still pointing to the same address like x:

Click to copy

setattr(z, "class", data.table:::.resetclass(z, "data.frame"))

setattr does not make a copy. So x and z are still pointing to the same address and both are now of class data.frame:

Click to copy

x <- data.frame(a = 1:2)
z <- x
class(x)
#[1] "data.frame"
address(x)
#[1] "0x555ec95de600"
address(z)
#[1] "0x555ec95de600"

setattr(z, "class", data.table:::.resetclass(z, "data.frame"))

class(x)
#[1] "data.table" "data.frame"
address(x)
#[1] "0x555ec95de600"
address(z)
#[1] "0x555ec95de600"

Then setalloccol is called which calls in this case:

Click to copy

assign("z", .Call(data.table:::Calloccolwrapper, z, 1024, FALSE))

which now let x and z point to different addresses.

Click to copy

address(x)
#[1] "0x555ecaa09c00"
address(z)
#[1] "0x555ec95de600"

And both have the class data.frame

Click to copy

class(x)
#[1] "data.table" "data.frame"
class(z)
#[1] "data.table" "data.frame"

I think when they would have used

Click to copy

class(z) <- data.table:::.resetclass(z, "data.frame")

instead of

Click to copy

setattr(z, "class", data.table:::.resetclass(z, "data.frame"))

the problem would not occur.

Click to copy

x <- data.frame(a = 1:2)
z <- x
address(x)
#[1] "0x555ec9cd2228"
class(z) <- data.table:::.resetclass(z, "data.frame")
class(x)
#[1] "data.frame"
class(z)
#[1] "data.table" "data.frame"
address(x)
#[1] "0x555ec9cd2228"
address(z)
#[1] "0x555ec9cd65a8"

but after class(z) <- value z will not point to the same address where it points before:

Click to copy

z <- data.frame(a = 1:2)
address(z)
#[1] "0x5653dbe72b68"
address(z$a)
#[1] "0x5653db82e140"
class(z) <- c("data.table", "data.frame")
address(z)
#[1] "0x5653dbe82d98"
address(z$a)
#[1] "0x5653db82e140"

but after setDT it will also not point to the same address where it points before:

Click to copy

z <- data.frame(a = 1:2)
address(z)
#[1] "0x55b6f04d0db8"
setDT(z)
address(z)
#[1] "0x55b6efe1e0e0"

As @Matt-dowle pointed out, it is also possible to change the data in x over z:

Click to copy

x <- data.frame(a = c(1,3))
z <- x
setDT(z)
z[, b:=3:4]
z[2, a:=7]
z
#   a b
#1: 1 3
#2: 7 4
x
#   a
#1: 1
#2: 7

Click to copy

R.version.string
#[1] "R version 4.0.2 (2020-06-22)"
packageVersion("data.table")
#[1] ‘1.12.8’

R data.table weird value/reference semantics

Tags:

R

Data.Table

Related

Recent Posts