rbindlist two data.tables where one has factor and other has character type for a column
UPDATE - This bug (#2650) was fixed on 17 May 2013 in v1.8.9
I believe that rbindlist
when applied to factors is combining the numerical values of the factors and using only the levels associated with the first list element.
As in this bug report: http://r-forge.r-project.org/tracker/index.php?func=detail&aid=2650&group_id=240&atid=975
# Temporary workaround:
levs <- c(as.character(DT.1$x), as.character(DT.2$x))
DT.1[, x := factor(x, levels=levs)]
DT.2[, x := factor(x, levels=levs)]
rbindlist(list(DT.1, DT.2))
As another view of whats going on:
DT3 <- data.table(x=c("1st", "2nd"), y=1:2)
DT4 <- copy(DT3)
DT3[, x := factor(x, levels=x)]
DT4[, x := factor(x, levels=x, labels=rev(x))]
DT3
DT4
# Have a look at the difference:
rbindlist(list(DT3, DT4))$x
# [1] 1st 2nd 1st 2nd
# Levels: 1st 2nd
do.call(rbind, list(DT3, DT4))$x
# [1] 1st 2nd 2nd 1st
# Levels: 1st 2nd
Edit as per comments:
as for observation 1, what's happening is similar to:
x <- factor(LETTERS[1:5])
x[6:10] <- letters[1:5]
x
# Notice however, if you are assigning a value that is already present
x[11] <- "S" # warning, since `S` is not one of the levels of x
x[12] <- "D" # all good, since `D` *is* one of the levels of x
rbindlist
is superfast because it doesn't do the checking of rbindfill
or do.call(rbind.data.frame,...)
You can use a workaround like this to ensure that factors are coerced to characters.
DT.1 <- data.table(x = factor(letters[1:5]), y = 6:10)
DT.2 <- data.table(x = LETTERS[1:5], y = 11:15)
for(ii in seq_along(DDL)){
ff <- Filter(function(x) is.factor(DDL[[ii]][[x]]), names(DDL[[ii]]))
for(fn in ff){
set(DDL[[ii]], j = fn, value = as.character(DDL[[ii]][[fn]]))
}
}
rbindlist(DDL)
or (less memory efficiently)
rbindlist(rapply(DDL, classes = 'factor', f= as.character, how = 'replace'))