Unseen factor levels when appending new records with unseen string values to a dataframe, cause Warning and result in NA
It could be caused by mismatch of types in two data.frames
.
First of all check types (classes). To diagnostic purposes do this:
new2old <- rbind( alltime, all2008 ) # this gives you a warning
old2new <- rbind( all2008, alltime ) # this should be without warning
cbind(
alltime = sapply( alltime, class),
all2008 = sapply( all2008, class),
new2old = sapply( new2old, class),
old2new = sapply( old2new, class)
)
I expect there be a row looks like:
alltime all2008 new2old old2new
... ... ... ... ...
some_column "factor" "numeric" "factor" "character"
... ... ... ... ...
If so then explanation:
rbind
don't check types match. If you analyse rbind.data.frame
code then you could see that the first argument initialized output types. If in first data.frame type is a factor, then output data.frame column is factor with levels unique(c(levels(x1),levels(x2)))
. But when in second data.frame column isn't factor then levels(x2)
is NULL
, so levels don't extend.
It means that your output data are wrong! There are NA
's instead of true values
I suppose that:
- you create you old data with another R/RODBC version so types were created with different methods (different settings - decimal separator maybe)
- there are NULL's or some specific data in problematic column, eg. someone change column under database.
Solution:
find wrong column and find reason why its's wrong and fixed. Eliminate cause not symptoms.
An "easy" way is to simply not have your strings set as factors when importing text data.
Note that the read.{table,csv,...}
functions take a stringsAsFactors
parameter, which is by default set to TRUE
. You can set this to FALSE
while you're importing and rbind
-ing your data.
If you'd like to set the column to be a factor at the end, you can do that too.
For example:
alltime <- read.table("alltime.txt", stringsAsFactors=FALSE)
all2008 <- read.table("all2008.txt", stringsAsFactors=FALSE)
alltime <- rbind(alltime, all2008)
# If you want the doctor column to be a factor, make it so:
alltime$doctor <- as.factor(alltime$doctor)
1) create the data frame with stringsAsFactor set to FALSE. This should resolve the factor-issue
2) afterwards don't use rbind - it messes up the column names if the data frame is empty. simply do it this way:
df[nrow(df)+1,] <- c("d","gsgsgd",4)
/
> df <- data.frame(a = character(0), b=character(0), c=numeric(0))
> df[nrow(df)+1,] <- c("d","gsgsgd",4)
Warnmeldungen:
1: In `[<-.factor`(`*tmp*`, iseq, value = "d") :
invalid factor level, NAs generated
2: In `[<-.factor`(`*tmp*`, iseq, value = "gsgsgd") :
invalid factor level, NAs generated
> df <- data.frame(a = character(0), b=character(0), c=numeric(0), stringsAsFactors=F)
> df[nrow(df)+1,] <- c("d","gsgsgd",4)
> df
a b c
1 d gsgsgd 4