Replace NA with 0, only in numeric columns in data.table

We can use set

for(j in seq_along(DT)){
    set(DT, i = which(is.na(DT[[j]]) & is.numeric(DT[[j]])), j = j, value = 0)
 }

Or create a index for numeric columns, loop through it and set the NA values to 0

ind <-   which(sapply(DT, is.numeric))
for(j in ind){
    set(DT, i = which(is.na(DT[[j]])), j = j, value = 0)
}

data

set.seed(24)
DT <- data.table(v1= c(NA, 1:4), v2 = c(NA, LETTERS[1:4]), v3=c(rnorm(4), NA))

I wanted to explore and possibly improve on the excellent answer given above by @akrun. Here's the data he used in his example:

library(data.table)

set.seed(24)
DT <- data.table(v1= c(NA, 1:4), v2 = c(NA, LETTERS[1:4]), v3=c(rnorm(4), NA))
DT

#>    v1   v2         v3
#> 1: NA <NA> -0.5458808
#> 2:  1    A  0.5365853
#> 3:  2    B  0.4196231
#> 4:  3    C -0.5836272
#> 5:  4    D         NA

And the two methods he suggested to use:

fun1 <- function(x){
  for(j in seq_along(x)){
  set(x, i = which(is.na(x[[j]]) & is.numeric(x[[j]])), j = j, value = 0)
  }
}

fun2 <- function(x){
  ind <-   which(sapply(x, is.numeric))
  for(j in ind){
    set(x, i = which(is.na(x[[j]])), j = j, value = 0)
  }
}

I think the first method above is really genius as it exploits the fact that NAs are typed.

First of all, even though .SD is not available in i argument, it is possible to pull the column name with get(), so I thought I could sub-assign data.table this way:

fun3 <- function(x){
  nms <- names(x)[sapply(x, is.numeric)]
  for(j in nms){
    x[is.na(get(j)), (j):=0]
  }
}

Generic case, of course would be to rely on .SD and .SDcols to work only on numeric columns

fun4 <- function(x){
  nms <- names(x)[sapply(x, is.numeric)]
  x[, (nms):=lapply(.SD, function(i) replace(i, is.na(i), 0)), .SDcols=nms]  
}

But then I thought to myself "Hey, who says we can't go all the way to base R for this sort of operation. Here's simple lapply() with conditional statement, wrapped into setDT()

fun5 <- function(x){
setDT(
  lapply(x, function(i){
    if(is.numeric(i))
         i[is.na(i)]<-0
    i
  })
)
}

Finally,we could use the same idea of conditional to limit the columns on which we apply the set()

fun6 <- function(x){
  for(j in seq_along(x)){
    if (is.numeric(x[[j]]) )
      set(x, i = which(is.na(x[[j]])), j = j, value = 0)
  }
}

Here are the benchmarks:

microbenchmark::microbenchmark(
  for.set.2cond = fun1(copy(DT)),
  for.set.ind = fun2(copy(DT)),
  for.get = fun3(copy(DT)),
  for.SDcol = fun4(copy(DT)),
  for.list = fun5(copy(DT)),
  for.set.if =fun6(copy(DT))
)

#> Unit: microseconds
#>           expr     min      lq     mean   median       uq      max neval cld
#>  for.set.2cond  59.812  67.599 131.6392  75.5620 114.6690 4561.597   100 a  
#>    for.set.ind  71.492  79.985 142.2814  87.0640 130.0650 4410.476   100 a  
#>        for.get 553.522 569.979 732.6097 581.3045 789.9365 7157.202   100   c
#>      for.SDcol 376.919 391.784 527.5202 398.3310 629.9675 5935.491   100  b 
#>       for.list  69.722  81.932 137.2275  87.7720 123.6935 3906.149   100 a  
#>     for.set.if  52.380  58.397 116.1909  65.1215  72.5535 4570.445   100 a

Replace NA with 0, only in numeric columns in data.table

data

Tags:

R

Numeric

Na

Data.Table

Related

Recent Posts