split data.table

I was looking for some way to do a split in data.table, I came across this old question.

Sometime a split is what you want to do, and the data.table "by" approach is not convenient.

Actually you can easily do your split by hand with data.table only instructions and it works very efficiently:

SplitDataTable <- function(dt,attr) {
  boundaries=c(0,which(head(dt[[attr]],-1)!=tail(dt[[attr]],-1)),nrow(dt))
  return(
    mapply(
      function(start,end) {dt[start:end,]},
      head(boundaries,-1)+1,
      tail(boundaries,-1),
      SIMPLIFY=F))
}

This works in v1.8.7 (and may work in v1.8.6 too) :

> sdt = lapply(split(1:nrow(dt), dt$b==2), function(x)dt[x])
> sdt
$`FALSE`
   a b
1: 1 1
2: 2 1

$`TRUE`
   a b
1: 3 2
2: 3 2

> sdt[[1]][,c:=.N,by=a]     # now no warning
> sdt
$`FALSE`
   a b c
1: 1 1 1
2: 2 1 1

$`TRUE`
   a b
1: 3 2
2: 3 2

But, as @mnel said, that's inefficient. Please avoid splitting if possible.

Tags:

R

Data.Table