Select nth observation and sum by group using data.table

There are several things you can improve:

  1. fread will return a data.table, so no need to wrap it in data.table. You can check with class(DT).
  2. Use the na.strings parameter when reading in the data. See below for an example.
  3. Summarise with:

    DT[, .(a = a[.N], 
           b = b[.N], 
           c = c[1], 
           d = sum(d), 
           e = sum(e), 
           f = unique(na.omit(f)))
       , by = ID]
    

you will then get:

   ID  a   b    c      d       e          f
1:  1 11 111 1000  30030  300300  5/07/1977
2:  2 22 222 2000  20000  200000  6/02/1980
3:  3 33 333 3000  60010  600100 20/12/1978
4:  4 44 444 4000 240150 2401500  7/06/1944
5:  5 55 555 5000 100010 1000100 31/05/1976

Some explanations & other notes:

  • Subsetting with [1] will give you the first value of a group. You could also use the first-function which is optimized in data.table, and thus faster.
  • Subsetting with [.N] will give you the last value of a group. You could also use the last-function which is optimized in data.table, and thus faster.
  • Don't use variable names that are also functions in R (in this case, don't use c as a variable name). See also ?c for an explanation of what the c-function does.
  • For summarising the f-variable, I used unique in combination with na.omit. If there is more than one unique date by ID, you could also use for example na.omit(f)[1].

If speed is an issue, you could optimize the above to (thx to @Frank):

DT[order(f)
   , .(a = last(a), 
       b = last(b), 
       c = first(c), 
       d = sum(d), 
       e = sum(e), 
       f = first(f))
   , by = ID]

Ordering by f will put NA-values last. As a result now the internal GForce-optimization is used for all calculations.


Used data:

DT <- fread("ID   a    b    c        d        e          f
             1   10  100 1000    10000   100000  ?
             1   10  100 1001    10010   100100  5/07/1977
             1   11  111 1002    10020   100200  5/07/1977
             2   22  222 2000    20000   200000  6/02/1980
             3   33  333 3000    30000   300000  20/12/1978
             3   33  333 3001    30010   300100  ?
             4   40  400 4000    40000   400000  ?
             4   40  400 4001    40010   400100  ?
             4   40  400 4002    40020   400200  7/06/1944
             4   44  444 4003    40030   400300  ?
             4   44  444 4004    40040   400400  ?
             4   44  444 4005    40050   400500  ?
             5   55  555 5000    50000   500000  31/05/1976
             5   55  555 5001    50010   500100  31/05/1976", na.strings='?')

We can use tidyverse. After grouping by 'ID', we summarise the columns based on the first or last observation

library(dplyr) 
DT %>% 
   group_by(ID) %>% 
   summarise(a = last(a),
             b = last(b), 
             c = first(c), 
             d = sum(d), 
             e = sum(e), 
             f = f[f!="?"][1])
# A tibble: 5 × 7
#     ID     a     b     c      d       e          f
#  <int> <int> <int> <int>  <int>   <int>      <chr>
#1     1    11   111  1000  30030  300300  5/07/1977
#2     2    22   222  2000  20000  200000  6/02/1980
#3     3    33   333  3000  60010  600100 20/12/1978
#4     4    44   444  4000 240150 2401500  7/06/1944
#5     5    55   555  5000 100010 1000100 31/05/1976

Tags:

R

Data.Table