Select nth observation and sum by group using data.table
There are several things you can improve:
fread
will return a data.table, so no need to wrap it indata.table
. You can check withclass(DT)
.- Use the
na.strings
parameter when reading in the data. See below for an example. Summarise with:
DT[, .(a = a[.N], b = b[.N], c = c[1], d = sum(d), e = sum(e), f = unique(na.omit(f))) , by = ID]
you will then get:
ID a b c d e f 1: 1 11 111 1000 30030 300300 5/07/1977 2: 2 22 222 2000 20000 200000 6/02/1980 3: 3 33 333 3000 60010 600100 20/12/1978 4: 4 44 444 4000 240150 2401500 7/06/1944 5: 5 55 555 5000 100010 1000100 31/05/1976
Some explanations & other notes:
- Subsetting with
[1]
will give you the first value of a group. You could also use thefirst
-function which is optimized in data.table, and thus faster. - Subsetting with
[.N]
will give you the last value of a group. You could also use thelast
-function which is optimized in data.table, and thus faster. - Don't use variable names that are also functions in R (in this case, don't use
c
as a variable name). See also?c
for an explanation of what thec
-function does. - For summarising the
f
-variable, I usedunique
in combination withna.omit
. If there is more than one unique date byID
, you could also use for examplena.omit(f)[1]
.
If speed is an issue, you could optimize the above to (thx to @Frank):
DT[order(f)
, .(a = last(a),
b = last(b),
c = first(c),
d = sum(d),
e = sum(e),
f = first(f))
, by = ID]
Ordering by f
will put NA
-values last. As a result now the internal GForce-optimization is used for all calculations.
Used data:
DT <- fread("ID a b c d e f
1 10 100 1000 10000 100000 ?
1 10 100 1001 10010 100100 5/07/1977
1 11 111 1002 10020 100200 5/07/1977
2 22 222 2000 20000 200000 6/02/1980
3 33 333 3000 30000 300000 20/12/1978
3 33 333 3001 30010 300100 ?
4 40 400 4000 40000 400000 ?
4 40 400 4001 40010 400100 ?
4 40 400 4002 40020 400200 7/06/1944
4 44 444 4003 40030 400300 ?
4 44 444 4004 40040 400400 ?
4 44 444 4005 40050 400500 ?
5 55 555 5000 50000 500000 31/05/1976
5 55 555 5001 50010 500100 31/05/1976", na.strings='?')
We can use tidyverse
. After grouping by 'ID', we summarise
the columns based on the first
or last
observation
library(dplyr)
DT %>%
group_by(ID) %>%
summarise(a = last(a),
b = last(b),
c = first(c),
d = sum(d),
e = sum(e),
f = f[f!="?"][1])
# A tibble: 5 × 7
# ID a b c d e f
# <int> <int> <int> <int> <int> <int> <chr>
#1 1 11 111 1000 30030 300300 5/07/1977
#2 2 22 222 2000 20000 200000 6/02/1980
#3 3 33 333 3000 60010 600100 20/12/1978
#4 4 44 444 4000 240150 2401500 7/06/1944
#5 5 55 555 5000 100010 1000100 31/05/1976