Replace a sequence of values by group depending on preceeding values
Here's another data.table approach:
dt[, x := rleid(values), by = .(id)]
dt[dt[values == "b", .(id, x=x-1, values="a")],
on = .(id, x, values),
values := "b"
][, x := NULL]
- create a new column "x" with the run length ids per value grouped by id
- join on itself while modifying the run length ids (x) to be the preceeding value and values to be "a" (the specific value you want to change), then update values with "b"
- delete column x afterwards
The result is:
dt
# id values
# 1: 1 a
# 2: 1 c
# 3: 1 b
# 4: 1 b
# 5: 1 a
# 6: 2 c
# 7: 2 c
# 8: 2 b
# 9: 2 b
# 10: 2 c
# 11: 3 c
# 12: 3 b
# 13: 3 b
# 14: 3 b
# 15: 3 b
And here's a generalization to the case where you want to replace values "a", "x", or "y" followed by "b" with "b":
dt[, x := rleid(values), by = .(id)]
dt[dt[values == "b", .(values=c("a", "x", "y")), by = .(id, x=x-1)],
on = .(id, x, values),
values := "b"
][, x := NULL]
Late to the party and several nice run length alternatives were already provided ;) So here I try nafill
instead.
(1) Create a variable 'v2' which is NA
when 'values' are "a". (2) Fill missing values by next observation carried backward. (3) When the original 'values' are "a" and the corresponding filled values in 'v2' are "b", update 'v' with 'v2'.
# 1
dt[values != "a" , v2 := values]
# 2
d1[, v2 := v2[nafill(replace(seq_len(.N), is.na(v2), NA), type = "nocb")], by = id]
# 3
dt[values == "a" & v2 == "b", values := v2]
# clean-up
dt[ , v2 := NULL]
Currently, nafill
only works with numeric variables, hence replace
step in chunk # 2
(modified from @chinsoon12 in the issue nafill, setnafill for character, factor and other types).
The NA
replacement code may be slightly shortened by using zoo::nalocf
:
dt[, v2 := zoo::na.locf(v2, fromLast = TRUE, na.rm = FALSE), by = id]
However, note that na.locf
is slower.
When comparing the answers on larger data (data.table(id = rep(1:1e4, each = 1e4, replace = TRUE), values = sample(c("a", "b", "c"), 1e8, replace = TRUE)
), it turns out that this alternative actually is faster than the others.