Find indices of duplicated rows
If you are using a keyed data.table, then you can use the following elegant syntax
library(data.table)
DT <- data.table(A = rep(1:3, each=4),
B = rep(1:4, each=3),
C = rep(1:2, 6), key = "A,B,C")
DT[unique(DT[duplicated(DT)]),which=T]
To unpack
DT[duplicated(DT)]
subsets those rows which are duplicates.unique(...)
returns only the unique combinations of the duplicated rows. This deals with any cases with more than 1 duplicate (duplicate duplicates eg triplicates etc)DT[..., which = T]
merges the duplicate rows with the original, withwhich=T
returning the row number (withoutwhich = T
it would just return the data).
You could also use
DT[,count := .N,by = list(A,B,C)][count>1, which=T]
Here's an example:
df <- data.frame(a = c(1,2,3,4,1,5,6,4,2,1))
duplicated(df) | duplicated(df, fromLast = TRUE)
#[1] TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
How it works?
The function duplicated(df)
determines duplicate elements in the original data. The fromLast = TRUE
indicates that "duplication should be considered from the reverse side". The two resulting logical vectors are combined using |
since a TRUE
in at least one of them indicates a duplicated value.