Find duplicated rows (based on 2 columns) in Data Frame in R
You can always try simply passing those first two columns to the function duplicated
:
duplicated(dat[,1:2])
assuming your data frame is called dat
. For more information, we can consult the help files for the duplicated
function by typing ?duplicated
at the console. This will provide the following sentences:
Determines which elements of a vector or data frame are duplicates of elements with smaller subscripts, and returns a logical vector indicating which elements (rows) are duplicates.
So duplicated
returns a logical vector, which we can then use to extract a subset of dat
:
ind <- duplicated(dat[,1:2])
dat[ind,]
or you can skip the separate assignment step and simply use:
dat[duplicated(dat[,1:2]),]
Here's a dplyr
option for tagging duplicates based on two (or more) columns. In this case ric
and date
:
df <- data_frame(ric = c('S1A.PA', 'ABC.PA', 'EFG.PA', 'S1A.PA', 'ABC.PA', 'EFG.PA'),
date = c('2011-06-30 20:00:00', '2011-07-03 20:00:00', '2011-07-04 20:00:00', '2011-07-05 20:00:00', '2011-07-03 20:00:00', '2011-07-04 20:00:00'),
open = c(23.7, 24.31, 24.495, 24.23, 24.31, 24.495))
df %>%
group_by(ric, date) %>%
mutate(dupe = n()>1)
# A tibble: 6 x 4
# Groups: ric, date [4]
ric date open dupe
<chr> <chr> <dbl> <lgl>
1 S1A.PA 2011-06-30 20:00:00 23.7 FALSE
2 ABC.PA 2011-07-03 20:00:00 24.3 TRUE
3 EFG.PA 2011-07-04 20:00:00 24.5 TRUE
4 S1A.PA 2011-07-05 20:00:00 24.2 FALSE
5 ABC.PA 2011-07-03 20:00:00 24.3 TRUE
6 EFG.PA 2011-07-04 20:00:00 24.5 TRUE
Easy way to get the information you want is to use dplyr
.
yourDF %>%
group_by(RIC, Date) %>%
mutate(num_dups = n(),
dup_id = row_number()) %>%
ungroup() %>%
mutate(is_duplicated = dup_id > 1)
Using this:
num_dups
tells you how many times that particular combo is duplicateddup_id
tells you which duplicate number that particular row is (e.g. 1st, 2nd, or 3rd, etc)is_duplicated
gives you an easy condition you can filter on later to remove all the duplicate rows (e.g.filter(!is_duplicated)
), though you could also usedup_id
for this (e.g.filter(dup_id == 1)
)
dplyr is so much nicer for this sort of thing:
library(dplyr)
yourDataFrame %>%
distinct(RIC, Date, .keep_all = TRUE)
(the ".keep_all is optional. if not used, it will return only the deduped 2 columns. when used, it returns the deduped whole data frame)