Remove duplicated rows using dplyr

Note: dplyr now contains the distinct function for this purpose.

Original answer below:

library(dplyr)
set.seed(123)
df <- data.frame(
  x = sample(0:1, 10, replace = T),
  y = sample(0:1, 10, replace = T),
  z = 1:10
)

One approach would be to group, and then only keep the first row:

df %>% group_by(x, y) %>% filter(row_number(z) == 1)

## Source: local data frame [3 x 3]
## Groups: x, y
## 
##   x y z
## 1 0 1 1
## 2 1 0 2
## 3 1 1 4

(In dplyr 0.2 you won't need the dummy z variable and will just be able to write row_number() == 1)

I've also been thinking about adding a slice() function that would work like:

df %>% group_by(x, y) %>% slice(from = 1, to = 1)

Or maybe a variation of unique() that would let you select which variables to use:

df %>% unique(x, y)

Here is a solution using dplyr >= 0.5.

library(dplyr)
set.seed(123)
df <- data.frame(
  x = sample(0:1, 10, replace = T),
  y = sample(0:1, 10, replace = T),
  z = 1:10
)

> df %>% distinct(x, y, .keep_all = TRUE)
    x y z
  1 0 1 1
  2 1 0 2
  3 1 1 4

Remove duplicated rows using dplyr

Tags:

R

Dplyr

Related

Recent Posts