Deleting reversed duplicates with R
A dplyr
possibility could be:
mydf %>%
group_by(grp = paste(pmax(gene_x, gene_y), pmin(gene_x, gene_y), sep = "_")) %>%
slice(1) %>%
ungroup() %>%
select(-grp)
gene_x gene_y
<chr> <chr>
1 AT1 AT2
2 AT1 AT3
3 AT3 AT4
Or:
mydf %>%
group_by(grp = paste(pmax(gene_x, gene_y), pmin(gene_x, gene_y), sep = "_")) %>%
filter(row_number() == 1) %>%
ungroup() %>%
select(-grp)
Or:
mydf %>%
group_by(grp = paste(pmax(gene_x, gene_y), pmin(gene_x, gene_y), sep = "_")) %>%
distinct(grp, .keep_all = TRUE) %>%
ungroup() %>%
select(-grp)
Or using dplyr
and purrr
:
mydf %>%
group_by(grp = paste(invoke(pmax, .), invoke(pmin, .), sep = "_")) %>%
slice(1) %>%
ungroup() %>%
select(-grp)
And as of purrr 0.3.0
invoke()
is retired, exec()
should be used instead:
mydf %>%
group_by(grp = paste(exec(pmax, !!!.), exec(pmin, !!!.), sep = "_")) %>%
slice(1) %>%
ungroup() %>%
select(-grp)
Or:
df %>%
rowwise() %>%
mutate(grp = paste(sort(c(gene_x, gene_y)), collapse = "_")) %>%
group_by(grp) %>%
slice(1) %>%
ungroup() %>%
select(-grp)
mydf <- read.table(text="gene_x gene_y
AT1 AT2
AT3 AT4
AT1 AT2
AT1 AT3
AT2 AT1", header=TRUE, stringsAsFactors=FALSE)
Here's one strategy using apply
, sort
, paste
, and duplicated
:
mydf[!duplicated(apply(mydf,1,function(x) paste(sort(x),collapse=''))),]
gene_x gene_y
1 AT1 AT2
2 AT3 AT4
4 AT1 AT3
And here's a slightly different solution:
mydf[!duplicated(lapply(as.data.frame(t(mydf), stringsAsFactors=FALSE), sort)),]
gene_x gene_y
1 AT1 AT2
2 AT3 AT4
4 AT1 AT3
Another tidyverse-centric approach but using purrr
:
library(tidyverse)
c_sort_collapse <- function(...){
c(...) %>%
sort() %>%
str_c(collapse = ".")
}
mydf %>%
mutate(x_y = map2_chr(gene_x, gene_y, c_sort_collapse)) %>%
distinct(x_y, .keep_all = TRUE) %>%
select(-x_y)
#> gene_x gene_y
#> 1 AT1 AT2
#> 2 AT3 AT4
#> 3 AT1 AT3