Select groups based on number of unique / distinct values

data.table version:

library(data.table)
sample <- as.data.table(sample)
sample[ , if(uniqueN(Value) == 1) .SD, by = Group]

#   Group ID Value
#1:    AA  1     1
#2:    AA  2     1
#3:    AA  3     1
#4:    BB  4     2
#5:    BB  5     2
#6:    BB  8     2

An alternative using ave if the data is numeric, is to check if the variance is 0:

sample[with(sample, ave(Value, Group, FUN=var ))==0,]

An alternative solution that could be faster on large data is:

setkey(sample, Group, Value)
ans <- sample[unique(sample)[, .N, by=Group][N==1, Group]]

The point is that calculating unique values for each group could be time consuming when there are more groups. Instead, we can set the key on the data.table, then take unique values by key (which is extremely fast) and then count the total values for each group. We then require only those where it is 1. We can then perform a join (which is once again very fast). Here's a benchmark on large data:

require(data.table)
set.seed(1L)
sample <- data.table(ID=1:1e7, 
        Group = sample(rep(paste0("id", 1:1e5), each=100)), 
        Value = sample(2, 1e7, replace=TRUE, prob=c(0.9, 0.1)))

system.time (
    ans1 <- sample[,if(length(unique(Value))==1) .SD ,by=Group]
)
# minimum of three runs
#   user  system elapsed 
# 14.328   0.066  14.382 

system.time ({
    setkey(sample, Group, Value)
    ans2 <- sample[unique(sample)[, .N, by=Group][N==1, Group]]
})
# minimum of three runs
#   user  system elapsed 
#  5.661   0.219   5.877 

setkey(ans1, Group, ID)
setkey(ans2, Group, ID)
identical(ans1, ans2) # [1] TRUE

Here's a solution using dplyr:

library(dplyr)

sample <- data.frame(
  ID = 1:9,  
  Group= c('AA', 'AA', 'AA', 'BB', 'BB', 'CC', 'CC', 'BB', 'CC'),  
  Value = c(1, 1, 1, 2, 2, 2, 3, 2, 3)
)

sample %>%
  group_by(Group) %>%
  filter(n_distinct(Value) == 1)

We group the data by Group, and then only select groups where the number of distinct values of Value is 1.

Here's an approach

> ind <- aggregate(Value~Group, FUN=function(x) length(unique(x))==1, data=sample)[,2]
> sample[sample[,"Group"] %in% levels(sample[,"Group"])[ind], ]
  ID Group Value
1  1    AA     1
2  2    AA     1
3  3    AA     1
4  4    BB     2
5  5    BB     2
8  8    BB     2

You can make a selector for sample using ave many different ways.

sample[ ave( sample$Value, sample$Group, FUN = function(x) length(unique(x)) ) == 1,]

sample[ ave( sample$Value, sample$Group, FUN = function(x) sum(x - x[1]) ) == 0,]

sample[ ave( sample$Value, sample$Group, FUN = function(x) diff(range(x)) ) == 0,]

Select groups based on number of unique / distinct values

Tags:

Unique

R

Dataframe

Related

Recent Posts