Count unique values for every column

Using the lengthsfunction:

lengths(lapply(Testdata, unique))

# var_1 var_2 var_3 
#     1     1     3 

You could use apply:

apply(Testdata, 2, function(x) length(unique(x)))
# var_1 var_2 var_3 
#     1     1     3

In dplyr:

Testdata %>% summarise_all(n_distinct)

🙂

( For those curious about the complete syntax.

In dplyr >0.8.0 using purrr syntax:

Testdata %>% summarise_all(list(~n_distinct(.)))

In dplyr <0.8.0:

Testdata %>% summarise_all(funs(n_distinct(.)))

)

For more information on summarizing multiple columns found here: https://dplyr.tidyverse.org/reference/summarise_all.html


This is actually an improvement on the comment by @Ananda Mahto. It didn't fit in the comment so I decided to add as an answer.

sapply is actually marginally faster than lapply, and gives the output in a more compact form, just like the output from apply.

A test run result on actual data:

> start <- Sys.time()
> apply(datafile, 2, function(x)length(unique(x)))
          symbol.           date     volume 
             1371            261      53647 
> Sys.time() - start
Time difference of 1.619567 secs
> 
> start <- Sys.time()
> lapply(datafile, function(x)length(unique(x)))
$symbol.
[1] 1371

$date
[1] 261

$volume
[1] 53647

> Sys.time() - start
Time difference of 0.07129478 secs
> 
> start <- Sys.time()
> sapply(datafile, function(x)length(unique(x)))
          symbol.              date             volume 
             1371               261              53647 
> Sys.time() - start
Time difference of 0.06939292 secs

The datafile has around 3.5 million rows.

Quoting the help text:

sapply is a user-friendly version and wrapper of lapply by default returning a vector, matrix or, if simplify = "array", an array if appropriate, by applying simplify2array(). sapply(x, f, simplify = FALSE, USE.NAMES = FALSE) is the same as lapply(x, f).