Count unique values for every column
Using the lengths
function:
lengths(lapply(Testdata, unique))
# var_1 var_2 var_3
# 1 1 3
You could use apply
:
apply(Testdata, 2, function(x) length(unique(x)))
# var_1 var_2 var_3
# 1 1 3
In dplyr
:
Testdata %>% summarise_all(n_distinct)
ð
( For those curious about the complete syntax.
In dplyr >0.8.0
using purrr
syntax:
Testdata %>% summarise_all(list(~n_distinct(.)))
In dplyr <0.8.0
:
Testdata %>% summarise_all(funs(n_distinct(.)))
)
For more information on summarizing multiple columns found here: https://dplyr.tidyverse.org/reference/summarise_all.html
This is actually an improvement on the comment by @Ananda Mahto. It didn't fit in the comment so I decided to add as an answer.
sapply
is actually marginally faster than lapply
, and gives the output in a more compact form, just like the output from apply
.
A test run result on actual data:
> start <- Sys.time()
> apply(datafile, 2, function(x)length(unique(x)))
symbol. date volume
1371 261 53647
> Sys.time() - start
Time difference of 1.619567 secs
>
> start <- Sys.time()
> lapply(datafile, function(x)length(unique(x)))
$symbol.
[1] 1371
$date
[1] 261
$volume
[1] 53647
> Sys.time() - start
Time difference of 0.07129478 secs
>
> start <- Sys.time()
> sapply(datafile, function(x)length(unique(x)))
symbol. date volume
1371 261 53647
> Sys.time() - start
Time difference of 0.06939292 secs
The datafile
has around 3.5 million rows.
Quoting the help text:
sapply is a user-friendly version and wrapper of lapply by default returning a vector, matrix or, if simplify = "array", an array if appropriate, by applying simplify2array(). sapply(x, f, simplify = FALSE, USE.NAMES = FALSE) is the same as lapply(x, f).