Creating a vector in R of counts for number of times each element appears in another vector
You could use outer
with colSums
:
colSums(outer(a, b, `==`))
[1] 0 0 2 1 1 0 0
You can do:
res <- table(factor(b, levels=b)[match(a, b, nomatch=0)])
salamander worm dog banana cat yellow blue
0 0 2 1 1 0 0
If you want a vanilla vector, there's as.vector(res)
.
Comments
- (Thanks to @HectorHaffenden) This approach assumes that all values in b are distinct.
- I expect this to be faster than making exhaustive comparisons with
==
as in some other answers. The steps are pretty similar to @GKi's double merge: find where the vectors match, then map back to b.
Benchmarks
Required packages: data.table, purrr, microbenchmark
Various options
library(data.table)
# NelsonGon's answer
purrem <- function() purrr::map_dbl(b, ~sum(.x==a))
# Andrew's answer
vappem <- function() vapply(b, function(x) sum(x == a), FUN.VALUE = integer(1))
# Andrew's answer
collem <- function() colSums(outer(a, b, `==`))
# arg0naut91's answer
lappem <- function() unlist(lapply(b, function(x) sum(x == a)))
# this answer
matchem <- function() table(factor(b, levels=b)[match(a, b, nomatch=0)])
# this answer + data.table
matchem2<- function()
setDT(list(b))[, n := 0L][setDT(list(a))[, .N, by=V1], on=.(V1), n := N]$n
# @GKi's answer
mergem <- function() merge(b, table(merge(a, b, by=1)), by=1, all.x=T)[,2]
Example input and benchmarking code
nv = 1e4 # values that can appear in a
nb = 1e3 # values to look up, nb <= na
na = 1e5 # length of a
set.seed(1)
a <- sample(nv, na, replace=TRUE)
b <- seq_len(nb)
microbenchmark::microbenchmark(times = 10,
pur_res <- purrem(),
vap_res <- vappem(),
col_res <- collem(),
lap_res <- lappem(),
mat_res <- matchem(),
mat_res2<- matchem2(),
mer_res <- mergem()
)
# make sure results match
# left as an exercise for the cautious user
identical(as.vector(mat_res), lap_res) # ok
identical(as.integer(col_res), lap_res) # ok
# etc
Results
Unit: milliseconds
expr min lq mean median uq max neval
pur_res <- purrem() 373.488498 389.331825 479.039835 430.363183 500.948370 858.77997 10
vap_res <- vappem() 367.247322 397.516902 472.635368 505.782597 532.951841 570.68548 10
col_res <- collem() 1353.356494 1481.029982 1507.536324 1515.966781 1552.886597 1650.93967 10
lap_res <- lappem() 352.197701 394.562073 469.988534 507.935397 525.426475 559.56388 10
mat_res <- matchem() 3.032507 3.230309 5.101941 3.371101 3.874484 15.31595 10
mat_res2 <- matchem2() 7.591947 11.666453 12.809046 12.266796 13.676658 22.04095 10
mer_res <- mergem() 23.448314 23.712974 27.730525 24.547323 24.716967 46.92548 10
If it takes under a second, fits in memory and is run once, choosing among these options probably isn't too important. The ranking among the not-slow options probably depends on the parameters of the OP's actual problem (which nv, na, nb can hopefully be adjusted to approximate here).
Feel free to edit in more options and rerun, copying your results over mine here. For example, I couldn't get @NelsonGon's stringi approach to work with these parameters, but maybe someone else has more patience or a more powerful computer. I'd also be curious to see memory usage, but haven't learned the packages that support measuring it yet.
If there is some nv/na/nb configuration where one answer performs particularly well, editing that answer with a similar benchmark highlighting that case is an option.
Just FYI:
bench::mark(
pur_res <- purrem(),
vap_res <- vappem(),
col_res <- collem(),
lap_res <- lappem(),
mat_res <- matchem(),
mat_res2<- matchem2(),
mer_res <- mergem(),
stringi <- sapply(b, function(x) sum(stringi::stri_count(x, regex=a))),
check=FALSE
)
# A tibble: 8 x 14
expression min mean median max `itr/sec` mem_alloc n_gc n_itr total_time result memory time gc
<chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <bch:tm> <list> <list> <list> <list>
1 pur_res <- purrem() 421.14ms 424.65ms 424.65ms 428.15ms 2.35 382.21MB 0 2 849.29ms <dbl [1,0~ <Rprofmem [2,1~ <bch:~ <tibble [2 ~
2 vap_res <- vappem() 367.88ms 370.61ms 370.61ms 373.34ms 2.70 381.52MB 0 2 741.23ms <int [1,0~ <Rprofmem [1,0~ <bch:~ <tibble [2 ~
3 col_res <- collem() 1.64s 1.64s 1.64s 1.64s 0.608 1.12GB 2 1 1.64s <dbl [1,0~ <Rprofmem [32 ~ <bch:~ <tibble [1 ~
4 lap_res <- lappem() 411.25ms 506.67ms 506.67ms 602.1ms 1.97 381.53MB 3 2 1.01s <int [1,0~ <Rprofmem [1,0~ <bch:~ <tibble [2 ~
5 mat_res <- matchem() 3.11ms 3.48ms 3.44ms 5.79ms 287. 1.4MB 0 144 501.66ms <S3: tabl~ <Rprofmem [90 ~ <bch:~ <tibble [14~
6 mat_res2 <- matchem2() 5.22ms 6.26ms 5.96ms 27.7ms 160. 4.83MB 1 80 501.18ms <int [1,0~ <Rprofmem [435~ <bch:~ <tibble [80~
7 mer_res <- mergem() 19.88ms 22.75ms 22.02ms 33.6ms 44.0 6.59MB 1 23 523.3ms <int [1,0~ <Rprofmem [410~ <bch:~ <tibble [23~
8 stringi <- sapply(b, function(x) sum(string~ 6.57m 6.57m 6.57m 6.57m 0.00254 1.12GB 1 1 6.57m <int [1,0~ <Rprofmem [2,3~ <bch:~ <tibble [1 ~
Perhaps this is a little bit faster, but not sure if a major improvement:
vapply(b, function(x) sum(x == a), FUN.VALUE = integer(1))
Output:
salamander worm dog banana cat yellow blue
0 0 2 1 1 0 0
Also unlist
with lapply
can be a slightly better performing combination in the apply
family:
unlist(lapply(b, function(x) sum(x == a)))
Output:
[1] 0 0 2 1 1 0 0
I don't have the opportunity to properly benchmark right now, however I believe also the unnecessary use of curly brackets ({}
) can negatively impact the performance.