Creating a vector in R of counts for number of times each element appears in another vector

You could use outer with colSums:

colSums(outer(a, b, `==`))
[1] 0 0 2 1 1 0 0

You can do:

res <- table(factor(b, levels=b)[match(a, b, nomatch=0)])

salamander       worm        dog     banana        cat     yellow       blue 
         0          0          2          1          1          0          0 

If you want a vanilla vector, there's as.vector(res).


Comments

  • (Thanks to @HectorHaffenden) This approach assumes that all values in b are distinct.
  • I expect this to be faster than making exhaustive comparisons with == as in some other answers. The steps are pretty similar to @GKi's double merge: find where the vectors match, then map back to b.

Benchmarks

Required packages: data.table, purrr, microbenchmark

Various options

library(data.table)
# NelsonGon's answer
purrem <- function() purrr::map_dbl(b, ~sum(.x==a))
# Andrew's answer
vappem <- function() vapply(b, function(x) sum(x == a), FUN.VALUE = integer(1))
# Andrew's answer
collem <- function() colSums(outer(a, b, `==`)) 
# arg0naut91's answer
lappem  <- function() unlist(lapply(b, function(x) sum(x == a)))
# this answer
matchem <- function() table(factor(b, levels=b)[match(a, b, nomatch=0)])
# this answer + data.table
matchem2<- function() 
  setDT(list(b))[, n := 0L][setDT(list(a))[, .N, by=V1], on=.(V1), n := N]$n
# @GKi's answer
mergem <- function() merge(b, table(merge(a, b, by=1)), by=1, all.x=T)[,2]

Example input and benchmarking code

nv = 1e4 # values that can appear in a
nb = 1e3 # values to look up, nb <= na
na = 1e5 # length of a

set.seed(1)
a <- sample(nv, na, replace=TRUE) 
b <- seq_len(nb)

microbenchmark::microbenchmark(times = 10,
pur_res <- purrem(),
vap_res <- vappem(),
col_res <- collem(),
lap_res <- lappem(),
mat_res <- matchem(),
mat_res2<- matchem2(),
mer_res <- mergem()
)

# make sure results match
# left as an exercise for the cautious user
identical(as.vector(mat_res), lap_res) # ok
identical(as.integer(col_res), lap_res) # ok
# etc

Results

Unit: milliseconds
                   expr         min          lq        mean      median          uq        max neval
    pur_res <- purrem()  373.488498  389.331825  479.039835  430.363183  500.948370  858.77997    10
    vap_res <- vappem()  367.247322  397.516902  472.635368  505.782597  532.951841  570.68548    10
    col_res <- collem() 1353.356494 1481.029982 1507.536324 1515.966781 1552.886597 1650.93967    10
    lap_res <- lappem()  352.197701  394.562073  469.988534  507.935397  525.426475  559.56388    10
   mat_res <- matchem()    3.032507    3.230309    5.101941    3.371101    3.874484   15.31595    10
 mat_res2 <- matchem2()    7.591947   11.666453   12.809046   12.266796   13.676658   22.04095    10
    mer_res <- mergem()   23.448314   23.712974   27.730525   24.547323   24.716967   46.92548    10

If it takes under a second, fits in memory and is run once, choosing among these options probably isn't too important. The ranking among the not-slow options probably depends on the parameters of the OP's actual problem (which nv, na, nb can hopefully be adjusted to approximate here).

Feel free to edit in more options and rerun, copying your results over mine here. For example, I couldn't get @NelsonGon's stringi approach to work with these parameters, but maybe someone else has more patience or a more powerful computer. I'd also be curious to see memory usage, but haven't learned the packages that support measuring it yet.

If there is some nv/na/nb configuration where one answer performs particularly well, editing that answer with a similar benchmark highlighting that case is an option.


Just FYI:

bench::mark(
    pur_res <- purrem(),
    vap_res <- vappem(),
    col_res <- collem(),
    lap_res <- lappem(),
    mat_res <- matchem(),
    mat_res2<- matchem2(),
    mer_res <- mergem(),
    stringi <- sapply(b, function(x) sum(stringi::stri_count(x, regex=a))),
    check=FALSE
)

# A tibble: 8 x 14
  expression                                          min     mean   median      max `itr/sec` mem_alloc  n_gc n_itr total_time result     memory          time   gc          
  <chr>                                          <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt> <dbl> <int>   <bch:tm> <list>     <list>          <list> <list>      
1 pur_res <- purrem()                            421.14ms 424.65ms 424.65ms 428.15ms   2.35     382.21MB     0     2   849.29ms <dbl [1,0~ <Rprofmem [2,1~ <bch:~ <tibble [2 ~
2 vap_res <- vappem()                            367.88ms 370.61ms 370.61ms 373.34ms   2.70     381.52MB     0     2   741.23ms <int [1,0~ <Rprofmem [1,0~ <bch:~ <tibble [2 ~
3 col_res <- collem()                               1.64s    1.64s    1.64s    1.64s   0.608      1.12GB     2     1      1.64s <dbl [1,0~ <Rprofmem [32 ~ <bch:~ <tibble [1 ~
4 lap_res <- lappem()                            411.25ms 506.67ms 506.67ms  602.1ms   1.97     381.53MB     3     2      1.01s <int [1,0~ <Rprofmem [1,0~ <bch:~ <tibble [2 ~
5 mat_res <- matchem()                             3.11ms   3.48ms   3.44ms   5.79ms 287.          1.4MB     0   144   501.66ms <S3: tabl~ <Rprofmem [90 ~ <bch:~ <tibble [14~
6 mat_res2 <- matchem2()                           5.22ms   6.26ms   5.96ms   27.7ms 160.         4.83MB     1    80   501.18ms <int [1,0~ <Rprofmem [435~ <bch:~ <tibble [80~
7 mer_res <- mergem()                             19.88ms  22.75ms  22.02ms   33.6ms  44.0        6.59MB     1    23    523.3ms <int [1,0~ <Rprofmem [410~ <bch:~ <tibble [23~
8 stringi <- sapply(b, function(x) sum(string~      6.57m    6.57m    6.57m    6.57m   0.00254    1.12GB     1     1      6.57m <int [1,0~ <Rprofmem [2,3~ <bch:~ <tibble [1 ~

Perhaps this is a little bit faster, but not sure if a major improvement:

vapply(b, function(x) sum(x == a), FUN.VALUE = integer(1))

Output:

salamander       worm        dog     banana        cat     yellow       blue 
         0          0          2          1          1          0          0 

Also unlist with lapply can be a slightly better performing combination in the apply family:

unlist(lapply(b, function(x) sum(x == a)))

Output:

[1] 0 0 2 1 1 0 0

I don't have the opportunity to properly benchmark right now, however I believe also the unnecessary use of curly brackets ({}) can negatively impact the performance.

Tags:

R