Mutate with a list column function in dplyr
You could simply add rowwise
df_comp_jaccard <- df_comp %>%
rowwise() %>%
dplyr::mutate(jaccard_sim = length(intersect(names_vec, source_vec))/
length(union(names_vec, source_vec)))
# A tibble: 3 x 3
names_ names_vec jaccard_sim
<chr> <list> <dbl>
1 b d f <chr [3]> 0.2
2 u k g <chr [3]> 0.0
3 m o c <chr [3]> 0.2
Using rowwise
you get the intuitive behavior some would expect when using mutate
: "do this operation for every row".
Not using rowwise
means you take advantage of vectorized functions, which is much faster, that's why it's the default, but may yield unexpected results if you're not careful.
The impression that mutate
(or other dplyr
functions) works row-wise is an illusion due to the fact you're working with vectorized functions, in fact you're always juggling with full columns.
I'll illustrate with a couple of examples:
Sometimes the result is the same, with a vectorized function such as paste
:
tibble(a=1:10,b=10:1) %>% mutate(X = paste(a,b,sep="_"))
tibble(a=1:10,b=10:1) %>% rowwise %>% mutate(X = paste(a,b,sep="_"))
# # A tibble: 5 x 3
# a b X
# <int> <int> <chr>
# 1 1 5 1_5
# 2 2 4 2_4
# 3 3 3 3_3
# 4 4 2 4_2
# 5 5 1 5_1
And sometimes it's different, with a function that is not vectorized, such as max
:
tibble(a=1:5,b=5:1) %>% mutate(max(a,b))
# # A tibble: 5 x 3
# a b `max(a, b)`
# <int> <int> <int>
# 1 1 5 5
# 2 2 4 5
# 3 3 3 5
# 4 4 2 5
# 5 5 1 5
tibble(a=1:5,b=5:1) %>% rowwise %>% mutate(max(a,b))
# # A tibble: 5 x 3
# a b `max(a, b)`
# <int> <int> <int>
# 1 1 5 5
# 2 2 4 4
# 3 3 3 3
# 4 4 2 4
# 5 5 1 5
Note that in this case you shouldn't use rowwise
in a real life situation, but pmax
which is vectorized for this purpose:
tibble(a=1:5,b=5:1) %>% mutate(pmax(a,b))
# # A tibble: 5 x 3
# a b `pmax(a, b)`
# <int> <int> <int>
# 1 1 5 5
# 2 2 4 4
# 3 3 3 3
# 4 4 2 4
# 5 5 1 5
Intersect is such function, you fed this function one list column containing vectors and one other vector, these 2 objects have no intersection.
We can use map
to loop through the list
library(tidyverse)
df_comp %>%
mutate(jaccard_sim = map_dbl(names_vec, ~length(intersect(.x,
source_vec))/length(union(.x, source_vec))))
# A tibble: 3 x 3
# names_ names_vec jaccard_sim
# <chr> <list> <dbl>
#1 b d f <chr [3]> 0.2
#2 u k g <chr [3]> 0.0
#3 m o c <chr [3]> 0.2
The map
functions are optimized. Below are the system.time
for a slightly bigger dataset
df_comp1 <- df_comp[rep(1:nrow(df_comp), 1e5),]
system.time({
df_comp1 %>%
rowwise() %>%
dplyr::mutate(jaccard_sim = length(intersect(names_vec, source_vec))/length(union(names_vec, source_vec)))
})
#user system elapsed
# 25.59 0.05 25.96
system.time({
df_comp1 %>%
mutate(jaccard_sim = map_dbl(names_vec, ~length(intersect(.x,
source_vec))/length(union(.x, source_vec))))
})
#user system elapsed
# 13.22 0.00 13.22