R grep: Match one string against multiple patterns

What about applying the regexpr function over a vector of keywords?

keywords <- c("dog", "cat", "bird")

strings <- c("Do you have a dog?", "My cat ate by bird.", "Let's get icecream!")

sapply(keywords, regexpr, strings, ignore.case=TRUE)

     dog cat bird
[1,]  15  -1   -1
[2,]  -1   4   15
[3,]  -1  -1   -1

    sapply(keywords, regexpr, strings[1], ignore.case=TRUE)

 dog  cat bird 
  15   -1   -1 

Values returned are the position of the first character in the match, with -1 meaning no match.

If the position of the match is irrelevant, use grepl instead:

sapply(keywords, grepl, strings, ignore.case=TRUE)

       dog   cat  bird
[1,]  TRUE FALSE FALSE
[2,] FALSE  TRUE  TRUE
[3,] FALSE FALSE FALSE

Update: This runs relatively quick on my system, even with a large number of keywords:

# Available on most *nix systems
words <- scan("/usr/share/dict/words", what="")
length(words)
[1] 234936

system.time(matches <- sapply(words, grepl, strings, ignore.case=TRUE))

   user  system elapsed 
  7.495   0.155   7.596 

dim(matches)
[1]      3 234936

To expand on the other answer, to transform the sapply() output into a useful logical vector you need to further use an apply() step.

keywords <- c("dog", "cat", "bird")
strings <- c("Do you have a dog?", "My cat ate by bird.", "Let's get icecream!")
(matches <- sapply(keywords, grepl, strings, ignore.case=TRUE))
#        dog   cat  bird
# [1,]  TRUE FALSE FALSE
# [2,] FALSE  TRUE  TRUE
# [3,] FALSE FALSE FALSE

To know which strings contain any of the keywords (patterns):

apply(matches, 1, any)
# [1]  TRUE  TRUE FALSE

To know which keywords (patterns) were matched in the supplied strings:

apply(matches, 2, any)
#  dog  cat bird 
# TRUE TRUE TRUE

re2r package can match multiple patterns (in parallel). Minimal example:

# compile patterns
re <- re2r::re2(keywords)
# match strings
re2r::re2_detect(strings, re, parallel = TRUE)

Tags:

Regex

R