Matching words from vectors of strings in R

Maybe you are looking for adist:

x <- adist(messy, approved, fixed=FALSE, ignore.case = TRUE)
y <- t(adist(approved, messy, fixed=FALSE, ignore.case = TRUE))
i <- x == apply(x, 1, min)
y[!i]  <- NA
colnames(y) <- approved
i <- apply(y == apply(y, 1, min, na.rm=TRUE), 2, function(i) messy[i & !is.na(i)])
do.call(cbind, lapply(i, function(x) x[seq_len(max(lengths(i)))]))
#     Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
#[1,] "Pit 28"                   "14"                       "Robinswood"   
#[2,] "28"                       NA                         NA             
#[3,] "CWP Pit 28"               NA                         NA             
#[4,] "Cotswold 28"              NA                         NA

A base R option would be :

result <- sapply(approved, function(x) grep(gsub('\\s+', '|', x), messy, value = TRUE))
result
#$`Cotswold Water Park Pit 28`
#[1] "Pit 28"      "28"          "CWP Pit 28"  "Cotswold 28"

#$`Cotswold Water Park Pit 14`
#[1] "Pit 28"      "CWP Pit 28"  "Cotswold 28" "14"         

#$`Robinswood Hill`
#[1] "Robinswood"

The logic here is that we insert pipe (|) symbol at every whitespace in approved and return the word in messy if any word matches.

To get output in the same format as shown we can do :

sapply(result, `[`, 1:max(lengths(result)))

#     Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
#[1,] "Pit 28"                   "Pit 28"                   "Robinswood"   
#[2,] "28"                       "CWP Pit 28"               NA             
#[3,] "CWP Pit 28"               "Cotswold 28"              NA             
#[4,] "Cotswold 28"              "14"                       NA

Matching words from vectors of strings in R

Tags:

String

Regex

R

String Matching

Stringr

Related

Recent Posts