Matching words from vectors of strings in R
Maybe you are looking for adist
:
x <- adist(messy, approved, fixed=FALSE, ignore.case = TRUE)
y <- t(adist(approved, messy, fixed=FALSE, ignore.case = TRUE))
i <- x == apply(x, 1, min)
y[!i] <- NA
colnames(y) <- approved
i <- apply(y == apply(y, 1, min, na.rm=TRUE), 2, function(i) messy[i & !is.na(i)])
do.call(cbind, lapply(i, function(x) x[seq_len(max(lengths(i)))]))
# Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
#[1,] "Pit 28" "14" "Robinswood"
#[2,] "28" NA NA
#[3,] "CWP Pit 28" NA NA
#[4,] "Cotswold 28" NA NA
A base R option would be :
result <- sapply(approved, function(x) grep(gsub('\\s+', '|', x), messy, value = TRUE))
result
#$`Cotswold Water Park Pit 28`
#[1] "Pit 28" "28" "CWP Pit 28" "Cotswold 28"
#$`Cotswold Water Park Pit 14`
#[1] "Pit 28" "CWP Pit 28" "Cotswold 28" "14"
#$`Robinswood Hill`
#[1] "Robinswood"
The logic here is that we insert pipe (|
) symbol at every whitespace in approved
and return the word in messy
if any word matches.
To get output in the same format as shown we can do :
sapply(result, `[`, 1:max(lengths(result)))
# Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
#[1,] "Pit 28" "Pit 28" "Robinswood"
#[2,] "28" "CWP Pit 28" NA
#[3,] "CWP Pit 28" "Cotswold 28" NA
#[4,] "Cotswold 28" "14" NA