How to know the operations made to calculate the Levenshtein distance between strings?
With adist()
, you can retrieve the operations:
drop(attr(adist("abc abc","abcde acc", count = TRUE), "counts"))
ins del sub
2 0 1
From ?adist
:
If counts is TRUE, the transformation counts are returned as the "counts" attribute of this matrix, as a 3-dimensional array with dimensions corresponding to the elements of x, the elements of y, and the type of transformation (insertions, deletions and substitutions), respectively.
This is known as the Needleman–Wunsch algorithm. It calculates both the distance between two strings as well as the so-called traceback, which allows you to reconstruct the alignment.
Since this problem mostly crops up in biology when comparing biological sequences, this algorithm (and related ones) are implemented in the R package {Biostrings}, which is part of Bioconductor.
Since this package implements are more general solution than the simple Levenshtein distance, the usage is unfortunately more complex, and the usage vignette is correspondingly long. But the fundamental usage for your purposes is as follows:
library(Biostrings)
dist_mat = diag(27L)
colnames(dist_mat) = rownames(dist_mat) = c(letters, ' ')
result = pairwiseAlignment(
"abc abc", "abcde acc",
substitutionMatrix = dist_mat,
gapOpening = 1, gapExtension = 1
)
This won’t simply give you the list c('b', 'c', 'c')
, though, because that list does not fully represent what actually happened here. Instead, it will return an alignment between the two strings. This can be represented as a sequence with substitutions and gaps:
score(result)
# [1] 3
aligned(result)
as.matrix(aligned(result))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] "a" "b" "c" "-" "-" " " "a" "b" "c"
aligned(result)
— For each character in the second string it provides the corresponding character in the original string, replacing inserted characters by -
. Basically, this is a “recipe” for transforming the first string into the second string. Note that it will only contain insertions and substitutions, not deletions. To get these, you need to perform the alignment the other way round (i.e. swapping the string arguments).