A better similarity ranking algorithm for variable length strings
Simon White of Catalysoft wrote an article about a very clever algorithm that compares adjacent character pairs that works really well for my purposes:
http://www.catalysoft.com/articles/StrikeAMatch.html
Simon has a Java version of the algorithm and below I wrote a PL/Ruby version of it (taken from the plain ruby version done in the related forum entry comment by Mark Wong-VanHaren) so that I can use it in my PostgreSQL queries:
CREATE FUNCTION string_similarity(str1 varchar, str2 varchar)
RETURNS float8 AS '
str1.downcase!
pairs1 = (0..str1.length-2).collect {|i| str1[i,2]}.reject {
|pair| pair.include? " "}
str2.downcase!
pairs2 = (0..str2.length-2).collect {|i| str2[i,2]}.reject {
|pair| pair.include? " "}
union = pairs1.size + pairs2.size
intersection = 0
pairs1.each do |p1|
0.upto(pairs2.size-1) do |i|
if p1 == pairs2[i]
intersection += 1
pairs2.slice!(i)
break
end
end
end
(2.0 * intersection) / union
' LANGUAGE 'plruby';
Works like a charm!
marzagao's answer is great. I converted it to C# so I thought I'd post it here:
Pastebin Link
/// <summary>
/// This class implements string comparison algorithm
/// based on character pair similarity
/// Source: http://www.catalysoft.com/articles/StrikeAMatch.html
/// </summary>
public class SimilarityTool
{
/// <summary>
/// Compares the two strings based on letter pair matches
/// </summary>
/// <param name="str1"></param>
/// <param name="str2"></param>
/// <returns>The percentage match from 0.0 to 1.0 where 1.0 is 100%</returns>
public double CompareStrings(string str1, string str2)
{
List<string> pairs1 = WordLetterPairs(str1.ToUpper());
List<string> pairs2 = WordLetterPairs(str2.ToUpper());
int intersection = 0;
int union = pairs1.Count + pairs2.Count;
for (int i = 0; i < pairs1.Count; i++)
{
for (int j = 0; j < pairs2.Count; j++)
{
if (pairs1[i] == pairs2[j])
{
intersection++;
pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success
break;
}
}
}
return (2.0 * intersection) / union;
}
/// <summary>
/// Gets all letter pairs for each
/// individual word in the string
/// </summary>
/// <param name="str"></param>
/// <returns></returns>
private List<string> WordLetterPairs(string str)
{
List<string> AllPairs = new List<string>();
// Tokenize the string and put the tokens/words into an array
string[] Words = Regex.Split(str, @"\s");
// For each word
for (int w = 0; w < Words.Length; w++)
{
if (!string.IsNullOrEmpty(Words[w]))
{
// Find the pairs of characters
String[] PairsInWord = LetterPairs(Words[w]);
for (int p = 0; p < PairsInWord.Length; p++)
{
AllPairs.Add(PairsInWord[p]);
}
}
}
return AllPairs;
}
/// <summary>
/// Generates an array containing every
/// two consecutive letters in the input string
/// </summary>
/// <param name="str"></param>
/// <returns></returns>
private string[] LetterPairs(string str)
{
int numPairs = str.Length - 1;
string[] pairs = new string[numPairs];
for (int i = 0; i < numPairs; i++)
{
pairs[i] = str.Substring(i, 2);
}
return pairs;
}
}
Here is another version of marzagao's answer, this one written in Python:
def get_bigrams(string):
"""
Take a string and return a list of bigrams.
"""
s = string.lower()
return [s[i:i+2] for i in list(range(len(s) - 1))]
def string_similarity(str1, str2):
"""
Perform bigram comparison between two strings
and return a percentage match in decimal form.
"""
pairs1 = get_bigrams(str1)
pairs2 = get_bigrams(str2)
union = len(pairs1) + len(pairs2)
hit_count = 0
for x in pairs1:
for y in pairs2:
if x == y:
hit_count += 1
break
return (2.0 * hit_count) / union
if __name__ == "__main__":
"""
Run a test using the example taken from:
http://www.catalysoft.com/articles/StrikeAMatch.html
"""
w1 = 'Healed'
words = ['Heard', 'Healthy', 'Help', 'Herded', 'Sealed', 'Sold']
for w2 in words:
print('Healed --- ' + w2)
print(string_similarity(w1, w2))
print()