Find sql records containing similar strings

For all the Googlers out there that run into this question, though it's already been marked as answered, I figured I'd share some code to help with this. If you're able to do CLR user-defined functions on your SQL Server, you can implement your own Levensthein Distance algorithm and then from there create a function that gives you a 'similarity score' called dbo.GetSimilarityScore(). I've based my score case-insensitivity, without much weight to jumbled word order and non-alphanumeric characters. You can adjust your scoring algorithm as needed, but this is a good start. Credit to this code project link for getting me started.

Option Explicit On
Option Strict On
Option Compare Binary
Option Infer On

Imports System
Imports System.Collections.Generic
Imports System.Data
Imports System.Data.SqlClient
Imports System.Data.SqlTypes
Imports System.Text
Imports System.Text.RegularExpressions
Imports Microsoft.SqlServer.Server

Partial Public Class UserDefinedFunctions

    Private Const Xms As RegexOptions = RegexOptions.IgnorePatternWhitespace Or RegexOptions.Multiline Or RegexOptions.Singleline
    Private Const Xmsi As RegexOptions = Xms Or RegexOptions.IgnoreCase

    ''' <summary>
    ''' Compute the distance between two strings.
    ''' </summary>
    ''' <param name="s1">The first of the two strings.</param>
    ''' <param name="s2">The second of the two strings.</param>
    ''' <returns>The Levenshtein cost.</returns>
    <Microsoft.SqlServer.Server.SqlFunction()> _
    Public Shared Function ComputeLevenstheinDistance(ByVal string1 As SqlString, ByVal string2 As SqlString) As SqlInt32
        If string1.IsNull OrElse string2.IsNull Then Return SqlInt32.Null
        Dim s1 As String = string1.Value
        Dim s2 As String = string2.Value

        Dim n As Integer = s1.Length
        Dim m As Integer = s2.Length
        Dim d As Integer(,) = New Integer(n, m) {}

        ' Step 1
        If n = 0 Then Return m
        If m = 0 Then Return n

        ' Step 2
        For i As Integer = 0 To n
            d(i, 0) = i
        Next

        For j As Integer = 0 To m
            d(0, j) = j
        Next

        ' Step 3
        For i As Integer = 1 To n
            'Step 4
            For j As Integer = 1 To m
                ' Step 5
                Dim cost As Integer = If((s2(j - 1) = s1(i - 1)), 0, 1)

                ' Step 6
                d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
            Next
        Next
        ' Step 7
        Return d(n, m)
    End Function

    ''' <summary>
    ''' Returns a score between 0.0-1.0 indicating how closely two strings match.  1.0 is a 100%
    ''' T-SQL equality match, and the score goes down from there towards 0.0 for less similar strings.
    ''' </summary>
    <Microsoft.SqlServer.Server.SqlFunction()> _
    Public Shared Function GetSimilarityScore(string1 As SqlString, string2 As SqlString) As SqlDouble
        If string1.IsNull OrElse string2.IsNull Then Return SqlInt32.Null

        Dim s1 As String = string1.Value.ToUpper().TrimEnd(" "c)
        Dim s2 As String = string2.Value.ToUpper().TrimEnd(" "c)
        If s1 = s2 Then Return 1.0F ' At this point, T-SQL would consider them the same, so I will too

        Dim score1 As SqlDouble = InternalGetSimilarityScore(s1, s2)
        If score1.IsNull Then Return SqlDouble.Null

        Dim mod1 As String = GetSimilarityString(s1)
        Dim mod2 As String = GetSimilarityString(s2)
        Dim score2 As SqlDouble = InternalGetSimilarityScore(mod1, mod2)
        If score2.IsNull Then Return SqlDouble.Null

        If score1 = 1.0F AndAlso score2 = 1.0F Then Return 1.0F
        If score1 = 0.0F AndAlso score2 = 0.0F Then Return 0.0F
        ' Return weighted result
        Return (score1 * 0.2F) + (score2 * 0.8F)
    End Function

    Private Shared Function InternalGetSimilarityScore(s1 As String, s2 As String) As SqlDouble
        Dim dist As SqlInt32 = ComputeLevenstheinDistance(s1, s2)
        Dim maxLen As Integer = If(s1.Length > s2.Length, s1.Length, s2.Length)
        If maxLen = 0 Then Return 1.0F
        Return 1.0F - Convert.ToDouble(dist.Value) / Convert.ToDouble(maxLen)
    End Function

    ''' <summary>
    ''' Removes all non-alpha numeric characters and then sorts
    ''' the words in alphabetical order.
    ''' </summary>
    Private Shared Function GetSimilarityString(s1 As String) As String
        Dim normString = Regex.Replace(If(s1, ""), "\W|_", " ", Xms)
        normString = Regex.Replace(normString, "\s+", " ", Xms).Trim()
        Dim words As New List(Of String)(normString.Split(" "c))
        words.Sort()
        Return String.Join(" ", words.ToArray())
    End Function

End Class

If you really want to define similarity in the exact way that you have formulated in your question, then you would - as you say - have to implement the Levensthein Distance calculation. Either in code calculated on each row retrieved by a DataReader or as a SQL Server function.

The problem stated is actually more tricky than it may appear at first sight, because you cannot assume to know what the mutually shared elements between two strings may be.

So in addition to Levensthein Distance you probably also want to specify a minimum number of consecutive characters that actually have to match (in order for sufficient similarity to be concluded).

In sum: It sounds like an overly complicated and time consuming/slow approach.

Interestingly, in SQL Server 2008 you have the DIFFERENCE function which may be used for something like this.

It evaluates the phonetic value of two strings and calculates the difference. I'm unsure if you will get it to work properly for multi-word expressions such as movie titles since it doesn't deal well with spaces or numbers and puts too much emphasis on the beginning of the string, but it is still an interesting predicate to be aware of.

If what you are actually trying to describe is some sort of search feature, then you should look into the Full Text Search capabilities of SQL Server 2008. It provides built-in Thesaurus support, fancy SQL predicates and a ranking mechanism for "best matches"

EDIT: If you are looking to eliminate duplicates maybe you could look into SSIS Fuzzy Lookup and Fuzzy Group Transformation. I have not tried this myself, but it looks like a promising lead.

EDIT2: If you don't want to dig into SSIS and still struggle with the performance of the Levensthein Distance algorithm, you could perhaps try this algorithm which appears to be less complex.


select id, title
from my_table
where 
    title like 'Aliens%' 
    and 
    len(rtrim(title)) < len('Aliens') + 7