Comparing Similar Text Strings in Excel
You might consider using the Microsoft Fuzzy Lookup Addin.
From MS site:
Overview
The Fuzzy Lookup Add-In for Excel was developed by Microsoft Research and performs fuzzy matching of textual data in Microsoft Excel. It can be used to identify fuzzy duplicate rows within a single table or to fuzzy join similar rows between two different tables. The matching is robust to a wide variety of errors including spelling mistakes, abbreviations, synonyms and added/missing data. For instance, it might detect that the rows “Mr. Andrew Hill”, “Hill, Andrew R.” and “Andy Hill” all refer to the same underlying entity, returning a similarity score along with each match. While the default configuration works well for a wide variety of textual data, such as product names or customer addresses, the matching may also be customized for specific domains or languages.
I would look into using this list (English section only) to help weed out the common shortenings.
Addition, you might want to consider using a function that will tell you, in exact terms, how "close" two string are. The following code came from here and thanks to smirkingman.
Option Explicit
Public Function Levenshtein(s1 As String, s2 As String)
Dim i As Integer
Dim j As Integer
Dim l1 As Integer
Dim l2 As Integer
Dim d() As Integer
Dim min1 As Integer
Dim min2 As Integer
l1 = Len(s1)
l2 = Len(s2)
ReDim d(l1, l2)
For i = 0 To l1
d(i, 0) = i
Next
For j = 0 To l2
d(0, j) = j
Next
For i = 1 To l1
For j = 1 To l2
If Mid(s1, i, 1) = Mid(s2, j, 1) Then
d(i, j) = d(i - 1, j - 1)
Else
min1 = d(i - 1, j) + 1
min2 = d(i, j - 1) + 1
If min2 < min1 Then
min1 = min2
End If
min2 = d(i - 1, j - 1) + 1
If min2 < min1 Then
min1 = min2
End If
d(i, j) = min1
End If
Next
Next
Levenshtein = d(l1, l2)
End Function
What this will do is tell you how many insertions and deletions one must do to one string to get to the other. I would try to keep this number low (and last names should be exact).
I have a (long) formula that you can use. It's not as well honed as those above – and only works for surname, rather than a full name – but you might find it useful.
So if you have a header row and want to compare A2
with B2
, place this in any other cell on that row (e.g., C2
) and copy down to the end.
=IF(A2=B2,"EXACT",IF(SUBSTITUTE(A2,"-"," ")=SUBSTITUTE(B2,"-"," "),"Hyphen",IF(LEN(A2)>LEN(B2),IF(LEN(A2)>LEN(SUBSTITUTE(A2,B2,"")),"Whole String",IF(MID(A2,1,1)=MID(B2,1,1),1,0)+IF(MID(A2,2,1)=MID(B2,2,1),1,0)+IF(MID(A2,3,1)=MID(B2,3,1),1,0)+IF(MID(A2,LEN(A2),1)=MID(B2,LEN(B2),1),1,0)+IF(MID(A2,LEN(A2)-1,1)=MID(B2,LEN(B2)-1,1),1,0)+IF(MID(A2,LEN(A2)-2,1)=MID(B2,LEN(B2)-2,1),1,0)&"°"),IF(LEN(B2)>LEN(SUBSTITUTE(B2,A2,"")),"Whole String",IF(MID(A2,1,1)=MID(B2,1,1),1,0)+IF(MID(A2,2,1)=MID(B2,2,1),1,0)+IF(MID(A2,3,1)=MID(B2,3,1),1,0)+IF(MID(A2,LEN(A2),1)=MID(B2,LEN(B2),1),1,0)+IF(MID(A2,LEN(A2)-1,1)=MID(B2,LEN(B2)-1,1),1,0)+IF(MID(A2,LEN(A2)-2,1)=MID(B2,LEN(B2)-2,1),1,0)&"°"))))
This will return:
- EXACT – if it's an exact match
- Hyphen – if it's a pair of double-barrelled names but on has a hyphen and the other a space
- Whole string – if all of one surname is part of the other (e.g., if a Smith has become a French-Smith)
After that it will give you a degree from 0° to 6° depending on the number of points of comparison between the two. (i.e., 6° compares better).
As I say a bit rough and ready, but hopefully gets you in roughly the right ball-park.