using REPLACE in WHERE clause to check spelling permutations - MS SQL

I believe the trouble you're having is that SQL-Server's repalce function doesn't accept [^A-Za-z] to mean "non-alpa characters". Instead it's actually looking for that exact string to replace it.

http://msdn.microsoft.com/en-us/library/ms186862%28v=sql.90%29.aspx


In terms of using Regular Expressions, I've only done that by using the CLR, which seems to be getting much too involved for this particular problem.


My advice would be to hold the searchable fields, in the two different formats, in the table itself. And then use a simple LIKE search.

WHERE last_name LIKE @last_name OR last_name_stripped LIKE @last_name

last_name_stripped could then be a computed column (maybe using a function to strip all non_alpha characters), or handled by your client at Insert time.


Depending on how complex your scenario can get, this will be lots of work, and slow too. But there's a more flexible approach. Consider something like this, referred to as initialTable:

| id | lastname | firstname |
|  1 | o'malley | josé      |
|  2 | omállèy  | dònáld    |
|  3 | o'neill  | jámès     |
|  4 | onackers | sharon    |

Maybe a bit much, but it illustrates the general problem. I had to implement a "fuzzy" search for our intranet website based on character data that looked very similar - there's many accents in french or spanish names or street addresses for example.

What I did was define a function that performed all replacements for a given string, for example (pseudocode):

function string replacestuff(string input)
{
  input = replace(input, "è", "e");
  input = replace(input, "é", "e");
  input = replace(input, "ò", "o");
  input = replace(input, "ó", "o");
  input = replace(input, "'", "");
  ...
  return input;
}

Using this conversion function, create a second table fuzzyTable that has the following content:

| id | lastname | firstname |
|  1 | omalley  | jose      |
|  2 | omalley  | donald    |
|  3 | oneill   | james     |
|  4 | onackers | sharon    |

Now, assume you'll get an input string for your search of josè. This can't be found in either table. What you'll have to do is this:

declare @input varchar(50)
declare @input_mod varchar(50)
set @input = 'josè'
set @input_mod = replacestuff(@input)

SELECT id FROM initialTable WHERE firstname like @input OR firstname like @input_mod
UNION
SELECT id FROM fuzzyTable WHERE firstname like @input OR firstname like @input_mod
GROUP BY id

(Of course, you'd have to add % to make LIKE work.) The key here is to modify your input search string using the replacement function; this way you'll get a match if searching for against a content of because both come down to se when being processed by the replacement function.

You could even do a two-level search; first check only the unmodified string against the proper table and then with the statement shown above do a fuzzy search if the user says so.

This is a very flexible approach and can handle all sorts of stuff, like finding german letters ä, ö, ü, ß by using two-letter expressions ae, oe, ue, ss. The disadvantage is that you'll have to keep duplicates of some data, and change those duplicates within fuzzyTable as the initialTable (or the replacement function) changes. In our current use case, the intranet database gets updated once a night, so it's not a problem.

EDIT

You need to be aware that, using this, in some cases you'll get false positives. For example, we're using this for an employee search, and if you've got a dutch name spelled Hoek, you'd also find this name searching for Hök, because in german the replacement for ö would be oe. This could be solved using country-aware replacement functions, but we never took the concept this far. Depending on your input data this is more or less academic, for our use case I can't remember anyone complaining.

The main reason why we came up with this approach in the first place was that some of the data we had to work with was riddled with spelling errors, ie. in french many vowels were accented the wrong way around, but still we needed to deliver a result.


Using:

WHERE ( REPLACE(people.lastname, '[^A-Za-z]', '') LIKE @last_name + '%' )

or

WHERE ( ComplexFunction( field ) LIKE whatever )

will most likely have the result that your query will not use the index (if there is one) of field people.lastname and thus scan the whole table every time you run the query.

I see two ways to avoid this:

One, add another field lastnameStripped to the table, where the ComplexFunction(lastname) is stored and an index to this field. Then you can search with either:

WHERE ( lastnameStripped LIKE REPLACE(@last_name, '[^A-Za-z]', '') + '%' )

or

WHERE ( lastnameStripped LIKE @last_name + '%' )

and both will use the index of lastnameStripped.

Two, create an indexed view with the ComplexFunction( lastname ) as a field.