Open Source Address Correction / Parser with Fuzzy Matching

you can try gisgraphy. it includes an address parser, a geocoder, and a reverse geocoder. (dont use the free service for batch, but install it on your server). fulltextsearch with synomyms, spellchecking can probably helps too. there is no problems if you need high volumes, because gisgraphy is available as webservices with several format (XML, JSON, PHP, Python, Ruby, YAML, GeoRSS, and Atom) so it can scale

gisgraphy


I have some experience with this. At SmartyStreets (where I work), we make address verification software called US Street Address which can be used through a bulk upload tool or API. (It's actually all web-based; no need to download or install it.)

The challenges of validating and standardizing addresses are plenty, I assure you. It gets even trickier when you attempt to parse the address into particular components yourself, or implementing "fuzzy search." But have no fear... we have un-officially published a basic procedure for performing free-form address validation. While our service isn't open source, we're fairly open about sharing our expertise with the community and setting new standards for quality and performance.

Anyway, I think you'll find that page somewhat helpful. An API such as ours will handle thousands upon thousands of requests per second since we're geo-distributed across three data centers nationally. The US Street Address software should be able to take care of the "fuzzy matching" for you and return only valid results, filling in the missing pieces and correcting misspellings.

It takes into account official USPS aliases and even unofficial street names or location names and matches them to official, deliverable endpoints. For your own custom names, though, you'll have to work it into your own database for that.

A final word, too: I would add that while open source tools are great and free, you will probably trade it for some aspect of service, performance, and overall quality. Even if you host the service in-house, you're responsible for maintaining it and meeting the demands of what sounds like, in your case, a heavy payload.

I'll be happy to personally answer your own questions about addresses -- I think the task before you is really quite interesting, and may seem overwhelming without the right resources.


Address standardization (AKA address correction, address normalization, address parsing) is not a simple task. If you have swift fingers and ample creativity, a very fine REGEX can be concocted that can do a remarkably good job. However, it doesn't handle very well the edge cases where the results can be ambiguous. The reason is a lack of context. You have to know what the correct result looks like in order to know that you have achieved the accuracy that you need. Certainly, taking a list of 100k addresses and being able to parse 70% of them accurately (using only REGEX) is better than not parsing any of them. But, how long does it take to parse the remaining "hard" addresses? A LONG TIME. They require a large number of specialized parsing functions because the context, or the "right answer" is unknown. This is where address verification comes in handy because the "context" is known. The fully standardized and corrected address is known and the master list can be used to compare the results.

I get asked this a lot since I work with address verification at Smartystreets.