General Address Parser for Freeform Text

Simson Garfinkel worked one up for his nifty address book for NeXTstep (which was later compiled and updated for Mac OS X and submitted to an Apple Design contest). Since then, it has been open sourced and is available from his website below:

http://simson.net/ref/sbook5/


This is essentially a class of the Named Entity Resolution problem. NER on Wikipedia

The best way to approach this is to parse the address using a language transducer to identify various constructs - an approach is similar to using regular expressions with a finite state machine.

I've had great success with the Java NLP and Machine learning framework called GATE, and their transducer lib is called Jape. Check out their GUI, and use that to write some Java code for it!

Their built in examples should get you started with the basics, and you can then extend it as needed. Essentially, it compartmentalizes text into components using the rules and the rule engine, so something like,

Xyz, Blah St,
Foo City, 11110, CA

would be translated to,

Place: Xyz
Street: Blah St
City: Foo
...

And then you can use your database of locations to do matches.

Jape also supports dictionary lookups, apart from rules - so if you already have "Blah St" in your database, and it has 2 parents - city Foo and Bar - you just disambiguate by parsing the next line.

Edit: GATE includes a tool called ANNIE - an information extraction system, that can be played around with to identify addresses. This uses some built in Jape rules that you can build upon.


Incidentally, have you seen a new API endpoint that SmartyStreets is experimenting with? It extracts addresses from text and validates them and converts them into components.

Refer to this other Stack Overflow post which goes into more detail. I work at SmartyStreets and helped to develop it, so I can tell you that this is a very hard problem, even if from the surface it seems simple.