Help needed using REGEXP for address string parsing
I don't think regular expressions will help you here, because its designed for pattern matching rather than semantic interpretation, so your string.split()
function will probably do as well.
But without a database to compare each token against, it'd be pretty hard to determine what level a token represents. If, for instance the right-most token is Zealand
, it could be either a country or a province, depending on the next token to the left.
I know you said you've not run it through a geocoder yet, but I would say that was the best way to go. Using Google's Geocoding API, you can pass it addresses formatted in interesting ways and it'll do a good job of returning a properly formatted version of that address.
In light of my comment on MerseyViking's answer, I thought I'd elaborate just for clarity and completeness.
I used to work in the address parsing/verification industry for SmartyStreets. What you're trying to do, I think, is called "Single-line address processing" (we call it SLAP). It's a complicated task, though, because addresses will inevitably be very, very different depending on user input, the type of address, and whether or not it is complete or correct.
There are too many factors for a regular expression to solve, as MerseyViking implied. Rather than potentially breaking Google's Terms of Service (otherwise, that was a pretty good solution -- if you don't care that the address is potentially invalid; Google does address approximation, not address validation), I suggest using a CASS-certified service (for US-based addresses) to do this. Since it looks like you may be working with Australian addresses, maybe see if there's a "land-down-under" alternative. SmartyStreets offers international address services that could help you.
We're developing a fine-tuned algorithm for doing SLAP accurately, and while we tweak it, finish it, and implement it, you might be interested in the rough idea of the algorithm. It is described here in some detail.