How to parse freeform street/postal address out of text, and into components

UPDATE: Geocode.xyz now works worldwide. For examples see https://geocode.xyz

For USA, Mexico and Canada, see geocoder.ca.

For example:

Input: something going on near the intersection of main and arthur kill rd new york

Output:

<geodata>
  <latt>40.5123510000</latt>
  <longt>-74.2500500000</longt>
  <AreaCode>347,718</AreaCode>
  <TimeZone>America/New_York</TimeZone>
  <standard>
    <street1>main</street1>
    <street2>arthur kill</street2>
    <stnumber/>
    <staddress/>
    <city>STATEN ISLAND</city>
    <prov>NY</prov>
    <postal>11385</postal>
    <confidence>0.9</confidence>
  </standard>
</geodata>

You may also check the results in the web interface or get output as Json or Jsonp. eg. I'm looking for restaurants around 123 Main Street, New York


There are many street address parsers. They come in two basic flavors - ones that have databases of place names and street names, and ones that don't.

A regular expression street address parser can get up to about a 95% success rate without much trouble. Then you start hitting the unusual cases. The Perl one in CPAN, "Geo::StreetAddress::US", is about that good. There are Python and Javascript ports of that, all open source. I have an improved version in Python which moves the success rate up slightly by handling more cases. To get the last 3% right, though, you need databases to help with disambiguation.

A database with 3-digit ZIP codes and US state names and abbreviations is a big help. When a parser sees a consistent postal code and state name, it can start to lock on to the format. This works very well for the US and UK.

Proper street address parsing starts from the end and works backwards. That's how the USPS systems do it. Addresses are least ambiguous at the end, where country names, city names, and postal codes are relatively easy to recognize. Street names can usually be isolated. Locations on streets are the most complex to parse; there you encounter things such as "Fifth Floor" and "Staples Pavillion". That's when a database is a big help.


I saw this question a lot when I worked for an address verification company. I'm posting the answer here to make it more accessible to programmers searching around with the same question. The company I was at processed billions of addresses, and we learned a lot in the process.

First, we need to understand a few things about addresses.

Addresses are not regular

This means that regular expressions are out. I've seen it all, from simple regular expressions that match addresses in a very specific format, to this:

/\s+(\d{2,5}\s+)(?![a|p]m\b)(([a-zA-Z|\s+]{1,5}){1,2})?([\s|,|.]+)?(([a-zA-Z|\s+]{1,30}){1,4})(court|ct|street|st|drive|dr|lane|ln|road|rd|blvd)([\s|,|.|;]+)?(([a-zA-Z|\s+]{1,30}){1,2})([\s|,|.]+)?\b(AK|AL|AR|AZ|CA|CO|CT|DC|DE|FL|GA|GU|HI|IA|ID|IL|IN|KS|KY|LA|MA|MD|ME|MI|MN|MO|MS|MT|NC|ND|NE|NH|NJ|NM|NV|NY|OH|OK|OR|PA|RI|SC|SD|TN|TX|UT|VA|VI|VT|WA|WI|WV|WY)([\s|,|.]+)?(\s+\d{5})?([\s|,|.]+)/i

... to this where a 900+ line-class file generates a supermassive regular expression on the fly to match even more. I don't recommend these (for example, here's a fiddle of the above regex, that makes plenty of mistakes). There isn't an easy magic formula to get this to work. In theory and by theory, it's impossible to match addresses with a regular expression.

USPS Publication 28 documents the many formats of addresses that are possible, with all their keywords and variations. Worst of all, addresses are often ambiguous. Words can mean more than one thing ("St" can be "Saint" or "Street"), and there are words that I'm pretty sure they invented. (Who knew that "Stravenue" was a street suffix?)

You'd need some code that really understands addresses, and if that code does exist, it's a trade secret. But you could probably roll your own if you're really into that.

Addresses come in unexpected shapes and sizes

Here are some contrived (but complete) addresses:

1)  102 main street
    Anytown, state

2)  400n 600e #2, 52173

3)  p.o. #104 60203

Even these are possibly valid:

4)  829 LKSDFJlkjsdflkjsdljf Bkpw 12345

5)  205 1105 14 90210

Obviously, these are not standardized. Punctuation and line break are not guaranteed. Here's what's going on:

  1. Number 1 is complete because it contains a street address and a city and state. With that information, there's enough to identify the address, and it can be considered "deliverable" (with some standardization).

  2. Number 2 is complete because it contains a street address (with secondary/unit number) and a 5-digit ZIP code, which is enough to identify an address.

  3. Number 3 is a complete post office box format, as it contains a ZIP code.

  4. Number 4 is also complete because the ZIP code is unique, meaning that a private entity or corporation has purchased that address space. A unique ZIP code is for high-volume or concentrated delivery spaces. Anything addressed to ZIP code 12345 goes to General Electric in Schenectady, NY. This example won't reach anyone in particular, but the USPS would still deliver it.

  5. Number 5 is also complete, believe it or not. With just those numbers, the full address can be discovered when parsed against a database of all possible addresses. Filling in the missing directionals, secondary designator, and ZIP+4 code is trivial when you see each number as a component. Here's what it looks like, fully expanded and standardized:

205 N 1105 W Apt 14

Beverly Hills CA 90210-5221

Address data is not your own

In most countries that provide official address data to licensed vendors, the address data itself belongs to the governing agency. In the US, the USPS owns the addresses. The same is true for Canada Post, Royal Mail, and others, though each country enforces or defines ownership a little differently. Knowing this is important since it usually forbids reverse-engineering the address database. You have to be careful how to acquire, store, and use the data.

Google Maps is a common go-to for quick address fixes, but the TOS is rather prohibitive; for example, you can't use their data or APIs without showing a Google Map, and for non-commercial purposes only (unless you pay), and you can't store the data (except for temporary caching). Makes sense. Google's data is some of the best in the world. However, Google Maps does not verify the address. If an address does not exist, it will still show you where the address would be if it did exist (try it on your own street; use a house number that you know doesn't exist). This is useful sometimes, but be aware of that.

Nominatim's usage policy is similarly limiting, especially for high volume and commercial use, and the data is mostly drawn from free sources, so it isn't as well maintained (such as the nature of open projects). However, this may still suit your needs. A great community supports it.

The USPS itself has an API, but it goes down a lot and comes with no guarantees nor support. It might also be hard to use. Some people use it sparingly with no problems. But it's easy to miss that the USPS requires that you use their API only for confirming addresses to ship through them.

People expect addresses to be hard

Unfortunately, we've conditioned our society to expect addresses to be complicated. There are dozens of good UX articles all over the Internet about this. Still, the fact is, if you have an address form with individual fields, that's what users expect, even though it makes it harder for edge-case addresses that don't fit the format the form is expecting, or maybe the form requires a field it shouldn't. Or users don't know where to put a certain part of their address.

I could go on and on about the bad UX of checkout forms these days, but instead, I'll say that combining the addresses into a single field will be a welcome change -- people will be able to type their address how they see fit, rather than trying to figure out your lengthy form. However, this change will be unexpected and users may find it a little jarring at first. Just be aware of that.

Part of this pain can be alleviated by putting the country field out front, before the address. When they fill out the country field first, you know how to make your form appear. Maybe you have a good way to deal with single-field US addresses, so if they select the United States, you can reduce your form to a single field, otherwise show the component fields. Just things to think about!

Now we know why it's hard; what can you do about it?

The USPS licenses vendors through a process called CASS™ Certification to provide verified addresses to customers. These vendors have access to the USPS database, updated monthly. Their software must conform to rigorous standards to be certified, and they don't often require agreement to such limiting terms as discussed above.

Many CASS-Certified companies can process lists or have APIs: Melissa Data, Experian QAS, and SmartyStreets, to name a few.

(Due to getting flak for "advertising," I've truncated my answer at this point. It's up to you to find a solution that works for you.)

The Truth: Really, folks, I don't work at any of these companies. It's not an advertisement.


libpostal: an open-source library to parse addresses, training with data from OpenStreetMap, OpenAddresses and OpenCage.

https://github.com/openvenues/libpostal (more info about it)

Other tools/services:

  • http://www.gisgraphy.com Free, open source, and ready to use geocoder and geolocalisation webservices, integrating OpenStreetMap, GeoNames and Quattroshapes.

  • https://github.com/kodapan/osm-common Library for accessing OpenStreetMap services, parsing and processing data.

  • http://wiki.openstreetmap.org/wiki/Nominatim

  • http://address-parser.net/

  • http://geoservices.tamu.edu/Services/AddressNormalization/