How can I extract address from raw text using NLTK in python?
Definitely regular expressions :)
Something like
import re
txt = ...
regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
address = re.findall(regexp, txt)
# address = ['44 West 22nd Street, New York, NY 12345']
Explanation:
[0-9]{1,3}
: 1 to 3 digits, the address number
(space)
: a space between the number and the street name
.+
: street name, any character for any number of occurrences
,
: a comma and a space before the city
.+
: city, any character for any number of occurrences
,
: a comma and a space before the state
[A-Z]{2}
: exactly 2 uppercase chars from A to Z
[0-9]{5}
: 5 digits
re.findall(expr, string)
will return an array with all the occurrences found.
Checkout libpostal, a library dedicated to address extraction
It cannot extract address from raw text but may help in related tasks
Pyap works best not just for this particular example but also for other addresses contained in texts.
text = ...
addresses = pyap.parse(text, country='US')