Combining a Tokenizer into a Grammar and Parser with NLTK
You could run a POS tagger over your text and then adapt your grammar to work on POS tags instead of words.
> text = nltk.word_tokenize("A car has a door")
['A', 'car', 'has', 'a', 'door']
> tagged_text = nltk.pos_tag(text)
[('A', 'DT'), ('car', 'NN'), ('has', 'VBZ'), ('a', 'DT'), ('door', 'NN')]
> pos_tags = [pos for (token,pos) in nltk.pos_tag(text)]
['DT', 'NN', 'VBZ', 'DT', 'NN']
> simple_grammar = nltk.parse_cfg("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP
VP -> V NP | VP PP
Det -> 'DT'
N -> 'NN'
V -> 'VBZ'
P -> 'PP'
""")
> parser = nltk.ChartParser(simple_grammar)
> tree = parser.parse(pos_tags)
I know this is a year later but I wanted to add some thoughts.
I take a lot of different sentences and tag them with parts of speech for a project I'm working on. From there I was doing as StompChicken suggested, pulling the tags from the tuples (word, tag) and using those tags as the "terminals" (the bottom nodes of tree as we create a completely tagged sentence).
Ultimately this doesn't suite my desire to mark head nouns in noun phrases, since I can't pull the head noun "word" into the grammar, since the grammar only has the tags.
So what I did was instead use the set of (word, tag) tuples to create a dictionary of tags, with all the words with that tag as values for that tag. Then I print this dictionary to the screen/grammar.cfg (context free grammar) file.
The form I use to print it works perfectly with setting up a parser through loading a grammar file (parser = nltk.load_parser('grammar.cfg')
). One of the lines it generates looks like this:
VBG -> "fencing" | "bonging" | "amounting" | "living" ... over 30 more words...
So now my grammar has the actual words as terminals and assigns the same tags that nltk.tag_pos
does.
Hope this helps anyone else wanting to automate tagging a large corpus and still have the actual words as terminals in their grammar.
import nltk
from collections import defaultdict
tag_dict = defaultdict(list)
...
""" (Looping through sentences) """
# Tag
tagged_sent = nltk.pos_tag(tokens)
# Put tags and words into the dictionary
for word, tag in tagged_sent:
if tag not in tag_dict:
tag_dict[tag].append(word)
elif word not in tag_dict.get(tag):
tag_dict[tag].append(word)
# Printing to screen
for tag, words in tag_dict.items():
print tag, "->",
first_word = True
for word in words:
if first_word:
print "\"" + word + "\"",
first_word = False
else:
print "| \"" + word + "\"",
print ''
Parsing is a tricky problem, alot of things can go wrong!
You want (at least) three components here, a tokenizer, a tagger and finally the parser.
First you need to tokenize the running text into a list of tokens. This can be as easy as splitting the input string around whitespace, but if you are parsing more general text you will also need to handle numbers and punctuation, which is non trivial. For instance sentence ending periods are often not regarded as part of the word it is attached to, but periods marking an abbreviation often are.
When you have a list of input tokens you can use a tagger to try to determine the POS of each word, and use it to disambiguate input tag sequences. This has two main advantages: First it speeds up parsing as we no longer have to consider alternate hypothesis licensed by ambiguous words, as the POS-tagger has already done this. Second it improves unknown word handling, ie. words not in your grammar, by also assigning those words a tag (hopefully the right one). Combining a parser and a tagger in this way is commonplace.
The POS-tags will then make up the pre-terminals in your grammar, The pre-terminals are the left-hand sides of productions with only terminals as their right-hand side. Ie in N -> "house", V -> "jump" etc. N and V are preterminals. It is fairly common to have the grammar with syntactic, only non-terminals on both-sides, productions and lexical productions, one non-terminal going to one terminal. This makes linguistic sense most of the time, and most CFG-parsers require the grammar to be in this form. However one could represent any CFG in this way by creating "dummy productions" from any terminals in RHSes with non-terminals in them.
It could be neccesary to have some sort of mapping between POS-tags and pre-terminals if you want to make more (or less) fine grained tag distinctions in your grammar than what your tagger outputs. You can then initialize the chart with the results from the tagger, ie. passive items of the appropriate category spanning each input token. Sadly I do not know NTLK, but I'm sure there is a simple way to do this. When the chart is seeded, parsing can contiune as normal, and any parse-trees can be extracted (also including the words) in the regular fashion.
However, in most practical applications you will find that the parser can return several different analyses as natural language is highly ambiguous. I don't know what kind of text corpus you are trying to parse, but if it's anything like natural language you probably will have to construct some sort of parse-selection model, this will require a treebank,a collection of parse-trees of some size ranging from a couple of hundred to several thousand parses, all depending on your grammar and how accurate results you need. Given this treebank one can automagically infer a PCFG corresponding to it. The PCFG can then be used as a simple model for ranking the parse trees.
All of this is a lot of work to do yourself. What are you using the parse results for? Have you looked at other resources in the NTLK or other packages such as the StanfordParser or the BerkeleyParser?