Python and NLTK: How to analyze sentence grammar?
Let's do some reverse engineering:
>>> import nltk
>>> grammar = nltk.parse_cfg("""
... NP -> Det N | Det N PP
... N -> 'Kim' | 'Dana' | 'everyone'
... """)
>>> sent = "Kim".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
Seems like the rules can't recognize even the first work as NP. So let's try injecting NP -> N
>>> import nltk
>>> grammar = nltk.parse_cfg("""
... NP -> Det N | Det N PP | N
... N -> 'Kim' | 'Dana' | 'everyone'
... """)
>>> sent = "Kim".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[Tree('NP', [Tree('N', ['Kim'])])]
So now it's working, let's continue Kim arrived or Dana and
:
>>> import nltk
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... PP -> P NP
... NP -> Det N | Det N PP | N
... VP -> V NP | VP PP
... N -> 'Kim' | 'Dana' | 'everyone'
... V -> 'arrived' | 'left' |'cheered'
... P -> 'or' | 'and'
... """)
>>> sent = "Kim arrived".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>>
>>> sent = "Kim arrived or".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
Seem like there is no way to get the VP
with or without the P
, since V
requires either an NP
after, or it has to go up the tree to be a VP
before taking a P
, so it's relax the rules and say VP -> V PP
instead of VP -> VP PP
:
>>> import nltk
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... PP -> P NP
... NP -> Det N | Det N PP | N
... VP -> V NP | V PP
... N -> 'Kim' | 'Dana' | 'everyone'
... V -> 'arrived' | 'left' |'cheered'
... P -> 'or' | 'and'
... """)
>>> sent = "Kim arrived or Dana".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[Tree('S', [Tree('NP', [Tree('N', ['Kim'])]), Tree('VP', [Tree('V', ['arrived']), Tree('PP', [Tree('P', ['or']), Tree('NP', [Tree('N', ['Dana'])])])])])]
Okay, we are getting closer, but seems like the next word broke the cfg rules again:
>> import nltk
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... PP -> P NP
... NP -> Det N | Det N PP | N
... VP -> V NP | V PP
... N -> 'Kim' | 'Dana' | 'everyone'
... V -> 'arrived' | 'left' |'cheered'
... P -> 'or' | 'and'
... """)
>>> sent = "Kim arrived or Dana left".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>> sent = "Kim arrived or Dana left and".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>>
>>> sent = "Kim arrived or Dana left and everyone".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>>
>>> sent = "Kim arrived or Dana left and everyone cheered".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
So I hope the above example shows you that trying to change the rules to incorporate language phenomenon from left to right is hard.
Instead of doing it from left to right, and achieve
[[[[[[[[Kim] arrived] or] Dana] left] and] everyone] cheered]
why don't you try to make more linguistically sound rules to achieve:
[[[Kim arrived] or [Dana left]] and [everyone cheered]]
[[Kim arrived] or [[Dana left] and [everyone cheered]]]
Try this instead:
import nltk
grammar = nltk.parse_cfg("""
S -> CP | VP
CP -> VP C VP | CP C VP | VP C CP
VP -> NP V
NP -> 'Kim' | 'Dana' | 'everyone'
V -> 'arrived' | 'left' |'cheered'
C -> 'or' | 'and'
""")
print "======= Kim arrived ========="
sent = "Kim arrived".split()
parser = nltk.ChartParser(grammar)
for t in parser.nbest_parse(sent):
print t
print "\n======= Kim arrived or Dana left ========="
sent = "Kim arrived or Dana left".split()
parser = nltk.ChartParser(grammar)
for t in parser.nbest_parse(sent):
print t
print "\n=== Kim arrived or Dana left and everyone cheered ===="
sent = "Kim arrived or Dana left and everyone cheered".split()
parser = nltk.ChartParser(grammar)
for t in parser.nbest_parse(sent):
print t
[out]
:
======= Kim arrived =========
(S (VP (NP Kim) (V arrived)))
======= Kim arrived or Dana left =========
(S (CP (VP (NP Kim) (V arrived)) (C or) (VP (NP Dana) (V left))))
=== Kim arrived or Dana left and everyone cheered ====
(S
(CP
(CP (VP (NP Kim) (V arrived)) (C or) (VP (NP Dana) (V left)))
(C and)
(VP (NP everyone) (V cheered))))
(S
(CP
(VP (NP Kim) (V arrived))
(C or)
(CP
(VP (NP Dana) (V left))
(C and)
(VP (NP everyone) (V cheered)))))
The above solution show how your CFG rules needs to be robust enough to not only capture the full sentence but also part of the sentence too.
You don't have a Det
defined in your grammar, but each NP
(and consequently S
) has to have one by grammar definition.
Compare with
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... NP -> Det N | Det N PP
... VP -> V NP | VP PP
... Det -> 'a' | 'the'
... N -> 'Kim' | 'Dana' | 'everyone'
... V -> 'arrived' | 'left' |'cheered'
... """)
>>>
>>> parser = nltk.ChartParser(grammar)
>>> parser.nbest_parse('the Kim left a Dana'.split())
[Tree('S', [Tree('NP', [Tree('Det', ['the']), Tree('N', ['Kim'])]), Tree('VP', [Tree('V', ['left']), Tree('NP', [Tree('Det', ['a']), Tree('N', ['Dana'])])])])]