Extract verb phrases using Spacy
The above answer references textacy
, this is all achievable with Spacy
directly with the Matcher, no need for the wrapper library.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm') # download model first
sentence = 'The author was staring pensively as she wrote'
pattern=[{'POS': 'VERB', 'OP': '?'},
{'POS': 'ADV', 'OP': '*'},
{'OP': '*'}, # additional wildcard - match any text in between
{'POS': 'VERB', 'OP': '+'}]
# instantiate a Matcher instance
matcher = Matcher(nlp.vocab)
# Add pattern to matcher
matcher.add("verb-phrases", None, pattern)
doc = nlp(sentence)
# call the matcher to find matches
matches = matcher(doc)
N.b. this returns a list of tuples containing the match ID and the start, end index for each match, e.g.:
[(15658055046270554203, 0, 4),
(15658055046270554203, 1, 4),
(15658055046270554203, 2, 4),
(15658055046270554203, 3, 4),
(15658055046270554203, 0, 8),
(15658055046270554203, 1, 8),
(15658055046270554203, 2, 8),
(15658055046270554203, 3, 8),
(15658055046270554203, 4, 8),
(15658055046270554203, 5, 8),
(15658055046270554203, 6, 8),
(15658055046270554203, 7, 8)]
You can turn these matches into spans using the indexes.
spans = [doc[start:end] for _, start, end in matches]
# output
"""
The author was staring
author was staring
was staring
staring
The author was staring pensively as she wrote
author was staring pensively as she wrote
was staring pensively as she wrote
staring pensively as she wrote
pensively as she wrote
as she wrote
she wrote
wrote
"""
Note, the I added the additional {'OP': '*'},
to the pattern which serves as a wildcard when note specified with a specific POS/DEP (i.e. it will match any text). This is useful here as the question is about verb phrases - the format VERB, ADV, VERB is an unusual structure (try to think of some example sentences), however VERB, ADV, [other text], VERB is likely (as given in the example sentence 'The author was staring pensively as she wrote'). Optionally, you can refine the pattern to be more specific (displacy is your friend here).
Further Note, all permutations of the match are returned due to the greediness of the matcher. you can optionally reduce this to just the longest form using filter_spans to remove duplicates or overlaps.
from spacy.util import filter_spans
filter_spans(spans)
# output
[The author was staring pensively as she wrote]
This might help you.
from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'The author is writing a new book.'
pattern = r'<VERB>?<ADV>*<VERB>+'
doc = textacy.Doc(sentence, lang='en_core_web_sm')
lists = textacy.extract.pos_regex_matches(doc, pattern)
for list in lists:
print(list.text)
Output:
is writing
On how to highlight the verb phrases do check the link below.
Highlight verb phrases using spacy and html
Another Approach:
Recently observed Textacy has made some changes to regex matches. Based on that approach i tried this way.
from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'The cat sat on the mat. He dog jumped into the water. The author is writing a book.'
pattern = [{'POS': 'VERB', 'OP': '?'},
{'POS': 'ADV', 'OP': '*'},
{'POS': 'VERB', 'OP': '+'}]
doc = textacy.make_spacy_doc(sentence, lang='en_core_web_sm')
lists = textacy.extract.matches(doc, pattern)
for list in lists:
print(list.text)
Output:
sat
jumped
writing
I checked the POS matches in this links seems the result is not the intended one.
[https://explosion.ai/demos/matcher][1]
Did anybody try framing POS tags instead of Regexp pattern for finding Verb phrases?
Edit 2:
import spacy
from spacy.matcher import Matcher
from spacy.util import filter_spans
nlp = spacy.load('en_core_web_sm')
sentence = 'The cat sat on the mat. He quickly ran to the market. The dog jumped into the water. The author is writing a book.'
pattern = [{'POS': 'VERB', 'OP': '?'},
{'POS': 'ADV', 'OP': '*'},
{'POS': 'AUX', 'OP': '*'},
{'POS': 'VERB', 'OP': '+'}]
# instantiate a Matcher instance
matcher = Matcher(nlp.vocab)
matcher.add("Verb phrase", None, pattern)
doc = nlp(sentence)
# call the matcher to find matches
matches = matcher(doc)
spans = [doc[start:end] for _, start, end in matches]
print (filter_spans(spans))
Output:
[sat, quickly ran, jumped, is writing]
Based on help from mdmjsh's answer.
Edit3: Strange behavior. The following sentence for the following pattern the verb phrase gets identified correctly in https://explosion.ai/demos/matcher
pattern = [{'POS': 'VERB', 'OP': '?'},
{'POS': 'ADV', 'OP': '*'},
{'POS': 'VERB', 'OP': '+'}]
The very black cat must be really meowing really loud in the yard.
But outputs the following while running from code.
[must, really meowing]