Python (NLTK) - more efficient way to extract noun phrases?

Take a look at Why is my NLTK function slow when processing the DataFrame?, there's no need to iterate through all rows multiple times if you don't need intermediate steps.

With ne_chunk and solution from

NLTK Named Entity recognition to a Python list and
How can I extract GPE(location) using NLTK ne_chunk?

[code]:

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd

def get_continuous_chunks(text, chunk_func=ne_chunk):
    chunked = chunk_func(pos_tag(word_tokenize(text)))
    continuous_chunk = []
    current_chunk = []

    for subtree in chunked:
        if type(subtree) == Tree:
            current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    return continuous_chunk

df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.', 
                           'Another bar foo Washington DC thingy with Bruce Wayne.']})

df['text'].apply(lambda sent: get_continuous_chunks((sent)))

[out]:

0                   [New York]
1    [Washington, Bruce Wayne]
Name: text, dtype: object

To use the custom RegexpParser :

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd

# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunker = RegexpParser(NP)

def get_continuous_chunks(text, chunk_func=ne_chunk):
    chunked = chunk_func(pos_tag(word_tokenize(text)))
    continuous_chunk = []
    current_chunk = []

    for subtree in chunked:
        if type(subtree) == Tree:
            current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    return continuous_chunk


df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.', 
                           'Another bar foo Washington DC thingy with Bruce Wayne.']})


df['text'].apply(lambda sent: get_continuous_chunks(sent, chunker.parse))

[out]:

0                  [bar sentence, New York city]
1    [bar foo Washington DC thingy, Bruce Wayne]
Name: text, dtype: object

Python (NLTK) - more efficient way to extract noun phrases?

Tags:

Pandas

Python 3.X

Nlp

Nltk

Text Chunking

Related

Recent Posts