Use of PunktSentenceTokenizer in NLTK
PunktSentenceTokenizer
is the abstract class for the default sentence tokenizer, i.e. sent_tokenize()
, provided in NLTK. It is an implmentation of Unsupervised Multilingual Sentence
Boundary Detection (Kiss and Strunk (2005). See https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L79
Given a paragraph with multiple sentence, e.g:
>>> from nltk.corpus import state_union
>>> train_text = state_union.raw("2005-GWBush.txt").split('\n')
>>> train_text[11]
u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all. This evening I will set forth policies to advance that ideal at home and around the world. '
You can use the sent_tokenize()
:
>>> sent_tokenize(train_text[11])
[u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.', u'This evening I will set forth policies to advance that ideal at home and around the world. ']
>>> for sent in sent_tokenize(train_text[11]):
... print sent
... print '--------'
...
Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.
--------
This evening I will set forth policies to advance that ideal at home and around the world.
--------
The sent_tokenize()
uses a pre-trained model from nltk_data/tokenizers/punkt/english.pickle
. You can also specify other languages, the list of available languages with pre-trained models in NLTK are:
alvas@ubi:~/nltk_data/tokenizers/punkt$ ls
czech.pickle finnish.pickle norwegian.pickle slovene.pickle
danish.pickle french.pickle polish.pickle spanish.pickle
dutch.pickle german.pickle portuguese.pickle swedish.pickle
english.pickle greek.pickle PY3 turkish.pickle
estonian.pickle italian.pickle README
Given a text in another language, do this:
>>> german_text = u"Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, Göttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter. Über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südniedersächsischen Orgellandschaft vollständig oder in Teilen erhalten. "
>>> for sent in sent_tokenize(german_text, language='german'):
... print sent
... print '---------'
...
Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, Göttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter.
---------
Über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südniedersächsischen Orgellandschaft vollständig oder in Teilen erhalten.
---------
To train your own punkt model, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py and training data format for nltk punkt
PunktSentenceTokenizer
is an sentence boundary detection algorithm that must be trained to be used [1]. NLTK already includes a pre-trained version of the PunktSentenceTokenizer.
So if you use initialize the tokenizer without any arguments, it will default to the pre-trained version:
In [1]: import nltk
In [2]: tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
In [3]: txt = """ This is one sentence. This is another sentence."""
In [4]: tokenizer.tokenize(txt)
Out[4]: [' This is one sentence.', 'This is another sentence.']
You can also provide your own training data to train the tokenizer before using it. Punkt tokenizer uses an unsupervised algorithm, meaning you just train it with regular text.
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
For most of the cases, it is totally fine to use the pre-trained version. So you can simply initialize the tokenizer without providing any arguments.
So "what all this has to do with POS tagging"? The NLTK POS tagger works with tokenized sentences, so you need to break your text into sentences and word tokens before you can POS tag.
NLTK's documentation.
[1] Kiss and Strunk, " Unsupervised Multilingual Sentence Boundary Detection"