How do I use NLTK's default tokenizer to get spans instead of strings?

Yes, most Tokenizers in nltk have a method called span_tokenize but unfortunately the Tokenizer you are using doesn't.

By default the word_tokenize function uses a TreebankWordTokenizer. The TreebankWordTokenizer implementation has a fairly robust implementation but currently it lacks an implementation for one important method, span_tokenize.

I see no implementation of span_tokenize for a TreebankWordTokenizer so I believe you will need to implement your own. Subclassing TokenizerI can make this process a little less complex.

You might find the span_tokenize method of PunktWordTokenizer useful as a starting point.

I hope this info helps.

At least since NLTK 3.4 TreebankWordTokenizer supports span_tokenize:

Click to copy

>>> from nltk.tokenize import TreebankWordTokenizer as twt
>>> list(twt().span_tokenize('What is the airspeed of an unladen swallow ?'))
[(0, 4),
 (5, 7),
 (8, 11),
 (12, 20),
 (21, 23),
 (24, 26),
 (27, 34),
 (35, 42),
 (43, 44)]

How do I use NLTK's default tokenizer to get spans instead of strings?

Tags:

Python

Tokenize

Nltk

Related

Recent Posts