How do I use NLTK's default tokenizer to get spans instead of strings?
Yes, most Tokenizers in nltk have a method called span_tokenize
but unfortunately the Tokenizer you are using doesn't.
By default the word_tokenize
function uses a TreebankWordTokenizer. The TreebankWordTokenizer
implementation has a fairly robust implementation but currently it lacks an implementation for one important method, span_tokenize
.
I see no implementation of span_tokenize
for a TreebankWordTokenizer
so I believe you will need to implement your own. Subclassing TokenizerI can make this process a little less complex.
You might find the span_tokenize
method of PunktWordTokenizer
useful as a starting point.
I hope this info helps.
At least since NLTK 3.4 TreebankWordTokenizer supports span_tokenize
:
>>> from nltk.tokenize import TreebankWordTokenizer as twt
>>> list(twt().span_tokenize('What is the airspeed of an unladen swallow ?'))
[(0, 4),
(5, 7),
(8, 11),
(12, 20),
(21, 23),
(24, 26),
(27, 34),
(35, 42),
(43, 44)]