Python Untokenize a sentence

You can use "treebank detokenizer" - TreebankWordDetokenizer:

from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown'])
# 'The quick brown'

There is also MosesDetokenizer which was in nltk but got removed because of the licensing issues, but it is available as a Sacremoses standalone package.

To reverse word_tokenize from nltk, i suggest looking in http://www.nltk.org/_modules/nltk/tokenize/punkt.html#PunktLanguageVars.word_tokenize and do some reverse engineering.

Short of doing crazy hacks on nltk, you can try this:

>>> import nltk
>>> import string
>>> nltk.word_tokenize("I've found a medicine for my disease.")
['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']
>>> tokens = nltk.word_tokenize("I've found a medicine for my disease.")
>>> "".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip()
"I've found a medicine for my disease."

Python Untokenize a sentence

Tags:

Python

Python 2.7

Nltk

Related

Recent Posts