NLP reverse tokenizing (going from tokens to nicely formatted sentence)
You can use nltk
to some extent for detokenization like this. You'll need to do some post processing or modify the regexes, but here is a sample idea:
import re
from nltk.tokenize.treebank import TreebankWordDetokenizer as Detok
detokenizer = Detok()
text = detokenizer.detokenize(tokens)
text = re.sub('\s*,\s*', ', ', text)
text = re.sub('\s*\.\s*', '. ', text)
text = re.sub('\s*\?\s*', '? ', text)
There are more edge cases with punctuations, but this is pretty simple and slightly better than ' '.join
Within spaCy you can always reconstruct the original string using ''.join(token.text_with_ws for token in doc)
. If all you have is a list of strings, there's not really a good deterministic solution. You could train a reverse model or use some approximate rules. I don't know a good general purpose implementation of this detokenize()
function.