Remove punctuation from Unicode formatted strings
You could use unicode.translate()
method:
import unicodedata
import sys
tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
if unicodedata.category(unichr(i)).startswith('P'))
def remove_punctuation(text):
return text.translate(tbl)
You could also use r'\p{P}'
that is supported by regex module:
import regex as re
def remove_punctuation(text):
return re.sub(ur"\p{P}+", "", text)
If you want to use J.F. Sebastian's solution in Python 3:
import unicodedata
import sys
tbl = dict.fromkeys(i for i in range(sys.maxunicode)
if unicodedata.category(chr(i)).startswith('P'))
def remove_punctuation(text):
return text.translate(tbl)