Remove punctuation from Unicode formatted strings

You could use unicode.translate() method:

import unicodedata
import sys

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
                      if unicodedata.category(unichr(i)).startswith('P'))
def remove_punctuation(text):
    return text.translate(tbl)

You could also use r'\p{P}' that is supported by regex module:

import regex as re

def remove_punctuation(text):
    return re.sub(ur"\p{P}+", "", text)

If you want to use J.F. Sebastian's solution in Python 3:

import unicodedata
import sys

tbl = dict.fromkeys(i for i in range(sys.maxunicode)
                      if unicodedata.category(chr(i)).startswith('P'))
def remove_punctuation(text):
    return text.translate(tbl)

Tags:

Python

Unicode