How to filter chinese (ONLY chinese)

I don't know of any good way to separate Chinese characters from other letters, but you can distinguish letters from other characters. Using regexes, you can use r"\w" (compiled with the re.UNICODE flag if you're on Python 2). That will include numbers as well as letters, but not punctuation.

unicodedata.category(c) will tell you what type of character c is. Your Chinese letters are "Lo" (letter without case), while the punctuation is "Po".


The Zhon library provides you with a list of Chinese punctuation marks: https://pypi.python.org/pypi/zhon

str = re.sub('[%s]' % zhon.unicode.PUNCTUATION, "", "你好,这只是一些中文文本..,.,全角")

This does almost what you want. Not exactly, because the sentence you provide contains some very non-standard punctuation marks, such as ".". Anyway, I think Zhon might be useful to others with a similar issue.

Tags:

Python