Find out the unicode script of a character
You can use ord
to retrieve the numeric value of a character (it works on both unicode and byte strings of length 1).
The next step, unfortunately, will involve you then testing against the ranges. Possibly the data here will be of assistance: http://cldr.unicode.org/index/downloads
The only way I know of is unfortunately to get the Unicode code point with ord()
and then use your own table (by using http://en.wikipedia.org/wiki/Unicode#Standardized_subsets and more). A preliminary conversion to some normal form may be in order, so as to handle the fact that a single "written" character can be expressed with different sequences of code points (the unicodedata module helps, here).
I was hoping someone's done it before, but apparently not, so here's what I've ended up with. The module below (I call it unicodedata2
) extends unicodedata
and provides script_cat(chr)
which returns a tuple (Script name, Category) for a unicode char. Example:
# coding=utf8
import unicodedata2
print unicodedata2.script_cat(u'Ф') #('Cyrillic', 'L')
print unicodedata2.script_cat(u'の') #('Hiragana', 'Lo')
print unicodedata2.script_cat(u'★') #('Common', 'So')
The module: https://gist.github.com/2204527
It seems to me that the Python unicodedata module contains tools for accessing the main file in the Unicode database but nothing for the other files: “The data in this database is based on the UnicodeData.txt file”
The script information is in the Scripts.txt file. It is of relatively simple format (described in UAX #44) and not horribly large (131 kilobytes), so you might consider parsing it in your program. Note that in the Unicode classification, there’s the “Common” script that contains characters used in different scripts, like punctuation marks.