Get unicode code point of a character using Python
If I understand your question correctly, you can do this.
>>> s='㈲'
>>> s.encode("unicode_escape")
b'\\u3232'
Shows the unicode escape code as a source string.
>>> ord(u"ć")
263
>>> u"café"[2]
u'f'
>>> u"café"[3]
u'\xe9'
>>> for c in u"café":
... print repr(c), ord(c)
...
u'c' 99
u'a' 97
u'f' 102
u'\xe9' 233
Turns out getting this right is fairly tricky: Python 2 and Python 3 have some subtle issues with extracting Unicode code points from a string.
Up until Python 3.3, it was possible to compile Python in one of two modes:
sys.maxunicode == 0x10FFFF
In this mode, Python's Unicode strings support the full range of Unicode code points from U+0000 to U+10FFFF. One code point is represented by one string element:
>>> import sys
>>> hex(sys.maxunicode)
'0x10ffff'
>>> len(u'\U0001F40D')
1
>>> [c for c in u'\U0001F40D']
[u'\U0001f40d']
This is the default for Python 2.7 on Linux, as well as universally on Python 3.3 and later across all operating systems.
sys.maxunicode == 0xFFFF
In this mode, Python's Unicode strings only support the range of Unicode code points from U+0000 to U+FFFF. Any code points from U+10000 through U+10FFFF are represented using a pair of string elements in the UTF-16 encoding::
>>> import sys
>>> hex(sys.maxunicode)
'0xffff'
>>> len(u'\U0001F40D')
2
>>> [c for c in u'\U0001F40D']
[u'\ud83d', u'\udc0d']
This is the default for Python 2.7 on macOS and Windows.
This runtime difference makes writing Python modules to manipulate Unicode strings as series of codepoints quite inconvenient.
The codepoints module
To solve this, I contributed a new module codepoints
to PyPI
:
https://pypi.python.org/pypi/codepoints/1.0
This module solves the problem by exposing APIs to convert Unicode strings to and from lists of code points, regardless of the underlying setting for sys.maxunicode
::
>>> hex(sys.maxunicode)
'0xffff'
>>> snake = tuple(codepoints.from_unicode(u'\U0001F40D'))
>>> len(snake)
1
>>> snake[0]
128013
>> hex(snake[0])
'0x1f40d'
>>> codepoints.to_unicode(snake)
u'\U0001f40d'