Python3 convert Unicode String to int representation
The usual way to convert the Unicode string to a number is to convert it to the sequence of bytes. The Unicode characters are pure abstraction, each character has its own number; however, there is more ways to convert the numbers to the stream of bytes. Probably the most versatile way of doing that is to encode the string to the UTF-8 encoding. You can choose many ways to get integer number from it. Here is one (I have borrowed the nice string from Ivella -- I hope no bad words are inside :) :
Python 3.2.1 (default, Jul 10 2011, 20:02:51) [MSC v.1500 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> s = "Hello, World, عالَم, ދުނިޔެ, जगत, 世界"
>>> b = s.encode('utf-8')
>>> b
b'Hello, World, \xd8\xb9\xd8\xa7\xd9\x84\xd9\x8e\xd9\x85, \xde\x8b\xde\xaa\xde\x82\xde\xa8\xde\x94\xde\xac, \xe0\xa4\x9c\xe0\xa4\x97\xe0\xa4\xa4, \xe4\xb8\x96\xe7\x95\x8c'
Now we have sequence of bytes where the ones with the number from 128 to 255 are displayed as hex-coded escape sequences. Let's convert all bytes to their hexcodes as a bytestring.
>>> import binascii
>>> h = binascii.hexlify(b)
>>> h
b'48656c6c6f2c20576f726c642c20d8b9d8a7d984d98ed9852c20de8bdeaade82dea8de94deac2c20e0a49ce0a497e0a4a42c20e4b896e7958c'
And we can look at it as at a big number written (as text) in hexadecimal notation. The int
allows us to convert it to the abstract number that--when printed--is more usually converted to decimal notation.
>>> i = int(h, 16)
>>> i
52620351230730152682202055464811384749235956796562762198329268116226267262806875102376740945811764490696968801603738907493997296927348108
Now you can store it as a number, encrypt it (although it is more usual to encrypt the earlier sequence of bytes), and later convert it back to the integer. Beware, there is not many languages (and probably no database) that are able to work with that big integers.
Let's go back to the original string. Firstly convert it to the hexadecimal representation (string).
>>> h2 = hex(i)
>>> h2
'0x48656c6c6f2c20576f726c642c20d8b9d8a7d984d98ed9852c20de8bdeaade82dea8de94deac2c20e0a49ce0a497e0a4a42c20e4b896e7958c'
>>> h3 = h2[2:] # remove the 0x from the beginning
>>> h3
'48656c6c6f2c20576f726c642c20d8b9d8a7d984d98ed9852c20de8bdeaade82dea8de94deac2c20e0a49ce0a497e0a4a42c20e4b896e7958c'
>>> type(h3)
<class 'str'>
We had to remove the 0x
as it only says that the rest are the hexadecimal characters that represent the number. Notice that the h3
is of the str
type. As we are in Python 3 (see the top), the str
means Unicode string. The next step is to convert the couples of hexa numerals back to bytes. Let's try unhexlify()
:
>>> binascii.unhexlify(h3)
Traceback (most recent call last):
File "<pyshell#16>", line 1, in <module>
binascii.unhexlify(h3)
TypeError: 'str' does not support the buffer interface
Oops! it accept only bytestrings. Then, encode each hexa numeral in Unicode to hexa numeral in the bytestring. The way to go is to encode; however, encoding to ASCII is trivial.
>>> b2 = h3.encode('ascii') # character by character; subset of ascii only
>>> b2
b'48656c6c6f2c20576f726c642c20d8b9d8a7d984d98ed9852c20de8bdeaade82dea8de94deac2c20e0a49ce0a497e0a4a42c20e4b896e7958c'
>>> b3 = binascii.unhexlify(b2)
>>> b3
b'Hello, World, \xd8\xb9\xd8\xa7\xd9\x84\xd9\x8e\xd9\x85, \xde\x8b\xde\xaa\xde\x82\xde\xa8\xde\x94\xde\xac, \xe0\xa4\x9c\xe0\xa4\x97\xe0\xa4\xa4, \xe4\xb8\x96\xe7\x95\x8c'
Now we have similar bytestring as after the first .encode('utf-8')
. Let's use the inverse operation -- decode from UTF-8. We should get the same Unicode string that we started with.
>>> s2 = b3.decode('utf-8')
>>> s2
'Hello, World, عالَم, ދުނިޔެ, जगत, 世界'
>>> s == s2 # is the original equal to the result?
True
:)
You are looking for the ord()
function, I think:
>>> ord('a')
97
>>> ord('\u00c2')
192
This gives you the integer number for the Unicode codepoint.
To convert a whole set of characters use a list comprehension:
>>> [ord(c) for c in 'Hello World!']
[72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 33]
It's inverse is the chr()
function:
>>> chr(97)
'a'
>>> chr(193)
'Á'
Note that when you encrypt end decrypt text, you usually encode text to a binary representation with a character encoding. Unicode text can be encoded with different encodings with different advantages and disadvantages. These days the most commonly used encoding for Unicode text UTF-8, but others exist to.
In Python 3, binary data is represented in the bytes
object, and you encode text to bytes with the str.encode()
method and go back by using bytes.decode()
:
>>> 'Hello World!'.encode('utf8')
b'Hello World!'
>>> b'Hello World!'.decode('utf8')
'Hello World!'
bytes
values are really just sequences, like lists and tuples and strings, but consisting of integer numbers from 0-255:
>>> list('Hello World!'.encode('utf8'))
[72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 33]
Personally, when encrypting, you want to encode and encrypt the resulting bytes.
If all this seems overwhelming or hard to follow, perhaps these articles on Unicode and character encodings can help out:
- What every developer needs to know about Unicode
- Ned Batchelder’s Pragmatic Unicode
- Python’s Unicode HOWTO