Converting emojis to Unicode and vice versa in python 3
sentence = "Head-Up Displays (HUD)ð» for #automotiveð sector\n \nThe #UK-based #startupð Envisics got €42 million #fundingð° from l… "
print("normal sentence - ", sentence)
uc_sentence = sentence.encode('unicode-escape')
print("\n\nunicode represented sentence - ", uc_sentence)
decoded_sentence = uc_sentence.decode('unicode-escape')
print("\n\ndecoded sentence - ", decoded_sentence)
output
normal sentence - Head-Up Displays (HUD)ð» for #automotiveð sector
The #UK-based #startupð Envisics got €42 million #fundingð° from l…
unicode represented sentence - b'Head-Up Displays (HUD)\\U0001f4bb for #automotive\\U0001f697 sector\\n \\nThe #UK-based #startup\\U0001f680 Envisics got \\u20ac42 million #funding\\U0001f4b0 from l\\u2026 '
decoded sentence - Head-Up Displays (HUD)ð» for #automotiveð sector
The #UK-based #startupð Envisics got €42 million #fundingð° from l…
'ð' is already a Unicode object. UTF-8 is not Unicode, it's a byte encoding for Unicode. To get the codepoint number of a Unicode character, you can use the ord
function. And to print it in the form you want you can format it as hex. Like this:
s = 'ð'
print('U+{:X}'.format(ord(s)))
output
U+1F600
If you have Python 3.6+, you can make it even shorter (and more efficient) by using an f-string:
s = 'ð'
print(f'U+{ord(s):X}')
BTW, if you want to create a Unicode escape sequence like '\U0001F600'
there's the 'unicode-escape'
codec. However, it returns a bytes
string, and you may wish to convert that back to text. You could use the 'UTF-8' codec for that, but you might as well just use the 'ASCII' codec, since it's guaranteed to only contain valid ASCII.
s = 'ð'
print(s.encode('unicode-escape'))
print(s.encode('unicode-escape').decode('ASCII'))
output
b'\\U0001f600'
\U0001f600
I suggest you take a look at this short article by Stack Overflow co-founder Joel Spolsky The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).