How to extract all the emojis from text?
I think it's important to point out that the previous answers won't work with emojis like ð¨ð©ð¦ð¦ , because it consists of 4 emojis, and using ... in emoji.UNICODE_EMOJI
will return 4 different emojis. Same for emojis with skin color like ð
ð½.
My solution
Include the emoji
and regex
modules. The regex module supports recognizing grapheme clusters (sequences of Unicode codepoints rendered as a single character), so we can count emojis like ð¨ð©ð¦ð¦
import emoji
import regex
def split_count(text):
emoji_list = []
data = regex.findall(r'\X', text)
for word in data:
if any(char in emoji.UNICODE_EMOJI['en'] for char in word):
emoji_list.append(word)
return emoji_list
Testing
with more emojis with skin color:
line = ["ð¤ ð me así, se ð ds ððð hello ð©ð¾ð emoji hello ð¨ð©ð¦ð¦ how are ð you todayð
ð½ð
ð½"]
counter = split_count(line[0])
print(' '.join(emoji for emoji in counter))
output:
ð¤ ð ð ð ð ð ð©ð¾ð ð¨ð©ð¦ð¦ ð ð
ð½ ð
ð½
Include flags
If you want to include flags, like ðµð° the Unicode range would be from ð¦ to ð¿, so add:
flags = regex.findall(u'[\U0001F1E6-\U0001F1FF]', text)
to the function above, and return emoji_list + flags
.
See this answer to "A python regex that matches the regional indicator character class" for more information about the flags.
For newer emoji
versions
to work with emoji >= v1.2.0 you have to add a language specifier (e.g. en
as in above code):
emoji.UNICODE_EMOJI['en']
You can use the emoji
library. You can check if a single codepoint is an emoji codepoint by checking if it is contained in emoji.UNICODE_EMOJI
.
import emoji
def extract_emojis(s):
return ''.join(c for c in s if c in emoji.UNICODE_EMOJI['en'])