Remove ✅, , ✈ , ♛ and other such emojis/images/signs from Java strings
Based on Full Emoji List, v11.0 you have 1644 different Unicode code points to remove. For example ✅
is on this list as U+2705
.
Having the full list of emojis you need to filter them out using code points. Iterating over single char
or byte
won't work as single code point can span multiple bytes. Because Java uses UTF-16 emojis will usually take two char
s.
String input = "ab✅cd";
for (int i = 0; i < input.length();) {
int cp = input.codePointAt(i);
// filter out if matches
i += Character.charCount(cp);
}
Mapping from Unicode code point U+2705
to Java int
is straightforward:
int viSign = 0x2705;
or since Java supports Unicode Strings:
int viSign = "✅".codePointAt(0);
Instead of blacklisting some elements, how about creating a whitelist of the characters you do wish to keep? This way you don't need to worry about every new emoji being added.
String characterFilter = "[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]";
String emotionless = aString.replaceAll(characterFilter,"");
So:
[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]
is a range representing all numeric (\\p{N}
), letter (\\p{L}
), mark (\\p{M}
), punctuation (\\p{P}
), whitespace/separator (\\p{Z}
), other formatting (\\p{Cf}
) and other characters aboveU+FFFF
in Unicode (\\p{Cs}
), and newline (\\s
) characters.\\p{L}
specifically includes the characters from other alphabets such as Cyrillic, Latin, Kanji, etc.- The
^
in the regex character set negates the match.
Example:
String str = "hello world _# 皆さん、こんにちは! 私はジョンと申します。ð¥";
System.out.print(str.replaceAll("[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]",""));
// Output:
// "hello world _# 皆さん、こんにちは! 私はジョンと申します。"
If you need more information, check out the Java documentation for regexes.
I'm not super into Java, so I won't try to write example code inline, but the way I would do this is to check what Unicode calls "the general category" of each character. There are a couple letter and punctuation categories.
You can use Character.getType to find the general category of a given character. You should probably retain those characters that fall in these general categories:
COMBINING_SPACING_MARK
CONNECTOR_PUNCTUATION
CURRENCY_SYMBOL
DASH_PUNCTUATION
DECIMAL_DIGIT_NUMBER
ENCLOSING_MARK
END_PUNCTUATION
FINAL_QUOTE_PUNCTUATION
FORMAT
INITIAL_QUOTE_PUNCTUATION
LETTER_NUMBER
LINE_SEPARATOR
LOWERCASE_LETTER
MATH_SYMBOL
MODIFIER_LETTER
MODIFIER_SYMBOL
NON_SPACING_MARK
OTHER_LETTER
OTHER_NUMBER
OTHER_PUNCTUATION
PARAGRAPH_SEPARATOR
SPACE_SEPARATOR
START_PUNCTUATION
TITLECASE_LETTER
UPPERCASE_LETTER
(All of the characters you listed as specifically wanting to remove have general category OTHER_SYMBOL
, which I did not include in the above category whitelist.)