How to remove non-valid unicode characters from strings in java
In a way, both answers provided by Mukesh Kumar and GsusRecovery are helping, but not fully correct.
document.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", "");
seems to replace all invalid characters. But CoreNLP seems to not support even more. I manually figured them out by running the parser on my whole corpus, which led to this:
document.replaceAll("[\\uD83D\\uFFFD\\uFE0F\\u203C\\u3010\\u3011\\u300A\\u166D\\u200C\\u202A\\u202C\\u2049\\u20E3\\u300B\\u300C\\u3030\\u065F\\u0099\\u0F3A\\u0F3B\\uF610\\uFFFC]", "");
So right now I am running two replaceAll()
commands before handing the document to the parser. The complete code snippet is
// remove invalid unicode characters
String tmpDoc1 = document.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", "");
// remove other unicode characters coreNLP can't handle
String tmpDoc2 = tmpDoc1.replaceAll("[\\uD83D\\uFFFD\\uFE0F\\u203C\\u3010\\u3011\\u300A\\u166D\\u200C\\u202A\\u202C\\u2049\\u20E3\\u300B\\u300C\\u3030\\u065F\\u0099\\u0F3A\\u0F3B\\uF610\\uFFFC]", "");
DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(tmpDoc2));
for (List<HasWord> sentence : tokenizer) {
List<TaggedWord> tagged = tagger.tagSentence(sentence);
GrammaticalStructure gs = parser.predict(tagged);
System.err.println(gs);
}
This is not necessarily a complete list of unsupported characters, though, which is why I opened an issue on GitHub.
Please note that CoreNLP automatically removes those unsupported characters. The only reason I want to preprocess my corpus is to avoid all those error messages.
UPDATE Nov 27ths
Christopher Manning just answered the GitHub Issue I opened. There are several ways to handle those characters using the class edu.stanford.nlp.process.TokenizerFactory;
. Take this code example to tokenize a document:
DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(document));
TokenizerFactory<? extends HasWord> factory=null;
factory=PTBTokenizer.factory();
factory.setOptions("untokenizable=noneDelete");
tokenizer.setTokenizerFactory(factory);
for (List<HasWord> sentence : tokenizer) {
// do something with the sentence
}
You can replace noneDelete
in line 4 with other options. I am citing Manning:
"(...) the complete set of six options combining whether to log a warning for none, the first, or all, and whether to delete them or to include them as single character tokens in the output: noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep."
That means, to keep the characters without getting all those error messages, the best way is to use the option noneKeep
. This way is way more elegant than any attempt to remove those characters.
Remove specific unwanted chars with:
document.replaceAll("[\\uD83D\\uFFFD\\uFE0F\\u203C\\u3010]", "");
If you found others unwanted chars simply add with the same schema to the list.
UPDATE:
The unicode chars are splitted by the regex engine in 7 macro-groups (and several sub-groups) identified by a one letter (macro-group) or two letters (sub-group).
Basing my arguments on your examples and the unicode classes indicated in the always good resource Regular Expressions Site i think you can try a unique only-good-pass approach such as this:
document.replaceAll("[^\\p{L}\\p{N}\\p{Z}\\p{Sm}\\p{Sc}\\p{Sk}\\p{Pi}\\p{Pf}\\p{Pc}\\p{Mc}]","")
This regex remove anything that is not:
\p{L}
: a letter in any language\p{N}
: a number\p{Z}
: any kind of whitespace or invisible separator\p{Sm}\p{Sc}\p{Sk}
: Math, Currency or generic marks as single char\p{Mc}*
: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).\p{Pi}\p{Pf}\p{Pc}*
: Opening quote, Closing quote, words connectors (i.e. underscore)
*
: i think these groups can be eligible to be removed as well for the purpose of CoreNPL.
This way you only need a single regex filter and you can handle groups of chars (with the same purpose) instead of single cases.