Remove characters not-suitable for UTF-8 encoding from String
UTF-8 is not a character set, it's a character encoding, just like UTF-16.
UTF-8 is capable to encode any unicode character and any unicode text to a sequence of bytes, so there is no such thing as characters not suitable for UTF-8.
You're using a constructor of String
which only takes a byte array (String(byte[] bytes)) which according to the javadocs:
Constructs a new String by decoding the specified array of bytes using the platform's default charset.
It uses the default charset of the platform to interpret the bytes (to convert the bytes to characters). Do not use this. Instead when converting a byte array to String
, specify the encoding you wish to use explicitly with the String(byte[] bytes, Charset charset) constructor.
If you have issues with certain characters, that is most likely due to using different character sets or encodings at the server side and at the client side (brownser+HTML). Make sure you use UTF-8 everywhere, do not mix encodings and do not use the default encoding of the platform.
Some readings how to achieve this:
How to get UTF-8 working in Java webapps?
Maybe the answer with the CharsetDecoder of this question helps. You could change the CodingErrorAction to REPLACE and set a replacement in my example "?". This will output a given replacement string for invalid byte sequences. In this example a UTF-8 decoder capability and stress test file is read and decoded:
CharsetDecoder utf8Decoder = Charset.forName("UTF-8").newDecoder();
utf8Decoder.onMalformedInput(CodingErrorAction.REPLACE);
utf8Decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
utf8Decoder.replaceWith("?");
// Read stress file
Path path = Paths.get("<path>/UTF-8-test.txt");
byte[] data = Files.readAllBytes(path);
ByteBuffer input = ByteBuffer.wrap(data);
// UTF-8 decoding
CharBuffer output = utf8Decoder.decode(input);
// Char buffer to string
String outputString = output.toString();
System.out.println(outputString);