Encode String to UTF-8
How about using
ByteBuffer byteBuffer = StandardCharsets.UTF_8.encode(myString)
Use byte[] ptext = String.getBytes("UTF-8");
instead of getBytes()
. getBytes()
uses so-called "default encoding", which may not be UTF-8.
In Java7 you can use:
import static java.nio.charset.StandardCharsets.*;
byte[] ptext = myString.getBytes(ISO_8859_1);
String value = new String(ptext, UTF_8);
This has the advantage over getBytes(String)
that it does not declare throws UnsupportedEncodingException
.
If you're using an older Java version you can declare the charset constants yourself:
import java.nio.charset.Charset;
public class StandardCharsets {
public static final Charset ISO_8859_1 = Charset.forName("ISO-8859-1");
public static final Charset UTF_8 = Charset.forName("UTF-8");
//....
}
String
objects in Java use the UTF-16 encoding that can't be modified*.
The only thing that can have a different encoding is a byte[]
. So if you need UTF-8 data, then you need a byte[]
. If you have a String
that contains unexpected data, then the problem is at some earlier place that incorrectly converted some binary data to a String
(i.e. it was using the wrong encoding).
* As a matter of implementation, String
can internally use a ISO-8859-1 encoded byte[]
when the range of characters fits it, but that is an implementation-specific optimization that isn't visible to users of String
(i.e. you'll never notice unless you dig into the source code or use reflection to dig into a String
object).