How to remove bad characters that are not suitable for utf8 encoding in MySQL?
You can filter surrogate characters with this regex:
String str = "ð "; //U+20000, represented by 2 chars in java (UTF-16 surrogate pair)
str = str.replaceAll( "([\\ud800-\\udbff\\udc00-\\udfff])", "");
System.out.println(str.length()); //0
When I had problem like this, I used Perl script to ensure that data is converted to valid UTF-8 by using code like this:
use Encode;
binmode(STDOUT, ":utf8");
while (<>) {
print Encode::decode('UTF-8', $_);
}
This script takes (possibly corrupted) UTF-8 on stdin
and re-prints valid UTF-8 to stdout
. Invalid characters are replaced with �
(U+FFFD
, Unicode replacement character).
If you run this script on good UTF-8 input, output should be identical to input.
If you have data in database, it makes sense to use DBI to scan your table(s) and scrub all data using this approach to make sure that everything is valid UTF-8.
This is Perl one-liner version of this same script:
perl -MEncode -e "binmode STDOUT,':utf8';while(<>){print Encode::decode 'UTF-8',\$_}" < bad.txt > good.txt
EDIT: Added Java-only solution.
This is an example how to do this in Java:
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CodingErrorAction;
public class UtfFix {
public static void main(String[] args) throws InterruptedException, CharacterCodingException {
CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPLACE);
decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
ByteBuffer bb = ByteBuffer.wrap(new byte[] {
(byte) 0xD0, (byte) 0x9F, // 'Ð'
(byte) 0xD1, (byte) 0x80, // 'Ñ'
(byte) 0xD0, // corrupted UTF-8, was 'и'
(byte) 0xD0, (byte) 0xB2, // 'в'
(byte) 0xD0, (byte) 0xB5, // 'е'
(byte) 0xD1, (byte) 0x82 // 'Ñ'
});
CharBuffer parsed = decoder.decode(bb);
System.out.println(parsed);
// this prints: ÐÑ?веÑ
}
}
You can encode and then decode it to/from UTF-8:
String label = "look into my eyes ã .ã ";
Charset charset = Charset.forName("UTF-8");
label = charset.decode(charset.encode(label)).toString();
System.out.println(label);
output:
look into my eyes ?.?
edit: I think this might only work on Java 6.