Postgresql and UTF8 to Latin1 conversion?

It looks like whatever client you are using is confused about the text encoding; it's sending utf-8 bytes as if they were latin-1, probably.

Check:

SHOW client_encoding;
SHOW server_encoding;
locale command in your terminal, if using psql

Your update is substituting the octal bytes \303\244 which are the utf-8 encoding for "ä" (U+00E4). You're not substituting latin-1 encoded data where you think you are.

Observe:

Click to copy

regress=> SELECT convert_from(BYTEA 'huvudv\303\244rke', 'latin-1');
 convert_from 
--------------
 huvudvÃ¤rke
(1 row)

regress=> SELECT convert_from(BYTEA 'huvudv\303\244rke', 'utf-8');
 convert_from 
--------------
 huvudvärke
(1 row)

Not only that, but your replace could only have matched in the first place if the replace target was the utf-8 encoded byte sequence for Ã¤ interpreted as latin-1, i.e. \303\203\302\244.

It's hard to be more specific without details about the Pg version, the client being used, etc, but the root cause is clearly your client doing something totally borked with encodings on I/O.

Your original text is totally mangled, it's not valid in UTF-8 or latin-1. It looks like someone's taken some UTF-8 data, decoded it as latin-1, and then encoded it as utf-8 again.

Yep, sure enough:

Click to copy

regress=> SELECT convert(convert_to('huvudvärke', 'utf-8'), 'latin-1', 'utf-8');
          convert          
---------------------------
 huvudv\303\203\302\244rke
(1 row)

there's your explanation.

I'd say you probably have a bunch of mis-encoded data in the DB already, and you noticed this one because it got mangled twice. You're probably doing something like routinely jamming utf-8 bytes into latin-1 encoded fields, but you usually get away with it because you decode them as utf-8 again - and this time you did something different.

If you want to get from your mangled text to the original, you simply have to reverse the incorrect encoding process. Decode the utf-8 and output latin-1, then re-interpet the latin-1 as utf-8 and decode again, e.g:

Click to copy

regress=> SELECT  convert_from(convert(BYTEA 'huvudv\303\203\302\244rke', 'utf-8', 'latin-1'), 'utf-8');
 convert_from 
--------------
 huvudvärke
(1 row)

Postgresql and UTF8 to Latin1 conversion?

Tags:

Postgresql

Bytea

Related

Recent Posts