How do I convert special UTF-8 chars to their iso-8859-1 equivalent using javascript?
Actually, everything is typically stored as Unicode of some kind internally, but lets not go into that. I'm assuming you're getting the iconic "åäö" type strings because you're using an ISO-8859 as your character encoding. There's a trick you can do to convert those characters. The escape
and unescape
functions used for encoding and decoding query strings are defined for ISO characters, whereas the newer encodeURIComponent
and decodeURIComponent
which do the same thing, are defined for UTF8 characters.
escape
encodes extended ISO-8859-1 characters (UTF code points U+0080-U+00ff) as %xx
(two-digit hex) whereas it encodes UTF codepoints U+0100 and above as %uxxxx
(%u
followed by four-digit hex.) For example, escape("å") == "%E5"
and escape("あ") == "%u3042"
.
encodeURIComponent
percent-encodes extended characters as a UTF8 byte sequence. For example, encodeURIComponent("å") == "%C3%A5"
and encodeURIComponent("あ") == "%E3%81%82"
.
So you can do:
fixedstring = decodeURIComponent(escape(utfstring));
For example, an incorrectly encoded character "å" becomes "Ã¥". The command does escape("Ã¥") == "%C3%A5"
which is the two incorrect ISO characters encoded as single bytes. Then decodeURIComponent("%C3%A5") == "å"
, where the two percent-encoded bytes are being interpreted as a UTF8 sequence.
If you'd need to do the reverse for some reason, that works too:
utfstring = unescape(encodeURIComponent(originalstring));
Is there a way to differentiate between bad UTF8 strings and ISO strings? Turns out there is. The decodeURIComponent function used above will throw an error if given a malformed encoded sequence. We can use this to detect with a great probability whether our string is UTF8 or ISO.
var fixedstring;
try{
// If the string is UTF-8, this will work and not throw an error.
fixedstring=decodeURIComponent(escape(badstring));
}catch(e){
// If it isn't, an error will be thrown, and we can assume that we have an ISO string.
fixedstring=badstring;
}
The problem is that once the page is served up, the content is going to be in the encoding described in the content-type meta tag. The content in "wrong" encoding is already garbled.
You're best to do this on the server before serving up the page. Or as I have been know to say: UTF-8 end-to-end or die.
Since the question on how to convert from ISO-8859-1 to UTF-8 is closed because of this one I'm going to post my solution here.
The problem is when you try to GET anything by using XMLHttpRequest, if the XMLHttpRequest.responseType is "text" or empty, the XMLHttpRequest.response is transformed to a DOMString and that's were things break up. After, it's almost impossible to reliably work with that string.
Now, if the content from the server is ISO-8859-1 you'll have to force the response to be of type "Blob" and later convert this to DOMSTring. For example:
var ajax = new XMLHttpRequest();
ajax.open('GET', url, true);
ajax.responseType = 'blob';
ajax.onreadystatechange = function(){
...
if(ajax.responseType === 'blob'){
// Convert the blob to a string
var reader = new window.FileReader();
reader.addEventListener('loadend', function() {
// For ISO-8859-1 there's no further conversion required
Promise.resolve(reader.result);
});
reader.readAsBinaryString(ajax.response);
}
}
Seems like the magic is happening on readAsBinaryString so maybe someone can shed some light on why this works.