Unicode Encoding and decoding issues in QRCode

Heuristics used by QR decoders often fails, BOM does not help

Most QR decoders use heuristics to automatically detect character encoding even if it is specified explicitly inside the QR code via the ECI extension.

It turned out that BOM helped to your decoder. But for most decoders, BOM does not help. As an example of a decoder that cannot display a proper UTF-8 string, take a Xiaomi phone with MIUI Global v11.0.3 (with their native scanner application). This phone cannot correctly show an UTF-8 QR code produced a link in your original question. Here is how it showed: R閙y Hubscher. With the BOM (using a link from your subsequent message) it showed this way: ?R閙y Hubscher (it just showed the BOM character as ?). But if you add a Chinese character like 日 before the string instead of BOM, Xiaomi will show the string correctly. Here is the link: chart.apis.google.com/chart?cht=qr&chs=200x200&choe=UTF-8&chl=%E6%97%A5R%C3%A9my%20Hubscher Xiaomi correctly displays the string 日Rémy Hubscher from a QR code generated by this link.

Another example is “QR code reader & QR code Scanner” Android app by TWMobile. It did properly decode all the QR codes from all the links that you have provided. So you did not have to use BOM to make the scanner by TWMobile properly display the strings.

Why do QR decoders always use heuristics to detect character set even though these heuristics frequently fails as shown in your case? As you know, there are 4 modes of storing text in a QR code: (1) numeric, (2) alphanumeric, (3) 8-bit, and (4) Kanji. So, QR code standard does not inherently support UTF-8. To use UTF-8 encoding (instead of the default “ISO-8859-1” or “JIS8”) in the 8-bit string, the implementation has to insert an ECI (Extended Channel Interpretations) before that string. ECI is an optional, additional feature for a QR Code. Good point is that it was defined in earliest QR code standard at least in 2000. ECI enables data encoding using character sets other than the default. It also enables other data interpretations (e.g. compacted data using defined compression schemes) or other industry-specific requirements to be encoded. The ECI protocol is defined in a specification developed by AIM, Inc, and is not available for free but can be purchased for a fee. Unfortunately, not all QR decoders can handle the ECI protocol, even in such a basic thing as changing default encoding to UTF-8. And even for default encoding like “ISO-8859-1” (for a 8-bit string mode) or “Shift_JIS”(for Kanji mode), decoders still use heuristics to determine character set, because some applications that encode QR codes may not support ECI or specify incorrect character set.

Conclusion

Because of heuristics to automatically detect character set, QR decoders often fail do display the string properly, even when correct encoding is explicitly specified via ECI as it was in your case and the BOM character did not help as shown in the Xiaomi example. You have found a solution in your reply, but it did not help for Xiaomi. Some QR decoders use heuristics algorithms that are so dumb that even BOM does not help.

Although the BOM did help with your QR decoder, a better solution would be to stop using error-prone QR decoders that use heuristics even if the character encoding is explicitly specified via ECI.

Find a better QR decoder if a decoder cannot properly decode the text without BOM. The encoder that you have provided (using the links) is OK.


The solution that comes up, is to encode the text in UTF-8 and add a BOM to specify that the string is actually in UTF-8.

Here it works :

  • http://chart.apis.google.com/chart?cht=qr&chs=200x200&choe=UTF-8&chl=%EF%BB%BFR%C3%A9my+Hubscher