What is the most common encoding of each language?

On the web, UTF-8 is by far the most common encoding for all languages.

That being said, here are the Windows XP locales grouped by default character encoding ("Language for non-Unicode programs"):

Big5: zh_HK, zh_MO, zh_TW
GBK (≈GB2312): zh_CN, zh_SG
Windows-31J (≈Shift_JIS): ja_JP
windows-874 (≈TIS-620, ISO-8859-11): th_TH
windows-949 (≈EUC-KR): ko_KR
windows-1250: bs_BA, cs_CZ, hr_BA, hr_HR, hu_HU, pl_PL, ro_RO, sk_SK, sl_SI, sq_AL, sr_BA, sr_SP
windows-1251: az_AZ, be_BY, bg_BG, kk_KZ, ky_KG, mk_MK, mn_MN, ru_RU, sr_BA, sr_SP, tt_RU, uk_UA, uz_UZ
windows-1252 (≈ISO-8859-1): af_ZA, arn_CL, ca_ES, cy_GB, da_DK, de_AT, de_CH, de_DE, de_LI, de_LU, en_AU, en_BZ, en_CA, en_CB, en_GB, en_IE, en_JM, en_NZ, en_PH, en_TT, en_US, en_ZA, en_ZW, es_AR, es_BO, es_CL, es_CO, es_CR, es_DO, es_EC, es_ES, es_GT, es_HN, es_MX, es_NI, es_PA, es_PE, es_PR, es_PY, es_SV, es_UY, es_VE, eu_ES, fi_FI, fil_PH, fo_FO, fr_BE, fr_CA, fr_CH, fr_FR, fr_LU, fr_MC, fy_NL, ga_IE, gl_ES, id_ID, is_IS, it_CH, it_IT, iu_CA, iv_IV, lb_LU, moh_CA, ms_BN, ms_MY, nb_NO, nl_BE, nl_NL, nn_NO, ns_ZA, pt_BR, pt_PT, qu_BO, qu_EC, qu_PE, rm_CH, se_FI, se_NO, se_SE, sv_FI, sv_SE, sw_KE, tn_ZA, xh_ZA, zu_ZA
windows-1253: el_GR
windows-1254 (≈ISO-8859-9): az_AZ, tr_TR, uz_UZ
windows-1255: he_IL
windows-1256: ar_AE, ar_BH, ar_DZ, ar_EG, ar_IQ, ar_JO, ar_KW, ar_LB, ar_LY, ar_MA, ar_OM, ar_QA, ar_SA, ar_SY, ar_TN, ar_YE, fa_IR, ps_AF, ur_PK
windows-1257: et_EE, lt_LT, lv_LV
windows-1258: vi_VN

and the most common encodings overall on the Web as of October 30th 2020:

UTF-8 95.7%
ISO-8859-1 1.8%
Windows-1251 1.0%
Windows-1252 0.4%
GB2312 0.3%
Shift JIS 0.2%
GBK 0.1%
EUC-KR 0.1%
ISO-8859-9 0.1%
Windows-1254 0.1%
EUC-JP 0.1%
Big5 0.1%

The HTML5 draft contains a table of default encodings for languages, reflecting what is regarded as common. However, note that it is supposed to be based on the user locale, i.e. the language of the browser or the operating system, not the language of the document—obviously because the latter is usually unknown, at least before you actually read the document, based on some assumption about the encoding.

I think you could in practice copy the list of encodings in a popular web browser. If it works well there, it probably works reasonably well in your application. Browsers do some clever things with the list and its order, but in practice, I think it would suffice to have a short list like utf-8, utf-16, windows-1252, and maybe a few others, followed by an option of getting the full list. Note that although utf-16 is practically unused and useless for web pages, it is common for plain text files around. It is important to name the encodings well, preferably with a common English (or other language) name together with the IANA “charset” name in parentheses—much like browsers do.

What is the most common encoding of each language?

Tags:

Encoding

Character Encoding

Related

Recent Posts