What is the most common encoding of each language?
On the web, UTF-8 is by far the most common encoding for all languages.
That being said, here are the Windows XP locales grouped by default character encoding ("Language for non-Unicode programs"):
- Big5: zh_HK, zh_MO, zh_TW
- GBK (≈GB2312): zh_CN, zh_SG
- Windows-31J (≈Shift_JIS): ja_JP
- windows-874 (≈TIS-620, ISO-8859-11): th_TH
- windows-949 (≈EUC-KR): ko_KR
- windows-1250: bs_BA, cs_CZ, hr_BA, hr_HR, hu_HU, pl_PL, ro_RO, sk_SK, sl_SI, sq_AL, sr_BA, sr_SP
- windows-1251: az_AZ, be_BY, bg_BG, kk_KZ, ky_KG, mk_MK, mn_MN, ru_RU, sr_BA, sr_SP, tt_RU, uk_UA, uz_UZ
- windows-1252 (≈ISO-8859-1): af_ZA, arn_CL, ca_ES, cy_GB, da_DK, de_AT, de_CH, de_DE, de_LI, de_LU, en_AU, en_BZ, en_CA, en_CB, en_GB, en_IE, en_JM, en_NZ, en_PH, en_TT, en_US, en_ZA, en_ZW, es_AR, es_BO, es_CL, es_CO, es_CR, es_DO, es_EC, es_ES, es_GT, es_HN, es_MX, es_NI, es_PA, es_PE, es_PR, es_PY, es_SV, es_UY, es_VE, eu_ES, fi_FI, fil_PH, fo_FO, fr_BE, fr_CA, fr_CH, fr_FR, fr_LU, fr_MC, fy_NL, ga_IE, gl_ES, id_ID, is_IS, it_CH, it_IT, iu_CA, iv_IV, lb_LU, moh_CA, ms_BN, ms_MY, nb_NO, nl_BE, nl_NL, nn_NO, ns_ZA, pt_BR, pt_PT, qu_BO, qu_EC, qu_PE, rm_CH, se_FI, se_NO, se_SE, sv_FI, sv_SE, sw_KE, tn_ZA, xh_ZA, zu_ZA
- windows-1253: el_GR
- windows-1254 (≈ISO-8859-9): az_AZ, tr_TR, uz_UZ
- windows-1255: he_IL
- windows-1256: ar_AE, ar_BH, ar_DZ, ar_EG, ar_IQ, ar_JO, ar_KW, ar_LB, ar_LY, ar_MA, ar_OM, ar_QA, ar_SA, ar_SY, ar_TN, ar_YE, fa_IR, ps_AF, ur_PK
- windows-1257: et_EE, lt_LT, lv_LV
- windows-1258: vi_VN
and the most common encodings overall on the Web as of October 30th 2020:
- UTF-8 95.7%
- ISO-8859-1 1.8%
- Windows-1251 1.0%
- Windows-1252 0.4%
- GB2312 0.3%
- Shift JIS 0.2%
- GBK 0.1%
- EUC-KR 0.1%
- ISO-8859-9 0.1%
- Windows-1254 0.1%
- EUC-JP 0.1%
- Big5 0.1%
The HTML5 draft contains a table of default encodings for languages, reflecting what is regarded as common. However, note that it is supposed to be based on the user locale, i.e. the language of the browser or the operating system, not the language of the document—obviously because the latter is usually unknown, at least before you actually read the document, based on some assumption about the encoding.
I think you could in practice copy the list of encodings in a popular web browser. If it works well there, it probably works reasonably well in your application. Browsers do some clever things with the list and its order, but in practice, I think it would suffice to have a short list like utf-8, utf-16, windows-1252, and maybe a few others, followed by an option of getting the full list. Note that although utf-16 is practically unused and useless for web pages, it is common for plain text files around. It is important to name the encodings well, preferably with a common English (or other language) name together with the IANA “charset” name in parentheses—much like browsers do.