Encoding of cyrillic filenames in zip files
I have found a solution on the OpenNET.ru forum, a popular Russian-language resource that is dedicated to open-source software and technologies since 1996. A post on OpenNET suggests that Info-ZIP, once a popular set of tools for handling ZIP archives on computers running MS-DOS assumed that on MS-DOS there is only one 8-bit encoding, namely CP850, therefore all filenames are automatically run through CP850->CP1252
conversion. CP1252 was probably chosen as the most popular approximation of the ISO-8859-1 character set encoding.
Therefore the correct find command to run after extracting an archive containing Cyrillic filenames would be
find -mindepth 1 -exec sh -c 'mv "$1" "$(echo "$1" | iconv -f cp1252 -t cp850 | iconv -f cp866 )"' sh {} \;
Interestingly one can find suggestions to use not CP1252 but ISO-8859-1. This does not seem to be the case as one some of the archives that I have encountered the transformation iconv -f iso8859-1 -t cp850
failed while iconv -f cp1252 -t cp850
converted successfully.
Getting back to individual characters
Р о с К о с м о с
CP866: 90 AE E1 8A AE E1 AC AE E1
Now applying CP850 -> CP1252 results in C9 AB DF E8 AB DF BC AB DF
. Exactly the sequence that we have observed.
Another useful command would be
unzip -l РосКосмос.zip | grep -aEv '^Archive:' | iconv -f iso8859-1 -t cp850 | iconv -f cp866
To get a list of files from the archive
Length Date Time Name
--------- ---------- ----- ----
0 2017-05-03 18:19 РосКосмос/ict_inf.pdf
--------- -------
0 1 file
Filtering away the line that starts with Archive:
is a protection to hide the name of the archive from conversion.