Encoding of cyrillic filenames in zip files

I have found a solution on the OpenNET.ru forum, a popular Russian-language resource that is dedicated to open-source software and technologies since 1996. A post on OpenNET suggests that Info-ZIP, once a popular set of tools for handling ZIP archives on computers running MS-DOS assumed that on MS-DOS there is only one 8-bit encoding, namely CP850, therefore all filenames are automatically run through CP850->CP1252 conversion. CP1252 was probably chosen as the most popular approximation of the ISO-8859-1 character set encoding.

Therefore the correct find command to run after extracting an archive containing Cyrillic filenames would be

find -mindepth 1 -exec sh -c 'mv "$1" "$(echo "$1" | iconv -f cp1252 -t cp850 | iconv -f cp866 )"' sh {} \;

Interestingly one can find suggestions to use not CP1252 but ISO-8859-1. This does not seem to be the case as one some of the archives that I have encountered the transformation iconv -f iso8859-1 -t cp850 failed while iconv -f cp1252 -t cp850 converted successfully.

Getting back to individual characters

           Р  о  с  К  о  с  м  о  с
CP866:     90 AE E1 8A AE E1 AC AE E1

Now applying CP850 -> CP1252 results in C9 AB DF E8 AB DF BC AB DF. Exactly the sequence that we have observed.

Another useful command would be

 unzip -l РосКосмос.zip | grep -aEv '^Archive:' | iconv -f iso8859-1 -t cp850 | iconv -f cp866

To get a list of files from the archive

 Length      Date    Time    Name
---------  ---------- -----   ----
        0  2017-05-03 18:19   РосКосмос/ict_inf.pdf
---------                     -------
        0                     1 file

Filtering away the line that starts with Archive: is a protection to hide the name of the archive from conversion.

Encoding of cyrillic filenames in zip files

Tags:

Character Encoding

Zip

Related

Recent Posts