Convert binary encoding that head and Notepad can read to UTF-8

"binary" isn't an encoding (character-set name). iconv needs an encoding name to do its job.

The file utility doesn't give useful information when it doesn't recognize the file format. It could be UTF-16 for example, without a byte-encoding-mark (BOM). notepad reads that. The same applies to UTF-8 (and head would display that since your terminal may be set to UTF-8 encoding, and it would not care about a BOM).

If the file is UTF-16, your terminal would display that using head because most of the characters would be ASCII (or even Latin-1), making the "other" byte of the UTF-16 characters a null.

In either case, the lack of BOM will (depending on the version of file) confuse it. But other programs may work, because these file formats can be used with Microsoft Windows as well as portable applications that may run on Windows.

To convert the file to UTF-8, you have to know which encoding it uses, and what the name for that encoding is with iconv. If it is already UTF-8, then whether you add a BOM (at the beginning) is optional. UTF-16 has two flavors, according to which byte is first. Or you could even have UTF-32. iconv -l lists these:

ISO-10646/UTF-8/
ISO-10646/UTF8/
UTF-7//
UTF-8//
UTF-16//
UTF-16BE//
UTF-16LE//
UTF-32//
UTF-32BE//
UTF-32LE//
UTF7//
UTF8//
UTF16//
UTF16BE//
UTF16LE//
UTF32//
UTF32BE//
UTF32LE//

"LE" and "BE" refer to little-end and big-end for the byte-order. Windows uses the "LE" flavors, and iconv likely assumes that for the flavors lacking "LE" or "BE".

You can see this using an octal (sic) dump:

$ od -bc big-end
0000000 000 124 000 150 000 165 000 040 000 101 000 165 000 147 000 040
         \0   T  \0   h  \0   u  \0      \0   A  \0   u  \0   g  \0    
0000020 000 061 000 070 000 040 000 060 000 065 000 072 000 060 000 061
         \0   1  \0   8  \0      \0   0  \0   5  \0   :  \0   0  \0   1
0000040 000 072 000 065 000 067 000 040 000 105 000 104 000 124 000 040
         \0   :  \0   5  \0   7  \0      \0   E  \0   D  \0   T  \0    
0000060 000 062 000 060 000 061 000 066 000 012
         \0   2  \0   0  \0   1  \0   6  \0  \n
0000072

$ od -bc little-end
0000000 124 000 150 000 165 000 040 000 101 000 165 000 147 000 040 000
          T  \0   h  \0   u  \0      \0   A  \0   u  \0   g  \0      \0
0000020 061 000 070 000 040 000 060 000 065 000 072 000 060 000 061 000
          1  \0   8  \0      \0   0  \0   5  \0   :  \0   0  \0   1  \0
0000040 072 000 065 000 067 000 040 000 105 000 104 000 124 000 040 000
          :  \0   5  \0   7  \0      \0   E  \0   D  \0   T  \0      \0
0000060 062 000 060 000 061 000 066 000 012 000
          2  \0   0  \0   1  \0   6  \0  \n  \0
0000072

Assuming UTF-16LE, you could convert using

iconv -f UTF-16LE// -t UTF-8// <input >output

strings (from binutils) succeeds to "print the strings of printable characters in files" when both iconv and recode failed as well, with file still reporting the content as binary data:

$ file -i /tmp/textFile
/tmp/textFile: application/octet-stream; charset=binary

$ chardetect /tmp/textFile
/tmp/textFile: utf-8 with confidence 0.99

$ iconv -f utf-8 -t utf-8 /tmp/textFile -o /tmp/textFile.iconv
$ file -i /tmp/textFile.iconv
/tmp/textFile.iconv: application/octet-stream; charset=binary

$ cp /tmp/textFile /tmp/textFile.recode ; recode utf-8 /tmp/textFile.recode
$ file -i /tmp/textFile.recode 
/tmp/textFile.recode: application/octet-stream; charset=binary

$ strings /tmp/textFile > /tmp/textFile.strings
$ file -i /tmp/textFile.strings
/tmp/textFile.strings: text/plain; charset=us-ascii

Convert binary encoding that head and Notepad can read to UTF-8

Tags:

Command Line

Character Encoding

Related

Recent Posts