Non-ISO extended-ASCII text

It is something that does not look like either utf-8 or iso-8859-1. It might be anything else. It may even not be a text at all. This type is kind of fall-back description for anything that does not contain zero bytes.

Even if it actually is a text file (the extension suggests it might be), there is unfortunately no automatic way to find out the encoding, because most encodings have the same range of valid codes. Utf-8 can be told apart with very high confidence, but beyond that it requires manual checking.

First you have to find out what language the file is in to get some idea what is correct content and what is garbled content and to have a list of possible encodings. Because there are zillions of encodings, but only few were used for any particular language.

Than you need to try converting the file from each possible encoding and for each conversion that succeeds technically (which unfortunately will be most of them) view the result and check whether it is correct or not.

A spell-checker might help you with the review, since incorrect conversions will lead to more spell checker errors.

For the conversion, you can use iconv(1), which is installed from libc package on GNU/Linux or recode. recode has more options and better error handling.


This won't fit into a comment, so here it goes: I too had a strange file on my hands:

$ file systeminfo.txt 
systeminfo.txt: Non-ISO extended-ASCII text

I knew this was generated by a German WindowsXP installation and contained some umlauts but iconv was not able to convert it to something sensible:

$ iconv -t UTF-8 systeminfo.txt > systeminfo_utf8.txt 
iconv: illegal input sequence at position 308

But since iconv knows so many encoding I used a brute force approach to find out a working source encoding:

$ iconv --list | sed 's/\/\/$//' | sort > encodings.list
$ for a in `cat encodings.list`; do
  printf "$a  "
  iconv -f $a -t UTF-8 systeminfo.txt > /dev/null 2>&1 \
    && echo "ok: $a" || echo "fail: $a"
done | tee result.txt

Then I would go through result.txt and look for the encoding that didn't fail. In my case, -f CP850 -t UTF-8 worked just fine, and the umlauts are still there, only now encoded in UTF-8 :-)

Tags:

Unix

Shell