How can I identify a strange character?
Your file contains two bytes, EB and 0A in hex. It’s likely that the file is using a character set with one byte per character, such as ISO-8859-1; in that character set, EB is ë:
$ printf "\353\n" | iconv -f ISO-8859-1
ë
Other candidates would be δ in code page 437, Ù in code page 850...
od -x
’s output is confusing in this case because of endianness; a better option is -t x1
which uses single bytes:
$ printf "\353\n" | od -t x1
0000000 eb 0a
0000002
od -x
maps to od -t x2
which reads two bytes at a time, and on little-endian systems outputs the bytes in reverse order.
When you come across a file like this, which isn’t valid UTF-8 (or makes no sense when interpreted as a UTF-8 file), there’s no fool-proof way to automatically determine its encoding (and character set). Context can help: if it’s a file produced on a Western PC in the last couple of decades, there’s a fair chance it’s encoded in ISO-8859-1, -15 (the Euro variant), or Windows-1252; if it’s older than that, CP-437 and CP-850 are likely candidates. Files from Eastern European systems, or Russian systems, or Asian systems, would use different character sets that I don’t know much about. Then there’s EBCDIC... iconv -l
will list all the character sets that iconv
knows about, and you can proceed by trial and error from there.
(At one point I knew most of CP-437 and ATASCII off by heart, them were the days.)
Note that od
is short for octal dump, so 005353
are the two bytes as octal word, od -x
is 0aeb
in hexadecimal as word, and the actual contents of your file are the two bytes eb
and 0a
in hexadecimal, in this order.
So both 005353
and 0aeb
can't just be interpreted as "hex code point".
0a
is a line feed (LF), and eb
depends on your encoding. file
is just guessing the encoding, it could be anything. Without any further information where the file came from etc. it will be difficult to find out.
It is impossible to guess with 100% of accuracy the charset of text files.
Tools like chardet, firefox, file -i when there is no explicit charset information defined (eg. if a HTML contains a meta charset=... in the head, things are easier) will try to use heuristics that are not so bad if the text is big enough.
In the following, I demonstrate charset-detection with chardet
(pip install chardet
/ apt-get install python-chardet
if necessary).
$ echo "in Noël" | iconv -f utf8 -t latin1 | chardet
<stdin>: windows-1252 with confidence 0.73
After having good charset candidate, we can use iconv
, recode
or similar
to change the file charset to your "active" charset (in my case utf-8) and see if it guessed correctly...
iconv -f windows-1252 -t utf-8 file
Some charset (like iso-8859-3, iso-8859-1) have many chars in common -- sometimes it is not easy to see if we found the perfect charset...
So it is very important to have metadata associated with relevant text (eg XML).