How can I test the encoding of a text file... Is it valid, and what is it?

The file command makes "best-guesses" about the encoding. Use the -i parameter to force file to print information about the encoding.

Demonstration:

$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-utf16.txt:    text/plain; charset=utf-16le
umlaut-utf8.txt:     text/plain; charset=utf-8

Here is how I created the files:

$ echo ä > umlaut-utf8.txt

Nowadays everything is utf-8. But convince yourself:

$ hexdump -C umlaut-utf8.txt 
00000000  c3 a4 0a                                          |...|
00000003

Compare with https://en.wikipedia.org/wiki/Ä#Computer_encoding

Convert to the other encodings:

$ iconv -f utf8 -t iso88591 umlaut-utf8.txt > umlaut-iso88591.txt 
$ iconv -f utf8 -t utf16 umlaut-utf8.txt > umlaut-utf16.txt

Check the hex dump:

$ hexdump -C umlaut-iso88591.txt 
00000000  e4 0a                                             |..|
00000002
$ hexdump -C umlaut-utf16.txt 
00000000  ff fe e4 00 0a 00                                 |......|
00000006

Create something "invalid" by mixing all three:

$ cat umlaut-iso88591.txt umlaut-utf8.txt umlaut-utf16.txt > umlaut-mixed.txt

What file says:

$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-mixed.txt:    application/octet-stream; charset=binary
umlaut-utf16.txt:    text/plain; charset=utf-16le
umlaut-utf8.txt:     text/plain; charset=utf-8

without -i:

$ file *
umlaut-iso88591.txt: ISO-8859 text
umlaut-mixed.txt:    data
umlaut-utf16.txt:    Little-endian UTF-16 Unicode text, with no line terminators
umlaut-utf8.txt:     UTF-8 Unicode text

The file command has no idea of "valid" or "invalid". It just sees some bytes and tries to guess what the encoding might be. As humans we might be able to recognize that a file is a text file with some umlauts in a "wrong" encoding. But as a computer it would need some sort of artificial intelligence.

One might argue that the heuristics of file is some sort of artificial intelligence. Yet, even if it is, it is a very limited one.

Here is more information about the file command: http://www.linfo.org/file_command.html

It isn't always possible to find out for sure what the encoding of a text file is. For example, the byte sequence \303\275 (c3 bd in hexadecimal) could be ý in UTF-8, or Ã½ in latin1, or Ă˝ in latin2, or 羸 in BIG-5, and so on.

Some encodings have invalid byte sequences, so it's possible to rule them out for sure. This is true in particular of UTF-8; most texts in most 8-bit encodings are not valid UTF-8. You can test for valid UTF-8 with isutf8 from moreutils or with iconv -f utf-8 -t utf-8 >/dev/null, amongst others.

There are tools that try to guess the encoding of a text file. They can make mistakes, but they often work in practice as long as you don't deliberately try to fool them.

file
Perl Encode::Guess (part of the standard distribution) tries successive encodings on a byte string and returns the first encoding in which the string is valid text.
Enca is an encoding guesser and converter. You can give it a language name and text that you presume is in that language (the supported languages are mostly East European languages), and it tries to guess the encoding.

If there is metadata (HTML/XML charset=, TeX \inputenc, emacs -*-coding-*-, …) in the file, advanced editors like Emacs or Vim are often able to parse that metadata. That's not easy to automate from the command line though.

How can I test the encoding of a text file... Is it valid, and what is it?

Tags:

Character Encoding

Text Processing

Utilities

Related

Recent Posts