How to auto detect text file encoding?
Try the chardet Python module, which is available on PyPI:
pip install chardet
Then run chardetect myfile.txt
.
Chardet is based on the detection code used by Mozilla, so it should give reasonable results, provided that the input text is long enough for statistical analysis. Do read the project documentation.
As mentioned in comments it is quite slow, but some distributions also ship the original C++ version as @Xavier has found in https://superuser.com/a/609056. There is also a Java version somewhere.
I would use this simple command:
encoding=$(file -bi myfile.txt)
Or if you want just the actual character set (like utf-8
):
encoding=$(file -b --mime-encoding myfile.txt)
On Debian-based Linux, the uchardet package (Debian / Ubuntu) provides a command line tool. See below the package description:
universal charset detection library - cli utility
.
uchardet is a C language binding of the original C++ implementation
of the universal charset detection library by Mozilla.
.
uchardet is a encoding detector library, which takes a sequence of
bytes in an unknown character encoding without any additional
information, and attempts to determine the encoding of the text.
.
The original code of universalchardet is available at
http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet
.
Techniques used by universalchardet are described at
http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html