Is there a convenient way to classify files as "binary" or "text"?
If you ask file
for just the mime-type you'll get many different ones like text/x-shellscript
, and application/x-executable
etc, but I imagine if you just check for the "text" part you should get good results. Eg (-b
for no filename in output):
file -b --mime-type filename | sed 's|/.*||'
Another approach would be to use isutf8
from the moreutils collection.
It exits with 0 if the file is valid UTF-8 or ASCII, or short circuits, prints an error message (silence with -q
) and exits with 1 otherwise.
If you like the heuristic used by GNU grep
, you could use it:
isbinary() {
LC_MESSAGES=C grep -Hm1 '^' < "${1-$REPLY}" | grep -q '^Binary'
}
It searches for NUL bytes in the first buffer read from the file (a few kilo-bytes for a regular file, but could be a lot less for a pipe or socket or some devices like /dev/random
). In UTF-8 locales, it also flags on byte sequences that don't form valid UTF-8 characters. It assumes LC_ALL
is not set to something where the language is not English.
The ${1-$REPLY}
form allows you to use it as a zsh
glob qualifier:
ls -ld -- *(.+isbinary)
would list the binary files.