identify files with non-ASCII or non-printable characters in file name
Assuming that "foreign" means "not an ASCII character", then you can use find
with a pattern to find all files not having printable ASCII characters in their names:
LC_ALL=C find . -name '*[! -~]*'
(The space is the first printable character listed on http://www.asciitable.com/, ~
is the last.)
The hint for LC_ALL=C
is required (actually, LC_CTYPE=C
and LC_COLLATE=C
), otherwise the character range is interpreted incorrectly. See also the manual page glob(7)
. Since LC_ALL=C
causes find
to interpret strings as ASCII, it will print multi-byte characters (such as π
) as question marks. To fix this, pipe to some program (e.g. cat
) or redirect to file.
Instead of specifying character ranges, [:print:]
can also be used to select "printable characters". Be sure to set the C locale or you get quite (seemingly) arbitrary behavior.
Example:
$ touch $(printf '\u03c0') "$(printf 'x\ty')"
$ ls -F
dir/ foo foo.c xrestop-0.4/ xrestop-0.4.tar.gz π
$ find -name '*[! -~]*' # this is broken (LC_COLLATE=en_US.UTF-8)
./x?y
./dir
./π
... (a lot more)
./foo.c
$ LC_ALL=C find . -name '*[! -~]*'
./x?y
./??
$ LC_ALL=C find . -name '*[! -~]*' | cat
./x y
./π
$ LC_ALL=C find . -name '*[![:print:]]*' | cat
./x y
./π
If you translate each file name using tr -d '[\200-\377]'
and compare it with the original name, then any file names that have special characters will not be the same.
(The above assuming that you mean non-ASCII with foreign)
You can use tr
to delete any foreign character from a filename and
compare the result with the original filename to see if it contained
foreign characters.
find . -type f > filenames
while read filename; do
stripped="$(printf '%s\n' "$filename" | tr -d -C '[[:alnum:]][[:space:]][[:punct:]]')"
test "$filename" = "$stripped" || printf '%s\n' "$filename";
done < filenames