How to grep for unicode � in a bash script
grep
is the wrong tool for the job.
You see the � U+FFFD REPLACEMENT CHARACTER
not because it’s literally in the file content, but because you looked at a binary file with a tool that is supposed to handle only text-based input. The standard way to handle invalid input (i.e., random binary data) is to replace everything that is not valid in the current locale (most probably UTF-8) with U+FFFD before it hits the screen.
That means it is very likely that a literal \xEF\xBF\xBD
(the UTF-8 byte sequence for the U+FFFD character) never occurs in the file. grep
is completely right in telling you, there is none.
One way to detect whether a file contains some unknown binary is with the file(1)
command:
$ head -c 100 /dev/urandom > rubbish.bin
$ file rubbish.bin
rubbish.bin: data
For any unknown file type it will simply say data
. Try
$ file out.txt | grep '^out.txt: data$'
to check whether the file really contains any arbitrary binary and thus most likely rubbish.
If you want to make sure that out.txt
is a UTF-8 encoded text file only, you can alternatively use iconv
:
$ iconv -f utf-8 -t utf-16 out.txt >/dev/null
TL;DR:
grep -axv '.*' out.txt
long answer
Both present answers are extremely misleading and basically wrong.
To test, Get this two files (from a very well regarded developer: Markus Kuhn ):
$ wget https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt
$ wget https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
Demo
The first UTF-8-demo.txt
is a file designed to show how well UTF-8 is able to present many languages, math, braille and many other useful types of characters. Take a look with a text editor (that understand utf-8) and you will see a lot of examples and no �
.
The test that one answer propose: to limit the character range to \x00-\x7F
will reject almost everything inside this file.
That is very wrong and will not remove any �
as there is none in that file.
Using the test recommended in that answer will remove 72.5 %
of the file:
$ grep -oP "[^\x00-\x7F]" UTF-8-demo.txt | tr -d '\n' | wc -c
10192
$ cat UTF-8-demo.txt | wc -c
14058
That is (for most practical purposes) the whole file. A file very well designed to show perfectly valid characters.
Test
The second file is designed to try several border cases to confirm that utf-8 readers are doing a good job. It contains inside many characters that will cause a '�' to be shown. But the other answer recommendation (the selected one) to use file
fails grossly with this file. Only removing a zero byte (\0
) (which technically is valid ASCII) and a \x7f
byte (DEL - delete) (which is clearly an ASCII character as well) will make all the file valid for the file
command:
$ cat UTF-8-test.txt | tr -d '\0\177' > a.txt
$ file a.txt
a.txt: Non-ISO extended-ASCII text, with LF, NEL line terminators
Not only does file
fail to detect the many incorrect characters, but also fail to detect and report that it is an UTF-8 encoded file.
And yes, file
is able to detect and report UTF-8 encoded text:
$ echo "ééakjfhhjhfakjfhfhaéá" | file -
/dev/stdin: UTF-8 Unicode text
Also, file
fails to report as ASCII most of the control characters in the range 1 to 31. It (file
) reports some ranges as data
:
$ printf '%b' "$(printf '\\U%x' {1..6})" | file -
/dev/stdin: data
Others as ASCII text
:
$ printf '%b' "$(printf '\\U%x' 7 {9..12})" | file -
/dev/stdin: ASCII text
As the printable character range (with newlines):
$ printf '%b' "$(printf '\\U%x' {32..126} 10)" | file -
/dev/stdin: ASCII text
But some ranges may cause weird results:
$ printf '%b' "$(printf '\\U%x' {14..26})" | file -
/dev/stdin: Atari MSA archive data, 4113 sectors per track, starting track: 5141, ending track: 5655
The program file
is not a tool to detect text, but to detect magic numbers in executable programs or files.
The ranges file
detect, and the corresponding type reported I found were:
One byte values, mostly ascii:
{1..6} {14..26} {28..31} 127 :data {128..132} {134..159} :Non-ISO extended-ASCII text 133 :ASCII text, with LF, NEL line terminators 27 :ASCII text, with escape sequences 13 :ASCII text, with CR, LF line terminators 8 :ASCII text, with overstriking 7 {9..12} {32..126} :ASCII text {160..255} :ISO-8859 text
Utf-8 encoded ranges:
{1..6} {14..26} {28..31} 127 :data 27 :ASCII text, with escape sequences 13 :ASCII text, with CR, LF line terminators 8 :ASCII text, with overstriking 7 {9..12} {32..126} :ASCII text {128..132} {134..159} :UTF-8 Unicode text 133 :UTF-8 Unicode text, with LF, NEL line terminators {160..255} :UTF-8 Unicode text {256..5120} :UTF-8 Unicode text
One possible solution lies below.
Previous Answer.
The Unicode value for the character you are posting is:
$ printf '%x\n' "'�"
fffd
Yes, that is a Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD). That is a character used to replace any invalid Unicode character found in the text. It is a "visual aid", not a real character. To find and list every full line that contains invalid UNICODE characters use:
grep -axv '.*' out.txt
but if you only want to detect if any character is invalid, use:
grep -qaxv '.*' out.txt; echo $?
If the result is 1
the file is clean, otherwise will be zero 0
.
If what you were asking was: how to find the �
character, then, use this:
➤ a='Basically, if the file "out.txt" contains "�" anywhere in the file I'
➤ echo "$a" | grep -oP $(printf %b \\Ufffd)
�
Or if your system process correctly UTF-8 text, simply:
➤ echo "$a" | grep -oP '�'
�
This very early answer was for the original post which was:
How to grep for unicode � in a bash script
if grep -q "�" out.txt then echo "working" else cat out.txt fi
Basically, if the file "out.txt" contains "�" anywhere in the file I would like it to echo "working" AND if the file "out.txt" does NOT contain "�" anywhere in the file then I would like it to cat out.txt
Try
grep -oP "[^\x00-\x7F]"
with an if .. then
statement as follows:
if grep -oP "[^\x00-\x7F]" file.txt; then
echo "grep found something ..."
else
echo "Nothing found!"
fi
Explanation:
-P
,--perl-regexp
: PATTERN is a Perl regular expression-o
,--only-matching
: show only the part of a line matching PATTERN[^\x00-\x7F]
is a regex to match a single non-ASCII character.[[:ascii:]]
- matches a single ASCII char[^[:ascii:]]
- matches a single non-ASCII char
in bash
LC_COLLATE=C grep -o '[^ -~]' file