How to grep for unicode � in a bash script

grep is the wrong tool for the job.

You see the � U+FFFD REPLACEMENT CHARACTER not because it’s literally in the file content, but because you looked at a binary file with a tool that is supposed to handle only text-based input. The standard way to handle invalid input (i.e., random binary data) is to replace everything that is not valid in the current locale (most probably UTF-8) with U+FFFD before it hits the screen.

That means it is very likely that a literal \xEF\xBF\xBD (the UTF-8 byte sequence for the U+FFFD character) never occurs in the file. grep is completely right in telling you, there is none.

One way to detect whether a file contains some unknown binary is with the file(1) command:

$ head -c 100 /dev/urandom > rubbish.bin
$ file rubbish.bin
rubbish.bin: data

For any unknown file type it will simply say data. Try

$ file out.txt | grep '^out.txt: data$'

to check whether the file really contains any arbitrary binary and thus most likely rubbish.

If you want to make sure that out.txt is a UTF-8 encoded text file only, you can alternatively use iconv:

$ iconv -f utf-8 -t utf-16 out.txt >/dev/null

TL;DR:

grep -axv '.*' out.txt 

long answer

Both present answers are extremely misleading and basically wrong.

To test, Get this two files (from a very well regarded developer: Markus Kuhn ):

$ wget https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt
$ wget https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

Demo

The first UTF-8-demo.txt is a file designed to show how well UTF-8 is able to present many languages, math, braille and many other useful types of characters. Take a look with a text editor (that understand utf-8) and you will see a lot of examples and no .

The test that one answer propose: to limit the character range to \x00-\x7F will reject almost everything inside this file.
That is very wrong and will not remove any as there is none in that file.

Using the test recommended in that answer will remove 72.5 % of the file:

$ grep -oP "[^\x00-\x7F]" UTF-8-demo.txt | tr -d '\n' | wc -c
10192
$ cat UTF-8-demo.txt | wc -c
14058

That is (for most practical purposes) the whole file. A file very well designed to show perfectly valid characters.

Test

The second file is designed to try several border cases to confirm that utf-8 readers are doing a good job. It contains inside many characters that will cause a '�' to be shown. But the other answer recommendation (the selected one) to use file fails grossly with this file. Only removing a zero byte (\0) (which technically is valid ASCII) and a \x7f byte (DEL - delete) (which is clearly an ASCII character as well) will make all the file valid for the file command:

$ cat UTF-8-test.txt | tr -d '\0\177' > a.txt
$ file a.txt 
a.txt: Non-ISO extended-ASCII text, with LF, NEL line terminators

Not only does file fail to detect the many incorrect characters, but also fail to detect and report that it is an UTF-8 encoded file.

And yes, file is able to detect and report UTF-8 encoded text:

$ echo "ééakjfhhjhfakjfhfhaéá" | file -
/dev/stdin: UTF-8 Unicode text

Also, file fails to report as ASCII most of the control characters in the range 1 to 31. It (file) reports some ranges as data:

$ printf '%b' "$(printf '\\U%x' {1..6})" | file -
/dev/stdin: data

Others as ASCII text:

$ printf '%b' "$(printf '\\U%x' 7 {9..12})" | file -
/dev/stdin: ASCII text

As the printable character range (with newlines):

$ printf '%b' "$(printf '\\U%x' {32..126} 10)" | file -
/dev/stdin: ASCII text

But some ranges may cause weird results:

$ printf '%b' "$(printf '\\U%x' {14..26})" | file -
/dev/stdin: Atari MSA archive data, 4113 sectors per track, starting track: 5141, ending track: 5655

The program file is not a tool to detect text, but to detect magic numbers in executable programs or files.

The ranges file detect, and the corresponding type reported I found were:

  • One byte values, mostly ascii:

    {1..6} {14..26} {28..31} 127   :data
    {128..132} {134..159}          :Non-ISO extended-ASCII text
    133                            :ASCII text, with LF, NEL line terminators
    27                             :ASCII text, with escape sequences
    13                             :ASCII text, with CR, LF line terminators
    8                              :ASCII text, with overstriking
    7 {9..12} {32..126}            :ASCII text
    {160..255}                     :ISO-8859 text
    
  • Utf-8 encoded ranges:

    {1..6} {14..26} {28..31} 127   :data
    27                             :ASCII text, with escape sequences
    13                             :ASCII text, with CR, LF line terminators
    8                              :ASCII text, with overstriking
    7 {9..12} {32..126}            :ASCII text
    {128..132} {134..159}          :UTF-8 Unicode text
    133                            :UTF-8 Unicode text, with LF, NEL line terminators
    {160..255}                     :UTF-8 Unicode text
    {256..5120}                    :UTF-8 Unicode text
    

One possible solution lies below.


Previous Answer.

The Unicode value for the character you are posting is:

$ printf '%x\n' "'�"
fffd

Yes, that is a Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD). That is a character used to replace any invalid Unicode character found in the text. It is a "visual aid", not a real character. To find and list every full line that contains invalid UNICODE characters use:

grep -axv '.*' out.txt 

but if you only want to detect if any character is invalid, use:

grep -qaxv '.*' out.txt; echo $?

If the result is 1 the file is clean, otherwise will be zero 0.


If what you were asking was: how to find the character, then, use this:

➤ a='Basically, if the file "out.txt" contains "�" anywhere in the file I'
➤ echo "$a" | grep -oP $(printf %b \\Ufffd)
�

Or if your system process correctly UTF-8 text, simply:

➤ echo "$a" | grep -oP '�'
�

This very early answer was for the original post which was:

How to grep for unicode � in a bash script

if grep -q "�" out.txt
    then
        echo "working"
    else
        cat out.txt  fi

Basically, if the file "out.txt" contains "�" anywhere in the file I would like it to echo "working" AND if the file "out.txt" does NOT contain "�" anywhere in the file then I would like it to cat out.txt

Try

grep -oP "[^\x00-\x7F]"

with an if .. then statement as follows:

if grep -oP "[^\x00-\x7F]" file.txt; then
    echo "grep found something ..."
else
    echo "Nothing found!"
fi

Explanation:

  • -P, --perl-regexp: PATTERN is a Perl regular expression
  • -o, --only-matching: show only the part of a line matching PATTERN
  • [^\x00-\x7F] is a regex to match a single non-ASCII character.
  • [[:ascii:]] - matches a single ASCII char
  • [^[:ascii:]] - matches a single non-ASCII char

in bash

LC_COLLATE=C grep -o '[^ -~]' file