How to grep for unicode � in a bash script

grep is the wrong tool for the job.

You see the � U+FFFD REPLACEMENT CHARACTER not because it’s literally in the file content, but because you looked at a binary file with a tool that is supposed to handle only text-based input. The standard way to handle invalid input (i.e., random binary data) is to replace everything that is not valid in the current locale (most probably UTF-8) with U+FFFD before it hits the screen.

That means it is very likely that a literal \xEF\xBF\xBD (the UTF-8 byte sequence for the U+FFFD character) never occurs in the file. grep is completely right in telling you, there is none.

One way to detect whether a file contains some unknown binary is with the file(1) command:

$ head -c 100 /dev/urandom > rubbish.bin
$ file rubbish.bin
rubbish.bin: data

For any unknown file type it will simply say data. Try

$ file out.txt | grep '^out.txt: data$'

to check whether the file really contains any arbitrary binary and thus most likely rubbish.

If you want to make sure that out.txt is a UTF-8 encoded text file only, you can alternatively use iconv:

$ iconv -f utf-8 -t utf-16 out.txt >/dev/null

TL;DR:

grep -axv '.*' out.txt

long answer

Both present answers are extremely misleading and basically wrong.

To test, Get this two files (from a very well regarded developer: Markus Kuhn ):

$ wget https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt
$ wget https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

Demo

The first UTF-8-demo.txt is a file designed to show how well UTF-8 is able to present many languages, math, braille and many other useful types of characters. Take a look with a text editor (that understand utf-8) and you will see a lot of examples and no �.

The test that one answer propose: to limit the character range to \x00-\x7F will reject almost everything inside this file.
That is very wrong and will not remove any � as there is none in that file.

Using the test recommended in that answer will remove 72.5 % of the file:

$ grep -oP "[^\x00-\x7F]" UTF-8-demo.txt | tr -d '\n' | wc -c
10192
$ cat UTF-8-demo.txt | wc -c
14058

That is (for most practical purposes) the whole file. A file very well designed to show perfectly valid characters.

Test

The second file is designed to try several border cases to confirm that utf-8 readers are doing a good job. It contains inside many characters that will cause a '�' to be shown. But the other answer recommendation (the selected one) to use file fails grossly with this file. Only removing a zero byte (\0) (which technically is valid ASCII) and a \x7f byte (DEL - delete) (which is clearly an ASCII character as well) will make all the file valid for the file command:

$ cat UTF-8-test.txt | tr -d '\0\177' > a.txt
$ file a.txt 
a.txt: Non-ISO extended-ASCII text, with LF, NEL line terminators

Not only does file fail to detect the many incorrect characters, but also fail to detect and report that it is an UTF-8 encoded file.

And yes, file is able to detect and report UTF-8 encoded text:

$ echo "ééakjfhhjhfakjfhfhaéá" | file -
/dev/stdin: UTF-8 Unicode text

Also, file fails to report as ASCII most of the control characters in the range 1 to 31. It (file) reports some ranges as data:

$ printf '%b' "$(printf '\\U%x' {1..6})" | file -
/dev/stdin: data

Others as ASCII text:

$ printf '%b' "$(printf '\\U%x' 7 {9..12})" | file -
/dev/stdin: ASCII text

As the printable character range (with newlines):

$ printf '%b' "$(printf '\\U%x' {32..126} 10)" | file -
/dev/stdin: ASCII text

But some ranges may cause weird results:

$ printf '%b' "$(printf '\\U%x' {14..26})" | file -
/dev/stdin: Atari MSA archive data, 4113 sectors per track, starting track: 5141, ending track: 5655

The program file is not a tool to detect text, but to detect magic numbers in executable programs or files.

The ranges file detect, and the corresponding type reported I found were:

One byte values, mostly ascii:

{1..6} {14..26} {28..31} 127   :data
{128..132} {134..159}          :Non-ISO extended-ASCII text
133                            :ASCII text, with LF, NEL line terminators
27                             :ASCII text, with escape sequences
13                             :ASCII text, with CR, LF line terminators
8                              :ASCII text, with overstriking
7 {9..12} {32..126}            :ASCII text
{160..255}                     :ISO-8859 text

Utf-8 encoded ranges:

{1..6} {14..26} {28..31} 127   :data
27                             :ASCII text, with escape sequences
13                             :ASCII text, with CR, LF line terminators
8                              :ASCII text, with overstriking
7 {9..12} {32..126}            :ASCII text
{128..132} {134..159}          :UTF-8 Unicode text
133                            :UTF-8 Unicode text, with LF, NEL line terminators
{160..255}                     :UTF-8 Unicode text
{256..5120}                    :UTF-8 Unicode text

One possible solution lies below.

Previous Answer.

The Unicode value for the character you are posting is:

$ printf '%x\n' "'�"
fffd

Yes, that is a Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD). That is a character used to replace any invalid Unicode character found in the text. It is a "visual aid", not a real character. To find and list every full line that contains invalid UNICODE characters use:

grep -axv '.*' out.txt

but if you only want to detect if any character is invalid, use:

grep -qaxv '.*' out.txt; echo $?

If the result is 1 the file is clean, otherwise will be zero 0.

If what you were asking was: how to find the � character, then, use this:

➤ a='Basically, if the file "out.txt" contains "�" anywhere in the file I'
➤ echo "$a" | grep -oP $(printf %b \\Ufffd)
�

Or if your system process correctly UTF-8 text, simply:

➤ echo "$a" | grep -oP '�'
�

This very early answer was for the original post which was:

How to grep for unicode � in a bash script
if grep -q "�" out.txt
    then
        echo "working"
    else
        cat out.txt  fi
Basically, if the file "out.txt" contains "�" anywhere in the file I would like it to echo "working" AND if the file "out.txt" does NOT contain "�" anywhere in the file then I would like it to cat out.txt

Try

grep -oP "[^\x00-\x7F]"

with an if .. then statement as follows:

if grep -oP "[^\x00-\x7F]" file.txt; then
    echo "grep found something ..."
else
    echo "Nothing found!"
fi

Explanation:

-P, --perl-regexp: PATTERN is a Perl regular expression
-o, --only-matching: show only the part of a line matching PATTERN
[^\x00-\x7F] is a regex to match a single non-ASCII character.
[[:ascii:]] - matches a single ASCII char
[^[:ascii:]] - matches a single non-ASCII char

in bash

LC_COLLATE=C grep -o '[^ -~]' file

How to grep for unicode � in a bash script

long answer

Demo

Test

Previous Answer.

How to grep for unicode � in a bash script

Tags:

Linux

Scripting

Grep

Openssl

Related

Recent Posts