How to specify characters using hexadecimal codes in `grep`?

Look at grep: Find all lines that contain Japanese kanjis.

Text is usually encoded in UTF-8; so you have to use the hex vales of the bytes used in UTF-8 encoding.

grep "["$'\xe0\xa4\x85'"-"$'\xe0\xa4\xb5'"]"

and

grep '[अ-व]'

are equivalent, and they perform a locale-based matching (that is, matching is dependent on the sorting rules of Devanagari script (that is, the matching is NOT "any char between \u0905 and \0935" but instead "anything sorting between Devanagari A and Devanagari VA"; there may be differences.

($'...' is the "ANSI-C escape string" syntax for bash, ksh, and zsh. It is just an easier way to type the characters. You can also use the \uXXXX and \UXXXXXXXX escapes to directly ask for code points in bash and zsh.)

On the other hand, you have this (note -P):

grep -P "\xe0\xa4[\x85-\xb5]"

that will do a binary matching with those byte values.

If shell escaping is enough you can use the $'\xHH' syntax like this:

grep -v "<["$'\x09\x00'"-"$'\x09\x7F'"]*\s"

Is that enough for your use case?

The "hexadecimal" value 0x0900 you wrote is exactly the value of the UNICODE code point which is also in hexadecimal.

hexadecimal code 0900 (instead of अ)

I believe that what you mean to say is the hexadecimal UNICODE code point: U0905.

The character at U-0900 is not the one you used: अ.
That character is U0905, part of this Unicode page, or listed at this page.

In bash (installed by default in Ubuntu), or directly with the program at: /usr/bin/printf (but not with sh printf), an Unicode character could be produced with:

$ printf '\u0905'
अ
$ /usr/bin/printf '\u0905'
अ

However, that character, which comes from a code point number could be represented by several byte streams depending of which code page is used.
It should be obvious that \U0905 is 0x09 0x05 in UTF-16 (UCS-2, etc)
and 0x00 0x00 0x09 0x05 in UTF-32.
It may not be obvious but in utf-8 it is represented by 0xe0 0xa4 0x85:

$ /usr/bin/printf '\u0905' | od -vAn -tx1
e0 a4 85

If the locale of your console is something similar to en_US.UTF-8.

And I am talking about the shell because it is the one that transforms a string into what the application receives. This:

grep "$(printf '\u0905')" file

makes grep "see" the character you need.
To understand the line above you may use echo:

$ echo grep "$(printf '\u0905')" file
grep अ file

Then, we can build a character range, as you request:

$ echo grep "$(printf '[\u0905-\u097f]')" file
grep [अ-ॿ] file

That answer your question:

How I can use hexadecimal code in place of अ and व?

How to specify characters using hexadecimal codes in `grep`?

Tags:

Shell

Unicode

Grep

Character Encoding

Related

Recent Posts