Any non-whitespace regular expression
In non-GNU systems what follows explain why \S
fail:
The \S
is part of a PCRE (Perl Compatible Regular Expressions). It is not part of the BRE (Basic Regular Expressions) or the ERE (Extended Regular Expressions) used in shells.
The bash operator =~
inside double bracket test [[
use ERE.
The only characters with special meaning in ERE (as opposed to any normal character) are .[\()*+?{|^$
. There are no S
as special. You need to construct the regex from more basic elements:
regex='^b[^[:space:]]+[a-z]$'
Where the bracket expression [^[:space:]]
is the equivalent to the \S
PCRE expressions :
The default \s
characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space (32).
The test would be:
var='big' regex='^b[^[:space:]]+[a-z]$'
[[ $var =~ $regex ]] && echo "$var" || echo 'none'
However, the code above will match bißß
for example. As the range [a-z]
will include other characters than abcdefghijklmnopqrstuvwxyz
if the selected locale is (UNICODE).
To avoid such issue, use:
var='bißß' regex='^b[^[:space:]]+[a-z]$'
( LC_ALL=C;
[[ $var =~ $regex ]]; echo "$var" || echo 'none'
)
Please be aware that the code will match characters only in the list: abcdefghijklmnopqrstuvwxyz
in the last character position, but still will match many other in the middle: e.g. bég
.
Still, this use of LC_ALL=C
will affect the other regex range: [[:space:]]
will match spaces only of the C locale.
To solve all the issues, we need to keep each regex separate:
reg1=[[:space:]] reg2='^b.*[a-z]$' out=none
if [[ $var =~ $reg1 ]] ; then out=none
elif ( LC_ALL=C; [[ $var =~ $reg2 ]] ); then out="$var"
fi
printf '%6.8s\t|' "$out"
Which reads as:
- If the input (var) has no spaces (in the present locale) then
- check that it start with a
b
and ends ina-z
(in the C locale).
Note that both tests are done on the positive ranges (as opposed to a "not"-range). The reason is that negating a couple of characters opens up a lot more possible matches. The UNICODE v8 has 120,737 characters already assigned. If a range negates 17 characters, then it is accepting 120720 other possible characters, which may include many non-printable control characters.
It should be a good idea to limit the character range that the middle characters could have (yes, those will not be spaces, but may be anything else).
[[ $var =~ ^b[^[:space:]]+[abcdefghijklmnopqrstuvwxyz]$ ]]
What [a-z]
matches depends on the locale and generally is not (only) one of abcdefghijklmnopqrstuvwxyz
.
perl
's \S
(horizontal and vertical spaces) now also recognised by some other regexp engines is [^[:space:]]
in POSIX and bash's EREs.
bash
uses the system's regexp library to match those regular expressions, but even on systems (like recent GNU ones) where the regexps have a \S
operator, that won't work because in:
[[ x = \S ]]
bash
calls regcomp("S")
and with:
[[ x = '\S' ]]
bash
calls regcomp("\\S")
(two backslashes).
However, with bash-3.1 or if you turn bash-3.1 compatibility on with shopt -s compat31
, then:
[[ x = '\S' ]]
will work (will match a non-spacing character) on systems where EREs support \S
.
$ bash -c "[[ x =~ '\S' ]]" || echo no
no
$ bash -O compat31 -c "[[ x =~ '\S' ]]" && echo yes
yes
Another option would be to put the regexp in a variable:
$ a='\S' bash -c '[[ x =~ $a ]]' && echo yes
yes
Again, that only works on systems that support that perl-like \S
in their regexps.
The POSIX equivalent to that bash
-specific code, would be:
if expr " $var" : \
' b[^[:space:]]\{1,\}[abcdefghijklmnopqrstuvwxyz]$' \
> /dev/null; then
printf '%s\n' "$var"
else
echo none
fi
Or:
case $var in
([!b]* | *[!abcdefghijklmnopqrstuvwxyz] | *[[:space:]]* | "" | ? | ??)
echo none;;
(*) printf '%s\n' "$var"
esac