Why do I get different exit status for ps | grep in a script?
In general, it's a bad idea to try the simple approach with ps
and grep
to try to determine if a given process is running.
You would be much better off using pgrep
for this:
if pgrep "varnish" >/dev/null; then
echo "Varnish in running"
else
echo "Varnish is not running"
fi
See the manual for pgrep
. On some systems (probably not on Linux), you get a -q
flag that corresponds to the same flag for grep
which gets rid of the need to redirect to /dev/null
. There's also a -f
flag that performs the match on the full command line rather than on only the process name. One may also limit the match to processes belonging to a specific user using -u
.
Installing pgrep
also gives you access to pkill
which allows you to signal processes based on their names.
Also, if this is a service daemon, and if your Unix system has a way of querying it for information (e.g., whether it's up and running or not), then that is the proper way of checking on it.
On Linux, you have systemctl
(systemctl is-active --quiet varnish
will return 0 if it's running, 3 otherwise), on OpenBSD you have rcctl
, etc.
Now to your script:
In your script, you parse the output from ps ax
. This output will contain the name of the script itself, check_varnish_pro.sh
, which obviously contains the string varnish
. This gives you a false positive. You would have spotted this if you had run it without the -q
flag for grep
while testing.
#!/bin/bash
ps ax | grep '[v]arnish'
Running it:
$ ./check_varnish_pro.sh
31004 p1 SN+ 0:00.04 /bin/bash ./check_varnish_pro.sh
Another issue is that although you try to "hide" the grep
process from being detected by grep
itself by using [v]
in the pattern. That approach will fail if you happen to run the script or the command line in a directory that has a file or directory named varnish
in it (in which case you will get a false positive, again). This is because the pattern is unquoted and the shell will perform filename globbing with it.
See:
bash-4.4$ set -x
bash-4.4$ ps ax | grep [v]arnish
+ ps ax
+ grep '[v]arnish'
bash-4.4$ touch varnish
+ touch varnish
bash-4.4$ ps ax | grep [v]arnish
+ ps ax
+ grep varnish
91829 p2 SN+p 0:00.02 grep varnish
The presence of the file varnish
will cause the shell to replace [v]arnish
with the filename varnish
and you get a hit on the pattern in the process table (the grep
process).
When you run a script named check_varnish_pro.sh
the test
ps ax | grep -q [v]arnish
is successful because there is a script named check_
varnish_pro
running.
@AlexP explains very succinctly what is actually happening, but @Kusalananda's idea of using pgrep
/pkill
for a critical process is strongly discouraged. Better solutions include:
- Asking the service whether it's running.
systemctl status varnishd
should take care of that on a modern *nix installation. If by some unfortunate circumstance you don't have a service available you can simply change the startup script to report the problem as soon as the process exits:
varnish || true some_command_to_send_an_alert_that_the_service_has_died
- Alternatively change the script that starts the service to record the PID, and then check the state periodically with
kill -0 "$pid"
.