How can I kill a process and be sure the PID hasn't been reused
Best would be to use the timeout
command if you have it which is meant for that:
timeout 86400 cmd
The current (8.23) GNU implementation at least works by using alarm()
or equivalent while waiting for the child process. It does not seem to be guarding against the SIGALRM
being delivered in between waitpid()
returning and timeout
exiting (effectively cancelling that alarm). During that small window, timeout
may even write messages on stderr (for instance if the child dumped a core) which would further enlarge that race window (indefinitely if stderr is a full pipe for instance).
I personally can live with that limitation (which probably will be fixed in a future version). timeout
will also take extra care to report the correct exit status, handle other corner cases (like SIGALRM blocked/ignored on startup, handle other signals...) better than you'd probably manage to do by hand.
As an approximation, you could write it in perl
like:
perl -MPOSIX -e '
$p = fork();
die "fork: $!\n" unless defined($p);
if ($p) {
$SIG{ALRM} = sub {
kill "TERM", $p;
exit 124;
};
alarm(86400);
wait;
exit (WIFSIGNALED($?) ? WTERMSIG($?)+128 : WEXITSTATUS($?))
} else {exec @ARGV}' cmd
There's a timelimit
command at http://devel.ringlet.net/sysutils/timelimit/ (predates GNU timeout
by a few months).
timelimit -t 86400 cmd
That one uses an alarm()
-like mechanism but installs a handler on SIGCHLD
(ignoring stopped children) to detect the child dying. It also cancels the alarm before running waitpid()
(that doesn't cancel the delivery of SIGALRM
if it was pending, but the way it's written, I can't see it being a problem) and kills before calling waitpid()
(so can't kill a reused pid).
netpipes also has a timelimit
command. That one predates all the other ones by decades, takes yet another approach, but doesn't work properly for stopped commands and returns a 1
exit status upon timeout.
As a more direct answer to your question, you could do something like:
if [ "$(ps -o ppid= -p "$p")" -eq "$$" ]; then
kill "$p"
fi
That is, check that the process is still a child of ours. Again, there's a small race window (in between ps
retrieving the status of that process and kill
killing it) during which the process could die and its pid be reused by another process.
With some shells (zsh
, bash
, mksh
), you can pass job specs instead of pids.
cmd &
sleep 86400
kill %
wait "$!" # to retrieve the exit status
That only works if you spawn only one background job (otherwise getting the right jobspec is not always possible reliably).
If that's an issue, just start a new shell instance:
bash -c '"$@" & sleep 86400; kill %; wait "$!"' sh cmd
That works because the shell removes the job from the job table upon the child dying. Here, there should not be any race window since by the time the shell calls kill()
, either the SIGCHLD signal has not been handled and the pid can't be reused (since it has not been waited for), or it has been handled and the job has been removed from the process table (and kill
would report an error). bash
's kill
at least blocks SIGCHLD before it accesses its job table to expand the %
and unblocks it after the kill()
.
Another option to avoid having that sleep
process hanging around even after cmd
has died, with bash
or ksh93
is to use a pipe with read -t
instead of sleep
:
{
{
cmd 4>&1 >&3 3>&- &
printf '%d\n.' "$!"
} | {
read p
read -t 86400 || kill "$p"
}
} 3>&1
That one still has race conditions, and you lose the command's exit status. It also assumes cmd
doesn't close its fd 4.
You could try implementing a race-free solution in perl
like:
perl -MPOSIX -e '
$p = fork();
die "fork: $!\n" unless defined($p);
if ($p) {
$SIG{CHLD} = sub {
$ss = POSIX::SigSet->new(SIGALRM); $oss = POSIX::SigSet->new;
sigprocmask(SIG_BLOCK, $ss, $oss);
waitpid($p,WNOHANG);
exit (WIFSIGNALED($?) ? WTERMSIG($?)+128 : WEXITSTATUS($?))
unless $? == -1;
sigprocmask(SIG_UNBLOCK, $oss);
};
$SIG{ALRM} = sub {
kill "TERM", $p;
exit 124;
};
alarm(86400);
pause while 1;
} else {exec @ARGV}' cmd args...
(though it would need to be improved to handle other types of corner cases).
Another race-free method could be using process groups:
set -m
((sleep 86400; kill 0) & exec cmd)
However note that using process groups can have side-effects if there's I/O to a terminal device involved. It has the additional benefit though to kill all the other extra processes spawned by cmd
.
In general, you can't. All of the answers given so far are buggy heuristics. There is only one case in which you can safely use the pid to send signals: when the target process is a direct child of the process that will be sending the signal, and the parent has not yet waited on it. In this case, even if it has exited, the pid is reserved (this is what a "zombie process" is) until the parent waits on it. I'm not aware of any way to do that cleanly with the shell.
An alternative safe way to kill processes is to start them with a controlling tty set to a pseudo-terminal for which you own the master side. You can then send signals via the terminal, e.g. writing the character for SIGTERM
or SIGQUIT
over the pty.
Yet another way that's more convenient with scripting is to use a named screen
session and send commands to the screen session to end it. This process takes place over a pipe or unix socket named according to the screen session, which won't automatically be reused if you choose a safe unique name.
When launching the process save its start time:
longrunningthing & p=$! stime=$(TZ=UTC0 ps -p "$p" -o lstart=) echo "Killing longrunningthing on PID $p in 24 hours" sleep 86400 echo Time up!
Before trying to kill the process stop it (this isn't truly essential, but it's a way to avoid race conditions: if you stop the process, it's pid cannot be reused)
kill -s STOP "$p"
Check that the process with that PID has the same start time and if yes, kill it, otherwise let the process continue:
cur=$(TZ=UTC0 ps -p "$p" -o lstart=) if [ "$cur" = "$stime" ] then # Okay, we can kill that process kill "$p" else # PID was reused. Better unblock the process! echo "long running task already completed!" kill -s CONT "$p" fi
This works because there can be only one process with the same PID and start time on a given OS.
Stopping the process during the check makes race-conditions a non-issue. Obviously this has the problem that, some random process may be stopped for some milliseconds. Depending on the type of process this may or may not be an issue.
Personally I'd simply use python and psutil
which handles PID reuse automatically:
import time
import psutil
# note: it would be better if you were able to avoid using
# shell=True here.
proc = psutil.Process('longrunningtask', shell=True)
time.sleep(86400)
# PID reuse handled by the library, no need to worry.
proc.terminate() # or: proc.kill()