Locking mechanisms for shared-memory consistency
So, doing some research I've found that System V semaphores have a flag called SEM_UNDO wich can revert the lock state when the program fails, but that's not guaranteed to work.
SEM_UNDO would unlock the semaphore if process crashes. If processes crashed due to corruption of the shared memory, there is nothing semaphores can do for you. OS can't undo the state of shared memory.
If you need to be able to roll-back state of the shared memory, then you have to implement something on your own. I have seen at least two models which deal with that.
First model before modifying anything in shared memory was taking a snapshot of the structure, saving in a list in the shared memory. If any any other process was able to get the lock and the list wasn't empty, it was undoing whatever the crashed process might have changed.
Second model is to make copies of the shm structures in the local memory and keep the lock locked for the whole transaction. When transaction is being committed, before releasing the lock, simply copy the structures from local memory into the shared memory. Probability that app would crash during copy is lower and intervention by external signals can be blocked by using sigprocmask()
. (Locking in the case better be well partitioned over the data. E.g. I have seen tests with set of 1000 locks for 10Mln records in shm accessed by 4 concurrent processes.)
There are only few things that are guaranteed to be cleaned up whence a program fails. The only thing that comes to my mind here are link counts. An open file descriptor increases the link count of the underlying inode and a corresponding close decreases it, including a forced close when the program fails.
So your processes could all open a common file (don't remember if it works for shared memory segments) and you could trigger some sort of alarm if the count decreases, where it shouldn't. E.g instead of doing a plain wait your processes could do a timedwait (for a second, e.g) in a loop and poll for the link count to be alerted when somethings is going wrong.