Safe locking mechanism in bash to serialize access to some resource
June 12, 2013Say you have a makefile as part of a pipeline that tries to use ssh
to
spin up a number of parallel jobs running on a cluster (see earlier
post).
Running make in parallel mode with -j
can
then result in many concurrent calls to ssh. There may be, however, a
maximum number of concurrent ssh sessions, which is set by default to
10 (see man ssh_config
for more details). Any attempts to establish
more concurrent ssh
sessions results in an error. However, in this
scenario, each ssh
session is very short (just long enough to qsub
a
batch script), so we could just try to serialize access to ssh
using
some locking mechanism.
One possible way to do this is using a global lock file. However, it’s easy to create a race condition when doing this. For example, the following code has a race condition:
if [[ ! -f ${SSH_LOCK_FILE} ]]
then
touch ${SSH_LOCK_FILE}
...some code here...
rm -f ${SSH_LOCK_FILE}
fi
The problem here is that another process might create the lock file in
the short time between this code checking for the files existence and
actually creating it. So the creation of the lock file needs to be
made atomic. One possible solution (in bash) is to use set -o
noclobber
, which means that redirection output to a file fails if the
file exists. So we can create the following functions to
create/release an atomic file lock in bash:
SSH_LOCK_FILE=${HOME}/.sshlock
function get_ssh_lock() {
for i in {1..30}; do
if ( set -o noclobber; echo "$$" > "${SSH_LOCK_FILE}") 2> /dev/null;
then
#this will cause the lock file to be deleted in case of other exit
trap 'rm -f "${SSH_LOCK_FILE}"; exit $?' INT TERM EXIT
return 0
else
usleep $(( 2000 * ($RANDOM % 100) + 500000 ))
fi
done
echo "Failed to aquire lock ${SSH_LOCK_FILE} after 30 attempts" >&2
return 1
}
This particular function attempts to acquire the lock file 30
times. If it succeeds, it traps INT
, TERM
and EXIT
signals to ensure
that the lock file is removed should something go wrong. It
then returns 0 (OK). If it fails to acquire the lock, it sleeps
between 0.5 and 0.7 s (some randomness to spread out the timers) and
tries again. If it fails 30 times, it returns 1 (FAIL).
The following function releases the lock
function release_ssh_lock() {
rm -f "${SSH_LOCK_FILE}"
trap - INT TERM EXIT
}
These functions are then used like this:
if get_ssh_lock
then
ssh -q remote_machine "some command"
release_ssh_lock
fi
Another solution would be a locking directory instead of a file since
mkdir
also operates atomically.
Neither of these is guaranteed to work over NFS if the processes are running on different machines!