Redis Lock Leakage?

Charles · January 3, 2021, 12:46am

Question about the Redis lock. I have a use case where I want to scale out workers and use the redis lock to maintain sanity.

However, I think what I’m seeing, and maybe you can confirm this, 1) single/multiple requests hit the lock code. 2) the lock allows a singe thread thru as it sets the lock. 3) code executes and the lock is removed. 4) all the remaining threads being blocked are allowed to pass through the lock. 5) the cycle continues.

So my first question, If there are multiple requests hitting the lock, once the lock is free, does it let all other request through the lock, or does it provide a proper lock, i.e. regardless of the number of threads sitting behind the lock, one-by-one allow only one at a time, setting a new lock, forcing the others to “wait their turn” ensuring that all threads haves been processed one-at-a-time?

If not the latter, and there is this “redis lock leakage”, is there a way to tighten up the lock and/or ensure that code is properly gated, to ensure only one thread at a time is allowed to be processed, and all others wait their turn, regardless of how wide I scale out horizontally?

mythz · January 3, 2021, 1:03pm

Here is the implementation of RedisLock used when calling the AcquireLock() API, it follows the locking behavior defined in setnx:

Fortunately, it’s possible to avoid this issue using the following algorithm. Let’s see how C4, our sane client, uses the good algorithm:

C4 sends SETNX lock.foo in order to acquire the lock

The crashed client C3 still holds it, so Redis will reply with 0 to C4.

C4 sends GET lock.foo to check if the lock expired. If it is not, it will sleep for some time and retry from the start.

Instead, if the lock is expired because the Unix time at lock.foo is older than the current Unix time, C4 tries to perform:
GETSET lock.foo <current Unix timestamp + lock timeout + 1>
Because of the GETSET semantic, C4 can check if the old value stored at key is still an expired timestamp. If it is, the lock was acquired.

If another client, for instance C5, was faster than C4 and acquired the lock with the GETSET operation, the C4 GETSET operation will return a non expired timestamp. C4 will simply restart from the first step. Note that even if C4 set the key a bit a few seconds in the future this is not a problem.

It only allows 1 client through whilst all other clients wait until their RedisLock timeout where it will fail with a TimeoutException. If the lock is held for longer than the timeout than the lock expires and another single client will be able to acquire the lock. If the lock is acquired without a timeout the lock expiry is 1 year (and all blocked clients will retry indefinitely). So as long as the operation completes within the lock expiry timeout there should only be a single client with the lock. So you’d want to ensure the that the operation completes (i.e. the client holds the lock) within the lock timeout, if this wasn’t the case another client will be able to acquire the lock whilst all other blocked clients will have thrown a TimeoutException.

Charles · January 5, 2021, 6:30am

Hi @mythz,

Thank-you for the explanation. So if I understand this correctly, the key is the lock-timeout. As long as I set this at a value high enough to ensuring there is enough time for the task to complete, and assuming any failed tasks will be retried, then we should be good?

For example, I have a task that can take 30 seconds, but I noticed when the pod gets saturated, and a lot of activity is going on, I’ve seen it take as long at 3 minutes. So If I set the timeout to say 5 minutes, then in the event a thread dies, the dead thread will fail to ACK, and the threads waiting will hit the 5 min. timeout, and they too would fail to ACK.

Assuming there is retry policy is in place, all should work as expected. This sounds about right to you?

mythz · January 5, 2021, 6:38am

Don’t know what you mean by fail to ACK are you talking about releasing the lock?

Automatic Retries are for auto reconnections, when Lock Timeout’s occur they will just throw a TimeoutException which would be up for your App to handle.

Charles · January 5, 2021, 7:05am

We are running this on an RabbitMQ message bus, so only when the task is completed does the message get ACKnowledged, and thus removed from the queue. Otherwise it dead letters or runs a retry schedule if configured.

So assuming the retry policy is in place, all those that fail to ACK should get retried, and in this use case, the redis lock will ensure only one thread can run through at one time.