Non-blocking lock acquisition failure can "leak" the ephemeral lock node. #732

bmahler · 2023-11-12T20:28:59Z

A bit about our setup for context: We use znodes as representation of work items (typically there are hundreds of work items / znodes present), and we have many workers (e.g. 800) constantly trying to lock one of the work znodes via the Lock class. If the worker obtains the lock, it holds it, performs the work (which takes quite some time), then releases the lock. The work loop in each worker looks something like this:

while True:
  children = zk_client.get_children(path)
  shuffle(children)
  for child in children:
    lock = zk_client.Lock(child)
    if not lock.acquire(blocking=False):
      continue
    # do work to process child
    lock.release()
    if work_finished:
      zk_client.delete(child)
  sleep(5)

As you can see, we stress Lock quite heavily, the typical load is something like O((800-300) idle workers * 300 znodes / 5 seconds) == 30,000 lock acquisition attempts per second. What happens the vast majority of the time when the lock is already acquired is that the failure to acquire the lock creates a temporary znode, and then this node gets deleted when not_gotten by calling _best_effort_cleanup(): https://github.com/python-zk/kazoo/blob/2.8.0/kazoo/recipe/lock.py#L219-L220

However, we have observed that on occasion, the following occurs:

A worker obtains the ephemeral lock node successfully.
Some time later, e.g. seconds, a different worker creates another ephemeral lock node that never gets cleaned up! This worker experienced a False from lock.acquire, we know this because if lock.acquire succeeds a message gets logged, and in all instances of the issue we never see the log message. This second worker moves on and processes other runs of the loop.
This second lock node never gets deleted! Presumably this is because we don't experience a session expiration.

The current theory is that in _best_effort_cleanup() hits a KazooException without the session expiring, since it doesn't handle the exception, hitting it without a session expiration will leak the ephemeral lock znode in the same way that we've observed.

The fix here seems to be to handle / retry the exception within _best_effort_cleanup(). Should this really be best effort? Alternatively, is there something we should do on our end? E.g. perhaps the deletion within _best_effort_cleanup() experiences a SUSPENDED KazooState? Let me know if additional info would be helpful.

We use kazoo 2.8.0.

The text was updated successfully, but these errors were encountered:

If a non-blocking lock fails to acquire the lock, and then encounters a KazooError (due to a suspended session), the _best_effort_cleanup method will swallow the exception and control will return without the lock contender node being deleted. If the session resumes (does not expire) then we will have left a lock contender in place, which will eventually become an orphaned, stuck lock once the original actor releases it. To correct this, retry deleting the lock contender in all cases. Due to the importance of this, we ignore the supplied timeout (in case the aquire method was called with a timeout) and retry forever. Closes: python-zk#732

jeblair · 2025-02-26T21:56:48Z

I independently came to the same conclusion as the OP about the cause of a stuck lock as well as the solution. I was able to reproduce the issue in a test script by triggering a disconnect immediately before _best_effort_cleanup(). I made a pull request to retry the cleanup routine.

StephenSorriaux added the Possible bug label Mar 8, 2024

jeblair linked a pull request Feb 26, 2025 that will close this issue

recipe(lock): retry lock cleanup #760

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Non-blocking lock acquisition failure can "leak" the ephemeral lock node. #732

Non-blocking lock acquisition failure can "leak" the ephemeral lock node. #732

bmahler commented Nov 12, 2023

jeblair commented Feb 26, 2025

Uh oh!

Non-blocking lock acquisition failure can "leak" the ephemeral lock node. #732

Non-blocking lock acquisition failure can "leak" the ephemeral lock node. #732

Comments

bmahler commented Nov 12, 2023

jeblair commented Feb 26, 2025

Uh oh!