Skip to content

Non-blocking lock acquisition failure can "leak" the ephemeral lock node. #732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bmahler opened this issue Nov 12, 2023 · 1 comment · May be fixed by #760
Open

Non-blocking lock acquisition failure can "leak" the ephemeral lock node. #732

bmahler opened this issue Nov 12, 2023 · 1 comment · May be fixed by #760

Comments

@bmahler
Copy link

bmahler commented Nov 12, 2023

A bit about our setup for context: We use znodes as representation of work items (typically there are hundreds of work items / znodes present), and we have many workers (e.g. 800) constantly trying to lock one of the work znodes via the Lock class. If the worker obtains the lock, it holds it, performs the work (which takes quite some time), then releases the lock. The work loop in each worker looks something like this:

while True:
  children = zk_client.get_children(path)
  shuffle(children)
  for child in children:
    lock = zk_client.Lock(child)
    if not lock.acquire(blocking=False):
      continue
    # do work to process child
    lock.release()
    if work_finished:
      zk_client.delete(child)
  sleep(5)

As you can see, we stress Lock quite heavily, the typical load is something like O((800-300) idle workers * 300 znodes / 5 seconds) == 30,000 lock acquisition attempts per second. What happens the vast majority of the time when the lock is already acquired is that the failure to acquire the lock creates a temporary znode, and then this node gets deleted when not_gotten by calling _best_effort_cleanup(): https://github.com/python-zk/kazoo/blob/2.8.0/kazoo/recipe/lock.py#L219-L220

However, we have observed that on occasion, the following occurs:

  1. A worker obtains the ephemeral lock node successfully.
  2. Some time later, e.g. seconds, a different worker creates another ephemeral lock node that never gets cleaned up! This worker experienced a False from lock.acquire, we know this because if lock.acquire succeeds a message gets logged, and in all instances of the issue we never see the log message. This second worker moves on and processes other runs of the loop.
  3. This second lock node never gets deleted! Presumably this is because we don't experience a session expiration.

The current theory is that in _best_effort_cleanup() hits a KazooException without the session expiring, since it doesn't handle the exception, hitting it without a session expiration will leak the ephemeral lock znode in the same way that we've observed.

The fix here seems to be to handle / retry the exception within _best_effort_cleanup(). Should this really be best effort? Alternatively, is there something we should do on our end? E.g. perhaps the deletion within _best_effort_cleanup() experiences a SUSPENDED KazooState? Let me know if additional info would be helpful.

We use kazoo 2.8.0.

jeblair added a commit to jeblair/kazoo that referenced this issue Feb 26, 2025
If a non-blocking lock fails to acquire the lock, and then encounters
a KazooError (due to a suspended session), the _best_effort_cleanup method
will swallow the exception and control will return without the lock
contender node being deleted.  If the session resumes (does not expire)
then we will have left a lock contender in place, which will eventually
become an orphaned, stuck lock once the original actor releases it.

To correct this, retry deleting the lock contender in all cases.  Due
to the importance of this, we ignore the supplied timeout (in case the
aquire method was called with a timeout) and retry forever.

Closes: python-zk#732
@jeblair jeblair linked a pull request Feb 26, 2025 that will close this issue
@jeblair
Copy link
Contributor

jeblair commented Feb 26, 2025

I independently came to the same conclusion as the OP about the cause of a stuck lock as well as the solution. I was able to reproduce the issue in a test script by triggering a disconnect immediately before _best_effort_cleanup(). I made a pull request to retry the cleanup routine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants