Skip to content

Cluster down handling can lose replacement endpoint events while old down work is pending #861

@dkropachev

Description

@dkropachev

Problem\nCluster down handling is asynchronous. If a down event for one endpoint is still in progress and the Host endpoint changes, a later down/forced-down event for the replacement endpoint can be ignored or overwritten by queued up/down state tied to the old endpoint.\n\n## Impact\nThe driver can skip cleanup/reconnector handling for the replacement endpoint or replay an up event that should have been cancelled by the newer down event.\n\n## Expected fix\nTrack the endpoint associated with active and pending down handling, queue replacement-endpoint down events separately, and replay only the event that still matches the current endpoint/epoch.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions