Skip to content

Conversation

clobrano
Copy link
Contributor

@clobrano clobrano commented Oct 15, 2025

Improve etcd recovery test validation to handle cases where member promotion from learner to voting happens faster than test observation (due to RHEL-119495).

Key changes:

  • Add hasNodeRebooted() function using BootID comparison to detect node reboot
  • Enhance validateEtcdRecoveryState() to check node reboot and so proving recovery occurred despite missing observation of the intermediate learner state
  • Handle three scenarios: single reboot (expected), dual reboot (unexpected), no reboot (genuine failure)
  • Improve test logging with timeout information for better debugging
  • Simplify function signatures by removing hardcoded survived node expectations (survived node is always expected to be started and non-learner)

The promotion from learner to voting member can happen faster than the time it takes to establish etcd client connections and query cluster state. Node reboot detection provides proof that disruption and recovery occurred even when intermediate states are missed due to timing.

Improve etcd recovery test validation to handle cases where member
promotion from learner to voting happens faster than test observation.

Key changes:
- Add hasNodeRebooted() function using BootID comparison to detect node
  reboot
- Enhance validateEtcdRecoveryState() to check node reboot and so
  proving recovery occurred despite missing observation of the
  intermediate learner state
- Handle three scenarios: single reboot (expected), dual reboot
  (unexpected), no reboot (genuine failure)
- Improve test logging with timeout information for better debugging
- Simplify function signatures by removing hardcoded survived node
  expectations (survived node is always expected to be started and
  non-learner)

The promotion from learner to voting member can happen faster than the
time it takes to establish etcd client connections and query cluster
state. Node reboot detection provides proof that disruption and recovery
occurred even when intermediate states are missed due to timing.
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 15, 2025
Copy link
Contributor

openshift-ci bot commented Oct 15, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

openshift-ci bot commented Oct 15, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: clobrano
Once this PR has been reviewed and has the lgtm label, please assign jeff-roche for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Oct 17, 2025

@clobrano: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/go-verify-deps 7f73854 link true /test go-verify-deps

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant