Skip to content

Conversation

@richard-cox
Copy link
Member

@richard-cox richard-cox commented Nov 11, 2025

Summary

Fixes #15414

Occurred changes and/or fixed issues

This PR revolves around ensuring

  • requests triggered by socket changes that go to a stale replica are handled, multiple requests are tried until an up to date replicate is hit (scenario 2)
  • requests triggered by socket changes from a stale replica are not made (scenario 3)

Requests are made in two places

  • core socket world
    • resource.changes --> request --> response goes to cache
  • pagination wrapper world
    • resource.changes --> watchers --> request --> response stored in wrapper cache

Major Points

  • General
    • Backoff now handles two different methodologies
      • recursive - Call a function, if it fails keep trying but with a delay (aka back off)
      • delay based - Call a function, but if it's recently been called delay execution (aka back off)
    • This PR addresses scenarios 2 and 3 by using the 'recursive' back off when making http request to fetch resources triggered on socket activity
    • The triggers are in two places
      • sockets - ui receives a resource.changes socket msg and needs to directly fetch resources
      • pagination wrappers - only used atm for clusters in side bar. ui receives a resource.changes socket msg, notifies watchers of that type and the wrappers fetch resources
    • findPage response should persist it's revision to store. it will be used to confirm resource.changes revision is newer than our cached
    • findPage in transient mode (does not touch store) should return the revision
  • Scenario 2
    • Ensure all places that make a findPage requests as a result of resource.changes event contain revision. This happens in two places
      • store managed resources - directly in shell/plugins/steve/subscribe.js fetchPageResources
      • wrapper managed resources - shell/utils/pagination-wrapper.ts request
    • Ensure that when we make requests using a revision we back-off retry if failed given invalid revision
  • Scenario 3
    • Ensure that we cache the revision from findPage requests, and abort resource.changes process if the cached revision is newer than the resource.changes revision
  • Other
    • Fixes multiple request to fetch clusters when using side nav to move around
    • Improve TS typing for kube and steve request and response objects
    • Added class SteveRevision to handle weirdness with steve revisions
    • Added and updated lots of unit tests

Technical notes summary

  • Challenges
    • Reproducing scenarios
    • Reproducing combined scenarios
    • Validating scenarios in both places where the solution is wired in (subscribe and pagination wrapper)
  • Alternative
    • Currently there's two places scen 2+3 affect and the backoff recursive process is wired into both. The alternative is to have a single place where we handle the retry in findPage, however it gets very noodle-y and required lots of plumbing. So there's a bit of repetition in both places

Areas or cases that should be tested

There's no easy way to manufacture scenarios 1, 2 and 3. I've being doing this manually by code change. Recording here for posterity

  • code changes
    • scenario 1
      • shell/plugins/steve/subscribe.js:watch. before checking if revision is undefined and for a specific resource set the revision in the watch request to 'abc'. do this for the first 5 times it's hit.
    • scenario 2
      • shell/plugins/dashboard-store/actions.js:findPage. before dispatch('request'addthrow { status: 400, code: 'unknown revision' };`. only call it second to fifth times (the initial call should succeed)
    • scenario 3
      • shell/plugins/steve/subscribe.js:ws.resource.changes. before calling fetchResources and for a specific resource set msg.revision to 0. do this the first 2 times it's hit
  • exercise scenario
    • scenario 1
      • socket based changes
        • set the specific resource to pod
        • nav to pods list. track network requests and console, the backoff should repeat and then succeed
      • pagination wrapper based changes
        • set resource to management.cattle.io.cluster
        • nav to fleet page (there's no other cluster fetch/watch). track network requests and console, the backoff should repeat and then succeed
    • scenario 2
      • socket based changes
        • set the specific resource to pod
        • nav to pods list. in another tab make a change to pods (anything). track network requests and console, the backoff should repeat and then succeed
      • pagination wrapper based changes
        • set resource to management.cattle.io.cluster
        • nav to fleet page (there's no other cluster fetch/watch). in another tab make a change to clusters (for example changing badge or labels). track network requests and console, the backoff should repeat and then succeed
    • scenario 3
      • socket based changes
        • set the specific resource to pod
        • nav to pods list. in another tab make a change to pods (anything). track network console, the backoff should not run (given the socket update contains a version older than the current cache contains)
      • pagination wrapper based changes
        • set resource to management.cattle.io.cluster
        • nav to fleet page (there's no other cluster fetch/watch). track network requests and console, the backoff should not run (given the socket update contains a version older than the current cache contains)

Along with scenarios 2 + 3, plus the original 1, the angles I've looked at

  • combining scenarios
  • Sockets dying and watches are set up again
  • Sockets dying whilst backoffs running
  • Resource watch is stopped whilst backoffs running

Areas which could experience regressions

  • Side bar cluster list (vai on and off) all functionality (pinned split, pinning, automatically updating, searching)

Screenshot/Video

Checklist

  • The PR is linked to an issue and the linked issue has a Milestone, or no issue is needed
  • The PR has a Milestone
  • The PR template has been filled out
  • The PR has been self reviewed
  • The PR has a reviewer assigned
  • The PR has automated tests or clear instructions for manual tests and the linked issue has appropriate QA labels, or tests are not needed
  • The PR has reviewed with UX and tested in light and dark mode, or there are no UX changes
  • The PR has been reviewed in terms of Accessibility
  • The PR has considered, and if applicable tested with, the three Global Roles Admin, Standard User and User Base

@richard-cox richard-cox added this to the v2.14.0 milestone Nov 11, 2025
@richard-cox richard-cox self-assigned this Nov 11, 2025
@richard-cox richard-cox added QA/None ci/skip-e2e Forcibly skip E2E tests in the CI and removed QA/None labels Nov 11, 2025
if ( markedColumn ) {
this._defaultSortBy = markedColumn.name;
descending = markedColumn.defaultSortDescending;
descending = markedColumn.defaultSortDescending || false;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fixes a console.warning related to initial sort order being undefined in sortable table

if (!canPagination) {
// Reduce the impact of the initial load, but only if we're not making a request
if (!canPagination || !sideNavServiceInitialized) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix multiple request to fetch clusters when using side nav to move around

  • When changing page the header + top level menu component is recreated. The impact of this is reduced a bit given the TopLevelMenuHelperService singleton, but multiple requests are still made
  • This is due to the immediate on a number of watches in TopLevelMenu triggering new requests
  • This is now fixed by making an initial request up front (only when needed) and not triggering the watches when we don't need to

@richard-cox richard-cox removed the ci/skip-e2e Forcibly skip E2E tests in the CI label Jan 7, 2026
private backOffId: string;
private classify: boolean;
private reactive: boolean;
private cachedRevision?: string;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache revision and result given there might be future calls that want an older result set... which we ignore and pass the current later result set

// 2. current version is older than target revision - reset previous (drop older requests with older revision, use new revision)
// 3. current version is same as target revision - we're retrying

if (currentRevision.isNewerThan(targetRevision)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do both

activeRevisionSt.isNewerThan(targetRevision)
cachedRevisionSt.isNewerThan(targetRevision)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shell/utils/pagination-wrapper.ts

  • needs to split them in to, one is an abort, the other is returning cached value
    shell/plugins/steve/subscribe.js
  • doesn't need to split them (does not need to return a result)

return;
}

if (targetRevision.isNewerThan(activeRevision)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

targetRevision.isNewerThan(currentRevision)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pagination.wrapper needed changing, not this one

id: safeBackOffId,
metadata: { revision },
description: `Fetching resources for ${ type }. Triggered by web socket`,
canFn: () => {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

canFn in wrapper world should check if the watch is running

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skipping this

  • check 1 - canBackOff requires reference to this.$socket
  • check 2 - watchStarted check requires original params

- Add tests for SteveRevision handling undefined
- add comments linking two implementations
- better match HA revision checks between implementations
- better backoff logging of revision: undefined
- reduce dupe filter given reuse of opt
@richard-cox richard-cox marked this pull request as ready for review January 12, 2026 15:33
@richard-cox richard-cox requested a review from nwmac January 12, 2026 15:34
@richard-cox richard-cox merged commit f28f762 into rancher:master Jan 14, 2026
65 of 67 checks passed
@richard-cox richard-cox deleted the pagination-ha-2 branch January 14, 2026 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support HA & slow proliferation of cached value - Scenario 2 + 3

2 participants