Support HA & slow proliferation of cached value - Scenario 2 + 3 #15916

richard-cox · 2025-11-11T14:38:21Z

Summary

Fixes #15414

Occurred changes and/or fixed issues

This PR revolves around ensuring

requests triggered by socket changes that go to a stale replica are handled, multiple requests are tried until an up to date replicate is hit (scenario 2)
requests triggered by socket changes from a stale replica are not made (scenario 3)

Requests are made in two places

core socket world
- resource.changes --> request --> response goes to cache
pagination wrapper world
- resource.changes --> watchers --> request --> response stored in wrapper cache

Major Points

General
- Backoff now handles two different methodologies
  - recursive - Call a function, if it fails keep trying but with a delay (aka back off)
  - delay based - Call a function, but if it's recently been called delay execution (aka back off)
- This PR addresses scenarios 2 and 3 by using the 'recursive' back off when making http request to fetch resources triggered on socket activity
- The triggers are in two places
  - sockets - ui receives a resource.changes socket msg and needs to directly fetch resources
  - pagination wrappers - only used atm for clusters in side bar. ui receives a resource.changes socket msg, notifies watchers of that type and the wrappers fetch resources
- findPage response should persist it's revision to store. it will be used to confirm resource.changes revision is newer than our cached
- findPage in transient mode (does not touch store) should return the revision
Scenario 2
- Ensure all places that make a findPage requests as a result of resource.changes event contain revision. This happens in two places
  - store managed resources - directly in shell/plugins/steve/subscribe.js fetchPageResources
  - wrapper managed resources - shell/utils/pagination-wrapper.ts request
- Ensure that when we make requests using a revision we back-off retry if failed given invalid revision
Scenario 3
- Ensure that we cache the revision from findPage requests, and abort resource.changes process if the cached revision is newer than the resource.changes revision
Other
- Fixes multiple request to fetch clusters when using side nav to move around
  - More details inline (https://github.com/rancher/dashboard/pull/15916/changes#r2668150814)
- Improve TS typing for kube and steve request and response objects
- Added class SteveRevision to handle weirdness with steve revisions
- Added and updated lots of unit tests

Technical notes summary

Challenges
- Reproducing scenarios
- Reproducing combined scenarios
- Validating scenarios in both places where the solution is wired in (subscribe and pagination wrapper)
Alternative
- Currently there's two places scen 2+3 affect and the backoff recursive process is wired into both. The alternative is to have a single place where we handle the retry in findPage, however it gets very noodle-y and required lots of plumbing. So there's a bit of repetition in both places

Areas or cases that should be tested

There's no easy way to manufacture scenarios 1, 2 and 3. I've being doing this manually by code change. Recording here for posterity

code changes
- scenario 1
  - shell/plugins/steve/subscribe.js:watch. before checking if revision is undefined and for a specific resource set the revision in the watch request to 'abc'. do this for the first 5 times it's hit.
- scenario 2
  - shell/plugins/dashboard-store/actions.js:findPage. before dispatch('request'addthrow { status: 400, code: 'unknown revision' };`. only call it second to fifth times (the initial call should succeed)
- scenario 3
  - shell/plugins/steve/subscribe.js:ws.resource.changes. before calling fetchResources and for a specific resource set msg.revision to 0. do this the first 2 times it's hit
exercise scenario
- scenario 1
  - socket based changes
    - set the specific resource to pod
    - nav to pods list. track network requests and console, the backoff should repeat and then succeed
  - pagination wrapper based changes
    - set resource to management.cattle.io.cluster
    - nav to fleet page (there's no other cluster fetch/watch). track network requests and console, the backoff should repeat and then succeed
- scenario 2
  - socket based changes
    - set the specific resource to pod
    - nav to pods list. in another tab make a change to pods (anything). track network requests and console, the backoff should repeat and then succeed
  - pagination wrapper based changes
    - set resource to management.cattle.io.cluster
    - nav to fleet page (there's no other cluster fetch/watch). in another tab make a change to clusters (for example changing badge or labels). track network requests and console, the backoff should repeat and then succeed
- scenario 3
  - socket based changes
    - set the specific resource to pod
    - nav to pods list. in another tab make a change to pods (anything). track network console, the backoff should not run (given the socket update contains a version older than the current cache contains)
  - pagination wrapper based changes
    - set resource to management.cattle.io.cluster
    - nav to fleet page (there's no other cluster fetch/watch). track network requests and console, the backoff should not run (given the socket update contains a version older than the current cache contains)

Along with scenarios 2 + 3, plus the original 1, the angles I've looked at

combining scenarios
Sockets dying and watches are set up again
Sockets dying whilst backoffs running
Resource watch is stopped whilst backoffs running

Areas which could experience regressions

Side bar cluster list (vai on and off) all functionality (pinned split, pinning, automatically updating, searching)

Screenshot/Video

Checklist

The PR is linked to an issue and the linked issue has a Milestone, or no issue is needed
The PR has a Milestone
The PR template has been filled out
The PR has been self reviewed
The PR has a reviewer assigned
The PR has automated tests or clear instructions for manual tests and the linked issue has appropriate QA labels, or tests are not needed
The PR has reviewed with UX and tested in light and dark mode, or there are no UX changes
The PR has been reviewed in terms of Accessibility
The PR has considered, and if applicable tested with, the three Global Roles Admin, Standard User and User Base

- fix several issues with multiple side-nav related requests when changing page - lots of comments / descriptions updates

richard-cox · 2026-01-07T11:42:45Z

shell/components/SortableTable/sorting.js

      if ( markedColumn ) {
        this._defaultSortBy = markedColumn.name;
-        descending = markedColumn.defaultSortDescending;
+        descending = markedColumn.defaultSortDescending || false;


this fixes a console.warning related to initial sort order being undefined in sortable table

richard-cox · 2026-01-07T11:56:13Z

shell/components/nav/TopLevelMenu.vue


-    if (!canPagination) {
-      // Reduce the impact of the initial load, but only if we're not making a request
+    if (!canPagination || !sideNavServiceInitialized) {


Fix multiple request to fetch clusters when using side nav to move around

When changing page the header + top level menu component is recreated. The impact of this is reduced a bit given the TopLevelMenuHelperService singleton, but multiple requests are still made

This is due to the immediate on a number of watches in TopLevelMenu triggering new requests

This is now fixed by making an initial request up front (only when needed) and not triggering the watches when we don't need to

richard-cox · 2026-01-07T16:24:22Z

shell/utils/pagination-wrapper.ts

+  private backOffId: string;
  private classify: boolean;
  private reactive: boolean;
+  private cachedRevision?: string;


cache revision and result given there might be future calls that want an older result set... which we ignore and pass the current later result set

richard-cox · 2026-01-09T15:03:47Z

shell/plugins/steve/subscribe.js

+    // 2. current version is older than target revision - reset previous (drop older requests with older revision, use new revision)
+    // 3. current version is same as target revision - we're retrying
+
+    if (currentRevision.isNewerThan(targetRevision)) {


do both

activeRevisionSt.isNewerThan(targetRevision)
cachedRevisionSt.isNewerThan(targetRevision)

shell/utils/pagination-wrapper.ts

needs to split them in to, one is an abort, the other is returning cached value
shell/plugins/steve/subscribe.js

doesn't need to split them (does not need to return a result)

richard-cox · 2026-01-09T15:05:24Z

shell/plugins/steve/subscribe.js

+      return;
+    }
+
+    if (targetRevision.isNewerThan(activeRevision)) {


targetRevision.isNewerThan(currentRevision)

pagination.wrapper needed changing, not this one

richard-cox · 2026-01-09T15:06:39Z

shell/plugins/steve/subscribe.js

+        id:          safeBackOffId,
+        metadata:    { revision },
+        description: `Fetching resources for ${ type }. Triggered by web socket`,
+        canFn:       () => {


canFn in wrapper world should check if the watch is running

skipping this

check 1 - canBackOff requires reference to this.$socket

check 2 - watchStarted check requires original params

- Add tests for SteveRevision handling undefined - add comments linking two implementations - better match HA revision checks between implementations - better backoff logging of revision: undefined - reduce dupe filter given reuse of opt

richard-cox added this to the v2.14.0 milestone Nov 11, 2025

richard-cox self-assigned this Nov 11, 2025

richard-cox added QA/None ci/skip-e2e Forcibly skip E2E tests in the CI and removed QA/None labels Nov 11, 2025

richard-cox added 4 commits December 15, 2025 10:25

WIP

6781b99

very WIP

d3770b5

more WIP

93eef9a

more WIP

d42f064

richard-cox force-pushed the pagination-ha-2 branch from 1b07a61 to d42f064 Compare December 15, 2025 10:28

richard-cox added 10 commits December 15, 2025 16:34

minor refactors, plus move from resource.changes to fetchResources

03774eb

wgtsfsfsdfsfsdfggfgdfgdfg

5debeea

before rework...

714779e

not running, started split of execute and recurse

7b5ef37

scenarios 1-3 tested with socket...on to wrapper

cc7e9b2

started wrapper

9e08244

updates

27c7099

- fix several issues with multiple side-nav related requests when changing page - lots of comments / descriptions updates

all scenarios testdd

496aab9

Merge remote-tracking branch 'upstream/master' into pagination-ha-2

a280931

fix unit tests

68c7be0

richard-cox commented Jan 7, 2026

View reviewed changes

richard-cox added 3 commits January 7, 2026 13:14

tidying up, started adding unit tests

a17fdcf

more tidying up, more unit tests

eb433f2

add tests for backoff

477dd17

richard-cox removed the ci/skip-e2e Forcibly skip E2E tests in the CI label Jan 7, 2026

richard-cox commented Jan 7, 2026

View reviewed changes

tidying up + improving diff

731bc57

richard-cox commented Jan 9, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/master' into pagination-ha-2

f669552

richard-cox force-pushed the pagination-ha-2 branch from c7f6fa8 to 7c78a03 Compare January 12, 2026 15:21

Updates following review

ab025cb

- Add tests for SteveRevision handling undefined - add comments linking two implementations - better match HA revision checks between implementations - better backoff logging of revision: undefined - reduce dupe filter given reuse of opt

richard-cox force-pushed the pagination-ha-2 branch from 7c78a03 to ab025cb Compare January 12, 2026 15:33

richard-cox marked this pull request as ready for review January 12, 2026 15:33

richard-cox requested a review from nwmac January 12, 2026 15:34

nwmac approved these changes Jan 14, 2026

View reviewed changes

richard-cox merged commit f28f762 into rancher:master Jan 14, 2026
65 of 67 checks passed

richard-cox deleted the pagination-ha-2 branch January 14, 2026 14:17

This was referenced Jan 14, 2026

[backport v2.13] Support HA & slow proliferation of cached value - Scenario 2 + 3 #15416

Closed

[backport v2.12] Support HA & slow proliferation of cached value - Scenario 2 + 3 #15415

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support HA & slow proliferation of cached value - Scenario 2 + 3 #15916

Support HA & slow proliferation of cached value - Scenario 2 + 3 #15916

Uh oh!

richard-cox commented Nov 11, 2025 •

edited

Loading

Uh oh!

richard-cox Jan 7, 2026

Uh oh!

richard-cox Jan 7, 2026

Uh oh!

richard-cox Jan 7, 2026

Uh oh!

richard-cox Jan 9, 2026

Uh oh!

richard-cox Jan 12, 2026

Uh oh!

richard-cox Jan 9, 2026

Uh oh!

richard-cox Jan 12, 2026

Uh oh!

richard-cox Jan 9, 2026

Uh oh!

richard-cox Jan 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Support HA & slow proliferation of cached value - Scenario 2 + 3 #15916

Support HA & slow proliferation of cached value - Scenario 2 + 3 #15916

Uh oh!

Conversation

richard-cox commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Occurred changes and/or fixed issues

Technical notes summary

Areas or cases that should be tested

Areas which could experience regressions

Screenshot/Video

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

richard-cox commented Nov 11, 2025 •

edited

Loading