Skip to content

fix: ensure navigation sidebar serves fresh data after course publish#38785

Open
wgu-taylor-payne wants to merge 2 commits into
openedx:masterfrom
WGU-Open-edX:fix/stale-navigation-sidebar
Open

fix: ensure navigation sidebar serves fresh data after course publish#38785
wgu-taylor-payne wants to merge 2 commits into
openedx:masterfrom
WGU-Open-edX:fix/stale-navigation-sidebar

Conversation

@wgu-taylor-payne

@wgu-taylor-payne wgu-taylor-payne commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Description

After a course publish in Studio, the navigation sidebar (CourseNavigationBlocksView) caches stale block structure data for up to 1 hour. This happens because the cache key uses course_version (which changes immediately on publish), causing a cache miss during the ~30s window before the async block structure rebuild task completes. The view reads the still-stale block structure, caches that stale result under the new key for 1 hour, and all subsequent requests are served stale data.

Fix: Replace course_version with a block_structure_version in the navigation sidebar cache key. This version (a UUID stored in cache) only bumps when the async block structure rebuild task actually completes. During the ~30s rebuild window, the old cache entry continues to serve (consistent pre-publish data). Once the rebuild finishes, the version bumps, causing a cache miss that builds and caches fresh data.

Impacted user roles: Learner (sees correct course outline sooner after a publish), Course Author (changes reflected faster).

How it works

  1. update_course_in_cache now sets a block_structure_version UUID in cache after each rebuild
  2. CourseNavigationBlocksView uses this version in its cache key instead of course_version
  3. The cache key only changes when data is actually ready — no more caching stale data for 1 hour

Why not synchronous rebuild on the request path?

The initial approach called update_collected_if_needed() on the GET path. As noted in review, block structure collection for large courses can exceed 30s and risks a stampede under concurrent load. The version-based approach avoids any expensive work on the request path — just a single cache.get() for the version.

Performance impact

  • Every request: One additional cache.get() call to read block_structure_version (sub-ms)
  • Cache miss (version bumped): Same as before — get_course_outline_block_tree() runs and result is cached
  • During 30s rebuild window: Cache hits continue (pre-publish data, same version key)
  • No stampede risk: Version only bumps once the rebuild is complete; all users see the bump at the same time but this is the same as the existing behavior on any cache miss

Deploy considerations

On first deploy, block_structure_version will not exist in cache for any course (returns ""). This causes a one-time cold cache effect (equivalent to a cache flush). Entries repopulate as users arrive. After the next publish per course, the version key is set and the system operates as designed.

Supporting information

This issue was discovered testing an internal Open edX instance. After deleting a unit in Studio and refreshing the course in the learning MFE within a few seconds, the deleted unit remained visible in the sidebar for up to 1 hour.

The root cause involves three components:

Testing instructions

Automated:

tutor dev run lms pytest \
  lms/djangoapps/course_home_api/outline/tests/test_view.py::SidebarBlocksTestViews::test_navigation_does_not_cache_stale_data_after_publish \
  --ds=lms.envs.test --no-migrations -x

Manual (requires async Celery):

  1. Open a course in the learning MFE, note a unit in the sidebar
  2. In Studio, delete that unit (auto-publishes)
  3. Immediately refresh the course in the MFE
  4. Wait 30+ seconds, refresh again
  5. Verify the deleted unit is no longer in the sidebar

Without the fix: unit remains visible for up to 1 hour after step 2.
With the fix: unit disappears after step 4 (once the rebuild task completes).

Deadline

None

Other information

  • No database migrations
  • No new dependencies
  • The block_structure_version cache key is set with timeout=None (no expiry) to prevent eviction by default TTL policies. If the cache is flushed, it self-heals on the next publish.

AI Usage

Used Kiro to aid in root cause discovery, evaluate multiple fix approaches (synchronous rebuild, is_up_to_date gate, version-based key), write the regression test, and draft this PR description.

@openedx-webhooks openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Jun 19, 2026
@openedx-webhooks

Copy link
Copy Markdown

Thanks for the pull request, @wgu-taylor-payne!

This repository is currently maintained by @openedx/wg-maintenance-openedx-platform.

Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review.

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

  • If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
    • This process (including the steps you'll need to take) is documented here.
  • If it doesn't, simply proceed with the next step.
🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

  • Dependencies

    This PR must be merged before / after / at the same time as ...

  • Blockers

    This PR is waiting for OEP-1234 to be accepted.

  • Timeline information

    This PR must be merged by XX date because ...

  • Partner information

    This is for a course on edx.org.

  • Supporting documentation
  • Relevant Open edX discussion forum threads
🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

Details
Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

  • The size and impact of the changes that it introduces
  • The need for product review
  • Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

@openedx-webhooks openedx-webhooks added the core contributor PR author is a Core Contributor (who may or may not have write access to this repo). label Jun 19, 2026
@github-project-automation github-project-automation Bot moved this to Needs Triage in Contributions Jun 19, 2026
After a course publish in Studio, the CourseNavigationBlocksView can
cache stale block structure data for up to 1 hour. This happens because
the block structure rebuild task runs with a 30-second delay, but the
navigation view may be hit during that window, read the old block
structure from its cache, and store the stale result under the new
course_version key.

The fix adds an update_collected_if_needed() call on cache miss,
ensuring the block structure is fresh before we build and cache the
navigation tree. This only runs on cache misses and adds negligible
overhead for the common case (block structure already up-to-date).
@wgu-taylor-payne wgu-taylor-payne force-pushed the fix/stale-navigation-sidebar branch from 5103ff8 to 4374146 Compare June 19, 2026 03:04
@mariajgrimaldi mariajgrimaldi linked an issue Jun 22, 2026 that may be closed by this pull request

@ormsbee ormsbee left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is operationally feasible for large courses and high traffic. Let's talk more about other possible mitigations.


if not course_blocks:
# Ensure the block structure cache is up-to-date before reading.
get_block_structure_manager(course_key).update_collected_if_needed()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going through the collection phase of a large course can be extremely expensive, which is why it's done asynchronously in celery tasks or management commands (it can often exceed the 30s timeout that many sites use for giving up on web worker requests). Placing it in the GET here also risks causing a stampede if it is a popular course that many concurrent users are trying to access, as parallel workers try to recompute the same collection phase data.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I appreciate this feedback. I'll look into another way of preventing the stale cache.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ormsbee I've pushed a new approach where instead of caching on the course version number, we cache on a block structure version which is updated after the block structure has been updated. We keep this block structure version in the cache for each course. I've updated the PR description with more details on this. Any thoughts on this approach?

@ormsbee

ormsbee commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

For instance, I think a course's navigation being incorrect for a minute after a deletion is a bad, but not necessarily release-blocking bug (FYI @crathbun428 and @jmakowski1123, who can weigh in here). If the wrong navigation is getting cached for an hour, then maybe that's the part that we should focus on for this fix.

@jmakowski1123

Copy link
Copy Markdown

I agree with Dave, I would not classify a 30-60sec cache as a blocker. But an hour is a bigger problem.

@mphilbrick211 mphilbrick211 added the mao-onboarding Reviewing this will help onboard devs from an Axim mission-aligned organization (MAO). label Jun 23, 2026
@mphilbrick211 mphilbrick211 moved this from Needs Triage to In Eng Review in Contributions Jun 23, 2026
@wgu-taylor-payne wgu-taylor-payne force-pushed the fix/stale-navigation-sidebar branch from 2985979 to ed9f8a0 Compare June 24, 2026 17:14
…publish

Replace synchronous update_collected_if_needed() with a version-based
cache key approach. Instead of eagerly rebuilding the block structure on
the request path (expensive, stampede risk), the navigation sidebar cache
key now uses a block_structure_version that only bumps when the async
rebuild task completes.

This ensures stale data is never cached for 1 hour while avoiding any
expensive work on the request path.
@wgu-taylor-payne wgu-taylor-payne force-pushed the fix/stale-navigation-sidebar branch from ed9f8a0 to 4490373 Compare June 24, 2026 17:54
@wgu-taylor-payne wgu-taylor-payne requested a review from ormsbee June 24, 2026 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core contributor PR author is a Core Contributor (who may or may not have write access to this repo). mao-onboarding Reviewing this will help onboard devs from an Axim mission-aligned organization (MAO). open-source-contribution PR author is not from Axim or 2U

Projects

Status: In Eng Review

Development

Successfully merging this pull request may close these issues.

Course outline sidebar content is not synced with Studio

5 participants