fix: ensure navigation sidebar serves fresh data after course publish#38785
fix: ensure navigation sidebar serves fresh data after course publish#38785wgu-taylor-payne wants to merge 2 commits into
Conversation
|
Thanks for the pull request, @wgu-taylor-payne! This repository is currently maintained by Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review. 🔘 Get product approvalIf you haven't already, check this list to see if your contribution needs to go through the product review process.
🔘 Provide contextTo help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:
🔘 Get a green buildIf one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green. DetailsWhere can I find more information?If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources: When can I expect my changes to be merged?Our goal is to get community contributions seen and reviewed as efficiently as possible. However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:
💡 As a result it may take up to several weeks or months to complete a review and merge your PR. |
After a course publish in Studio, the CourseNavigationBlocksView can cache stale block structure data for up to 1 hour. This happens because the block structure rebuild task runs with a 30-second delay, but the navigation view may be hit during that window, read the old block structure from its cache, and store the stale result under the new course_version key. The fix adds an update_collected_if_needed() call on cache miss, ensuring the block structure is fresh before we build and cache the navigation tree. This only runs on cache misses and adds negligible overhead for the common case (block structure already up-to-date).
5103ff8 to
4374146
Compare
ormsbee
left a comment
There was a problem hiding this comment.
I don't think this is operationally feasible for large courses and high traffic. Let's talk more about other possible mitigations.
|
|
||
| if not course_blocks: | ||
| # Ensure the block structure cache is up-to-date before reading. | ||
| get_block_structure_manager(course_key).update_collected_if_needed() |
There was a problem hiding this comment.
Going through the collection phase of a large course can be extremely expensive, which is why it's done asynchronously in celery tasks or management commands (it can often exceed the 30s timeout that many sites use for giving up on web worker requests). Placing it in the GET here also risks causing a stampede if it is a popular course that many concurrent users are trying to access, as parallel workers try to recompute the same collection phase data.
There was a problem hiding this comment.
Thank you, I appreciate this feedback. I'll look into another way of preventing the stale cache.
There was a problem hiding this comment.
@ormsbee I've pushed a new approach where instead of caching on the course version number, we cache on a block structure version which is updated after the block structure has been updated. We keep this block structure version in the cache for each course. I've updated the PR description with more details on this. Any thoughts on this approach?
|
For instance, I think a course's navigation being incorrect for a minute after a deletion is a bad, but not necessarily release-blocking bug (FYI @crathbun428 and @jmakowski1123, who can weigh in here). If the wrong navigation is getting cached for an hour, then maybe that's the part that we should focus on for this fix. |
|
I agree with Dave, I would not classify a 30-60sec cache as a blocker. But an hour is a bigger problem. |
2985979 to
ed9f8a0
Compare
…publish Replace synchronous update_collected_if_needed() with a version-based cache key approach. Instead of eagerly rebuilding the block structure on the request path (expensive, stampede risk), the navigation sidebar cache key now uses a block_structure_version that only bumps when the async rebuild task completes. This ensures stale data is never cached for 1 hour while avoiding any expensive work on the request path.
ed9f8a0 to
4490373
Compare
Description
After a course publish in Studio, the navigation sidebar (
CourseNavigationBlocksView) caches stale block structure data for up to 1 hour. This happens because the cache key usescourse_version(which changes immediately on publish), causing a cache miss during the ~30s window before the async block structure rebuild task completes. The view reads the still-stale block structure, caches that stale result under the new key for 1 hour, and all subsequent requests are served stale data.Fix: Replace
course_versionwith ablock_structure_versionin the navigation sidebar cache key. This version (a UUID stored in cache) only bumps when the async block structure rebuild task actually completes. During the ~30s rebuild window, the old cache entry continues to serve (consistent pre-publish data). Once the rebuild finishes, the version bumps, causing a cache miss that builds and caches fresh data.Impacted user roles: Learner (sees correct course outline sooner after a publish), Course Author (changes reflected faster).
How it works
update_course_in_cachenow sets ablock_structure_versionUUID in cache after each rebuildCourseNavigationBlocksViewuses this version in its cache key instead ofcourse_versionWhy not synchronous rebuild on the request path?
The initial approach called
update_collected_if_needed()on the GET path. As noted in review, block structure collection for large courses can exceed 30s and risks a stampede under concurrent load. The version-based approach avoids any expensive work on the request path — just a singlecache.get()for the version.Performance impact
cache.get()call to readblock_structure_version(sub-ms)get_course_outline_block_tree()runs and result is cachedDeploy considerations
On first deploy,
block_structure_versionwill not exist in cache for any course (returns""). This causes a one-time cold cache effect (equivalent to a cache flush). Entries repopulate as users arrive. After the next publish per course, the version key is set and the system operates as designed.Supporting information
This issue was discovered testing an internal Open edX instance. After deleting a unit in Studio and refreshing the course in the learning MFE within a few seconds, the deleted unit remained visible in the sidebar for up to 1 hour.
The root cause involves three components:
Testing instructions
Automated:
Manual (requires async Celery):
Without the fix: unit remains visible for up to 1 hour after step 2.
With the fix: unit disappears after step 4 (once the rebuild task completes).
Deadline
None
Other information
block_structure_versioncache key is set withtimeout=None(no expiry) to prevent eviction by default TTL policies. If the cache is flushed, it self-heals on the next publish.AI Usage
Used Kiro to aid in root cause discovery, evaluate multiple fix approaches (synchronous rebuild, is_up_to_date gate, version-based key), write the regression test, and draft this PR description.