Skip to content

[detect_move] fix: reset collection status on repo move to prevent scheduler stall#3802

Open
mn-ram wants to merge 6 commits intoaugurlabs:mainfrom
mn-ram:fix/core-scheduling-blocked-by-move-retries
Open

[detect_move] fix: reset collection status on repo move to prevent scheduler stall#3802
mn-ram wants to merge 6 commits intoaugurlabs:mainfrom
mn-ram:fix/core-scheduling-blocked-by-move-retries

Conversation

@mn-ram
Copy link
Copy Markdown

@mn-ram mn-ram commented Mar 27, 2026

Description

When detect_github_repo_move_core or detect_github_repo_move_secondary detects a 301-redirected repository, the task previously raised Retry() with no arguments. Celery does not forward positional args on a bare Retry, so the task would restart with the original (now-stale) repo_git, crash, and leave core_status stuck in COLLECTING. Once all 40 scheduler slots were occupied by these stalled tasks, augur_collection_monitor stopped dispatching any new Core collection work — requiring manual container restarts to recover.

This PR fixes the root cause by using self.retry(args=[new_url], countdown=0, max_retries=1) so the task restarts immediately with the correct, updated repo_git. All downstream tasks in the chain then run against the new URL. No collection-status columns are modified; the existing on_success handler manages state cleanup as normal.

Fixes #3667

Changes

  • augur/tasks/github/detect_move/tasks.py:
    • Add bind=True to both task decorators so self is available
    • Replace bare pass / Retry() with self.retry(args=[e.new_url], countdown=0, max_retries=1) on RepoMovedException
    • Fall back to Reject if new_url is not present

Notes for Reviewers

self.retry raises celery.exceptions.Retry internally, so the raise is the standard Celery pattern. countdown=0 means the retry is enqueued immediately with no delay. max_retries=1 prevents an infinite loop if the new URL also returns 301.

Signed commits

  • Yes, I signed my commits.

@mn-ram mn-ram requested a review from sgoggins as a code owner March 27, 2026 20:07
mn-ram added 2 commits March 28, 2026 01:42
…rying

When detect_github_repo_move_core found a 301-redirected repo, it raised
Retry() which left core_status stuck in COLLECTING. augur_collection_monitor
counts COLLECTING rows against max_repo (40). Once all 40 slots were occupied
by pending retries, the scheduler dispatched no new work until each retry
eventually failed and on_failure reset the status to Error.

Fix ping_github_for_repo_move to reset the hook's status to Pending (no prior
collection) or Success (prior data exists) and clear the task_id before raising
RepoMovedException. Change both detect_github_repo_move_core and
detect_github_repo_move_secondary to raise Reject instead of Retry so the slot
is freed immediately and the next scheduler cycle picks up the repo under its
updated URL without constraint violations.

Fixes augurlabs#3667

Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>
Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>
@mn-ram mn-ram force-pushed the fix/core-scheduling-blocked-by-move-retries branch from 7ee8e1d to 5193257 Compare March 27, 2026 20:12
Comment on lines +119 to +132
# Reset status so the scheduler re-queues the repo under the new URL.
status_field = f"{collection_hook}_status"
task_id_field = f"{collection_hook}_task_id"
last_collected_field = f"{collection_hook}_data_last_collected"

statusQuery = session.query(CollectionStatus).filter(CollectionStatus.repo_id == repo.repo_id)
collectionRecord = execute_session_query(statusQuery, 'one')
setattr(collectionRecord, task_id_field, None)
if getattr(collectionRecord, last_collected_field) is not None:
setattr(collectionRecord, status_field, CollectionState.SUCCESS.value)
else:
setattr(collectionRecord, status_field, CollectionState.PENDING.value)
session.commit()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really dont think we should be changing the task status things as a solution to this problem.

the whole reason for calling retry is to cause the task to start over with the updated repo name so that all the downstream tasks are run on the correct/newly updated repository name

When a 301 redirect is detected, the repo URL is updated in the database.
Downstream collection tasks can continue under the old URL since GitHub
will redirect remaining API requests, and any new collection requests will
use the updated URL directly.

Removing the retry eliminates the pile-up of COLLECTING slots that blocked
augur_collection_monitor from dispatching new work once all max_repo slots
were occupied by pending retries.

Fixes augurlabs#3667

Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>
@mn-ram mn-ram requested a review from MoralCode March 27, 2026 21:16
When a repo returns 301, retry the task immediately with the updated
repo_git so all downstream tasks run against the correct URL.
No collection-status manipulation; on_success handles state cleanup.

Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>
@mn-ram mn-ram force-pushed the fix/core-scheduling-blocked-by-move-retries branch from c66b91c to 3c75770 Compare March 28, 2026 20:32
Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>
@mn-ram mn-ram force-pushed the fix/core-scheduling-blocked-by-move-retries branch from 745c83f to 035fda2 Compare March 28, 2026 20:34
@mn-ram
Copy link
Copy Markdown
Author

mn-ram commented Mar 28, 2026

Hi @MoralCode, I've addressed your feedback:

  • Removed all collection-status changes — core.py is untouched by this PR
  • Using self.retry(args=[new_url], countdown=0, max_retries=1) so the task restarts immediately with the correct URL
  • Added unit tests covering the retry and reject paths

Could you take another look when you get a chance? Thanks!

@MoralCode
Copy link
Copy Markdown
Collaborator

If the task resterts, does the entire rest of the chain get the new URL?

@mn-ram
Copy link
Copy Markdown
Author

mn-ram commented Mar 29, 2026

Yes, the rest of the chain gets the new URL. ping_github_for_repo_movecallsupdate_repo_with_dict(), which updates repo_git in the DB before the exception is raised. So when self.retry(args=[new_url]) re-executes the current task, the DB already has the new URL—and all downstream tasks read from the DB, so they automatically pick it up.

@MoralCode
Copy link
Copy Markdown
Collaborator

executes the current task, the DB already has the new URL—and all downstream tasks read from the DB, so they automatically pick it up.

I'm fairly sure that many of the tasks would already have been scheduled in celery, potentially before the move detection ever runs. Was there particular documentation you found that led you to your conclusion?

@mn-ram
Copy link
Copy Markdown
Author

mn-ram commented Mar 29, 2026

You're right — I investigated further and confirmed the issue.

All downstream collection tasks are built with .si(repo_git) (Celery's immutable signature), so the old URL is baked into their callbacks at chain-construction time. Since detect_github_repo_move_core is the first task in the chain, the downstream tasks are stored as callbacks in the task message (not yet on the broker), so self.retry() only re-ran the detection step while leaving those stale callbacks in place.

Fix applied:

Instead of self.retry(), the handler now:

  1. Sets self.request.chain = None — discards the downstream callbacks before Celery can dispatch them (same pattern already used in task_failed_util).
  2. Resets core_status / secondary_status to PENDING in the DB so the scheduler picks the repo up again in the next cycle with the correct new URL already in the DB.
  3. Returns cleanly.

Updated the test to assert request.chain is None and core_status == PENDING instead of the old retry assertion.

Downstream collection tasks are built with .si(repo_git) — the URL is
baked into their Celery signatures at chain-construction time. Retrying
only the detect_move task with the new URL left all downstream tasks
queued with the old URL.

Fix: on RepoMovedException, set request.chain = None to discard the
stale downstream callbacks (they are not yet on the broker since
detect_move is the first task), then reset core/secondary_status to
PENDING so the scheduler re-collects the repo with the new URL.

Update test to verify the chain-cancel + PENDING-reset behaviour and
remove the now-incorrect retry assertion.

Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>
@mn-ram mn-ram force-pushed the fix/core-scheduling-blocked-by-move-retries branch from 06b70e2 to bce3ca2 Compare March 29, 2026 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

all Core tasks stop getting scheduled if there are more than 40 repos that need renaming

2 participants