[detect_move] fix: reset collection status on repo move to prevent scheduler stall by mn-ram · Pull Request #3802 · augurlabs/augur

mn-ram · 2026-03-27T20:07:33Z

Description

When detect_github_repo_move_core or detect_github_repo_move_secondary detects a 301-redirected repository, the task previously raised Retry() with no arguments. Celery does not forward positional args on a bare Retry, so the task would restart with the original (now-stale) repo_git, crash, and leave core_status stuck in COLLECTING. Once all 40 scheduler slots were occupied by these stalled tasks, augur_collection_monitor stopped dispatching any new Core collection work — requiring manual container restarts to recover.

This PR fixes the root cause by using self.retry(args=[new_url], countdown=0, max_retries=1) so the task restarts immediately with the correct, updated repo_git. All downstream tasks in the chain then run against the new URL. No collection-status columns are modified; the existing on_success handler manages state cleanup as normal.

Fixes #3667

Changes

augur/tasks/github/detect_move/tasks.py:
- Add bind=True to both task decorators so self is available
- Replace bare pass / Retry() with self.retry(args=[e.new_url], countdown=0, max_retries=1) on RepoMovedException
- Fall back to Reject if new_url is not present

Notes for Reviewers

self.retry raises celery.exceptions.Retry internally, so the raise is the standard Celery pattern. countdown=0 means the retry is enqueued immediately with no delay. max_retries=1 prevents an infinite loop if the new URL also returns 301.

Signed commits

Yes, I signed my commits.

…rying When detect_github_repo_move_core found a 301-redirected repo, it raised Retry() which left core_status stuck in COLLECTING. augur_collection_monitor counts COLLECTING rows against max_repo (40). Once all 40 slots were occupied by pending retries, the scheduler dispatched no new work until each retry eventually failed and on_failure reset the status to Error. Fix ping_github_for_repo_move to reset the hook's status to Pending (no prior collection) or Success (prior data exists) and clear the task_id before raising RepoMovedException. Change both detect_github_repo_move_core and detect_github_repo_move_secondary to raise Reject instead of Retry so the slot is freed immediately and the next scheduler cycle picks up the repo under its updated URL without constraint violations. Fixes augurlabs#3667 Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>

Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>

MoralCode · 2026-03-27T20:18:26Z

augur/tasks/github/detect_move/core.py

+        # Reset status so the scheduler re-queues the repo under the new URL.
+        status_field = f"{collection_hook}_status"
+        task_id_field = f"{collection_hook}_task_id"
+        last_collected_field = f"{collection_hook}_data_last_collected"
+
+        statusQuery = session.query(CollectionStatus).filter(CollectionStatus.repo_id == repo.repo_id)
+        collectionRecord = execute_session_query(statusQuery, 'one')
+        setattr(collectionRecord, task_id_field, None)
+        if getattr(collectionRecord, last_collected_field) is not None:
+            setattr(collectionRecord, status_field, CollectionState.SUCCESS.value)
+        else:
+            setattr(collectionRecord, status_field, CollectionState.PENDING.value)
+        session.commit()
+


I really dont think we should be changing the task status things as a solution to this problem.

the whole reason for calling retry is to cause the task to start over with the updated repo name so that all the downstream tasks are run on the correct/newly updated repository name

When a 301 redirect is detected, the repo URL is updated in the database. Downstream collection tasks can continue under the old URL since GitHub will redirect remaining API requests, and any new collection requests will use the updated URL directly. Removing the retry eliminates the pile-up of COLLECTING slots that blocked augur_collection_monitor from dispatching new work once all max_repo slots were occupied by pending retries. Fixes augurlabs#3667 Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>

When a repo returns 301, retry the task immediately with the updated repo_git so all downstream tasks run against the correct URL. No collection-status manipulation; on_success handles state cleanup. Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>

Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>

mn-ram · 2026-03-28T20:38:54Z

Hi @MoralCode, I've addressed your feedback:

Removed all collection-status changes — core.py is untouched by this PR
Using self.retry(args=[new_url], countdown=0, max_retries=1) so the task restarts immediately with the correct URL
Added unit tests covering the retry and reject paths

Could you take another look when you get a chance? Thanks!

MoralCode · 2026-03-29T01:56:02Z

If the task resterts, does the entire rest of the chain get the new URL?

mn-ram · 2026-03-29T02:29:45Z

Yes, the rest of the chain gets the new URL. ping_github_for_repo_movecallsupdate_repo_with_dict(), which updates repo_git in the DB before the exception is raised. So when self.retry(args=[new_url]) re-executes the current task, the DB already has the new URL—and all downstream tasks read from the DB, so they automatically pick it up.

MoralCode · 2026-03-29T03:02:37Z

executes the current task, the DB already has the new URL—and all downstream tasks read from the DB, so they automatically pick it up.

I'm fairly sure that many of the tasks would already have been scheduled in celery, potentially before the move detection ever runs. Was there particular documentation you found that led you to your conclusion?

mn-ram · 2026-03-29T08:55:51Z

You're right — I investigated further and confirmed the issue.

All downstream collection tasks are built with .si(repo_git) (Celery's immutable signature), so the old URL is baked into their callbacks at chain-construction time. Since detect_github_repo_move_core is the first task in the chain, the downstream tasks are stored as callbacks in the task message (not yet on the broker), so self.retry() only re-ran the detection step while leaving those stale callbacks in place.

Fix applied:

Instead of self.retry(), the handler now:

Sets self.request.chain = None — discards the downstream callbacks before Celery can dispatch them (same pattern already used in task_failed_util).
Resets core_status / secondary_status to PENDING in the DB so the scheduler picks the repo up again in the next cycle with the correct new URL already in the DB.
Returns cleanly.

Updated the test to assert request.chain is None and core_status == PENDING instead of the old retry assertion.

Downstream collection tasks are built with .si(repo_git) — the URL is baked into their Celery signatures at chain-construction time. Retrying only the detect_move task with the new URL left all downstream tasks queued with the old URL. Fix: on RepoMovedException, set request.chain = None to discard the stale downstream callbacks (they are not yet on the broker since detect_move is the first task), then reset core/secondary_status to PENDING so the scheduler re-collects the repo with the new URL. Update test to verify the chain-cancel + PENDING-reset behaviour and remove the now-incorrect retry assertion. Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>

mn-ram requested a review from sgoggins as a code owner March 27, 2026 20:07

mn-ram added 2 commits March 28, 2026 01:42

fix(detect_move): trim verbose comments to one-liners

5193257

Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>

mn-ram force-pushed the fix/core-scheduling-blocked-by-move-retries branch from 7ee8e1d to 5193257 Compare March 27, 2026 20:12

MoralCode requested changes Mar 27, 2026

View reviewed changes

mn-ram requested a review from MoralCode March 27, 2026 21:16

mn-ram force-pushed the fix/core-scheduling-blocked-by-move-retries branch from c66b91c to 3c75770 Compare March 28, 2026 20:32

test(detect_move): add unit tests for ping and task retry behaviour

035fda2

Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>

mn-ram force-pushed the fix/core-scheduling-blocked-by-move-retries branch from 745c83f to 035fda2 Compare March 28, 2026 20:34

mn-ram force-pushed the fix/core-scheduling-blocked-by-move-retries branch from 06b70e2 to bce3ca2 Compare March 29, 2026 09:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[detect_move] fix: reset collection status on repo move to prevent scheduler stall#3802

[detect_move] fix: reset collection status on repo move to prevent scheduler stall#3802
mn-ram wants to merge 6 commits intoaugurlabs:mainfrom
mn-ram:fix/core-scheduling-blocked-by-move-retries

mn-ram commented Mar 27, 2026 •

edited

Loading

Uh oh!

MoralCode Mar 27, 2026

Uh oh!

mn-ram commented Mar 28, 2026

Uh oh!

MoralCode commented Mar 29, 2026

Uh oh!

mn-ram commented Mar 29, 2026

Uh oh!

MoralCode commented Mar 29, 2026

Uh oh!

mn-ram commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mn-ram commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Notes for Reviewers

Signed commits

Uh oh!

MoralCode Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

mn-ram commented Mar 28, 2026

Uh oh!

MoralCode commented Mar 29, 2026

Uh oh!

mn-ram commented Mar 29, 2026

Uh oh!

MoralCode commented Mar 29, 2026

Uh oh!

mn-ram commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mn-ram commented Mar 27, 2026 •

edited

Loading