[detect_move] fix: reset collection status on repo move to prevent scheduler stall#3802
[detect_move] fix: reset collection status on repo move to prevent scheduler stall#3802mn-ram wants to merge 6 commits intoaugurlabs:mainfrom
Conversation
…rying When detect_github_repo_move_core found a 301-redirected repo, it raised Retry() which left core_status stuck in COLLECTING. augur_collection_monitor counts COLLECTING rows against max_repo (40). Once all 40 slots were occupied by pending retries, the scheduler dispatched no new work until each retry eventually failed and on_failure reset the status to Error. Fix ping_github_for_repo_move to reset the hook's status to Pending (no prior collection) or Success (prior data exists) and clear the task_id before raising RepoMovedException. Change both detect_github_repo_move_core and detect_github_repo_move_secondary to raise Reject instead of Retry so the slot is freed immediately and the next scheduler cycle picks up the repo under its updated URL without constraint violations. Fixes augurlabs#3667 Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>
Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>
7ee8e1d to
5193257
Compare
| # Reset status so the scheduler re-queues the repo under the new URL. | ||
| status_field = f"{collection_hook}_status" | ||
| task_id_field = f"{collection_hook}_task_id" | ||
| last_collected_field = f"{collection_hook}_data_last_collected" | ||
|
|
||
| statusQuery = session.query(CollectionStatus).filter(CollectionStatus.repo_id == repo.repo_id) | ||
| collectionRecord = execute_session_query(statusQuery, 'one') | ||
| setattr(collectionRecord, task_id_field, None) | ||
| if getattr(collectionRecord, last_collected_field) is not None: | ||
| setattr(collectionRecord, status_field, CollectionState.SUCCESS.value) | ||
| else: | ||
| setattr(collectionRecord, status_field, CollectionState.PENDING.value) | ||
| session.commit() | ||
|
|
There was a problem hiding this comment.
I really dont think we should be changing the task status things as a solution to this problem.
the whole reason for calling retry is to cause the task to start over with the updated repo name so that all the downstream tasks are run on the correct/newly updated repository name
When a 301 redirect is detected, the repo URL is updated in the database. Downstream collection tasks can continue under the old URL since GitHub will redirect remaining API requests, and any new collection requests will use the updated URL directly. Removing the retry eliminates the pile-up of COLLECTING slots that blocked augur_collection_monitor from dispatching new work once all max_repo slots were occupied by pending retries. Fixes augurlabs#3667 Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>
When a repo returns 301, retry the task immediately with the updated repo_git so all downstream tasks run against the correct URL. No collection-status manipulation; on_success handles state cleanup. Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>
c66b91c to
3c75770
Compare
Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>
745c83f to
035fda2
Compare
|
Hi @MoralCode, I've addressed your feedback:
Could you take another look when you get a chance? Thanks! |
|
If the task resterts, does the entire rest of the chain get the new URL? |
|
Yes, the rest of the chain gets the new URL. |
I'm fairly sure that many of the tasks would already have been scheduled in celery, potentially before the move detection ever runs. Was there particular documentation you found that led you to your conclusion? |
|
You're right — I investigated further and confirmed the issue. All downstream collection tasks are built with Fix applied: Instead of
Updated the test to assert |
Downstream collection tasks are built with .si(repo_git) — the URL is baked into their Celery signatures at chain-construction time. Retrying only the detect_move task with the new URL left all downstream tasks queued with the old URL. Fix: on RepoMovedException, set request.chain = None to discard the stale downstream callbacks (they are not yet on the broker since detect_move is the first task), then reset core/secondary_status to PENDING so the scheduler re-collects the repo with the new URL. Update test to verify the chain-cancel + PENDING-reset behaviour and remove the now-incorrect retry assertion. Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>
06b70e2 to
bce3ca2
Compare
Description
When
detect_github_repo_move_coreordetect_github_repo_move_secondarydetects a 301-redirected repository, the task previously raisedRetry()with no arguments. Celery does not forward positional args on a bareRetry, so the task would restart with the original (now-stale)repo_git, crash, and leavecore_statusstuck inCOLLECTING. Once all 40 scheduler slots were occupied by these stalled tasks,augur_collection_monitorstopped dispatching any new Core collection work — requiring manual container restarts to recover.This PR fixes the root cause by using
self.retry(args=[new_url], countdown=0, max_retries=1)so the task restarts immediately with the correct, updatedrepo_git. All downstream tasks in the chain then run against the new URL. No collection-status columns are modified; the existingon_successhandler manages state cleanup as normal.Fixes #3667
Changes
augur/tasks/github/detect_move/tasks.py:bind=Trueto both task decorators soselfis availablepass/Retry()withself.retry(args=[e.new_url], countdown=0, max_retries=1)onRepoMovedExceptionRejectifnew_urlis not presentNotes for Reviewers
self.retryraisescelery.exceptions.Retryinternally, so theraiseis the standard Celery pattern.countdown=0means the retry is enqueued immediately with no delay.max_retries=1prevents an infinite loop if the new URL also returns 301.Signed commits