Fix SageMakerTransformOperator succeeding on a failed deferred job#69042
Open
steveahnahn wants to merge 1 commit into
Open
Fix SageMakerTransformOperator succeeding on a failed deferred job#69042steveahnahn wants to merge 1 commit into
steveahnahn wants to merge 1 commit into
Conversation
A deferred SageMaker transform job that fails during the wait was reported as a
successful task: execute_complete did not check the trigger event status, so the
failed job's description was pushed as the result and downstream tasks ran against
missing or invalid transform output with no error surfaced. Since the trigger was
migrated to AwsBaseWaiterTrigger it yields a {"status": "error"} event on failure
instead of raising, so the status guard that every sibling SageMaker operator has
became load-bearing here too.
SameerMesiah97
left a comment
Contributor
There was a problem hiding this comment.
Approved pending green CI.
I think this PR exposes an underlying issue in AwsBaseWaiterTrigger.run(); the state machine for the SageMaker transform job has been simpliefied to just 2 possible states i.e "success" and "error", when I would imagine there can be more states for failure, cancellation, timeout etc. However, fixing that would involve overriding the parent run method with a customized run method in SageMakerTrigger, which is beyond the scope of this PR. A follow-up PR exploring this may be justified.
SameerMesiah97
approved these changes
Jun 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
SageMakerTransformOperatorwithdeferrable=True(which is the default whenever a deployment sets[operators] default_deferrable=True) can mark a task successful even though the SageMaker batch-transform job failed. The failed job's description is pushed as the XCom result, so downstream tasks run against missing/invalid transform output with no error surfaced.Root cause
The operator defers while the job is still
InProgress, so the job can fail during the deferred wait. The trigger isSageMakerTrigger, which since #68927 inheritsAwsBaseWaiterTrigger.run(). On waiter failure that base yieldsTriggerEvent({"status": "error", ...})instead of raising — soexecute_completeis resumed with an error event.SageMakerTransformOperator.execute_completenever checked the status: it logged"SageMaker job ... completed."and returnedserialize_result(...), so the task succeeded.Every other SageMaker operator's
execute_completealready guards this (if status != "success": raise). Transform was the only one missing it. Before #68927 the previous trigger raised on failure, so the missing guard was dormant; migrating to the yielding base class turned it into a live silent-success.Fix
Reject a non-success status in
SageMakerTransformOperator.execute_complete, matching the sibling operators, and add a regression test (a deferred transform that fails during the wait now fails the task instead of reporting success).Note on the exception type: the sibling operators raise
AirflowException, but thecheck-no-new-airflow-exceptionshook forbids newraise AirflowExceptionusages, so this raisesRuntimeError(identical task-failure/retry semantics, and already used elsewhere in this provider, e.g.dms.py).Was generative AI tooling used to co-author this PR?
Generated-by: Claude Code (Opus 4.8) following the guidelines