Skip to content

Fix SageMakerTransformOperator succeeding on a failed deferred job#69042

Open
steveahnahn wants to merge 1 commit into
apache:mainfrom
steveahnahn:fix-sagemaker-transform-deferrable-silent-success
Open

Fix SageMakerTransformOperator succeeding on a failed deferred job#69042
steveahnahn wants to merge 1 commit into
apache:mainfrom
steveahnahn:fix-sagemaker-transform-deferrable-silent-success

Conversation

@steveahnahn

Copy link
Copy Markdown
Contributor

Problem

SageMakerTransformOperator with deferrable=True (which is the default whenever a deployment sets [operators] default_deferrable=True) can mark a task successful even though the SageMaker batch-transform job failed. The failed job's description is pushed as the XCom result, so downstream tasks run against missing/invalid transform output with no error surfaced.

Root cause

The operator defers while the job is still InProgress, so the job can fail during the deferred wait. The trigger is SageMakerTrigger, which since #68927 inherits AwsBaseWaiterTrigger.run(). On waiter failure that base yields TriggerEvent({"status": "error", ...}) instead of raising — so execute_complete is resumed with an error event. SageMakerTransformOperator.execute_complete never checked the status: it logged "SageMaker job ... completed." and returned serialize_result(...), so the task succeeded.

Every other SageMaker operator's execute_complete already guards this (if status != "success": raise). Transform was the only one missing it. Before #68927 the previous trigger raised on failure, so the missing guard was dormant; migrating to the yielding base class turned it into a live silent-success.

Fix

Reject a non-success status in SageMakerTransformOperator.execute_complete, matching the sibling operators, and add a regression test (a deferred transform that fails during the wait now fails the task instead of reporting success).

Note on the exception type: the sibling operators raise AirflowException, but the check-no-new-airflow-exceptions hook forbids new raise AirflowException usages, so this raises RuntimeError (identical task-failure/retry semantics, and already used elsewhere in this provider, e.g. dms.py).


Was generative AI tooling used to co-author this PR?
  • Yes — Claude Code (Opus 4.8)

Generated-by: Claude Code (Opus 4.8) following the guidelines

A deferred SageMaker transform job that fails during the wait was reported as a
successful task: execute_complete did not check the trigger event status, so the
failed job's description was pushed as the result and downstream tasks ran against
missing or invalid transform output with no error surfaced. Since the trigger was
migrated to AwsBaseWaiterTrigger it yields a {"status": "error"} event on failure
instead of raising, so the status guard that every sibling SageMaker operator has
became load-bearing here too.
@boring-cyborg boring-cyborg Bot added area:providers provider:amazon AWS/Amazon - related issues labels Jun 26, 2026
@steveahnahn steveahnahn marked this pull request as ready for review June 26, 2026 17:35
@steveahnahn steveahnahn requested a review from o-nikolas as a code owner June 26, 2026 17:35

@SameerMesiah97 SameerMesiah97 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved pending green CI.

I think this PR exposes an underlying issue in AwsBaseWaiterTrigger.run(); the state machine for the SageMaker transform job has been simpliefied to just 2 possible states i.e "success" and "error", when I would imagine there can be more states for failure, cancellation, timeout etc. However, fixing that would involve overriding the parent run method with a customized run method in SageMakerTrigger, which is beyond the scope of this PR. A follow-up PR exploring this may be justified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:amazon AWS/Amazon - related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants