Skip to content

Add sagemaker_ftp tool to find available SageMaker training plan offe…#1134

Open
aravneelaws wants to merge 1 commit into
mainfrom
sagemaker_find_ftp
Open

Add sagemaker_ftp tool to find available SageMaker training plan offe…#1134
aravneelaws wants to merge 1 commit into
mainfrom
sagemaker_find_ftp

Conversation

@aravneelaws

@aravneelaws aravneelaws commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Purpose

Adds a small Bash helper that wraps aws sagemaker search-training-plan-offerings
to make it easy to discover available Flexible Training Plan (FTP) capacity in
a specific Availability Zone, for either HyperPod clusters (Slurm or EKS) or
SageMaker training jobs.

The SearchTrainingPlanOfferings API filters strictly on --duration-hours
and only returns offerings of that exact duration, with no "show me everything"
mode. To answer the common customer question "is there capacity for X
instances of Y in AZ Z?", users today have to manually call the API once per
duration and merge the results. This tool automates that sweep, filters by
AZ, ranks the matches by effective $/instance/hour, and prints a
ready-to-run create-training-plan command for the cheapest option.

Changes

  • New directory: 2.ami_and_containers/tools/sagemaker_ftp/
  • New file: find-ftp.sh — Bash script with MIT-0 SPDX header
    • Required flags: --az, --instance-type, --count, --region
    • Optional flags: --start-after, --profile, --target (hyperpod-cluster default, or training-job), --durations, --json, -h/--help
    • Sweeps default durations 24, 48, 72, 168, 336, 720 hours (overridable)
    • Filters returned offerings to the requested AZ
    • Computes $/instance/hour = UpfrontFee / (DurationHours * InstanceCount)
    • Prints a table (text mode) or structured JSON (--json)
    • Recommends the lowest $/instance/hour offering and prints, but never executes, the corresponding create-training-plan command
    • Validates inputs and surfaces clear errors for missing args, invalid --count, invalid --target, missing aws/jq
  • New file: README.md — usage, prerequisites, examples (default target, training-job target, custom durations, JSON), interpretation guide, troubleshooting table, and references to the relevant AWS docs

Test Plan

Environment:

  • AWS Service: SageMaker (SearchTrainingPlanOfferings API)
  • Instance type: ml.p5.48xlarge, ml.p5en.48xlarge, ml.p6-b200.48xlarge, ml.p6-b300.48xlarge
  • Number of nodes: 1–64 (search-only; no resources reserved)
  • Region: us-west-2

Test commands:

# Help text
./find-ftp.sh --help

# Default target (hyperpod-cluster)
./find-ftp.sh --az us-west-2b --instance-type ml.p5.48xlarge \
  --count 2 --region us-west-2 --start-after 2026-06-15T00:00:00Z

# training-job target
./find-ftp.sh --az us-west-2b --instance-type ml.p5.48xlarge \
  --count 2 --region us-west-2 --start-after 2026-06-15T00:00:00Z \
  --target training-job

# Custom durations
./find-ftp.sh --az us-west-2b --instance-type ml.p5.48xlarge \
  --count 2 --region us-west-2 --durations 24,168

# JSON output
./find-ftp.sh --az us-west-2b --instance-type ml.p5.48xlarge \
  --count 2 --region us-west-2 --json | jq '.recommendation'

# Validation: missing required arg
./find-ftp.sh --az us-west-2b --instance-type ml.p5.48xlarge

# Validation: invalid --target
./find-ftp.sh --az us-west-2b --instance-type ml.p5.48xlarge \
  --count 2 --region us-west-2 --target nonsense

Test Results

Scenario Outcome
--help Prints full usage with all flags including --target, --durations, --json
Default target, ml.p5.48xlarge x2, sweep Returned 9 offerings in us-west-2b; cheapest at $38.14/inst/hr (24 h plan)
--target training-job, same parameters Returned a different offering (different tpo-... ID and price), confirming the flag actually switches API target rather than reusing HyperPod results
--durations 24,168 Sweep limited to those two durations; 3 matches returned
--json Valid JSON to stdout, status messages on stderr; shape: {query, count, offerings, recommendation}
Missing required arg Exit 1 with ERROR: missing required argument(s): ... and --help hint
Invalid --target nonsense Exit 1 with ERROR: --target must be 'hyperpod-cluster' or 'training-job'
Per-duration ResourceLimitExceeded (count above account quota) Silently skipped for that duration (matches the "no offering at this size" semantics)

The script is pure search/read — it never calls CreateTrainingPlan. Purchases require the user to copy and run the printed command themselves.

Directory Structure

This PR is a tooling addition under 2.ami_and_containers/tools/, not a test
case under 3.test_cases/, so the test-case directory layout in the template
does not apply here. Layout added:

└── tools/
    └── sagemaker_ftp/
        ├── find-ftp.sh
        └── README.md

Checklist

  • I have read the contributing guidelines.
  • I am working against the latest main branch.
  • I have searched existing open and recently merged PRs to confirm this is not a duplicate.
  • The contribution is self-contained with documentation and scripts.
  • External dependencies are pinned to a specific version or tag (no latest). (not applicable)
  • A README is included or updated with prerequisites, instructions, and known issues.
  • New test cases follow the expected directory structure. (N/A — this is a tool, not a test case)

@KeitaW KeitaW left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 1/3 — Correctness (runtime behavior)

This is a well-built, self-contained helper: set -euo pipefail, the MIT-0 header in the right place, thorough argument validation, prerequisite checks, and a genuinely careful README. The findings below are mostly about runtime behavior under the script's own stated prerequisites — the first one is a likely hard failure in the default output mode.

2 inline comments below.

Comment on lines +329 to +330
(.start | strftime("%Y-%m-%d %H:%M")),
(.end | strftime("%Y-%m-%d %H:%M"))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jq strftime will error on AWS CLI v2's ISO-8601 timestamps, breaking the default text output

Observation: The text-mode table (lines 329–330) and the recommendation block (line 346) format times with .start | strftime(...) / .end | strftime(...), where .start/.end come straight from ReservedCapacityOfferings[].StartTime/.EndTime.
Impact: SearchTrainingPlanOfferings returns those fields as timestamps (API referenceStartTime/EndTime: number), and AWS CLI v2 — which the README requires — reformats all timestamps to ISO-8601 strings by default (cli_timestamp_format=iso8601; CLI v2 migration changes). jq's strftime needs a broken-down time, not an ISO string: echo '"2026-06-15T11:30:00Z"' | jq 'strftime("%Y-%m-%d")'jq: error: strftime/1 requires parsed datetime inputs (verified live with jq 1.6, 2026-06-13). Because this runs in the main flow under set -euo pipefail, the default text mode aborts whenever offerings are found — and only then, so a "no offerings" smoke test passes and hides it. (--json mode is unaffected; it passes the timestamps through untouched.)
Suggestion: Normalize before formatting so it works whether the CLI emits ISO-8601 (v2 default) or epoch (wire). For the table (lines 329–330):

Suggested change
(.start | strftime("%Y-%m-%d %H:%M")),
(.end | strftime("%Y-%m-%d %H:%M"))
(((.start | fromdateiso8601?) // (.start | tonumber)) | strftime("%Y-%m-%d %H:%M")),
(((.end | fromdateiso8601?) // (.end | tonumber)) | strftime("%Y-%m-%d %H:%M"))

The same ((. | fromdateiso8601?) // (. | tonumber)) | strftime(...) wrapper is needed at line 346 (the recommendation Window: line) — not one-click-applyable from here since it's a separate range. I wasn't able to run the live API (no credentials in this environment) — could you confirm against a real call with default CLI v2 config? This is the one finding I'd most want verified before merge.

duration_h: .DurationHours,
upfront: (.UpfrontFee | tonumber),
currency: .CurrencyCode,
az: .ReservedCapacityOfferings[0].AvailabilityZone,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AZ / window can be read from a non-matching reserved-capacity offering

Observation: The filter keeps an offering if any(.ReservedCapacityOfferings[]; .AvailabilityZone == $az) (lines 254–255), but then reads az/start/end from .ReservedCapacityOfferings[0] unconditionally (lines 259–261).
Impact: ReservedCapacityOfferings is an array (API reference). When an offering has more than one reserved-capacity entry and [0] is not the AZ that matched, the displayed AZ and start/end window belong to a different entry than the one the user filtered for — a quietly wrong result, not an error.
Suggestion: Select the matching entry first, e.g. (.ReservedCapacityOfferings[] | select(.AvailabilityZone == $az)) (take the first match) and read az/start/end from it. UNVERIFIED whether multi-entry offerings occur in practice for single-AZ FTP — but the code shouldn't depend on [0] being the match.

@KeitaW KeitaW left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 2/3 — Robustness & UX

2 inline comment(s) below.

--target-resources "$TARGET_RESOURCE" \
--output json 2>/dev/null); then
# API errored (e.g. ResourceLimitExceeded for that duration). Skip silently.
continue

@KeitaW KeitaW Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All AWS errors are swallowed, not just ResourceLimitExceeded

Observation: The per-duration call is if ! RESP=$(aws ... 2>/dev/null); then continue; fi (lines 541–551), and the README documents this as the quota-skip path.
Impact: 2>/dev/null + continue swallows every failure identically — expired credentials, throttling (ThrottlingException), an invalid region, a service outage — all surface to the user as the benign "No offerings found", which then steers them toward the wrong fixes (try another AZ/count/region). The AccessDeniedException row in the README's own troubleshooting table can't actually be reached, because the error text is discarded.
Suggestion: Capture stderr and distinguish the expected skip from real errors — e.g. keep continue only when the message matches ResourceLimitExceeded, otherwise surface it (a log line, or fail). Keeps the script's fail-loud behavior consistent.

" Duration: " + (.duration_h | tostring) + " h",
" AZ: " + .az,
" Upfront fee: " + (.upfront | tostring) + " " + .currency,
" Effective: $" + (((.per_inst_hr * 100 | floor) / 100) | tostring) + " / instance / hour",

@KeitaW KeitaW Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Recommendation" ranks $/instance/hour across durations, which can favor a far larger total commitment

Observation: The recommended offering is the global minimum $/instance/hour across all probed durations (line 345 and the sort_by(.per_inst_hr) | .[0] logic).
Impact: Per-hour rate often falls as duration rises, so the "recommendation" can be a 720 h / 30-day plan when the user probed durations because they wanted ~24 h — a much larger non-refundable upfront fee for a lower headline rate. Given the prominent (correct) non-refundable warnings, the single-line "Recommendation" framing undersells the total-cost difference.
Suggestion: Either surface upfront/total alongside the per-hour rate in the recommendation, or note that it optimizes per-hour rate only and the user should weigh total commitment. A normalized headline number shouldn't oversell. Not blocking.

@KeitaW KeitaW left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 3/3 — Nits, Sources & What Looks Great

1 inline comment(s) below.

Sources

  • SearchTrainingPlanOfferings response shape (StartTime/EndTime type number; UpfrontFee type string — the script's | tonumber on it is correct; ReservedCapacityOfferings is an array): https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_SearchTrainingPlanOfferings.html (verified 2026-06-13)
  • AWS CLI v2 defaults timestamps to ISO-8601 (cli_timestamp_format): https://docs.aws.amazon.com/cli/latest/userguide/cliv2-migration-changes.html (verified 2026-06-13)
  • jq 1.6 strftime errors on an ISO-8601 string, accepts a raw epoch number: verified live, 2026-06-13.
  • Directory placement 2.ami_and_containers/tools/sagemaker_ftp/ is consistent with the existing tools/ec2md/ sibling — verified live against origin/main, 2026-06-13 (so no directory-layout finding).
  • MIT-0 header: the .sh correctly carries it; sibling repo READMEs (1.architectures/**/README.md) do not carry MIT-0 headers, so the missing README header is consistent with repo practice and is not flagged (verified live, 2026-06-13).

Things That Look Great

  • Disciplined shell. set -euo pipefail, MIT-0 header with the shebang correctly placed before it, positive-integer validation on --count and --durations, an explicit --target allow-list, and command -v prerequisite checks for aws and jq. This is exactly the repo's shell convention done right.
  • --json keeps stdout clean. Status messages go to stderr via the log() helper, so ... --json | jq composes correctly — a genuinely thoughtful touch.
  • Honest, prominent non-refundable warnings. The script prints the create-training-plan command rather than running it, and says so in three places. Right call for an irreversible, money-spending API.
  • UpfrontFee | tonumber correctly handles that the API returns the fee as a string — easy to miss.
  • The README is unusually complete — prerequisites, IAM actions, an empty-results playbook, AZ-name-vs-AZ-ID caveat, and a troubleshooting table. Closes most of the questions a first-time user would file.

Summary: One likely hard failure (Batch 1, the strftime/CLI-v2 interaction — please confirm against a live call), two should-fix robustness items, and two nits. The structure and documentation are strong; with the timestamp handling fixed this is close to merge-ready.

.id,
(.duration_h | tostring),
(.upfront | tostring),
((.per_inst_hr * 100 | floor) / 100 | tostring),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Price display truncates instead of rounding

Observation: ((.per_inst_hr * 100 | floor) / 100) (lines 328 and 345) truncates toward zero.
Impact: $38.149/hr displays as $38.14, not $38.15 — cosmetic, but a price the user reads when comparing close offerings. Suggestion: round instead of floor (jq 1.6+ has round):

Suggested change
((.per_inst_hr * 100 | floor) / 100 | tostring),
((.per_inst_hr * 100 | round) / 100 | tostring),

The same floorround applies at line 345 (the recommendation Effective: line). Nit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants