Add sagemaker_ftp tool to find available SageMaker training plan offe…#1134
Add sagemaker_ftp tool to find available SageMaker training plan offe…#1134aravneelaws wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Review Batch 1/3 — Correctness (runtime behavior)
This is a well-built, self-contained helper: set -euo pipefail, the MIT-0 header in the right place, thorough argument validation, prerequisite checks, and a genuinely careful README. The findings below are mostly about runtime behavior under the script's own stated prerequisites — the first one is a likely hard failure in the default output mode.
2 inline comments below.
| (.start | strftime("%Y-%m-%d %H:%M")), | ||
| (.end | strftime("%Y-%m-%d %H:%M")) |
There was a problem hiding this comment.
jq strftime will error on AWS CLI v2's ISO-8601 timestamps, breaking the default text output
Observation: The text-mode table (lines 329–330) and the recommendation block (line 346) format times with .start | strftime(...) / .end | strftime(...), where .start/.end come straight from ReservedCapacityOfferings[].StartTime/.EndTime.
Impact: SearchTrainingPlanOfferings returns those fields as timestamps (API reference — StartTime/EndTime: number), and AWS CLI v2 — which the README requires — reformats all timestamps to ISO-8601 strings by default (cli_timestamp_format=iso8601; CLI v2 migration changes). jq's strftime needs a broken-down time, not an ISO string: echo '"2026-06-15T11:30:00Z"' | jq 'strftime("%Y-%m-%d")' → jq: error: strftime/1 requires parsed datetime inputs (verified live with jq 1.6, 2026-06-13). Because this runs in the main flow under set -euo pipefail, the default text mode aborts whenever offerings are found — and only then, so a "no offerings" smoke test passes and hides it. (--json mode is unaffected; it passes the timestamps through untouched.)
Suggestion: Normalize before formatting so it works whether the CLI emits ISO-8601 (v2 default) or epoch (wire). For the table (lines 329–330):
| (.start | strftime("%Y-%m-%d %H:%M")), | |
| (.end | strftime("%Y-%m-%d %H:%M")) | |
| (((.start | fromdateiso8601?) // (.start | tonumber)) | strftime("%Y-%m-%d %H:%M")), | |
| (((.end | fromdateiso8601?) // (.end | tonumber)) | strftime("%Y-%m-%d %H:%M")) |
The same ((. | fromdateiso8601?) // (. | tonumber)) | strftime(...) wrapper is needed at line 346 (the recommendation Window: line) — not one-click-applyable from here since it's a separate range. I wasn't able to run the live API (no credentials in this environment) — could you confirm against a real call with default CLI v2 config? This is the one finding I'd most want verified before merge.
| duration_h: .DurationHours, | ||
| upfront: (.UpfrontFee | tonumber), | ||
| currency: .CurrencyCode, | ||
| az: .ReservedCapacityOfferings[0].AvailabilityZone, |
There was a problem hiding this comment.
AZ / window can be read from a non-matching reserved-capacity offering
Observation: The filter keeps an offering if any(.ReservedCapacityOfferings[]; .AvailabilityZone == $az) (lines 254–255), but then reads az/start/end from .ReservedCapacityOfferings[0] unconditionally (lines 259–261).
Impact: ReservedCapacityOfferings is an array (API reference). When an offering has more than one reserved-capacity entry and [0] is not the AZ that matched, the displayed AZ and start/end window belong to a different entry than the one the user filtered for — a quietly wrong result, not an error.
Suggestion: Select the matching entry first, e.g. (.ReservedCapacityOfferings[] | select(.AvailabilityZone == $az)) (take the first match) and read az/start/end from it. UNVERIFIED whether multi-entry offerings occur in practice for single-AZ FTP — but the code shouldn't depend on [0] being the match.
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 2/3 — Robustness & UX
2 inline comment(s) below.
| --target-resources "$TARGET_RESOURCE" \ | ||
| --output json 2>/dev/null); then | ||
| # API errored (e.g. ResourceLimitExceeded for that duration). Skip silently. | ||
| continue |
There was a problem hiding this comment.
All AWS errors are swallowed, not just ResourceLimitExceeded
Observation: The per-duration call is if ! RESP=$(aws ... 2>/dev/null); then continue; fi (lines 541–551), and the README documents this as the quota-skip path.
Impact: 2>/dev/null + continue swallows every failure identically — expired credentials, throttling (ThrottlingException), an invalid region, a service outage — all surface to the user as the benign "No offerings found", which then steers them toward the wrong fixes (try another AZ/count/region). The AccessDeniedException row in the README's own troubleshooting table can't actually be reached, because the error text is discarded.
Suggestion: Capture stderr and distinguish the expected skip from real errors — e.g. keep continue only when the message matches ResourceLimitExceeded, otherwise surface it (a log line, or fail). Keeps the script's fail-loud behavior consistent.
| " Duration: " + (.duration_h | tostring) + " h", | ||
| " AZ: " + .az, | ||
| " Upfront fee: " + (.upfront | tostring) + " " + .currency, | ||
| " Effective: $" + (((.per_inst_hr * 100 | floor) / 100) | tostring) + " / instance / hour", |
There was a problem hiding this comment.
"Recommendation" ranks $/instance/hour across durations, which can favor a far larger total commitment
Observation: The recommended offering is the global minimum $/instance/hour across all probed durations (line 345 and the sort_by(.per_inst_hr) | .[0] logic).
Impact: Per-hour rate often falls as duration rises, so the "recommendation" can be a 720 h / 30-day plan when the user probed durations because they wanted ~24 h — a much larger non-refundable upfront fee for a lower headline rate. Given the prominent (correct) non-refundable warnings, the single-line "Recommendation" framing undersells the total-cost difference.
Suggestion: Either surface upfront/total alongside the per-hour rate in the recommendation, or note that it optimizes per-hour rate only and the user should weigh total commitment. A normalized headline number shouldn't oversell. Not blocking.
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 3/3 — Nits, Sources & What Looks Great
1 inline comment(s) below.
Sources
SearchTrainingPlanOfferingsresponse shape (StartTime/EndTimetypenumber;UpfrontFeetypestring— the script's| tonumberon it is correct;ReservedCapacityOfferingsis an array): https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_SearchTrainingPlanOfferings.html (verified 2026-06-13)- AWS CLI v2 defaults timestamps to ISO-8601 (
cli_timestamp_format): https://docs.aws.amazon.com/cli/latest/userguide/cliv2-migration-changes.html (verified 2026-06-13) - jq 1.6
strftimeerrors on an ISO-8601 string, accepts a raw epoch number: verified live, 2026-06-13. - Directory placement
2.ami_and_containers/tools/sagemaker_ftp/is consistent with the existingtools/ec2md/sibling — verified live againstorigin/main, 2026-06-13 (so no directory-layout finding). - MIT-0 header: the
.shcorrectly carries it; sibling repo READMEs (1.architectures/**/README.md) do not carry MIT-0 headers, so the missing README header is consistent with repo practice and is not flagged (verified live, 2026-06-13).
Things That Look Great
- Disciplined shell.
set -euo pipefail, MIT-0 header with the shebang correctly placed before it, positive-integer validation on--countand--durations, an explicit--targetallow-list, andcommand -vprerequisite checks forawsandjq. This is exactly the repo's shell convention done right. --jsonkeeps stdout clean. Status messages go to stderr via thelog()helper, so... --json | jqcomposes correctly — a genuinely thoughtful touch.- Honest, prominent non-refundable warnings. The script prints the
create-training-plancommand rather than running it, and says so in three places. Right call for an irreversible, money-spending API. UpfrontFee | tonumbercorrectly handles that the API returns the fee as a string — easy to miss.- The README is unusually complete — prerequisites, IAM actions, an empty-results playbook, AZ-name-vs-AZ-ID caveat, and a troubleshooting table. Closes most of the questions a first-time user would file.
Summary: One likely hard failure (Batch 1, the strftime/CLI-v2 interaction — please confirm against a live call), two should-fix robustness items, and two nits. The structure and documentation are strong; with the timestamp handling fixed this is close to merge-ready.
| .id, | ||
| (.duration_h | tostring), | ||
| (.upfront | tostring), | ||
| ((.per_inst_hr * 100 | floor) / 100 | tostring), |
There was a problem hiding this comment.
Price display truncates instead of rounding
Observation: ((.per_inst_hr * 100 | floor) / 100) (lines 328 and 345) truncates toward zero.
Impact: $38.149/hr displays as $38.14, not $38.15 — cosmetic, but a price the user reads when comparing close offerings. Suggestion: round instead of floor (jq 1.6+ has round):
| ((.per_inst_hr * 100 | floor) / 100 | tostring), | |
| ((.per_inst_hr * 100 | round) / 100 | tostring), |
The same floor→round applies at line 345 (the recommendation Effective: line). Nit.
Purpose
Adds a small Bash helper that wraps
aws sagemaker search-training-plan-offeringsto make it easy to discover available Flexible Training Plan (FTP) capacity in
a specific Availability Zone, for either HyperPod clusters (Slurm or EKS) or
SageMaker training jobs.
The
SearchTrainingPlanOfferingsAPI filters strictly on--duration-hoursand only returns offerings of that exact duration, with no "show me everything"
mode. To answer the common customer question "is there capacity for X
instances of Y in AZ Z?", users today have to manually call the API once per
duration and merge the results. This tool automates that sweep, filters by
AZ, ranks the matches by effective
$/instance/hour, and prints aready-to-run
create-training-plancommand for the cheapest option.Changes
2.ami_and_containers/tools/sagemaker_ftp/find-ftp.sh— Bash script with MIT-0 SPDX header--az,--instance-type,--count,--region--start-after,--profile,--target(hyperpod-clusterdefault, ortraining-job),--durations,--json,-h/--help24, 48, 72, 168, 336, 720hours (overridable)$/instance/hour = UpfrontFee / (DurationHours * InstanceCount)--json)$/instance/houroffering and prints, but never executes, the correspondingcreate-training-plancommand--count, invalid--target, missingaws/jqREADME.md— usage, prerequisites, examples (default target, training-job target, custom durations, JSON), interpretation guide, troubleshooting table, and references to the relevant AWS docsTest Plan
Environment:
SearchTrainingPlanOfferingsAPI)ml.p5.48xlarge,ml.p5en.48xlarge,ml.p6-b200.48xlarge,ml.p6-b300.48xlargeus-west-2Test commands:
Test Results
--help--target,--durations,--jsonml.p5.48xlarge x2, sweepus-west-2b; cheapest at$38.14/inst/hr(24 h plan)--target training-job, same parameterstpo-...ID and price), confirming the flag actually switches API target rather than reusing HyperPod results--durations 24,168--json{query, count, offerings, recommendation}ERROR: missing required argument(s): ...and--helphint--target nonsenseERROR: --target must be 'hyperpod-cluster' or 'training-job'ResourceLimitExceeded(count above account quota)The script is pure search/read — it never calls CreateTrainingPlan. Purchases require the user to copy and run the printed command themselves.
Directory Structure
This PR is a tooling addition under 2.ami_and_containers/tools/, not a test
case under 3.test_cases/, so the test-case directory layout in the template
does not apply here. Layout added:
Checklist
mainbranch.latest). (not applicable)