Skip to content

PyTorch2FlyDSL benchmark robustness + cost estimation + model bump#268

Open
peyron-amd wants to merge 5 commits into
mainfrom
feat/translation-bench-median
Open

PyTorch2FlyDSL benchmark robustness + cost estimation + model bump#268
peyron-amd wants to merge 5 commits into
mainfrom
feat/translation-bench-median

Conversation

@peyron-amd

Copy link
Copy Markdown
Collaborator

Summary

Robustness + accounting improvements to the PyTorch→FlyDSL translation harness, plus a model default bump. Scope is limited to translate.py and its config (4 commits, ~209/49 lines).

  • Median-over-N latency: replace single-shot CUDA timing with a median over bench_warmup (10) + bench_iters (30) timed passes to cut measurement noise.
  • Configurable PyTorch reference mode: reference_modeeager | compile | compile_fallback (default compile_fallback), so the speedup baseline is PyTorch at its best.
  • Cost persistence: derive token usage from the agent trajectory and write translation_cost_usd into translation_result.json using configurable per-Mtok rates (defaults = public Opus pricing).
  • Always record PyTorch reference latency, even when the candidate fails correctness (candidate latency/speedup still gated on a passing candidate).
  • Config hardening: pop bench/reference keys before constructing the agent so they don't reach TranslationAgentConfig.__init__ (fixes an unexpected-kwarg crash).
  • Default model → claude-opus-4.8.
    All settings are driven from mini_kernel_pytorch_to_flydsl.yaml (no env vars).

…ce mode

Replace the translation harness's single timed forward (after 3 warmups) with
a median over N timed passes using CUDA events (no Triton), to remove the
run-to-run speedup noise. Configured via the existing translation YAML
agent: section (bench_warmup=10, bench_iters=30, reference_mode), with no new
env vars; bench_iters defaults to the shared optimization constant
DEFAULT_EVAL_BENCHMARK_ITERATIONS when omitted so the two stages can't drift.

reference_mode (reference only; candidate unchanged): compile_fallback (default,
torch.compile then fall back to eager on failure - PyTorch at its best),
compile, or eager (reproduces historical numbers). Print/parse/speedup
contracts preserved.
translation_result.json now records spend and tokens regardless of outcome:
- translation_pytorch_latency_ms is always set when the harness prints it,
  even when the candidate fails correctness (parsed before the success/fail
  branch; candidate latency + speedup stay success-only since they're
  meaningless for an incorrect kernel).
- translation_cost_usd / translation_tokens / translation_model_calls /
  translation_cost_rates_per_mtok aggregated from the round trajectories
  (input/output/cache read+write), priced with configurable per-Mtok rates
  (model: cost_per_mtok_*, default public Claude Opus rates).
…agent

bench_warmup/bench_iters/reference_mode live in the agent: YAML section but are
translation-harness settings, not agent fields. run_translation_agent splats
agent_config into TranslationAgentConfig(**kwargs), so reading them with .get()
left them in the dict and crashed every round with
"TranslationAgentConfig.__init__() got an unexpected keyword argument". Use
.pop() to consume them before the agent config is built.
All translation-bench arms now run on claude-opus-4.8 by default
(verified accepted by the amd_llm gateway via cond48/med48 runs).
run_translation parsed the PyTorch reference latency with re.search but
re was never imported in that scope (the file uses function-local
imports), so the always-record-reference-latency path raised NameError
at runtime and ruff flagged F821. Add a local import and apply ruff
format to the cost helper + bench log line.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants