Skip to content

Feat/zero overhead latency#22

Open
sinarafati-amd wants to merge 1 commit into
mainfrom
feat/zero-overhead-latency
Open

Feat/zero overhead latency#22
sinarafati-amd wants to merge 1 commit into
mainfrom
feat/zero-overhead-latency

Conversation

@sinarafati-amd

Copy link
Copy Markdown
Collaborator
  • New Latency stage: wall-clock via CUDA/HIP-graph subprocess + kernel-only via rocprofv3 (HIP reuses existing pmc_perf.csv for free).
  • New magpie.bench (Python) + magpie_bench.hpp (C++) helpers; subprocess harness isolates Python/JIT/dispatch overhead, seeded for reproducibility.
  • Unified JSON report adds wall_median_ms, kernel_median_ms, dispatch_overhead_us, crosscheck_vs_rocprof_ratio.
  • Backward-compatible (opt-out via latency: {enabled: false});

@sinarafati-amd sinarafati-amd requested a review from haofrank April 29, 2026 06:09
@sinarafati-amd sinarafati-amd marked this pull request as ready for review April 30, 2026 14:25
@irvineoy

irvineoy commented May 7, 2026

Copy link
Copy Markdown
Collaborator

I found a few issues worth addressing before merge (excluding the current merge conflicts):

  1. primary_metric=kernel_median_ms can silently report wall-clock latency.

LatencyResult.get_primary_value() falls back to wall_stats when kernel_stats is missing, and _run_both() can return success=True when only the wall-clock path succeeds. That means a config asking for primary_metric: kernel_median_ms may still get a non-null primary_value_ms backed by wall latency if the kernel-trace side fails or produces no stats.

This is risky for compare/autotuning, because the result looks like a successful kernel-only ranking metric. I think this should either fail the latency result when the requested primary metric is unavailable, or report primary_value_ms: null with a clear warning instead of falling back to wall latency.

  1. Latency appears to be enabled by default, which can break existing configs.

LatencyConfig.enabled defaults to True, so existing PyTorch/Triton/CUDA workflows that do not define a bench_target and do not emit a MAGPIE_LATENCY_JSON marker can now end up with latency_state=FAILED even though they never opted into the new latency stage.

For backward compatibility, it would be safer to default this to disabled, or auto-skip latency unless the user explicitly configures a latency target/method.

  1. The prebuilt HIP example binary should not be committed.

examples/simple_hip_test/vector_add_bench is a checked-in prebuilt binary. This is not portable across systems/architectures and can be stale relative to vector_add_bench.hip. In testing, the committed binary failed on the MI300 machine, while rebuilding from the .hip source worked.

Can we remove the binary from the PR and rely on compile_command / docs to build it locally instead?

  1. The relative pythonpath in the Triton examples is resolved from the runner working directory.

For example, examples/simple_triton_test/analyze_triton_latency.yaml sets:

working_dir: "examples/simple_triton_test"
latency:
  pythonpath:
    - "examples/simple_triton_test"

But the latency runner subprocess uses cwd=kernel_cfg.working_dir, so that relative PYTHONPATH becomes effectively examples/simple_triton_test/examples/simple_triton_test and the import target may not be found. This should either resolve relative pythonpath entries against the repo/config-file directory before passing them to the subprocess, or the example should use . for this case.

  1. The HIP helper header is not included in the wheel/package data.

Magpie/bench/include/magpie_bench.hpp is added and referenced by the docs/examples, but pyproject.toml package-data currently only includes YAML/JSON files:

[tool.setuptools.package-data]
Magpie = ["*.yaml", "*.yaml.example", "*.json"]

A built wheel does not contain magpie_bench.hpp, so installed-package users will not have the documented include path. Please add the header path to package data, e.g. bench/include/*.hpp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants