Feat/zero overhead latency#22
Conversation
|
I found a few issues worth addressing before merge (excluding the current merge conflicts):
This is risky for compare/autotuning, because the result looks like a successful kernel-only ranking metric. I think this should either fail the latency result when the requested primary metric is unavailable, or report
For backward compatibility, it would be safer to default this to disabled, or auto-skip latency unless the user explicitly configures a latency target/method.
Can we remove the binary from the PR and rely on
For example, working_dir: "examples/simple_triton_test"
latency:
pythonpath:
- "examples/simple_triton_test"But the latency runner subprocess uses
[tool.setuptools.package-data]
Magpie = ["*.yaml", "*.yaml.example", "*.json"]A built wheel does not contain |
Latencystage: wall-clock via CUDA/HIP-graph subprocess + kernel-only viarocprofv3(HIP reuses existingpmc_perf.csvfor free).magpie.bench(Python) +magpie_bench.hpp(C++) helpers; subprocess harness isolates Python/JIT/dispatch overhead, seeded for reproducibility.wall_median_ms,kernel_median_ms,dispatch_overhead_us,crosscheck_vs_rocprof_ratio.latency: {enabled: false});