Skip to content

[Feature] Make build_output runnable through simpler without manual SceneTestCase wrapping #1327

@zhangqi-chen

Description

@zhangqi-chen

Summary

When debugging a kernel produced by ir.compile, I sometimes want to skip pypto's Python front-end entirely and re-run the existing build_output/<jit_dir>/ contents (kernel cpp/so + orchestration cpp/so + kernel_config.py) directly through simpler's SceneTestCase / scene_test machinery — for example to bisect a runtime stall against a different simpler revision, or to iterate on the orchestration cpp by hand-editing it.

Today this requires a substantial rewrite of the auto-generated kernel_config.py. The generated file only contains KERNELS (list), ORCHESTRATION (dict), and RUNTIME_CONFIG — none of which simpler can consume directly. To run via simpler I had to:

  1. Add from simpler.task_interface import ArgDirection as D and re-derive every kernel's signature (the D.IN / D.OUT / D.INOUT list) by hand-reading the orchestration cpp (add_input / add_output / add_inout calls per params_t<i>).
  2. Re-derive the orchestration's external-tensor signature from the from_tensor_arg(orch_args.tensor(N)) lines and which of those are written via add_output somewhere.
  3. Convert the flat KERNELS + ORCHESTRATION dicts into simpler's nested CALLABLE = {\"orchestration\": {..., \"signature\": [...]}, \"incores\": [{..., \"signature\": [...]}, ...]} shape.
  4. Wrap the result in a @scene_test(level=2, runtime=...) SceneTestCase subclass, write CASES, and implement generate_args + compute_golden (importing build_tensor_specs / golden helpers from the original pypto-lib source module via a sys.path hack into the repo root, since build_output/<jit_dir>/ isn't on the import path).
  5. Add if __name__ == \"__main__\": SceneTestCase.run_module(__name__).

Steps 1–3 are pure mechanical translation of information pypto already has at compile time. The only step that genuinely needs user input is the test inputs / golden — i.e. step 4's generate_args + compute_golden.

Motivation / Use Case

Concrete debugging scenario that triggered this:

I have a build_output/_jit_moe_expert_test_<ts>/ from models/deepseek/v4/deepseek_v4_decode_moe_expert.py. The dispatch ran into a runtime stall (10 AIVs busy, 0 AICs, scheduler in_flight=0, Thread N: PTO2 timeout after 800001 idle iterations). To narrow it down I want to:

  • Re-run the same build_output against a different simpler / pto-isa pin without re-running pypto's IR compile (which is slow and would invalidate the artifact under test).
  • Edit orchestration/moe_expert_test.cpp by hand and re-run, since the suspected root cause is a missing ringbuffer notify that originates from codegen — re-emitting via pl.jit would just regenerate the buggy version.
  • Run via pytest kernel_config.py -p a2a3 so I get the full SceneTestCase infra (per-case dispatch, swimlane, PMU, --dump-tensor, etc.).

None of this requires any IR-level work. It just requires that the generated artifact be re-runnable. Today the boilerplate above blocks that.

Proposed API / Behavior

Two reasonable directions; I'd be happy with either, but my preference is Option A because it puts the metadata where it semantically belongs.

Option A — Enrich kernel_config.py codegen so simpler can consume it directly.

At compile_program time, pypto already knows each task's add_input / add_output / add_inout direction (it's emitting the orchestration cpp from the IR). Emit that as a signature field on each KERNELS entry and on ORCHESTRATION, and additionally emit a sibling CALLABLE dict in the simpler-expected shape:

# kernel_config.py (additions in **bold**)
ORCHESTRATION = {
    \"source\": ...,
    \"function_name\": \"aicpu_orchestration_entry\",
    **\"signature\": [D.IN, D.IN, ..., D.OUT, D.OUT],**  # 18 entries
}
KERNELS = [
    {\"func_id\": 0, \"name\": \"x_local_q\", ..., **\"signature\": [D.IN, D.OUT, D.OUT]**},
    ...
]
**CALLABLE = {\"orchestration\": ORCHESTRATION, \"incores\": KERNELS}**  # alias for simpler

With that, the user-side wrapper collapses to a thin SceneTestCase whose CALLABLE = ... line just imports from kernel_config. Steps 1–3 above disappear; only the test-input / golden step remains, which is genuinely user-supplied.

(Optionally, the codegen could also drop a kernel_config_scene_test.py template alongside that contains the @scene_test class skeleton with TODO markers for generate_args / compute_golden. Not required, but nice.)

Option B — Add a pypto.runtime.execute_compiled debug entrypoint that takes simpler-style input.

Add a replay_compiled(work_dir, case_name=..., platform=..., device_id=..., simpler_kwargs=...) (name TBD) that:

  • Loads kernel_config.py.
  • Skips KernelCompiler / _patch_orchestration_headers / pto-isa cloning when the user passes --skip-compile (i.e. trust the existing .so files under kernels/<core>/ and orchestration/).
  • Builds the simpler ChipCallable directly from those .so's + signatures (signatures still need to be present, so this depends on the same codegen change as Option A, just hidden behind a different surface).
  • Lets the user pass torch tensors / scalars for the orch args, the same way execute_compiled does today, but routes through simpler's runtime test harness so swimlane / PMU / --dump-tensor / -c pinning all work.

Option B is strictly more work because it duplicates simpler's SceneTestCase.run_module CLI; that's why I prefer Option A.

Alternatives Considered

  • Hand-write the SceneTestCase once and check it in. That's what I did this session. It works but duplicates information already in the orchestration cpp, drifts as soon as the kernel signature changes, and has to be redone for every new build_output.
  • Re-run via python <kernel>.py -p ... (the normal flow). Doesn't solve the use case — that re-runs pypto's IR compile, which regenerates the artifact under test and so cannot reproduce a bug specific to a particular build_output/<jit_dir>/.
  • Manually invoke pypto.runtime.execute_compiled(work_dir, args, ...) from a small driver script. Works for one-shot reproduction but loses the SceneTestCase infra (multi-case dispatch, swimlane, PMU, -c pin, dump-tensor, parallel scheduler).

Additional Context

  • For reference, the wrapper I had to write by hand for the moe_expert build_output is ~150 lines. About 100 of those lines (signatures + CALLABLE shape) are pure mechanical translation that pypto could emit; the remaining ~50 (CASES + generate_args + compute_golden) genuinely need user input.
  • Git Commit ID: 4b69a87696a3528dfc50d4cfbeccbe98574378e1
  • Host Platform: Linux (aarch64)
  • NPU Kind: N/A (not hardware-specific — this is a codegen / API ergonomics issue)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

Status
Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions