[Feature] Make build_output runnable through simpler without manual SceneTestCase wrapping

### Summary

When debugging a kernel produced by `ir.compile`, I sometimes want to skip pypto's Python front-end entirely and re-run the existing `build_output/<jit_dir>/` contents (kernel cpp/so + orchestration cpp/so + `kernel_config.py`) directly through simpler's `SceneTestCase` / `scene_test` machinery — for example to bisect a runtime stall against a different simpler revision, or to iterate on the orchestration cpp by hand-editing it.

Today this requires a substantial rewrite of the auto-generated `kernel_config.py`. The generated file only contains `KERNELS` (list), `ORCHESTRATION` (dict), and `RUNTIME_CONFIG` — none of which simpler can consume directly. To run via simpler I had to:

1. Add `from simpler.task_interface import ArgDirection as D` and re-derive every kernel's `signature` (the `D.IN` / `D.OUT` / `D.INOUT` list) by hand-reading the orchestration cpp (`add_input` / `add_output` / `add_inout` calls per `params_t<i>`).
2. Re-derive the orchestration's external-tensor signature from the `from_tensor_arg(orch_args.tensor(N))` lines and which of those are written via `add_output` somewhere.
3. Convert the flat `KERNELS` + `ORCHESTRATION` dicts into simpler's nested `CALLABLE = {\"orchestration\": {..., \"signature\": [...]}, \"incores\": [{..., \"signature\": [...]}, ...]}` shape.
4. Wrap the result in a `@scene_test(level=2, runtime=...)` `SceneTestCase` subclass, write `CASES`, and implement `generate_args` + `compute_golden` (importing `build_tensor_specs` / golden helpers from the original pypto-lib source module via a `sys.path` hack into the repo root, since `build_output/<jit_dir>/` isn't on the import path).
5. Add `if __name__ == \"__main__\": SceneTestCase.run_module(__name__)`.

Steps 1–3 are pure mechanical translation of information pypto already has at compile time. The only step that genuinely needs user input is the test inputs / golden — i.e. step 4's `generate_args` + `compute_golden`.

### Motivation / Use Case

Concrete debugging scenario that triggered this:

I have a `build_output/_jit_moe_expert_test_<ts>/` from `models/deepseek/v4/deepseek_v4_decode_moe_expert.py`. The dispatch ran into a runtime stall (10 AIVs busy, 0 AICs, scheduler `in_flight=0`, `Thread N: PTO2 timeout after 800001 idle iterations`). To narrow it down I want to:

- Re-run the same build_output against a different simpler / pto-isa pin without re-running pypto's IR compile (which is slow and would invalidate the artifact under test).
- Edit `orchestration/moe_expert_test.cpp` by hand and re-run, since the suspected root cause is a missing ringbuffer notify that originates from codegen — re-emitting via `pl.jit` would just regenerate the buggy version.
- Run via `pytest kernel_config.py -p a2a3` so I get the full `SceneTestCase` infra (per-case dispatch, swimlane, PMU, `--dump-tensor`, etc.).

None of this requires any IR-level work. It just requires that the generated artifact be re-runnable. Today the boilerplate above blocks that.

### Proposed API / Behavior

Two reasonable directions; I'd be happy with either, but my preference is **Option A** because it puts the metadata where it semantically belongs.

**Option A — Enrich `kernel_config.py` codegen so simpler can consume it directly.**

At `compile_program` time, pypto already knows each task's `add_input` / `add_output` / `add_inout` direction (it's emitting the orchestration cpp from the IR). Emit that as a `signature` field on each `KERNELS` entry and on `ORCHESTRATION`, and additionally emit a sibling `CALLABLE` dict in the simpler-expected shape:

```python
# kernel_config.py (additions in **bold**)
ORCHESTRATION = {
    \"source\": ...,
    \"function_name\": \"aicpu_orchestration_entry\",
    **\"signature\": [D.IN, D.IN, ..., D.OUT, D.OUT],**  # 18 entries
}
KERNELS = [
    {\"func_id\": 0, \"name\": \"x_local_q\", ..., **\"signature\": [D.IN, D.OUT, D.OUT]**},
    ...
]
**CALLABLE = {\"orchestration\": ORCHESTRATION, \"incores\": KERNELS}**  # alias for simpler
```

With that, the user-side wrapper collapses to a thin `SceneTestCase` whose `CALLABLE = ...` line just imports from `kernel_config`. Steps 1–3 above disappear; only the test-input / golden step remains, which is genuinely user-supplied.

(Optionally, the codegen could also drop a `kernel_config_scene_test.py` template alongside that contains the `@scene_test` class skeleton with TODO markers for `generate_args` / `compute_golden`. Not required, but nice.)

**Option B — Add a `pypto.runtime.execute_compiled` debug entrypoint that takes simpler-style input.**

Add a `replay_compiled(work_dir, case_name=..., platform=..., device_id=..., simpler_kwargs=...)` (name TBD) that:

- Loads `kernel_config.py`.
- Skips `KernelCompiler` / `_patch_orchestration_headers` / pto-isa cloning when the user passes `--skip-compile` (i.e. trust the existing `.so` files under `kernels/<core>/` and `orchestration/`).
- Builds the simpler `ChipCallable` directly from those `.so`'s + signatures (signatures still need to be present, so this depends on the same codegen change as Option A, just hidden behind a different surface).
- Lets the user pass torch tensors / scalars for the orch args, the same way `execute_compiled` does today, but routes through simpler's runtime test harness so swimlane / PMU / `--dump-tensor` / `-c` pinning all work.

Option B is strictly more work because it duplicates simpler's `SceneTestCase.run_module` CLI; that's why I prefer Option A.

### Alternatives Considered

- **Hand-write the `SceneTestCase` once and check it in.** That's what I did this session. It works but duplicates information already in the orchestration cpp, drifts as soon as the kernel signature changes, and has to be redone for every new build_output.
- **Re-run via `python <kernel>.py -p ...` (the normal flow).** Doesn't solve the use case — that re-runs pypto's IR compile, which regenerates the artifact under test and so cannot reproduce a bug specific to a particular `build_output/<jit_dir>/`.
- **Manually invoke `pypto.runtime.execute_compiled(work_dir, args, ...)` from a small driver script.** Works for one-shot reproduction but loses the `SceneTestCase` infra (multi-case dispatch, swimlane, PMU, `-c` pin, dump-tensor, parallel scheduler).

### Additional Context

- For reference, the wrapper I had to write by hand for the moe_expert build_output is ~150 lines. About 100 of those lines (signatures + CALLABLE shape) are pure mechanical translation that pypto could emit; the remaining ~50 (CASES + generate_args + compute_golden) genuinely need user input.
- Git Commit ID: `4b69a87696a3528dfc50d4cfbeccbe98574378e1`
- Host Platform: Linux (aarch64)
- NPU Kind: N/A (not hardware-specific — this is a codegen / API ergonomics issue)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Make build_output runnable through simpler without manual SceneTestCase wrapping #1327

Summary

Motivation / Use Case

Proposed API / Behavior

Alternatives Considered

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature] Make build_output runnable through simpler without manual SceneTestCase wrapping #1327

Description

Summary

Motivation / Use Case

Proposed API / Behavior

Alternatives Considered

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions