Summary
When debugging a kernel produced by ir.compile, I sometimes want to skip pypto's Python front-end entirely and re-run the existing build_output/<jit_dir>/ contents (kernel cpp/so + orchestration cpp/so + kernel_config.py) directly through simpler's SceneTestCase / scene_test machinery — for example to bisect a runtime stall against a different simpler revision, or to iterate on the orchestration cpp by hand-editing it.
Today this requires a substantial rewrite of the auto-generated kernel_config.py. The generated file only contains KERNELS (list), ORCHESTRATION (dict), and RUNTIME_CONFIG — none of which simpler can consume directly. To run via simpler I had to:
- Add
from simpler.task_interface import ArgDirection as D and re-derive every kernel's signature (the D.IN / D.OUT / D.INOUT list) by hand-reading the orchestration cpp (add_input / add_output / add_inout calls per params_t<i>).
- Re-derive the orchestration's external-tensor signature from the
from_tensor_arg(orch_args.tensor(N)) lines and which of those are written via add_output somewhere.
- Convert the flat
KERNELS + ORCHESTRATION dicts into simpler's nested CALLABLE = {\"orchestration\": {..., \"signature\": [...]}, \"incores\": [{..., \"signature\": [...]}, ...]} shape.
- Wrap the result in a
@scene_test(level=2, runtime=...) SceneTestCase subclass, write CASES, and implement generate_args + compute_golden (importing build_tensor_specs / golden helpers from the original pypto-lib source module via a sys.path hack into the repo root, since build_output/<jit_dir>/ isn't on the import path).
- Add
if __name__ == \"__main__\": SceneTestCase.run_module(__name__).
Steps 1–3 are pure mechanical translation of information pypto already has at compile time. The only step that genuinely needs user input is the test inputs / golden — i.e. step 4's generate_args + compute_golden.
Motivation / Use Case
Concrete debugging scenario that triggered this:
I have a build_output/_jit_moe_expert_test_<ts>/ from models/deepseek/v4/deepseek_v4_decode_moe_expert.py. The dispatch ran into a runtime stall (10 AIVs busy, 0 AICs, scheduler in_flight=0, Thread N: PTO2 timeout after 800001 idle iterations). To narrow it down I want to:
- Re-run the same build_output against a different simpler / pto-isa pin without re-running pypto's IR compile (which is slow and would invalidate the artifact under test).
- Edit
orchestration/moe_expert_test.cpp by hand and re-run, since the suspected root cause is a missing ringbuffer notify that originates from codegen — re-emitting via pl.jit would just regenerate the buggy version.
- Run via
pytest kernel_config.py -p a2a3 so I get the full SceneTestCase infra (per-case dispatch, swimlane, PMU, --dump-tensor, etc.).
None of this requires any IR-level work. It just requires that the generated artifact be re-runnable. Today the boilerplate above blocks that.
Proposed API / Behavior
Two reasonable directions; I'd be happy with either, but my preference is Option A because it puts the metadata where it semantically belongs.
Option A — Enrich kernel_config.py codegen so simpler can consume it directly.
At compile_program time, pypto already knows each task's add_input / add_output / add_inout direction (it's emitting the orchestration cpp from the IR). Emit that as a signature field on each KERNELS entry and on ORCHESTRATION, and additionally emit a sibling CALLABLE dict in the simpler-expected shape:
# kernel_config.py (additions in **bold**)
ORCHESTRATION = {
\"source\": ...,
\"function_name\": \"aicpu_orchestration_entry\",
**\"signature\": [D.IN, D.IN, ..., D.OUT, D.OUT],** # 18 entries
}
KERNELS = [
{\"func_id\": 0, \"name\": \"x_local_q\", ..., **\"signature\": [D.IN, D.OUT, D.OUT]**},
...
]
**CALLABLE = {\"orchestration\": ORCHESTRATION, \"incores\": KERNELS}** # alias for simpler
With that, the user-side wrapper collapses to a thin SceneTestCase whose CALLABLE = ... line just imports from kernel_config. Steps 1–3 above disappear; only the test-input / golden step remains, which is genuinely user-supplied.
(Optionally, the codegen could also drop a kernel_config_scene_test.py template alongside that contains the @scene_test class skeleton with TODO markers for generate_args / compute_golden. Not required, but nice.)
Option B — Add a pypto.runtime.execute_compiled debug entrypoint that takes simpler-style input.
Add a replay_compiled(work_dir, case_name=..., platform=..., device_id=..., simpler_kwargs=...) (name TBD) that:
- Loads
kernel_config.py.
- Skips
KernelCompiler / _patch_orchestration_headers / pto-isa cloning when the user passes --skip-compile (i.e. trust the existing .so files under kernels/<core>/ and orchestration/).
- Builds the simpler
ChipCallable directly from those .so's + signatures (signatures still need to be present, so this depends on the same codegen change as Option A, just hidden behind a different surface).
- Lets the user pass torch tensors / scalars for the orch args, the same way
execute_compiled does today, but routes through simpler's runtime test harness so swimlane / PMU / --dump-tensor / -c pinning all work.
Option B is strictly more work because it duplicates simpler's SceneTestCase.run_module CLI; that's why I prefer Option A.
Alternatives Considered
- Hand-write the
SceneTestCase once and check it in. That's what I did this session. It works but duplicates information already in the orchestration cpp, drifts as soon as the kernel signature changes, and has to be redone for every new build_output.
- Re-run via
python <kernel>.py -p ... (the normal flow). Doesn't solve the use case — that re-runs pypto's IR compile, which regenerates the artifact under test and so cannot reproduce a bug specific to a particular build_output/<jit_dir>/.
- Manually invoke
pypto.runtime.execute_compiled(work_dir, args, ...) from a small driver script. Works for one-shot reproduction but loses the SceneTestCase infra (multi-case dispatch, swimlane, PMU, -c pin, dump-tensor, parallel scheduler).
Additional Context
- For reference, the wrapper I had to write by hand for the moe_expert build_output is ~150 lines. About 100 of those lines (signatures + CALLABLE shape) are pure mechanical translation that pypto could emit; the remaining ~50 (CASES + generate_args + compute_golden) genuinely need user input.
- Git Commit ID:
4b69a87696a3528dfc50d4cfbeccbe98574378e1
- Host Platform: Linux (aarch64)
- NPU Kind: N/A (not hardware-specific — this is a codegen / API ergonomics issue)
Summary
When debugging a kernel produced by
ir.compile, I sometimes want to skip pypto's Python front-end entirely and re-run the existingbuild_output/<jit_dir>/contents (kernel cpp/so + orchestration cpp/so +kernel_config.py) directly through simpler'sSceneTestCase/scene_testmachinery — for example to bisect a runtime stall against a different simpler revision, or to iterate on the orchestration cpp by hand-editing it.Today this requires a substantial rewrite of the auto-generated
kernel_config.py. The generated file only containsKERNELS(list),ORCHESTRATION(dict), andRUNTIME_CONFIG— none of which simpler can consume directly. To run via simpler I had to:from simpler.task_interface import ArgDirection as Dand re-derive every kernel'ssignature(theD.IN/D.OUT/D.INOUTlist) by hand-reading the orchestration cpp (add_input/add_output/add_inoutcalls perparams_t<i>).from_tensor_arg(orch_args.tensor(N))lines and which of those are written viaadd_outputsomewhere.KERNELS+ORCHESTRATIONdicts into simpler's nestedCALLABLE = {\"orchestration\": {..., \"signature\": [...]}, \"incores\": [{..., \"signature\": [...]}, ...]}shape.@scene_test(level=2, runtime=...)SceneTestCasesubclass, writeCASES, and implementgenerate_args+compute_golden(importingbuild_tensor_specs/ golden helpers from the original pypto-lib source module via asys.pathhack into the repo root, sincebuild_output/<jit_dir>/isn't on the import path).if __name__ == \"__main__\": SceneTestCase.run_module(__name__).Steps 1–3 are pure mechanical translation of information pypto already has at compile time. The only step that genuinely needs user input is the test inputs / golden — i.e. step 4's
generate_args+compute_golden.Motivation / Use Case
Concrete debugging scenario that triggered this:
I have a
build_output/_jit_moe_expert_test_<ts>/frommodels/deepseek/v4/deepseek_v4_decode_moe_expert.py. The dispatch ran into a runtime stall (10 AIVs busy, 0 AICs, schedulerin_flight=0,Thread N: PTO2 timeout after 800001 idle iterations). To narrow it down I want to:orchestration/moe_expert_test.cppby hand and re-run, since the suspected root cause is a missing ringbuffer notify that originates from codegen — re-emitting viapl.jitwould just regenerate the buggy version.pytest kernel_config.py -p a2a3so I get the fullSceneTestCaseinfra (per-case dispatch, swimlane, PMU,--dump-tensor, etc.).None of this requires any IR-level work. It just requires that the generated artifact be re-runnable. Today the boilerplate above blocks that.
Proposed API / Behavior
Two reasonable directions; I'd be happy with either, but my preference is Option A because it puts the metadata where it semantically belongs.
Option A — Enrich
kernel_config.pycodegen so simpler can consume it directly.At
compile_programtime, pypto already knows each task'sadd_input/add_output/add_inoutdirection (it's emitting the orchestration cpp from the IR). Emit that as asignaturefield on eachKERNELSentry and onORCHESTRATION, and additionally emit a siblingCALLABLEdict in the simpler-expected shape:With that, the user-side wrapper collapses to a thin
SceneTestCasewhoseCALLABLE = ...line just imports fromkernel_config. Steps 1–3 above disappear; only the test-input / golden step remains, which is genuinely user-supplied.(Optionally, the codegen could also drop a
kernel_config_scene_test.pytemplate alongside that contains the@scene_testclass skeleton with TODO markers forgenerate_args/compute_golden. Not required, but nice.)Option B — Add a
pypto.runtime.execute_compileddebug entrypoint that takes simpler-style input.Add a
replay_compiled(work_dir, case_name=..., platform=..., device_id=..., simpler_kwargs=...)(name TBD) that:kernel_config.py.KernelCompiler/_patch_orchestration_headers/ pto-isa cloning when the user passes--skip-compile(i.e. trust the existing.sofiles underkernels/<core>/andorchestration/).ChipCallabledirectly from those.so's + signatures (signatures still need to be present, so this depends on the same codegen change as Option A, just hidden behind a different surface).execute_compileddoes today, but routes through simpler's runtime test harness so swimlane / PMU /--dump-tensor/-cpinning all work.Option B is strictly more work because it duplicates simpler's
SceneTestCase.run_moduleCLI; that's why I prefer Option A.Alternatives Considered
SceneTestCaseonce and check it in. That's what I did this session. It works but duplicates information already in the orchestration cpp, drifts as soon as the kernel signature changes, and has to be redone for every new build_output.python <kernel>.py -p ...(the normal flow). Doesn't solve the use case — that re-runs pypto's IR compile, which regenerates the artifact under test and so cannot reproduce a bug specific to a particularbuild_output/<jit_dir>/.pypto.runtime.execute_compiled(work_dir, args, ...)from a small driver script. Works for one-shot reproduction but loses theSceneTestCaseinfra (multi-case dispatch, swimlane, PMU,-cpin, dump-tensor, parallel scheduler).Additional Context
4b69a87696a3528dfc50d4cfbeccbe98574378e1