PyPTO Serving is a small local inference stack for running Qwen3-14B generation with PyPTO kernels on Ascend NPUs. It includes a reusable Python runtime, Qwen3-14B example kernels, CLI entry points, and tests for batching and config handling.
python/
cli/ pypto-serving CLI implementation
core/ engine, scheduler, KV cache, model loading
examples/
pypto-serving executable CLI wrapper
model/qwen3_14b/
cpu_generate.py CPU reference generation example
npu_generate.py NPU generation/profiling example
npu_serving.json sample interactive serving config
runner/ Qwen3 executors and runner glue
src/ PyPTO kernel/program builders
tests/ CLI and batching tests
Run the unit tests:
python -m pytest tests/test_cli.py tests/test_batching.pyShow CLI help:
./examples/pypto-serving --help
python -m python.cli --helpOne-shot generation, non-L3 path:
task-submit --device auto --max-time 0 --run \
"PTO2_RING_HEAP=536870912 PTO2_RING_TASK_WINDOW=131072 PTO2_RING_DEP_POOL=131072 \
python examples/model/qwen3_14b/npu_generate.py \
--model-dir /data/linyifan/models/Qwen3-14B \
--prompt 'Huawei is' \
--platform a2a3 \
--max-seq-len 512 \
--max-new-tokens 5"One-shot generation, L3 path:
task-submit --device auto --max-time 0 --run \
"PTO2_RING_HEAP=536870912 PTO2_RING_TASK_WINDOW=131072 PTO2_RING_DEP_POOL=131072 \
python examples/model/qwen3_14b/npu_generate.py \
--model-dir /data/linyifan/models/Qwen3-14B \
--prompt 'Huawei is' \
--platform a2a3 \
--max-seq-len 512 \
--max-new-tokens 5 \
--l3"Interactive generation:
task-submit --run -i \
"./examples/pypto-serving \
--config examples/model/qwen3_14b/npu_serving.json \
--device 0 \
--interactive"At the [user] prompt, enter a prompt such as Huawei is; use /exit or
/quit to leave the interactive session.
- The sample config points at
/data/linyifan/models/Qwen3-14B; editexamples/model/qwen3_14b/npu_serving.jsonor pass another config if your model path differs. ./examples/pypto-serving --device <id>overridesnpu.device_idfrom the JSON config for both one-shot and interactive serving.- Generated kernel artifacts are written under
build_output/and are ignored by git. - This repository expects PyPTO, CANN, torch, safetensors, transformers, and the local Ascend runtime environment to be available in the active Python environment.