This repository is a minimal LLM inference framework built around a simple generate interface and PyPTO-backed model execution. The current implementation is aimed at local, single-request inference for Qwen3-14B using bundled single-layer kernels under model/, with the model weights loaded from a user-provided local directory.
The current pipeline is:
- Tokenize the prompt.
- Lookup token embeddings.
- Run model prefill.
- Run the decode loop.
- Sample the next token.
- Detokenize generated tokens.
- Return text and generated token IDs.
The framework is designed so the frontend API stays simple while the backend can be swapped between a reference executor and a device executor.
llm/core/Core runtime code.llm/core/engine.pyHigh-level entry point. Exposes model initialization and generation.llm/core/model_loader.pyGeneral model loading abstraction. The current concrete adapter loads a local Hugging Face-style directory, but the loader interface is not restricted to Hugging Face.llm/core/types.pyShared config and runtime data structures.llm/core/kv_cache.pyPaged KV cache manager.llm/core/executor.pyReference PyTorch executor.llm/core/pypto_executor.pyPyPTO-backed executor that composes the full model from the provided single-layer kernels.model/Kernel implementations. Prefill was extended to accept paged-KV inputs.examples/Example entry points.qwen3_14b_local_generate.pyis the main smoke-test example.design/Notes on the implementation direction and architecture.
The main external interface is LLMEngine:
engine.init_model(...)
text = engine.generate(model_id, prompt)
result = engine.generate_result(model_id, prompt)generate_result() is the better interface for debugging because it returns:
- generated text
- generated token IDs
- finish reason
The loader is intentionally abstracted so that the repo can later support:
- local Hugging Face snapshots
- custom weight layouts
- converted offline formats
- future model registries
At the moment, the implemented path expects a local directory with config, tokenizer files, and weight shards for Qwen3-14B.
Two distinct config layers are used:
ModelConfigStatic model shape and tokenizer-related metadata.RuntimeConfigExecution-time controls such as page size, max sequence length, KV dtype, and device placement.
This separation is important because the user-provided model directory defines the model shape, while the runtime environment defines how the model is executed.
The bundled kernels only implement one transformer layer, so the framework builds the full model by:
- loading all layer weights from the model directory
- iterating through every layer in prefill
- iterating through every layer again during decode
- maintaining KV state across the full request
The current PyPTO executor validates that the loaded model matches the Qwen3-14B layer shape expected by the bundled kernels.
KV cache is managed centrally in KvCacheManager.
Responsibilities:
- allocate pages for a request
- extend capacity during decode
- map logical token positions to physical KV slots
- materialize cache views for device kernels
- free pages when the request finishes
The prefill kernel was updated to accept a paged block_table contract so prefill and decode now use the same paging model.
python -m compileall llm examplesThe user requested that device execution use task-submit. The example below follows that contract:
task-submit --device auto --run "python examples/qwen3_14b_local_generate.py --model-dir /data/linyifan/models/Qwen3-14B --prompt 'Hello' --platform a2a3 --pypto-root /data/liuxu/pypto --max-new-tokens 1"The example prints:
texttoken_idsfinish_reason
What is working:
- model loading from a local directory
- tokenizer integration
- full prefill and decode control flow
- paged KV cache management
- PyPTO kernel compilation and execution
- NPU end-to-end smoke execution through
task-submit
What is not finished:
- numerical correctness of the hardware path
The current hardware run can complete end-to-end, but the logits are not yet numerically stable. The sampler currently sanitizes non-finite logits to prevent crashes during smoke tests. That makes the control flow testable, but the generated tokens should not yet be treated as correct model output.
- Add per-layer validation between
ModelExecutorandPyptoQwen14BExecutor. - Compare prefill outputs layer by layer and identify the first divergence.
- Do the same for decode with a fixed prompt and fixed token input.
- Verify weight layout, transpose direction, and RMSNorm handling against the kernel contract.
- Once logits are finite and aligned, remove the sampler fallback from the critical path.
- Keep new code under
llm/. - Keep kernel code under
model/. - Put runnable demos in
examples/. - Use
task-submit --device auto --run "python ..."for NPU execution. - Prefer
generate_result()overgenerate()when debugging token-level behavior.