Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
| `subsystems/kernels/pegainfer-kernels-boundary.md` | Architecture decision: pegainfer should use reusable frontend/runtime/data-plane layers plus per-model engines; kernels become first-class assets through a ledger, simulator, and request tracing. |
| `subsystems/kernels/kernel-op-reports.md` | Qwen3 kernel/report tooling is feature-gated: `qwen3_kernel_report` covers per-op kernel reports, and `qwen3_model_report` emits runtime-traced eager-DAG decode operator rollups with TensorSpec `KernelCall`s, latency stats, tables, and Graphviz DOT; measured FA2 `CTA_TILE_Q=64` prefill default in place. |
| `subsystems/kernels/typed-forward-pipeline.md` | Reusable typed tensor pipeline macro in `pegainfer-kernels` so model crates can express common `typed_ops` chains without model-specific wrapper macros. |
| `subsystems/kernels/tvm-ffi-mvp.md` | Required TVM FFI dependency in `pegainfer-kernels` plus a packed TVM wrapper for the Qwen3.5 GDR solve Triton AOT CUBIN launcher. |

## playbooks

Expand Down
71 changes: 71 additions & 0 deletions docs/subsystems/kernels/tvm-ffi-mvp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# TVM FFI Triton CUBIN Wrapper

> **TL;DR:** `pegainfer-kernels` now takes `tvm-ffi` as a required dependency, exposes the Qwen3.5 GDR solve Triton AOT CUBIN launcher through packed TVM FFI, and has unit coverage for the wrapper registry plus packed-ABI diagnostics.
>
> **Last touched:** 2026-06

## Preparation

- **Read**:
- `docs/index.md` - routed this task to the kernels subsystem.
- `docs/subsystems/kernels/pegainfer-kernels-boundary.md` - confirmed DSL/kernel integration belongs at the kernels boundary rather than in model runtimes.
- `docs/subsystems/kernels/kernel-op-reports.md` - confirmed Triton/CuTe tooling is already feature-scoped in kernel infrastructure.
- `pegainfer-kernels/tools/triton/README.md` - described the current Triton AOT CUBIN generation and validation path.
- `pegainfer-kernels/build.rs` - showed generated Triton AOT C stubs and wrapper symbols.
- `pegainfer-kernels/src/ffi/qwen35.rs` and `pegainfer-kernels/src/ffi/shared.rs` - showed the existing C ABI launch symbols used by Rust model code.
- Local `tvm-ffi` crate source - confirmed typed callbacks only cover up to 8 arguments, so Triton launchers need packed TVM FFI wrappers.
- **Relevant history**:
- GitHub issue `#191` proposed TVM FFI as the DSL interface direction.
- Draft PR `#202` kept TVM FFI optional/test-only; the revised request makes it a required `pegainfer-kernels` dependency and focuses it on Triton CUBIN launch wrappers.
- **Plan**:
1. Add `tvm-ffi` as a required dependency of `pegainfer-kernels`.
2. Add a `triton_cubin` module that exposes a current Qwen3.5 Triton AOT CUBIN launcher as a packed TVM FFI function.
3. Keep existing C ABI and Rust call sites available; the TVM FFI layer is an additional DSL boundary, not a production scheduler/model migration.
4. Add a small example that registers the wrapper and prints the function contract.
5. Validate formatting and the strongest local build/test checks available.
- **Risks / open questions**:
- Required `tvm-ffi` means `tvm-ffi-config` and `libtvm_ffi` become build prerequisites for `pegainfer-kernels`.
- The current wrapper accepts raw device pointer and stream handles as TVM integers or opaque pointers; a future DLPack/tensor-handle wrapper can sit on top once the DSL artifact contract is stable.

## Execution Log

### Step 1: Required dependency and wrapper surface
- Added required `tvm-ffi = "0.1.0-alpha.0"` to `pegainfer-kernels`.
- Added `pegainfer_kernels::triton_cubin`, which exposes metadata plus a packed TVM FFI callback for the generated Qwen3.5 GDR solve Triton AOT launcher.
- Kept existing CUDA C ABI symbols and model call sites unchanged.

### Step 2: Small example
- Added `pegainfer-kernels/examples/triton_cubin_tvm_ffi.rs` to register the TVM FFI global function and print the launch contract.

### Step 3: Unit test coverage
- Added wrapper unit tests for:
- known/unknown wrapper lookup;
- global TVM FFI registry round-trip;
- accepted raw handle encodings (`u64` and opaque pointer);
- missing-argument diagnostics before CUDA launch;
- handle and scalar type diagnostics before CUDA launch.
- Kept tests on pre-launch validation paths so they do not require valid device memory or actually launch the Triton CUBIN.

### Step 4: Validation
- `cargo fmt --all --check` passed.
- `cargo check --release -p pegainfer-kernels` failed before codegen because `tvm-ffi-config` was not on `PATH`; this is expected now that `tvm-ffi` is required.
- Retried with the discovered local TVM FFI install on `PATH`:
- `PATH=/home/ziyang/gpu_memory_profiling/.venv/bin:$PATH cargo check --release -p pegainfer-kernels`
- `tvm-ffi` built successfully.
- The build then failed in the existing CUDA build at `pegainfer-kernels/csrc/shared/flashinfer_top1.cu` because the dirty `pegainfer-kernels/third_party/flashinfer` submodule commit changes the `TopKDispatch` API. This is unrelated to the TVM FFI wrapper and was left untouched.
- After adding tests:
- `cargo fmt --all --check` passed.
- `PATH=/home/ziyang/gpu_memory_profiling/.venv/bin:$PATH cargo test --release -p pegainfer-kernels triton_cubin --lib` built `tvm-ffi` successfully, then hit the same existing `flashinfer_top1.cu` `TopKDispatch` API mismatch before Rust unit tests could run.

## Debrief

- **Outcome**: Added required TVM FFI dependency wiring plus a real Triton CUBIN wrapper MVP for the Qwen3.5 GDR solve launcher, with unit tests covering wrapper discovery, registry registration, packed handle conversion, and pre-launch diagnostics.
- **Pitfalls encountered**:
- `apply_patch` and normal shell commands were blocked by the sandbox namespace failure, so edits were applied with scoped `git apply` patches.
- TVM FFI is now a real build prerequisite; hosts need `tvm-ffi-config` on `PATH`.
- Local full kernel-crate validation is currently blocked by the pre-existing dirty FlashInfer submodule, not by the TVM FFI code.
- **Lessons learned**:
- TVM FFI typed callbacks currently cover only up to 8 arguments, while Triton/CUDA launchers can exceed that, so the wrapper should use packed TVM FFI callbacks for launch surfaces.
- **Follow-ups**:
- Add packed TVM FFI wrappers for the remaining generated Triton AOT launchers once the FlashInfer submodule is back at the expected API or the CUDA call site is updated.
- Consider a higher-level DLPack/tensor-handle wrapper above the raw pointer/stream packed ABI once the DSL artifact contract is stable.
1 change: 1 addition & 0 deletions pegainfer-kernels/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ anyhow = { workspace = true }
cudarc = { workspace = true }
half = { workspace = true }
serde = { workspace = true }
tvm-ffi = "0.1.0-alpha.0"

[build-dependencies]
cc = { workspace = true }
Expand Down
20 changes: 20 additions & 0 deletions pegainfer-kernels/examples/triton_cubin_tvm_ffi.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
use pegainfer_kernels::triton_cubin::{self, QWEN35_GDR_CHUNK_SOLVE};

fn main() -> tvm_ffi::Result<()> {
triton_cubin::register_global_functions()?;

println!("registered Triton CUBIN TVM FFI functions:");
for spec in triton_cubin::TRITON_CUBIN_FUNCTIONS {
println!(" {} -> {}", spec.name, spec.ffi_symbol);
}

let solve = triton_cubin::get_global_or_register(QWEN35_GDR_CHUNK_SOLVE.name)?;
println!(
"{} is ready; call it with packed args: {}",
QWEN35_GDR_CHUNK_SOLVE.name,
QWEN35_GDR_CHUNK_SOLVE.arg_names.join(", ")
);

drop(solve);
Ok(())
}
1 change: 1 addition & 0 deletions pegainfer-kernels/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,5 @@ pub mod gpu_buffers;
pub mod ops;
pub mod paged_kv;
pub mod tensor;
pub mod triton_cubin;
pub mod typed_ops;
Loading