openinfer-project · Ma1oneZhang · Jun 9, 2026 · Jun 9, 2026 · Jun 9, 2026
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/docs/index.md b/docs/index.md
@@ -119,6 +119,7 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
 | `subsystems/kernels/pegainfer-kernels-boundary.md` | Architecture decision: pegainfer should use reusable frontend/runtime/data-plane layers plus per-model engines; kernels become first-class assets through a ledger, simulator, and request tracing. |
 | `subsystems/kernels/kernel-op-reports.md` | Qwen3 kernel/report tooling is feature-gated: `qwen3_kernel_report` covers per-op kernel reports, and `qwen3_model_report` emits runtime-traced eager-DAG decode operator rollups with TensorSpec `KernelCall`s, latency stats, tables, and Graphviz DOT; measured FA2 `CTA_TILE_Q=64` prefill default in place. |
 | `subsystems/kernels/typed-forward-pipeline.md` | Reusable typed tensor pipeline macro in `pegainfer-kernels` so model crates can express common `typed_ops` chains without model-specific wrapper macros. |
+| `subsystems/kernels/tvm-ffi-mvp.md` | Required TVM FFI dependency in `pegainfer-kernels` plus a packed TVM wrapper for the Qwen3.5 GDR solve Triton AOT CUBIN launcher. |
 
 ## playbooks
 

diff --git a/docs/subsystems/kernels/tvm-ffi-mvp.md b/docs/subsystems/kernels/tvm-ffi-mvp.md
@@ -0,0 +1,71 @@
+# TVM FFI Triton CUBIN Wrapper
+
+> **TL;DR:** `pegainfer-kernels` now takes `tvm-ffi` as a required dependency, exposes the Qwen3.5 GDR solve Triton AOT CUBIN launcher through packed TVM FFI, and has unit coverage for the wrapper registry plus packed-ABI diagnostics.
+>
+> **Last touched:** 2026-06
+
+## Preparation
+
+- **Read**:
+  - `docs/index.md` - routed this task to the kernels subsystem.
+  - `docs/subsystems/kernels/pegainfer-kernels-boundary.md` - confirmed DSL/kernel integration belongs at the kernels boundary rather than in model runtimes.
+  - `docs/subsystems/kernels/kernel-op-reports.md` - confirmed Triton/CuTe tooling is already feature-scoped in kernel infrastructure.
+  - `pegainfer-kernels/tools/triton/README.md` - described the current Triton AOT CUBIN generation and validation path.
+  - `pegainfer-kernels/build.rs` - showed generated Triton AOT C stubs and wrapper symbols.
+  - `pegainfer-kernels/src/ffi/qwen35.rs` and `pegainfer-kernels/src/ffi/shared.rs` - showed the existing C ABI launch symbols used by Rust model code.
+  - Local `tvm-ffi` crate source - confirmed typed callbacks only cover up to 8 arguments, so Triton launchers need packed TVM FFI wrappers.
+- **Relevant history**:
+  - GitHub issue `#191` proposed TVM FFI as the DSL interface direction.
+  - Draft PR `#202` kept TVM FFI optional/test-only; the revised request makes it a required `pegainfer-kernels` dependency and focuses it on Triton CUBIN launch wrappers.
+- **Plan**:
+  1. Add `tvm-ffi` as a required dependency of `pegainfer-kernels`.
+  2. Add a `triton_cubin` module that exposes a current Qwen3.5 Triton AOT CUBIN launcher as a packed TVM FFI function.
+  3. Keep existing C ABI and Rust call sites available; the TVM FFI layer is an additional DSL boundary, not a production scheduler/model migration.
+  4. Add a small example that registers the wrapper and prints the function contract.
+  5. Validate formatting and the strongest local build/test checks available.
+- **Risks / open questions**:
+  - Required `tvm-ffi` means `tvm-ffi-config` and `libtvm_ffi` become build prerequisites for `pegainfer-kernels`.
+  - The current wrapper accepts raw device pointer and stream handles as TVM integers or opaque pointers; a future DLPack/tensor-handle wrapper can sit on top once the DSL artifact contract is stable.
+
+## Execution Log
+
+### Step 1: Required dependency and wrapper surface
+- Added required `tvm-ffi = "0.1.0-alpha.0"` to `pegainfer-kernels`.
+- Added `pegainfer_kernels::triton_cubin`, which exposes metadata plus a packed TVM FFI callback for the generated Qwen3.5 GDR solve Triton AOT launcher.
+- Kept existing CUDA C ABI symbols and model call sites unchanged.
+
+### Step 2: Small example
+- Added `pegainfer-kernels/examples/triton_cubin_tvm_ffi.rs` to register the TVM FFI global function and print the launch contract.
+
+### Step 3: Unit test coverage
+- Added wrapper unit tests for:
+  - known/unknown wrapper lookup;
+  - global TVM FFI registry round-trip;
+  - accepted raw handle encodings (`u64` and opaque pointer);
+  - missing-argument diagnostics before CUDA launch;
+  - handle and scalar type diagnostics before CUDA launch.
+- Kept tests on pre-launch validation paths so they do not require valid device memory or actually launch the Triton CUBIN.
+
+### Step 4: Validation
+- `cargo fmt --all --check` passed.
+- `cargo check --release -p pegainfer-kernels` failed before codegen because `tvm-ffi-config` was not on `PATH`; this is expected now that `tvm-ffi` is required.
+- Retried with the discovered local TVM FFI install on `PATH`:
+  - `PATH=/home/ziyang/gpu_memory_profiling/.venv/bin:$PATH cargo check --release -p pegainfer-kernels`
+  - `tvm-ffi` built successfully.
+  - The build then failed in the existing CUDA build at `pegainfer-kernels/csrc/shared/flashinfer_top1.cu` because the dirty `pegainfer-kernels/third_party/flashinfer` submodule commit changes the `TopKDispatch` API. This is unrelated to the TVM FFI wrapper and was left untouched.
+- After adding tests:
+  - `cargo fmt --all --check` passed.
+  - `PATH=/home/ziyang/gpu_memory_profiling/.venv/bin:$PATH cargo test --release -p pegainfer-kernels triton_cubin --lib` built `tvm-ffi` successfully, then hit the same existing `flashinfer_top1.cu` `TopKDispatch` API mismatch before Rust unit tests could run.
+
+## Debrief
+
+- **Outcome**: Added required TVM FFI dependency wiring plus a real Triton CUBIN wrapper MVP for the Qwen3.5 GDR solve launcher, with unit tests covering wrapper discovery, registry registration, packed handle conversion, and pre-launch diagnostics.
+- **Pitfalls encountered**:
+  - `apply_patch` and normal shell commands were blocked by the sandbox namespace failure, so edits were applied with scoped `git apply` patches.
+  - TVM FFI is now a real build prerequisite; hosts need `tvm-ffi-config` on `PATH`.
+  - Local full kernel-crate validation is currently blocked by the pre-existing dirty FlashInfer submodule, not by the TVM FFI code.
+- **Lessons learned**:
+  - TVM FFI typed callbacks currently cover only up to 8 arguments, while Triton/CUDA launchers can exceed that, so the wrapper should use packed TVM FFI callbacks for launch surfaces.
+- **Follow-ups**:
+  - Add packed TVM FFI wrappers for the remaining generated Triton AOT launchers once the FlashInfer submodule is back at the expected API or the CUDA call site is updated.
+  - Consider a higher-level DLPack/tensor-handle wrapper above the raw pointer/stream packed ABI once the DSL artifact contract is stable.
diff --git a/pegainfer-kernels/Cargo.toml b/pegainfer-kernels/Cargo.toml
@@ -8,6 +8,7 @@ anyhow = { workspace = true }
 cudarc = { workspace = true }
 half = { workspace = true }
 serde = { workspace = true }
+tvm-ffi = "0.1.0-alpha.0"
 
 [build-dependencies]
 cc = { workspace = true }

diff --git a/pegainfer-kernels/examples/triton_cubin_tvm_ffi.rs b/pegainfer-kernels/examples/triton_cubin_tvm_ffi.rs
@@ -0,0 +1,20 @@
+use pegainfer_kernels::triton_cubin::{self, QWEN35_GDR_CHUNK_SOLVE};
+
+fn main() -> tvm_ffi::Result<()> {
+    triton_cubin::register_global_functions()?;
+
+    println!("registered Triton CUBIN TVM FFI functions:");
+    for spec in triton_cubin::TRITON_CUBIN_FUNCTIONS {
+        println!("  {} -> {}", spec.name, spec.ffi_symbol);
+    }
+
+    let solve = triton_cubin::get_global_or_register(QWEN35_GDR_CHUNK_SOLVE.name)?;
+    println!(
+        "{} is ready; call it with packed args: {}",
+        QWEN35_GDR_CHUNK_SOLVE.name,
+        QWEN35_GDR_CHUNK_SOLVE.arg_names.join(", ")
+    );
+
+    drop(solve);
+    Ok(())
+}
diff --git a/pegainfer-kernels/src/lib.rs b/pegainfer-kernels/src/lib.rs
@@ -7,4 +7,5 @@ pub mod gpu_buffers;
 pub mod ops;
 pub mod paged_kv;
 pub mod tensor;
+pub mod triton_cubin;
 pub mod typed_ops;