openinfer-project · Ma1oneZhang · Jun 9, 2026 · Jun 9, 2026 · Jun 9, 2026
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/docs/index.md b/docs/index.md
@@ -119,6 +119,7 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
 | `subsystems/kernels/pegainfer-kernels-boundary.md` | Architecture decision: pegainfer should use reusable frontend/runtime/data-plane layers plus per-model engines; kernels become first-class assets through a ledger, simulator, and request tracing. |
 | `subsystems/kernels/kernel-op-reports.md` | Qwen3 kernel/report tooling is feature-gated: `qwen3_kernel_report` covers per-op kernel reports, and `qwen3_model_report` emits runtime-traced eager-DAG decode operator rollups with TensorSpec `KernelCall`s, latency stats, tables, and Graphviz DOT; measured FA2 `CTA_TILE_Q=64` prefill default in place. |
 | `subsystems/kernels/typed-forward-pipeline.md` | Reusable typed tensor pipeline macro in `pegainfer-kernels` so model crates can express common `typed_ops` chains without model-specific wrapper macros. |
+| `subsystems/kernels/tvm-ffi-mvp.md` | Optional `tvm-ffi-triton-cubin` bridge in `pegainfer-kernels` plus a packed TVM wrapper for the Qwen3.5 GDR solve Triton AOT CUBIN launcher. |
 
 ## playbooks
 

diff --git a/docs/subsystems/kernels/tvm-ffi-mvp.md b/docs/subsystems/kernels/tvm-ffi-mvp.md
@@ -0,0 +1,85 @@
+# TVM FFI Triton CUBIN Wrapper
+
+> **TL;DR:** `pegainfer-kernels` now has an optional `tvm-ffi-triton-cubin` bridge for the Qwen3.5 GDR solve Triton AOT CUBIN launcher, with unit coverage for wrapper registration and packed-ABI diagnostics.
+>
+> **Last touched:** 2026-06
+
+## Preparation
+
+- **Read**:
+  - `docs/index.md` - routed this task to the kernels subsystem.
+  - `docs/subsystems/kernels/pegainfer-kernels-boundary.md` - confirmed DSL/kernel integration belongs at the kernels boundary rather than in model runtimes.
+  - `docs/subsystems/kernels/kernel-op-reports.md` - confirmed Triton/CuTe tooling is already feature-scoped in kernel infrastructure.
+  - `pegainfer-kernels/tools/triton/README.md` - described the current Triton AOT CUBIN generation and validation path.
+  - `pegainfer-kernels/build.rs` - showed generated Triton AOT C stubs and wrapper symbols.
+  - `pegainfer-kernels/src/ffi/qwen35.rs` and `pegainfer-kernels/src/ffi/shared.rs` - showed the existing C ABI launch symbols used by Rust model code.
+  - Local `tvm-ffi` crate source - confirmed typed callbacks only cover up to 8 arguments, so Triton launchers need packed TVM FFI wrappers.
+- **Relevant history**:
+  - GitHub issue `#191` proposed TVM FFI as the DSL interface direction.
+  - Draft PR `#202` kept TVM FFI optional/test-only; PR `#315` now keeps the bridge optional behind `tvm-ffi-triton-cubin` while focusing it on Triton CUBIN launch wrappers.
+- **Plan**:
+  1. Add `tvm-ffi` as an optional dependency of `pegainfer-kernels` behind `tvm-ffi-triton-cubin`.
+  2. Add a `triton_cubin` module that exposes a current Qwen3.5 Triton AOT CUBIN launcher as a packed TVM FFI function.
+  3. Keep existing C ABI and Rust call sites available; the TVM FFI layer is an additional DSL boundary, not a production scheduler/model migration.
+  4. Add a small example that registers the wrapper and prints the function contract.
+  5. Validate formatting and the strongest local build/test checks available.
+- **Risks / open questions**:
+  - The `tvm-ffi-triton-cubin` feature means `tvm-ffi-config` and `libtvm_ffi` are build prerequisites only for the optional bridge path.
+  - The current wrapper accepts raw device pointer and stream handles as TVM integers or opaque pointers; a future DLPack/tensor-handle wrapper can sit on top once the DSL artifact contract is stable.
+
+## Execution Log
+
+### Step 1: Required dependency and wrapper surface
+- Added optional `tvm-ffi = "0.1.0-alpha.0"` to `pegainfer-kernels` behind `tvm-ffi-triton-cubin`.
+- Added `pegainfer_kernels::triton_cubin`, which exposes metadata plus a packed TVM FFI callback for the generated Qwen3.5 GDR solve Triton AOT launcher.
+- Kept existing CUDA C ABI symbols and model call sites unchanged.
+
+### Step 2: Small example
+- Added `pegainfer-kernels/examples/triton_cubin_tvm_ffi.rs` to register the TVM FFI global function and print the launch contract.
+
+### Step 3: Unit test coverage
+- Added wrapper unit tests for:
+  - known/unknown wrapper lookup;
+  - global TVM FFI registry round-trip;
+  - accepted raw handle encodings (`u64` and opaque pointer);
+  - missing-argument diagnostics before CUDA launch;
+  - handle and scalar type diagnostics before CUDA launch.
+- Kept tests on pre-launch validation paths so they do not require valid device memory or actually launch the Triton CUBIN.
+
+### Step 4: Validation
+- `cargo fmt --all --check` passed.
+- `cargo check --release -p pegainfer-kernels` no longer requires `tvm-ffi-config`; the TVM FFI bridge is feature-gated.
+- Retried with the discovered local TVM FFI install on `PATH`:
+  - `PATH=/home/ziyang/gpu_memory_profiling/.venv/bin:$PATH cargo check --release -p pegainfer-kernels --features tvm-ffi-triton-cubin`
+  - `tvm-ffi` built successfully.
+  - The build then failed in the existing CUDA build at `pegainfer-kernels/csrc/shared/flashinfer_top1.cu` because the dirty `pegainfer-kernels/third_party/flashinfer` submodule commit changes the `TopKDispatch` API. This is unrelated to the TVM FFI wrapper and was left untouched.
+- After adding tests:
+  - `cargo fmt --all --check` passed.
+  - `PATH=/home/ziyang/gpu_memory_profiling/.venv/bin:$PATH cargo test --release -p pegainfer-kernels --features tvm-ffi-triton-cubin triton_cubin --lib` builds the optional bridge, then currently hits the existing `flashinfer_top1.cu` `TopKDispatch` API mismatch before Rust unit tests can run in this dirty submodule checkout.
+
+### Step 5: Review fixes
+- Addressed xiaguan's requested changes on PR `#315`:
+  - made `tvm-ffi` optional behind `tvm-ffi-triton-cubin` so normal `pegainfer-kernels` builds do not require `tvm-ffi-config` / `libtvm_ffi`;
+  - replaced `expect_err(...)` in tests with explicit `Result` matching because `tvm_ffi::Any` does not implement `Debug`;
+  - updated the example and docs to require/pass the feature.
+- Also addressed automated inline feedback by accepting TVM FFI packed integers as `i64` for pointer handles and scalar launch dimensions, with range checks before casting.
+- Review-fix validation:
+  - `cargo fmt --all --check` passed.
+  - `cargo metadata --no-deps --format-version 1` passed.
+  - `cargo tree -p pegainfer-kernels -e normal --no-default-features --depth 1` shows normal dependencies only (`anyhow`, `cudarc`, `half`, `serde`), no `tvm-ffi`.
+  - `cargo tree -p pegainfer-kernels -e normal --features tvm-ffi-triton-cubin --depth 1` shows `tvm-ffi` only with the bridge feature enabled.
+  - `cargo check --release -p pegainfer-kernels` no longer needs `tvm-ffi-config`, then stops at the existing dirty-FlashInfer `flashinfer_top1.cu` `TopKDispatch` mismatch.
+  - `PATH=/home/ziyang/gpu_memory_profiling/.venv/bin:$PATH cargo test --release -p pegainfer-kernels --features tvm-ffi-triton-cubin triton_cubin --lib -- --nocapture` also stops at the same CUDA build-script mismatch before Rust tests run in this checkout.
+
+## Debrief
+
+- **Outcome**: Added optional TVM FFI dependency wiring plus a real Triton CUBIN wrapper MVP for the Qwen3.5 GDR solve launcher, with unit tests covering wrapper discovery, registry registration, packed handle conversion, and pre-launch diagnostics.
+- **Pitfalls encountered**:
+  - `apply_patch` and normal shell commands were blocked by the sandbox namespace failure, so edits were applied with scoped `git apply` patches.
+  - TVM FFI is now a real build prerequisite only when `tvm-ffi-triton-cubin` is enabled; hosts using that feature need `tvm-ffi-config` on `PATH`.
+  - Local full kernel-crate validation is currently blocked by the pre-existing dirty FlashInfer submodule, not by the TVM FFI code.
+- **Lessons learned**:
+  - TVM FFI typed callbacks currently cover only up to 8 arguments, while Triton/CUDA launchers can exceed that, so the wrapper should use packed TVM FFI callbacks for launch surfaces.
+- **Follow-ups**:
+  - Add packed TVM FFI wrappers for the remaining generated Triton AOT launchers once the FlashInfer submodule is back at the expected API or the CUDA call site is updated.
+  - Consider a higher-level DLPack/tensor-handle wrapper above the raw pointer/stream packed ABI once the DSL artifact contract is stable.
diff --git a/pegainfer-kernels/Cargo.toml b/pegainfer-kernels/Cargo.toml
@@ -8,15 +8,21 @@ anyhow = { workspace = true }
 cudarc = { workspace = true }
 half = { workspace = true }
 serde = { workspace = true }
+tvm-ffi = { version = "0.1.0-alpha.0", optional = true }
 
 [build-dependencies]
 cc = { workspace = true }
 
 [features]
 default = []
+tvm-ffi-triton-cubin = ["dep:tvm-ffi"]
 deepseek-v4 = []
 deepseek-v4-cutedsl-diagnostic = ["deepseek-v4"]
 kimi-k2 = []
 
+[[example]]
+name = "triton_cubin_tvm_ffi"
+required-features = ["tvm-ffi-triton-cubin"]
+
 [lints]
 workspace = true
diff --git a/pegainfer-kernels/examples/triton_cubin_tvm_ffi.rs b/pegainfer-kernels/examples/triton_cubin_tvm_ffi.rs
@@ -0,0 +1,20 @@
+use pegainfer_kernels::triton_cubin::{self, QWEN35_GDR_CHUNK_SOLVE};
+
+fn main() -> tvm_ffi::Result<()> {
+    triton_cubin::register_global_functions()?;
+
+    println!("registered Triton CUBIN TVM FFI functions:");
+    for spec in triton_cubin::TRITON_CUBIN_FUNCTIONS {
+        println!("  {} -> {}", spec.name, spec.ffi_symbol);
+    }
+
+    let solve = triton_cubin::get_global_or_register(QWEN35_GDR_CHUNK_SOLVE.name)?;
+    println!(
+        "{} is ready; call it with packed args: {}",
+        QWEN35_GDR_CHUNK_SOLVE.name,
+        QWEN35_GDR_CHUNK_SOLVE.arg_names.join(", ")
+    );
+
+    drop(solve);
+    Ok(())
+}
diff --git a/pegainfer-kernels/src/lib.rs b/pegainfer-kernels/src/lib.rs
@@ -7,4 +7,6 @@ pub mod gpu_buffers;
 pub mod ops;
 pub mod paged_kv;
 pub mod tensor;
+#[cfg(feature = "tvm-ffi-triton-cubin")]
+pub mod triton_cubin;
 pub mod typed_ops;