Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
| `subsystems/kernels/pegainfer-kernels-boundary.md` | Architecture decision: pegainfer should use reusable frontend/runtime/data-plane layers plus per-model engines; kernels become first-class assets through a ledger, simulator, and request tracing. |
| `subsystems/kernels/kernel-op-reports.md` | Qwen3 kernel/report tooling is feature-gated: `qwen3_kernel_report` covers per-op kernel reports, and `qwen3_model_report` emits runtime-traced eager-DAG decode operator rollups with TensorSpec `KernelCall`s, latency stats, tables, and Graphviz DOT; measured FA2 `CTA_TILE_Q=64` prefill default in place. |
| `subsystems/kernels/typed-forward-pipeline.md` | Reusable typed tensor pipeline macro in `pegainfer-kernels` so model crates can express common `typed_ops` chains without model-specific wrapper macros. |
| `subsystems/kernels/tvm-ffi-mvp.md` | Optional `tvm-ffi-triton-cubin` bridge in `pegainfer-kernels` plus a packed TVM wrapper for the Qwen3.5 GDR solve Triton AOT CUBIN launcher. |

## playbooks

Expand Down
85 changes: 85 additions & 0 deletions docs/subsystems/kernels/tvm-ffi-mvp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# TVM FFI Triton CUBIN Wrapper

> **TL;DR:** `pegainfer-kernels` now has an optional `tvm-ffi-triton-cubin` bridge for the Qwen3.5 GDR solve Triton AOT CUBIN launcher, with unit coverage for wrapper registration and packed-ABI diagnostics.
>
> **Last touched:** 2026-06

## Preparation

- **Read**:
- `docs/index.md` - routed this task to the kernels subsystem.
- `docs/subsystems/kernels/pegainfer-kernels-boundary.md` - confirmed DSL/kernel integration belongs at the kernels boundary rather than in model runtimes.
- `docs/subsystems/kernels/kernel-op-reports.md` - confirmed Triton/CuTe tooling is already feature-scoped in kernel infrastructure.
- `pegainfer-kernels/tools/triton/README.md` - described the current Triton AOT CUBIN generation and validation path.
- `pegainfer-kernels/build.rs` - showed generated Triton AOT C stubs and wrapper symbols.
- `pegainfer-kernels/src/ffi/qwen35.rs` and `pegainfer-kernels/src/ffi/shared.rs` - showed the existing C ABI launch symbols used by Rust model code.
- Local `tvm-ffi` crate source - confirmed typed callbacks only cover up to 8 arguments, so Triton launchers need packed TVM FFI wrappers.
- **Relevant history**:
- GitHub issue `#191` proposed TVM FFI as the DSL interface direction.
- Draft PR `#202` kept TVM FFI optional/test-only; PR `#315` now keeps the bridge optional behind `tvm-ffi-triton-cubin` while focusing it on Triton CUBIN launch wrappers.
- **Plan**:
1. Add `tvm-ffi` as an optional dependency of `pegainfer-kernels` behind `tvm-ffi-triton-cubin`.
2. Add a `triton_cubin` module that exposes a current Qwen3.5 Triton AOT CUBIN launcher as a packed TVM FFI function.
3. Keep existing C ABI and Rust call sites available; the TVM FFI layer is an additional DSL boundary, not a production scheduler/model migration.
4. Add a small example that registers the wrapper and prints the function contract.
5. Validate formatting and the strongest local build/test checks available.
- **Risks / open questions**:
- The `tvm-ffi-triton-cubin` feature means `tvm-ffi-config` and `libtvm_ffi` are build prerequisites only for the optional bridge path.
- The current wrapper accepts raw device pointer and stream handles as TVM integers or opaque pointers; a future DLPack/tensor-handle wrapper can sit on top once the DSL artifact contract is stable.

## Execution Log

### Step 1: Required dependency and wrapper surface
- Added optional `tvm-ffi = "0.1.0-alpha.0"` to `pegainfer-kernels` behind `tvm-ffi-triton-cubin`.
- Added `pegainfer_kernels::triton_cubin`, which exposes metadata plus a packed TVM FFI callback for the generated Qwen3.5 GDR solve Triton AOT launcher.
- Kept existing CUDA C ABI symbols and model call sites unchanged.

### Step 2: Small example
- Added `pegainfer-kernels/examples/triton_cubin_tvm_ffi.rs` to register the TVM FFI global function and print the launch contract.

### Step 3: Unit test coverage
- Added wrapper unit tests for:
- known/unknown wrapper lookup;
- global TVM FFI registry round-trip;
- accepted raw handle encodings (`u64` and opaque pointer);
- missing-argument diagnostics before CUDA launch;
- handle and scalar type diagnostics before CUDA launch.
- Kept tests on pre-launch validation paths so they do not require valid device memory or actually launch the Triton CUBIN.

### Step 4: Validation
- `cargo fmt --all --check` passed.
- `cargo check --release -p pegainfer-kernels` no longer requires `tvm-ffi-config`; the TVM FFI bridge is feature-gated.
- Retried with the discovered local TVM FFI install on `PATH`:
- `PATH=/home/ziyang/gpu_memory_profiling/.venv/bin:$PATH cargo check --release -p pegainfer-kernels --features tvm-ffi-triton-cubin`
- `tvm-ffi` built successfully.
- The build then failed in the existing CUDA build at `pegainfer-kernels/csrc/shared/flashinfer_top1.cu` because the dirty `pegainfer-kernels/third_party/flashinfer` submodule commit changes the `TopKDispatch` API. This is unrelated to the TVM FFI wrapper and was left untouched.
- After adding tests:
- `cargo fmt --all --check` passed.
- `PATH=/home/ziyang/gpu_memory_profiling/.venv/bin:$PATH cargo test --release -p pegainfer-kernels --features tvm-ffi-triton-cubin triton_cubin --lib` builds the optional bridge, then currently hits the existing `flashinfer_top1.cu` `TopKDispatch` API mismatch before Rust unit tests can run in this dirty submodule checkout.

### Step 5: Review fixes
- Addressed xiaguan's requested changes on PR `#315`:
- made `tvm-ffi` optional behind `tvm-ffi-triton-cubin` so normal `pegainfer-kernels` builds do not require `tvm-ffi-config` / `libtvm_ffi`;
- replaced `expect_err(...)` in tests with explicit `Result` matching because `tvm_ffi::Any` does not implement `Debug`;
- updated the example and docs to require/pass the feature.
- Also addressed automated inline feedback by accepting TVM FFI packed integers as `i64` for pointer handles and scalar launch dimensions, with range checks before casting.
- Review-fix validation:
- `cargo fmt --all --check` passed.
- `cargo metadata --no-deps --format-version 1` passed.
- `cargo tree -p pegainfer-kernels -e normal --no-default-features --depth 1` shows normal dependencies only (`anyhow`, `cudarc`, `half`, `serde`), no `tvm-ffi`.
- `cargo tree -p pegainfer-kernels -e normal --features tvm-ffi-triton-cubin --depth 1` shows `tvm-ffi` only with the bridge feature enabled.
- `cargo check --release -p pegainfer-kernels` no longer needs `tvm-ffi-config`, then stops at the existing dirty-FlashInfer `flashinfer_top1.cu` `TopKDispatch` mismatch.
- `PATH=/home/ziyang/gpu_memory_profiling/.venv/bin:$PATH cargo test --release -p pegainfer-kernels --features tvm-ffi-triton-cubin triton_cubin --lib -- --nocapture` also stops at the same CUDA build-script mismatch before Rust tests run in this checkout.

## Debrief

- **Outcome**: Added optional TVM FFI dependency wiring plus a real Triton CUBIN wrapper MVP for the Qwen3.5 GDR solve launcher, with unit tests covering wrapper discovery, registry registration, packed handle conversion, and pre-launch diagnostics.
- **Pitfalls encountered**:
- `apply_patch` and normal shell commands were blocked by the sandbox namespace failure, so edits were applied with scoped `git apply` patches.
- TVM FFI is now a real build prerequisite only when `tvm-ffi-triton-cubin` is enabled; hosts using that feature need `tvm-ffi-config` on `PATH`.
- Local full kernel-crate validation is currently blocked by the pre-existing dirty FlashInfer submodule, not by the TVM FFI code.
- **Lessons learned**:
- TVM FFI typed callbacks currently cover only up to 8 arguments, while Triton/CUDA launchers can exceed that, so the wrapper should use packed TVM FFI callbacks for launch surfaces.
- **Follow-ups**:
- Add packed TVM FFI wrappers for the remaining generated Triton AOT launchers once the FlashInfer submodule is back at the expected API or the CUDA call site is updated.
- Consider a higher-level DLPack/tensor-handle wrapper above the raw pointer/stream packed ABI once the DSL artifact contract is stable.
6 changes: 6 additions & 0 deletions pegainfer-kernels/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,21 @@ anyhow = { workspace = true }
cudarc = { workspace = true }
half = { workspace = true }
serde = { workspace = true }
tvm-ffi = { version = "0.1.0-alpha.0", optional = true }

[build-dependencies]
cc = { workspace = true }

[features]
default = []
tvm-ffi-triton-cubin = ["dep:tvm-ffi"]
deepseek-v4 = []
deepseek-v4-cutedsl-diagnostic = ["deepseek-v4"]
kimi-k2 = []

[[example]]
name = "triton_cubin_tvm_ffi"
required-features = ["tvm-ffi-triton-cubin"]

[lints]
workspace = true
20 changes: 20 additions & 0 deletions pegainfer-kernels/examples/triton_cubin_tvm_ffi.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
use pegainfer_kernels::triton_cubin::{self, QWEN35_GDR_CHUNK_SOLVE};

fn main() -> tvm_ffi::Result<()> {
triton_cubin::register_global_functions()?;

println!("registered Triton CUBIN TVM FFI functions:");
for spec in triton_cubin::TRITON_CUBIN_FUNCTIONS {
println!(" {} -> {}", spec.name, spec.ffi_symbol);
}

let solve = triton_cubin::get_global_or_register(QWEN35_GDR_CHUNK_SOLVE.name)?;
println!(
"{} is ready; call it with packed args: {}",
QWEN35_GDR_CHUNK_SOLVE.name,
QWEN35_GDR_CHUNK_SOLVE.arg_names.join(", ")
);

drop(solve);
Ok(())
}
2 changes: 2 additions & 0 deletions pegainfer-kernels/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,6 @@ pub mod gpu_buffers;
pub mod ops;
pub mod paged_kv;
pub mod tensor;
#[cfg(feature = "tvm-ffi-triton-cubin")]
pub mod triton_cubin;
pub mod typed_ops;
Loading