Releases: openinfer-project/openinfer
OpenInfer 0.1.0
OpenInfer 0.1.0
OpenInfer 0.1.0 is the first source release of OpenInfer: a pure Rust + CUDA LLM inference engine with an OpenAI-compatible /v1/completions API.
This release marks the point where the Qwen3-4B/8B line is usable as a production-oriented small-model serving stack, while the large MoE model lines remain active work in progress.
What Is Included
- OpenAI-compatible completions serving through the vLLM Rust frontend.
- A Rust-native engine contract built around
EngineHandle, request submission, and streamedTokenEvents. - Per-model engines instead of one universal model abstraction.
- CUDA Graph decode path for stable steady-state serving.
- Continuous batching scheduler for Qwen3.
- Prefix cache backed by Dynamo KV block-management logic.
- Pegaflow KV offload integration for HBM to host memory recovery paths.
- Feature-gated model support:
- Qwen3-4B / Qwen3-8B
- Qwen3.5-4B
- DeepSeek-V4
- DeepSeek-V2-Lite
- Kimi-K2
- Kernel layer using a mix of handwritten CUDA, cuBLAS, FlashInfer, Triton AOT, TileLang, and CuTe DSL.
Qwen3 Status
The Qwen3-4B/8B path is the most mature line in this release.
It includes:
- BF16 weight loading
- full-attention prefill and decode
- paged KV cache
- prefix cache
- continuous batching
- CUDA Graph decode
- greedy and sampling support
- LoRA routing support
- integration and golden-logit accuracy gates
On a single RTX 5090, Qwen3-4B is broadly competitive with vLLM 0.22.1 under vllm bench serve using the same client, same request stream, and same seed.
At low load, OpenInfer and vLLM are close on TTFT and TPOT. At medium load, vLLM is still ahead on decode TPOT. Under overload, OpenInfer reaches comparable or slightly higher saturated output throughput in the current sweep, but both engines are well past the healthy latency regime at that point.
Prefix Cache And KV Offload
OpenInfer is designed around a readable KV data plane.
For GPU prefix-cache hits, warm TTFT stays low as context grows. In the Qwen3-4B RTX 5090 sweep, OpenInfer stays ahead of vLLM on warm TTFT across tested prompt lengths up to 16k tokens.
Pegaflow integration provides a host-memory KV recovery path. In CPU-memory warm-hit mode, long prompts avoid full prefill and recover KV blocks from host memory, giving larger speedups as prompt length grows.
Why Rust
The goal of OpenInfer is not to rebuild every piece of the inference stack from scratch.
Instead, the project uses Rust to assemble components with explicit ownership boundaries:
- vLLM Rust frontend for protocol, tokenizer, streaming, and OpenAI compatibility.
- Dynamo KV block-management logic for prefix matching and KV lifecycle.
- Pegaflow for KV offload.
- Per-model engines for model-specific scheduling, state layout, and kernel DAGs.
- A kernel crate that keeps CUDA/Triton/TileLang/CuTe DSL assets feature-gated by model.
This keeps the serving path inspectable, avoids a large Python runtime in the hot service, and makes it easier to co-design routing, KV cache, scheduler, and kernels.
Known Limitations
- Qwen3 is the mature path in this release.
- DeepSeek-V4 and Kimi-K2 support exist, but serving performance and production polish are still catching up to vLLM.
- Some model lines require feature-specific build dependencies, such as Triton, TileLang, CuTe DSL, NCCL, or RDMA-related libraries.
- This is a source release, not a prebuilt binary distribution.
- No formal release workflow exists yet; this tag is intended as the first public 0.1.0 checkpoint.
Build
Default build targets Qwen3-4B:
cargo run --release -- --model-path /path/to/Qwen3-4BOther model lines are feature-gated:
cargo run --release --features qwen35-4b -- --model-path /path/to/Qwen3.5-4B
cargo run --release --features deepseek-v4 -- --model-path /path/to/DeepSeek-V4
cargo run --release --features kimi-k2 -- --model-path /path/to/Kimi-K2Closing
OpenInfer 0.1.0 is both a working inference engine and a statement of direction: inference systems should be built from smaller, readable, replaceable Rust components.
The long-term goal is for pieces such as the frontend, KV cache, offload layer, scheduler, and kernel tooling to become composable building blocks that can be reused outside this repository.