OpenInfer 0.1.0

OpenInfer 0.1.0 is the first source release of OpenInfer: a pure Rust + CUDA LLM inference engine with an OpenAI-compatible /v1/completions API.

This release marks the point where the Qwen3-4B/8B line is usable as a production-oriented small-model serving stack, while the large MoE model lines remain active work in progress.

What Is Included

OpenAI-compatible completions serving through the vLLM Rust frontend.
A Rust-native engine contract built around EngineHandle, request submission, and streamed TokenEvents.
Per-model engines instead of one universal model abstraction.
CUDA Graph decode path for stable steady-state serving.
Continuous batching scheduler for Qwen3.
Prefix cache backed by Dynamo KV block-management logic.
Pegaflow KV offload integration for HBM to host memory recovery paths.
Feature-gated model support:
- Qwen3-4B / Qwen3-8B
- Qwen3.5-4B
- DeepSeek-V4
- DeepSeek-V2-Lite
- Kimi-K2
Kernel layer using a mix of handwritten CUDA, cuBLAS, FlashInfer, Triton AOT, TileLang, and CuTe DSL.

Qwen3 Status

The Qwen3-4B/8B path is the most mature line in this release.

It includes:

BF16 weight loading
full-attention prefill and decode
paged KV cache
prefix cache
continuous batching
CUDA Graph decode
greedy and sampling support
LoRA routing support
integration and golden-logit accuracy gates

On a single RTX 5090, Qwen3-4B is broadly competitive with vLLM 0.22.1 under vllm bench serve using the same client, same request stream, and same seed.

At low load, OpenInfer and vLLM are close on TTFT and TPOT. At medium load, vLLM is still ahead on decode TPOT. Under overload, OpenInfer reaches comparable or slightly higher saturated output throughput in the current sweep, but both engines are well past the healthy latency regime at that point.

Prefix Cache And KV Offload

OpenInfer is designed around a readable KV data plane.

For GPU prefix-cache hits, warm TTFT stays low as context grows. In the Qwen3-4B RTX 5090 sweep, OpenInfer stays ahead of vLLM on warm TTFT across tested prompt lengths up to 16k tokens.

Pegaflow integration provides a host-memory KV recovery path. In CPU-memory warm-hit mode, long prompts avoid full prefill and recover KV blocks from host memory, giving larger speedups as prompt length grows.

Why Rust

The goal of OpenInfer is not to rebuild every piece of the inference stack from scratch.

Instead, the project uses Rust to assemble components with explicit ownership boundaries:

vLLM Rust frontend for protocol, tokenizer, streaming, and OpenAI compatibility.
Dynamo KV block-management logic for prefix matching and KV lifecycle.
Pegaflow for KV offload.
Per-model engines for model-specific scheduling, state layout, and kernel DAGs.
A kernel crate that keeps CUDA/Triton/TileLang/CuTe DSL assets feature-gated by model.

This keeps the serving path inspectable, avoids a large Python runtime in the hot service, and makes it easier to co-design routing, KV cache, scheduler, and kernels.

Known Limitations

Qwen3 is the mature path in this release.
DeepSeek-V4 and Kimi-K2 support exist, but serving performance and production polish are still catching up to vLLM.
Some model lines require feature-specific build dependencies, such as Triton, TileLang, CuTe DSL, NCCL, or RDMA-related libraries.
This is a source release, not a prebuilt binary distribution.
No formal release workflow exists yet; this tag is intended as the first public 0.1.0 checkpoint.

Build

Default build targets Qwen3-4B:

cargo run --release -- --model-path /path/to/Qwen3-4B

Other model lines are feature-gated:

cargo run --release --features qwen35-4b -- --model-path /path/to/Qwen3.5-4B
cargo run --release --features deepseek-v4 -- --model-path /path/to/DeepSeek-V4
cargo run --release --features kimi-k2 -- --model-path /path/to/Kimi-K2

Closing

OpenInfer 0.1.0 is both a working inference engine and a statement of direction: inference systems should be built from smaller, readable, replaceable Rust components.

The long-term goal is for pieces such as the frontend, KV cache, offload layer, scheduler, and kernel tooling to become composable building blocks that can be reused outside this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

OpenInfer 0.1.0

What Is Included

Qwen3 Status

Prefix Cache And KV Offload

Why Rust

Known Limitations

Build

Closing

Uh oh!

Releases: openinfer-project/openinfer

OpenInfer 0.1.0

OpenInfer 0.1.0

What Is Included

Qwen3 Status

Prefix Cache And KV Offload

Why Rust

Known Limitations

Build

Closing

Uh oh!