GitHub - cklxx/arle: Rust-native inference runtime for Qwen3 / Qwen3.5 — OpenAI-compatible serving + integrated agent, train, and self-evolution workflows. CUDA + Metal, no PyTorch on the hot path.

ARLE
Pure-Rust runtime for serving, local agents, training, and evaluation. infer is the OpenAI-compatible serving binary; arle is the unified front door.

Quick Start · HTTP API · Support Matrix · Architecture · Roadmap · Changelog · Contributing

English · 简体中文

Quick Start

1. Install

Apple Silicon — Homebrew (recommended):

brew install cklxx/tap/arle
arle --doctor

Apple Silicon or Linux x86_64 — one-line installer:

curl -fsSL https://github.com/cklxx/arle/releases/latest/download/install.sh | sh

The script grabs the matching tarball from the latest GitHub Release, SHA256-verifies it, and drops the binaries into ~/.local/bin (override with INSTALL_DIR=...). See docs/install.md for the full matrix, env-var overrides, and uninstall steps.

Linux + NVIDIA — pull the published Docker image, no compile:

docker run --rm --gpus all -p 8000:8000 \
  -v /path/to/Qwen3-4B:/model:ro \
  ghcr.io/cklxx/arle:latest \
  serve --backend cuda --model-path /model --port 8000

The :latest tag tracks the newest non-prerelease release image. Tagged releases are published as ghcr.io/cklxx/arle:X.Y.Z (note: no v prefix - the docker metadata-action strips it). For the current release: ghcr.io/cklxx/arle:0.1.5.

From source (any backend; needed for cpu, CUDA/TileLang, or local hacking):

git clone https://github.com/cklxx/arle && cd arle
# Apple Silicon:
cargo build --release --no-default-features --features metal,no-cuda,cli --bin arle
# Linux + NVIDIA:
cargo build --release --features cuda --bin arle

2. Serve a model

arle serve --backend metal \
  --model-path mlx-community/Qwen3-0.6B-4bit --port 8000   # Apple Silicon
arle serve --backend cuda \
  --model-path /path/to/Qwen3-4B --port 8000               # Linux + NVIDIA

3. Talk to it

# pip install openai
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
print(client.chat.completions.create(
    model="qwen3-4b",
    messages=[{"role": "user", "content": "Hello from ARLE"}],
).choices[0].message.content)

Or with curl: see examples/curl_chat.sh. More copy-paste paths: examples/.

4. Run the local agent

arle                                                       # interactive REPL with built-in tools
arle --model-path /path/to/Qwen3-4B run --prompt "Summarize this repo"   # one-shot
arle --doctor --json                                       # self-check, machine-readable

CPU-only smoke build (no GPU required, source build):

cargo build --release --no-default-features --features cpu,no-cuda,cli --bin arle
./target/release/arle --doctor

Status at a glance

Backend	Platform	Status	Notes
CUDA	Linux + NVIDIA	Stable	Continuous batching, paged KV, radix-backed reuse, TileLang BF16 attention, custom CUDA quantized decode, CUDA Graph decode, packed paged-prefill for Qwen3 / Qwen3.5. L4 / Qwen3-4B BF16 + FP8 paged KV (auto): 197 tok/s @ c=16 / 4096-in, peak_active=16 saturated.
Metal	Apple Silicon	Beta	Live scheduler-backed serving, chunked prefill, replay-backed prefix reuse. Qwen3.5-0.8B MLX 4bit single-request step-driver reaches 305.5 tok/s on M4 Pro 20c; GGUF Q4_K_M exact default is 202.1 tok/s direct, with an opt-in native-q4 Metal load path at 236.7 tok/s direct / 239.8 tok/s step-driver on the matched 1024/256 profile.
Metal DFlash	Apple Silicon	Beta — default-on	Speculative decode for Qwen3 / Qwen3.5. Qwen3-4B bf16 achieves 5.9× decode speedup, Qwen3.5-4B-4bit maintains bit-identical parity, validated for c=1..8.
CPU	Portable	Dev-only	Smoke tests and request-path validation. DeepSeek V4 has a slow Rust reference path for 1B init correctness / HTTP smoke; not a serving-performance target.

Models: Qwen3 (0.6B – 72B) and the Qwen3.5 family (including 0.8B GGUF Q4_K_M and 4B hybrid linear + full attention) are supported on CUDA and Metal according to the current matrix. Qwen3.6 / Qwen3.5-MoE has a narrow Metal Beta path; CUDA remains stubbed. Next-model priority queue: DeepSeek V4 (#1, V4-only substrate + CPU reference smoke landed) then Qwen 3.6 (#2, planned); see ROADMAP.md §Next-Model Priority Order. DeepSeek V2/V3/R1 support paths are intentionally not carried in the current runtime.

Authoritative matrix (HTTP API tiers, quantization, agent / train / eval surfaces): docs/support-matrix.md. Stability tiers: docs/stability-policy.md.

Why ARLE

In agent and RL workloads every turn pays a prefill tax: system prompt + history + tool results must be re-processed. As context grows, prefill dominates latency. ARLE treats this as the core problem in both serving and agent / RL loops:

Multi-turn KV reuse. Slot-sticky reuse keeps prior-turn KV hot for the next turn. CUDA also includes a radix-backed tiered-KV path (T0 GPU → T1 host pinned → T2 local disk → T3 cluster-shared) for full-block reuse and staged readmission, so only the new user message requires prefill each turn when the prefix stays reusable.
Paged KV pool. Main CUDA KV formats use page_size=16 with direct GPU page attach and tail-page CoW on shared prefixes — predictable accounting, reusable full blocks, cheaper prefix sharing.
Shared runtime authority. infer, arle, and the in-tree train / eval jobs resolve models and reuse the same Rust runtime / model contracts. Serving, local agent work, and RL tooling stay on one code path instead of drifting across separate stacks.

Architecture deep-dive: docs/architecture.md · docs/codebase-map.md. Latest benchmark snapshots (per change, dated): docs/experience/wins/ · run your own with scripts/bench_guidellm.sh.

Entry surfaces

arle is the single binary users interact with:

Command	What it does
`arle` (no args)	Interactive agent REPL with built-in `python` and `shell` tools (sandboxed).
`arle run --prompt "…"` / `--stdin --json`	Script-friendly one-shot agent prompt. Use `--no-tools` to disable tool execution.
`arle serve --backend {cuda,metal,cpu} --model-path …`	Launch the OpenAI-compatible HTTP server through an ARLE-native backend.
`arle train {pretrain,sft,grpo,multi-turn,eval}`	In-tree training and RL workflows on the same runtime.
`arle data {download,convert}`	Dataset utilities.
`arle --doctor [--json] [--strict]`	Self-check: backend, hardware, HF cache, model resolution. CI-friendly.

The REPL persists line history at ~/.arle-history and exposes slash commands: /help, /reset, /clear, /tools, /model, /stats, /models, /save, /load, /export.

Operators who want only the native serving binary can use infer directly (cargo build -p infer --release --features cuda on Linux, --features metal,no-cuda on Apple Silicon) — same HTTP contract, without the agent / train / data surface.

📰 Latest Updates

2026-05-10 — 🎉 W4-hybrid prefill graph capture closes 4k/c=4 SGLang +76.6% gap via Path B.2 bucketed allocation key (a56b7a9/c44788f). Engine-side TTFT p50 2000ms → 150ms = -92.5% improvement on RTX 4070 Ti SUPER 16GB (server-side /v1/stats engine_ttft_us ground truth; client-side guidellm 0.6.0 broken — bench tool bug isolated). Throughput +632% in 60s window. Bucketed page_indices_len (64-entry) + prefix_token_rows_len (128-row) reduce capture key churn from 388 unique → 7 unique with 98.5% LRU dominant key reuse. Codex's "second-order bucketing" insight (captured scalar launch parameters use bucket capacity, not exact dim from first capture) was load-bearing; new anti-pattern in skill v1.7.0 catalog. Opt-in via INFER_PREFILL_GRAPH=1 + INFER_HYBRID_W4A8_PREFILL=1. Plus RoPE scaling support (YARN / Linear / NtkAware) wired through qwen3-spec + qwen35-spec + precompute_rope_with_scaling. Evidence: docs/experience/wins/2026-05-10-bench-40-pathB2-tier1-strong-proceed.md, docs/experience/wins/2026-05-10-m-rope-yarn-scaling-phase1-phase2-landed.md.
2026-04-28 — CUDA L4 Qwen3-4B BF16, c=16 / 4096-in increased from 120 → 197 tok/s (+64%) after enabling automatic HBM-tier chunked_prefill_size and FP8 paged KV defaulting on L4-class GPUs. peak_active saturates at 16/16; achieves +42% vs SGLang reference on the same workload. Evidence: docs/experience/wins/2026-04-28-bench-guidellm-cuda-l4-kv-fp8-auto.md.

Full history: CHANGELOG.md. Next up: ROADMAP.md.

Documentation map

docs/http-api.md — HTTP route contract, streaming behavior, boundary guarantees
docs/support-matrix.md — backend / model / quant / API support tiers
docs/stability-policy.md — stability levels and compatibility posture
docs/architecture.md — package boundaries and dependency direction
docs/codebase-map.md — workspace layout and main execution paths
docs/environment.md — environment variables and runtime knobs
docs/troubleshooting.md — common build / runtime errors and fixes
docs/comparison.md — how ARLE compares to vLLM / SGLang / mistral.rs / llama.cpp
docs/release-checklist.md · docs/perf-and-correctness-gates.md
CONTRIBUTING.md — contributor setup, validation, release expectations
SECURITY.md — vulnerability reporting policy
examples/ — copy-paste smoke paths (curl, OpenAI SDK, Docker, Metal, train fixtures)
docs/index.md — maintainer-facing PARA index, plans, and experience logs

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2,851 Commits
.cargo		.cargo
.claude		.claude
.githooks		.githooks
.github		.github
benchmarks		benchmarks
crates		crates
docs		docs
examples		examples
infer		infer
memory		memory
scripts		scripts
src		src
tests		tests
traces		traces
web		web
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
AGENT_SCHEDULER_OPTIMIZATION_PLAN.md		AGENT_SCHEDULER_OPTIMIZATION_PLAN.md
AGENT_WORLD_CLASS_OPTIMIZATION_SUMMARY.md		AGENT_WORLD_CLASS_OPTIMIZATION_SUMMARY.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README.zh-CN.md		README.zh-CN.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
deny.toml		deny.toml
pyproject.toml		pyproject.toml
requirements-bench.txt		requirements-bench.txt
requirements-build.txt		requirements-build.txt
rust-toolchain.toml		rust-toolchain.toml
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick Start

1. Install

2. Serve a model

3. Talk to it

4. Run the local agent

Status at a glance

Why ARLE

Entry surfaces

📰 Latest Updates

Documentation map

License

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Quick Start

1. Install

2. Serve a model

3. Talk to it

4. Run the local agent

Status at a glance

Why ARLE

Entry surfaces

📰 Latest Updates

Documentation map

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages