ARLE
Pure-Rust runtime for serving, local agents, training, and evaluation. infer is the OpenAI-compatible serving binary; arle is the unified front door.
Quick Start · HTTP API · Support Matrix · Architecture · Roadmap · Changelog · Contributing
English · 简体中文
Apple Silicon — Homebrew (recommended):
brew install cklxx/tap/arle
arle --doctorApple Silicon or Linux x86_64 — one-line installer:
curl -fsSL https://github.com/cklxx/arle/releases/latest/download/install.sh | shThe script grabs the matching tarball from the latest GitHub Release,
SHA256-verifies it, and drops the binaries into ~/.local/bin (override
with INSTALL_DIR=...). See docs/install.md for the full
matrix, env-var overrides, and uninstall steps.
Linux + NVIDIA — pull the published Docker image, no compile:
docker run --rm --gpus all -p 8000:8000 \
-v /path/to/Qwen3-4B:/model:ro \
ghcr.io/cklxx/arle:latest \
serve --backend cuda --model-path /model --port 8000The :latest tag tracks the newest non-prerelease release image. Tagged
releases are published as ghcr.io/cklxx/arle:X.Y.Z (note: no v prefix -
the docker metadata-action strips it). For the current release:
ghcr.io/cklxx/arle:0.1.5.
From source (any backend; needed for cpu, CUDA/TileLang, or local hacking):
git clone https://github.com/cklxx/arle && cd arle
# Apple Silicon:
cargo build --release --no-default-features --features metal,no-cuda,cli --bin arle
# Linux + NVIDIA:
cargo build --release --features cuda --bin arlearle serve --backend metal \
--model-path mlx-community/Qwen3-0.6B-4bit --port 8000 # Apple Silicon
arle serve --backend cuda \
--model-path /path/to/Qwen3-4B --port 8000 # Linux + NVIDIA# pip install openai
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
print(client.chat.completions.create(
model="qwen3-4b",
messages=[{"role": "user", "content": "Hello from ARLE"}],
).choices[0].message.content)Or with curl: see examples/curl_chat.sh.
More copy-paste paths: examples/.
arle # interactive REPL with built-in tools
arle --model-path /path/to/Qwen3-4B run --prompt "Summarize this repo" # one-shot
arle --doctor --json # self-check, machine-readableCPU-only smoke build (no GPU required, source build):
cargo build --release --no-default-features --features cpu,no-cuda,cli --bin arle
./target/release/arle --doctor| Backend | Platform | Status | Notes |
|---|---|---|---|
| CUDA | Linux + NVIDIA | Stable | Continuous batching, paged KV, radix-backed reuse, TileLang BF16 attention, custom CUDA quantized decode, CUDA Graph decode, packed paged-prefill for Qwen3 / Qwen3.5. L4 / Qwen3-4B BF16 + FP8 paged KV (auto): 197 tok/s @ c=16 / 4096-in, peak_active=16 saturated. |
| Metal | Apple Silicon | Beta | Live scheduler-backed serving, chunked prefill, replay-backed prefix reuse. Qwen3.5-0.8B MLX 4bit single-request step-driver reaches 305.5 tok/s on M4 Pro 20c; GGUF Q4_K_M exact default is 202.1 tok/s direct, with an opt-in native-q4 Metal load path at 236.7 tok/s direct / 239.8 tok/s step-driver on the matched 1024/256 profile. |
| Metal DFlash | Apple Silicon | Beta — default-on | Speculative decode for Qwen3 / Qwen3.5. Qwen3-4B bf16 achieves 5.9× decode speedup, Qwen3.5-4B-4bit maintains bit-identical parity, validated for c=1..8. |
| CPU | Portable | Dev-only | Smoke tests and request-path validation. DeepSeek V4 has a slow Rust reference path for 1B init correctness / HTTP smoke; not a serving-performance target. |
Models: Qwen3 (0.6B – 72B) and the Qwen3.5 family (including 0.8B GGUF Q4_K_M and 4B hybrid linear + full attention) are supported on CUDA and Metal according to the current matrix. Qwen3.6 / Qwen3.5-MoE has a narrow Metal Beta path; CUDA remains stubbed. Next-model priority queue: DeepSeek V4 (#1, V4-only substrate + CPU reference smoke landed) then Qwen 3.6 (#2, planned); see ROADMAP.md §Next-Model Priority Order. DeepSeek V2/V3/R1 support paths are intentionally not carried in the current runtime.
Authoritative matrix (HTTP API tiers, quantization, agent / train / eval surfaces): docs/support-matrix.md. Stability tiers: docs/stability-policy.md.
In agent and RL workloads every turn pays a prefill tax: system prompt + history + tool results must be re-processed. As context grows, prefill dominates latency. ARLE treats this as the core problem in both serving and agent / RL loops:
- Multi-turn KV reuse. Slot-sticky reuse keeps prior-turn KV hot for the next turn. CUDA also includes a radix-backed tiered-KV path (
T0 GPU → T1 host pinned → T2 local disk → T3 cluster-shared) for full-block reuse and staged readmission, so only the new user message requires prefill each turn when the prefix stays reusable. - Paged KV pool. Main CUDA KV formats use
page_size=16with direct GPU page attach and tail-page CoW on shared prefixes — predictable accounting, reusable full blocks, cheaper prefix sharing. - Shared runtime authority.
infer,arle, and the in-tree train / eval jobs resolve models and reuse the same Rust runtime / model contracts. Serving, local agent work, and RL tooling stay on one code path instead of drifting across separate stacks.
Architecture deep-dive: docs/architecture.md · docs/codebase-map.md.
Latest benchmark snapshots (per change, dated): docs/experience/wins/ · run your own with scripts/bench_guidellm.sh.
arle is the single binary users interact with:
| Command | What it does |
|---|---|
arle (no args) |
Interactive agent REPL with built-in python and shell tools (sandboxed). |
arle run --prompt "…" / --stdin --json |
Script-friendly one-shot agent prompt. Use --no-tools to disable tool execution. |
arle serve --backend {cuda,metal,cpu} --model-path … |
Launch the OpenAI-compatible HTTP server through an ARLE-native backend. |
arle train {pretrain,sft,grpo,multi-turn,eval} |
In-tree training and RL workflows on the same runtime. |
arle data {download,convert} |
Dataset utilities. |
arle --doctor [--json] [--strict] |
Self-check: backend, hardware, HF cache, model resolution. CI-friendly. |
The REPL persists line history at ~/.arle-history and exposes slash commands: /help, /reset, /clear, /tools, /model, /stats, /models, /save, /load, /export.
Operators who want only the native serving binary can use infer directly (cargo build -p infer --release --features cuda on Linux, --features metal,no-cuda on Apple Silicon) — same HTTP contract, without the agent / train / data surface.
- 2026-05-10 — 🎉 W4-hybrid prefill graph capture closes 4k/c=4 SGLang +76.6% gap via Path B.2 bucketed allocation key (
a56b7a9/c44788f). Engine-side TTFT p50 2000ms → 150ms = -92.5% improvement on RTX 4070 Ti SUPER 16GB (server-side/v1/stats engine_ttft_usground truth; client-side guidellm 0.6.0 broken — bench tool bug isolated). Throughput +632% in 60s window. Bucketedpage_indices_len(64-entry) +prefix_token_rows_len(128-row) reduce capture key churn from 388 unique → 7 unique with 98.5% LRU dominant key reuse. Codex's "second-order bucketing" insight (captured scalar launch parameters use bucket capacity, not exact dim from first capture) was load-bearing; new anti-pattern in skill v1.7.0 catalog. Opt-in viaINFER_PREFILL_GRAPH=1+INFER_HYBRID_W4A8_PREFILL=1. Plus RoPE scaling support (YARN / Linear / NtkAware) wired through qwen3-spec + qwen35-spec +precompute_rope_with_scaling. Evidence:docs/experience/wins/2026-05-10-bench-40-pathB2-tier1-strong-proceed.md,docs/experience/wins/2026-05-10-m-rope-yarn-scaling-phase1-phase2-landed.md. - 2026-04-28 — CUDA L4
Qwen3-4BBF16, c=16 / 4096-in increased from 120 → 197 tok/s (+64%) after enabling automatic HBM-tierchunked_prefill_sizeand FP8 paged KV defaulting on L4-class GPUs.peak_activesaturates at 16/16; achieves +42% vs SGLang reference on the same workload. Evidence:docs/experience/wins/2026-04-28-bench-guidellm-cuda-l4-kv-fp8-auto.md.
Full history: CHANGELOG.md. Next up: ROADMAP.md.
- docs/http-api.md — HTTP route contract, streaming behavior, boundary guarantees
- docs/support-matrix.md — backend / model / quant / API support tiers
- docs/stability-policy.md — stability levels and compatibility posture
- docs/architecture.md — package boundaries and dependency direction
- docs/codebase-map.md — workspace layout and main execution paths
- docs/environment.md — environment variables and runtime knobs
- docs/troubleshooting.md — common build / runtime errors and fixes
- docs/comparison.md — how ARLE compares to vLLM / SGLang / mistral.rs / llama.cpp
- docs/release-checklist.md · docs/perf-and-correctness-gates.md
- CONTRIBUTING.md — contributor setup, validation, release expectations
- SECURITY.md — vulnerability reporting policy
- examples/ — copy-paste smoke paths (curl, OpenAI SDK, Docker, Metal, train fixtures)
- docs/index.md — maintainer-facing PARA index, plans, and experience logs