Skip to content

cklxx/arle

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2,851 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

ARLE
Pure-Rust runtime for serving, local agents, training, and evaluation. infer is the OpenAI-compatible serving binary; arle is the unified front door.

Website CI CUDA CI Metal CI MIT License Release

Quick Start · HTTP API · Support Matrix · Architecture · Roadmap · Changelog · Contributing

English · 简体中文


Quick Start

1. Install

Apple Silicon — Homebrew (recommended):

brew install cklxx/tap/arle
arle --doctor

Apple Silicon or Linux x86_64 — one-line installer:

curl -fsSL https://github.com/cklxx/arle/releases/latest/download/install.sh | sh

The script grabs the matching tarball from the latest GitHub Release, SHA256-verifies it, and drops the binaries into ~/.local/bin (override with INSTALL_DIR=...). See docs/install.md for the full matrix, env-var overrides, and uninstall steps.

Linux + NVIDIA — pull the published Docker image, no compile:

docker run --rm --gpus all -p 8000:8000 \
  -v /path/to/Qwen3-4B:/model:ro \
  ghcr.io/cklxx/arle:latest \
  serve --backend cuda --model-path /model --port 8000

The :latest tag tracks the newest non-prerelease release image. Tagged releases are published as ghcr.io/cklxx/arle:X.Y.Z (note: no v prefix - the docker metadata-action strips it). For the current release: ghcr.io/cklxx/arle:0.1.5.

From source (any backend; needed for cpu, CUDA/TileLang, or local hacking):

git clone https://github.com/cklxx/arle && cd arle
# Apple Silicon:
cargo build --release --no-default-features --features metal,no-cuda,cli --bin arle
# Linux + NVIDIA:
cargo build --release --features cuda --bin arle

2. Serve a model

arle serve --backend metal \
  --model-path mlx-community/Qwen3-0.6B-4bit --port 8000   # Apple Silicon
arle serve --backend cuda \
  --model-path /path/to/Qwen3-4B --port 8000               # Linux + NVIDIA

3. Talk to it

# pip install openai
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
print(client.chat.completions.create(
    model="qwen3-4b",
    messages=[{"role": "user", "content": "Hello from ARLE"}],
).choices[0].message.content)

Or with curl: see examples/curl_chat.sh. More copy-paste paths: examples/.

4. Run the local agent

arle                                                       # interactive REPL with built-in tools
arle --model-path /path/to/Qwen3-4B run --prompt "Summarize this repo"   # one-shot
arle --doctor --json                                       # self-check, machine-readable

CPU-only smoke build (no GPU required, source build):

cargo build --release --no-default-features --features cpu,no-cuda,cli --bin arle
./target/release/arle --doctor

Status at a glance

Backend Platform Status Notes
CUDA Linux + NVIDIA Stable Continuous batching, paged KV, radix-backed reuse, TileLang BF16 attention, custom CUDA quantized decode, CUDA Graph decode, packed paged-prefill for Qwen3 / Qwen3.5. L4 / Qwen3-4B BF16 + FP8 paged KV (auto): 197 tok/s @ c=16 / 4096-in, peak_active=16 saturated.
Metal Apple Silicon Beta Live scheduler-backed serving, chunked prefill, replay-backed prefix reuse. Qwen3.5-0.8B MLX 4bit single-request step-driver reaches 305.5 tok/s on M4 Pro 20c; GGUF Q4_K_M exact default is 202.1 tok/s direct, with an opt-in native-q4 Metal load path at 236.7 tok/s direct / 239.8 tok/s step-driver on the matched 1024/256 profile.
Metal DFlash Apple Silicon Beta — default-on Speculative decode for Qwen3 / Qwen3.5. Qwen3-4B bf16 achieves 5.9× decode speedup, Qwen3.5-4B-4bit maintains bit-identical parity, validated for c=1..8.
CPU Portable Dev-only Smoke tests and request-path validation. DeepSeek V4 has a slow Rust reference path for 1B init correctness / HTTP smoke; not a serving-performance target.

Models: Qwen3 (0.6B – 72B) and the Qwen3.5 family (including 0.8B GGUF Q4_K_M and 4B hybrid linear + full attention) are supported on CUDA and Metal according to the current matrix. Qwen3.6 / Qwen3.5-MoE has a narrow Metal Beta path; CUDA remains stubbed. Next-model priority queue: DeepSeek V4 (#1, V4-only substrate + CPU reference smoke landed) then Qwen 3.6 (#2, planned); see ROADMAP.md §Next-Model Priority Order. DeepSeek V2/V3/R1 support paths are intentionally not carried in the current runtime.

Authoritative matrix (HTTP API tiers, quantization, agent / train / eval surfaces): docs/support-matrix.md. Stability tiers: docs/stability-policy.md.


Why ARLE

In agent and RL workloads every turn pays a prefill tax: system prompt + history + tool results must be re-processed. As context grows, prefill dominates latency. ARLE treats this as the core problem in both serving and agent / RL loops:

  • Multi-turn KV reuse. Slot-sticky reuse keeps prior-turn KV hot for the next turn. CUDA also includes a radix-backed tiered-KV path (T0 GPU → T1 host pinned → T2 local disk → T3 cluster-shared) for full-block reuse and staged readmission, so only the new user message requires prefill each turn when the prefix stays reusable.
  • Paged KV pool. Main CUDA KV formats use page_size=16 with direct GPU page attach and tail-page CoW on shared prefixes — predictable accounting, reusable full blocks, cheaper prefix sharing.
  • Shared runtime authority. infer, arle, and the in-tree train / eval jobs resolve models and reuse the same Rust runtime / model contracts. Serving, local agent work, and RL tooling stay on one code path instead of drifting across separate stacks.

Architecture deep-dive: docs/architecture.md · docs/codebase-map.md. Latest benchmark snapshots (per change, dated): docs/experience/wins/ · run your own with scripts/bench_guidellm.sh.


Entry surfaces

arle is the single binary users interact with:

Command What it does
arle (no args) Interactive agent REPL with built-in python and shell tools (sandboxed).
arle run --prompt "…" / --stdin --json Script-friendly one-shot agent prompt. Use --no-tools to disable tool execution.
arle serve --backend {cuda,metal,cpu} --model-path … Launch the OpenAI-compatible HTTP server through an ARLE-native backend.
arle train {pretrain,sft,grpo,multi-turn,eval} In-tree training and RL workflows on the same runtime.
arle data {download,convert} Dataset utilities.
arle --doctor [--json] [--strict] Self-check: backend, hardware, HF cache, model resolution. CI-friendly.

The REPL persists line history at ~/.arle-history and exposes slash commands: /help, /reset, /clear, /tools, /model, /stats, /models, /save, /load, /export.

Operators who want only the native serving binary can use infer directly (cargo build -p infer --release --features cuda on Linux, --features metal,no-cuda on Apple Silicon) — same HTTP contract, without the agent / train / data surface.


📰 Latest Updates

  • 2026-05-10 — 🎉 W4-hybrid prefill graph capture closes 4k/c=4 SGLang +76.6% gap via Path B.2 bucketed allocation key (a56b7a9/c44788f). Engine-side TTFT p50 2000ms → 150ms = -92.5% improvement on RTX 4070 Ti SUPER 16GB (server-side /v1/stats engine_ttft_us ground truth; client-side guidellm 0.6.0 broken — bench tool bug isolated). Throughput +632% in 60s window. Bucketed page_indices_len (64-entry) + prefix_token_rows_len (128-row) reduce capture key churn from 388 unique → 7 unique with 98.5% LRU dominant key reuse. Codex's "second-order bucketing" insight (captured scalar launch parameters use bucket capacity, not exact dim from first capture) was load-bearing; new anti-pattern in skill v1.7.0 catalog. Opt-in via INFER_PREFILL_GRAPH=1 + INFER_HYBRID_W4A8_PREFILL=1. Plus RoPE scaling support (YARN / Linear / NtkAware) wired through qwen3-spec + qwen35-spec + precompute_rope_with_scaling. Evidence: docs/experience/wins/2026-05-10-bench-40-pathB2-tier1-strong-proceed.md, docs/experience/wins/2026-05-10-m-rope-yarn-scaling-phase1-phase2-landed.md.
  • 2026-04-28 — CUDA L4 Qwen3-4B BF16, c=16 / 4096-in increased from 120 → 197 tok/s (+64%) after enabling automatic HBM-tier chunked_prefill_size and FP8 paged KV defaulting on L4-class GPUs. peak_active saturates at 16/16; achieves +42% vs SGLang reference on the same workload. Evidence: docs/experience/wins/2026-04-28-bench-guidellm-cuda-l4-kv-fp8-auto.md.

Full history: CHANGELOG.md. Next up: ROADMAP.md.


Documentation map


License

MIT

About

Rust-native inference runtime for Qwen3 / Qwen3.5 — OpenAI-compatible serving + integrated agent, train, and self-evolution workflows. CUDA + Metal, no PyTorch on the hot path.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors