name

mlx-local-inference

description

Use when calling local AI on this Mac — text generation, embeddings, speech-to-text, OCR, or image understanding. LLM/VLM via oMLX gateway at localhost:8000/v1. Embedding/ASR/OCR via Python libraries (mlx-lm, mlx-vlm, mlx-audio). Works offline. Use instead of cloud APIs for privacy or low latency.

metadata

openclaw

requires

darwin

anyBins

MLX Local Inference Stack

Local AI inference on Apple Silicon. oMLX handles LLM/VLM with continuous batching. Python libraries handle Embedding/ASR/OCR directly via uv.

Architecture

┌─────────────────────────────────────┐
│  oMLX (localhost:8000/v1)           │
│  - LLM (Qwen3-14B, etc.)            │
│  - VLM (vision-language models)     │
│  - Continuous batching + SSD cache  │
└─────────────────────────────────────┘

┌─────────────────────────────────────┐
│  Python Libraries (via uv run)      │
│  - mlx-lm: Embedding                │
│  - mlx-vlm: OCR (PaddleOCR-VL)      │
│  - mlx-audio: ASR (Qwen3-ASR)       │
└─────────────────────────────────────┘

Models

Capability	Implementation	Model	Size
💬 LLM	oMLX API	`Qwen3-14B-4bit`	~8 GB
👁️ VLM	oMLX API	Any mlx-vlm model	varies
📐 Embed	mlx-lm (uv)	`Qwen3-Embedding-0.6B-4bit-DWQ`	~1 GB
🎤 ASR	mlx-audio (uv)	`Qwen3-ASR-1.7B-8bit`	~1.5 GB
👁️ OCR	mlx-vlm (uv)	`PaddleOCR-VL-1.5-6bit`	~3.3 GB

Model Selection Strategy (M-Series Hardware)

Choose the model tier that matches your Mac's unified memory:

Tier	Mac Memory	Think/Vision Model	Load Strategy
Flagship (M4)	32GB+	`Qwen3.5-35B-A3B-4bit`	Always On
Performance	16GB	`Gemma-3-12B-it-4bit`	On-demand
Standard	8GB	`Qwen3-7B-4bit`	Aggressive Unload

Usage

LLM / Vision-Language (via oMLX API)

from openai import OpenAI
import os

# Ensure system stability
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

client = OpenAI(base_url="http://localhost:8000/v1", api_key="local")

# Flagship recommendation for M4 32GB: Qwen 3.5 MoE
resp = client.chat.completions.create(
    model="qwen3.5-35b",
    messages=[{"role": "user", "content": "Hello"}]
)
print(resp.choices[0].message.content)

Embeddings (via mlx-lm + uv)

uv run --with mlx-lm python -c "
from mlx_lm import load
model, tokenizer = load('~/models/Qwen3-Embedding-0.6B-4bit-DWQ')
text = 'text to embed'
inputs = tokenizer(text, return_tensors='np')
embeddings = model(**inputs).last_hidden_state.mean(axis=1)
print(embeddings.shape)
"

ASR — Speech-to-Text (via mlx-audio + uv)

Important: Must run with --python 3.11 to avoid OpenMP threading issues (SIGSEGV).

uv run --python 3.11 --with mlx-audio python -m mlx_audio.stt.generate \
  --model ~/models/Qwen3-ASR-1.7B-8bit \
  --audio "audio.wav" \
  --output-path /tmp/asr_result \
  --format txt \
  --language zh \
  --verbose

OCR (via mlx-vlm + uv)

Important: The generate function parameter order must be (model, processor, prompt, image).

cat << 'PY_EOF' > run_ocr.py
import os
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model_path = os.path.expanduser("~/models/PaddleOCR-VL-1.5-6bit")
model, processor = load(model_path)
prompt = apply_chat_template(processor, config=model.config, prompt="OCR:", num_images=1)

output = generate(model, processor, prompt, "document.jpg", max_tokens=512, temp=0.0)
print(output.text)
PY_EOF

uv run --python 3.11 --with mlx-vlm python run_ocr.py

Service Management (oMLX only)

# Check running models
curl http://localhost:8000/v1/models

# Restart oMLX
launchctl kickstart -k gui/$(id -u)/com.omlx-server

Model Storage Strategy

All models stored in ~/models/ using oMLX-compatible structure:

~/models/
├── Qwen3-Embedding-0.6B-4bit-DWQ/
├── Qwen3-ASR-1.7B-8bit/
├── PaddleOCR-VL-1.5-6bit/
└── Qwen3-14B-4bit/

Requirements

Apple Silicon Mac (M1/M2/M3/M4)
uv installed (curl -LsSf https://astral.sh/uv/install.sh | sh)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLX Local Inference Stack

Architecture

Models

Model Selection Strategy (M-Series Hardware)

Usage

LLM / Vision-Language (via oMLX API)

Embeddings (via mlx-lm + uv)

ASR — Speech-to-Text (via mlx-audio + uv)

OCR (via mlx-vlm + uv)

Service Management (oMLX only)

Model Storage Strategy

Requirements

FilesExpand file tree

SKILL.md

Latest commit

History

SKILL.md

File metadata and controls

MLX Local Inference Stack

Architecture

Models

Model Selection Strategy (M-Series Hardware)

Usage

LLM / Vision-Language (via oMLX API)

Embeddings (via mlx-lm + uv)

ASR — Speech-to-Text (via mlx-audio + uv)

OCR (via mlx-vlm + uv)

Service Management (oMLX only)

Model Storage Strategy

Requirements