🧠 MLX Local Inference Stack

Give your Apple Silicon Mac the power to hear, see, read, speak, think — all locally.

中文 · English

Installation

# Clone repository
git clone https://github.com/bendusy/mlx-local-inference.git
cd mlx-local-inference

# Install Python libraries
pip install mlx-lm mlx-vlm mlx-whisper huggingface_hub

# Download models to ~/models/ (oMLX-compatible structure)
python3 -c "
from huggingface_hub import snapshot_download
import os

models = [
    'mlx-community/Qwen3-Embedding-0.6B-4bit-DWQ',
    'mlx-community/Qwen3-ASR-1.7B-8bit',
    'mlx-community/Qwen3.5-35B-A3B-4bit'
]

for repo_id in models:
    model_name = repo_id.split('/')[-1]
    local_dir = os.path.expanduser(f'~/models/{model_name}')
    print(f'Downloading {model_name}...')
    snapshot_download(
        repo_id=repo_id,
        local_dir=local_dir,
        local_dir_use_symlinks=False
    )
"

# Install oMLX (for LLM/VLM) via Homebrew
brew tap omlx-ai/tap
brew install omlx

# Start oMLX server
omlx serve --model-dir ~/models --port 8000

Note: If you don't have Homebrew, install it first:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Why This Exists

Your M-series Mac has powerful unified memory — yet most AI workflows still send every request to the cloud. MLX Local Inference Stack turns your Mac into a fully self-contained AI workstation, with a memory-efficient design that works on 16 GB machines.

Model Selection Strategy (Unified Memory)

Choose the tier that matches your hardware. This stack prioritizes ASR (Speech-to-Text) to ensure seamless interaction via IM channels (Feishu, Discord).

🟢 32GB RAM Tier

Think/Vision: Qwen3.5-35B-A3B-4bit (MoE)
ASR (Critical): Qwen3-ASR-1.7B-8bit (Always On)
Strategy: Uses MoE for high-speed reasoning while keeping ASR resident for instant voice-to-agent communication.

🟡 16GB RAM Tier

Think: Gemma-3-12B-it-4bit
ASR (Critical): Qwen3-ASR-1.7B-8bit (Always On)
Strategy: Balanced for stability. Priorities ASR residency over LLM size.

⚪ 8GB RAM Tier

Think: Qwen3-7B-4bit
ASR: Qwen3-ASR-1.7B-4bit (On-demand)

🛠️ Portable Execution (via `uv`)

To ensure maximum compatibility and zero-dependency mess, all components should be run via uv.

👂 Hear — Instant ASR (High Priority)

# Optimized for IM interaction (Feishu/Discord)
uv run --python 3.11 --with mlx-audio python -m mlx_audio.stt.generate \
  --model ~/models/Qwen3-ASR-1.7B-8bit \
  --audio "voice_message.ogg" \
  --output-path /tmp/asr_result \
  --language zh

🧠 Think — Local LLM

# Run via oMLX or direct uv
uv run --with mlx-lm python -m mlx_lm.generate \
  --model ~/models/Qwen3.5-35B-A3B-4bit \
  --prompt "Analyze this request..."

📊 Performance Benchmark Results (M4 32GB)

Based on recent stress tests, the stack follows these optimization rules:

1. The 8k Token Wall

Observation: Beyond 8,000 tokens, inference speed (TPS) on M4 chips experiences significant bandwidth throttling due to KV Cache size.
Optimization: Use --kv-bits 4 for Qwen 3.5 MoE to maintain ~15 TPS even at 16k context.

2. MoE vs Dense Architecture

Winner: Qwen3.5-35B-A3B (MoE) consistently outperforms Gemma-3-12B (Dense) in both throughput and reasoning depth on Apple Silicon.
Throughput: Qwen 3.5 (~50 t/s) vs Gemma 3 (~15 t/s).

3. Tool Calling Precision

Result: Qwen 3.5 retains 100% logic consistency in complex tool-calling scenarios, even when context is pushed to 32k limits (though speed drops to ~1 t/s).

4. System Stability Fixes

Environment: Always set KMP_DUPLICATE_LIB_OK=TRUE to prevent OpenMP initialization crashes.
Library Sync: Use mlx-lm >= 0.31.1 for native Qwen 3.5 MoE support.

Architecture

Hybrid approach: oMLX for LLM/VLM (high performance), Python libraries for Embedding/ASR/OCR (simplicity).

┌─────────────────────────────────────┐
│  oMLX (localhost:8000/v1)           │
│  - LLM (Qwen3-14B, etc.)            │
│  - VLM (vision-language models)     │
│  - Continuous batching + SSD cache  │
└─────────────────────────────────────┘

┌─────────────────────────────────────┐
│  Python Libraries (direct call)     │
│  - mlx-lm: Embedding                │
│  - mlx-vlm: OCR (PaddleOCR-VL)      │
│  - mlx-whisper: ASR (Qwen3-ASR)     │
└─────────────────────────────────────┘

What Your Mac Gains

Ability	Implementation	Model	Memory
💬 Think	oMLX API	`Qwen3-14B-4bit`	~8 GB
👁️ See (VLM)	oMLX API	Any mlx-vlm model	varies
📐 Embed	mlx-lm (Python)	`Qwen3-Embedding-0.6B-4bit-DWQ`	~1 GB
👂 Hear	mlx-whisper (Python)	`Qwen3-ASR-1.7B-8bit`	~1.5 GB
👁️ Read (OCR)	mlx-vlm (Python)	`PaddleOCR-VL-1.5-6bit`	~3.3 GB

Usage

💬 LLM — Text Generation (via oMLX API)

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="local")

response = client.chat.completions.create(
    model="Qwen3-14B-4bit",
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

📐 Embed — Text Vectorization (via mlx-lm)

from mlx_lm import load

# Load from ~/models/ (oMLX-compatible path)
model, tokenizer = load("~/models/Qwen3-Embedding-0.6B-4bit-DWQ")
inputs = tokenizer("text to embed", return_tensors="np")
embeddings = model(**inputs).last_hidden_state.mean(axis=1)

👂 Hear — Speech Recognition (via mlx-whisper)

import mlx_whisper

# Load from ~/models/ (oMLX-compatible path)
result = mlx_whisper.transcribe(
    "audio.wav",
    path_or_hf_repo="~/models/Qwen3-ASR-1.7B-8bit"
)
print(result["text"])

👁️ Read — OCR (via mlx-vlm)

from mlx_vlm import load, generate
from mlx_vlm.utils import load_image

# Load from ~/models/ (oMLX-compatible path)
model, processor = load("~/models/PaddleOCR-VL-1.5-6bit")
image = load_image("document.jpg")

output = generate(model, processor, image, "OCR:", max_tokens=512, temp=0.0)
print(output)

Service Management (oMLX)

# List discovered models
curl http://localhost:8000/v1/models

# Restart service
launchctl kickstart -k gui/$(id -u)/com.omlx-server

# Logs
tail -f /tmp/omlx-server.log

Notes

All models stored in ~/models/ using oMLX-compatible structure (e.g., ~/models/Qwen3-14B-4bit/)
oMLX is used only for LLM/VLM (chat/completions)
Embedding/ASR/OCR are handled by Python libraries (mlx-lm, mlx-whisper, mlx-vlm)
Future-proof: When oMLX adds support for Embedding/ASR, we can switch instantly without re-downloading models

Project Structure

mlx-local-inference/
├── SKILL.md
├── README.md
├── README_CN.md
├── references/
└── ...

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 MLX Local Inference Stack

Installation

Why This Exists

Model Selection Strategy (Unified Memory)

🟢 32GB RAM Tier

🟡 16GB RAM Tier

⚪ 8GB RAM Tier

🛠️ Portable Execution (via `uv`)

👂 Hear — Instant ASR (High Priority)

🧠 Think — Local LLM

📊 Performance Benchmark Results (M4 32GB)

1. The 8k Token Wall

2. MoE vs Dense Architecture

3. Tool Calling Precision

4. System Stability Fixes

Architecture

What Your Mac Gains

Usage

💬 LLM — Text Generation (via oMLX API)

📐 Embed — Text Vectorization (via mlx-lm)

👂 Hear — Speech Recognition (via mlx-whisper)

👁️ Read — OCR (via mlx-vlm)

Service Management (oMLX)

Notes

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
references		references
server		server
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
SKILL.md		SKILL.md

Folders and files

Latest commit

History

Repository files navigation

🧠 MLX Local Inference Stack

Installation

Why This Exists

Model Selection Strategy (Unified Memory)

🟢 32GB RAM Tier

🟡 16GB RAM Tier

⚪ 8GB RAM Tier

🛠️ Portable Execution (via uv)

👂 Hear — Instant ASR (High Priority)

🧠 Think — Local LLM

📊 Performance Benchmark Results (M4 32GB)

1. The 8k Token Wall

2. MoE vs Dense Architecture

3. Tool Calling Precision

4. System Stability Fixes

Architecture

What Your Mac Gains

Usage

💬 LLM — Text Generation (via oMLX API)

📐 Embed — Text Vectorization (via mlx-lm)

👂 Hear — Speech Recognition (via mlx-whisper)

👁️ Read — OCR (via mlx-vlm)

Service Management (oMLX)

Notes

Project Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

🛠️ Portable Execution (via `uv`)

Packages