Give your Apple Silicon Mac the power to hear, see, read, speak, think — all locally.
中文 · English
# Clone repository
git clone https://github.com/bendusy/mlx-local-inference.git
cd mlx-local-inference
# Install Python libraries
pip install mlx-lm mlx-vlm mlx-whisper huggingface_hub
# Download models to ~/models/ (oMLX-compatible structure)
python3 -c "
from huggingface_hub import snapshot_download
import os
models = [
'mlx-community/Qwen3-Embedding-0.6B-4bit-DWQ',
'mlx-community/Qwen3-ASR-1.7B-8bit',
'mlx-community/Qwen3.5-35B-A3B-4bit'
]
for repo_id in models:
model_name = repo_id.split('/')[-1]
local_dir = os.path.expanduser(f'~/models/{model_name}')
print(f'Downloading {model_name}...')
snapshot_download(
repo_id=repo_id,
local_dir=local_dir,
local_dir_use_symlinks=False
)
"
# Install oMLX (for LLM/VLM) via Homebrew
brew tap omlx-ai/tap
brew install omlx
# Start oMLX server
omlx serve --model-dir ~/models --port 8000Note: If you don't have Homebrew, install it first:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"Your M-series Mac has powerful unified memory — yet most AI workflows still send every request to the cloud. MLX Local Inference Stack turns your Mac into a fully self-contained AI workstation, with a memory-efficient design that works on 16 GB machines.
Choose the tier that matches your hardware. This stack prioritizes ASR (Speech-to-Text) to ensure seamless interaction via IM channels (Feishu, Discord).
- Think/Vision:
Qwen3.5-35B-A3B-4bit(MoE) - ASR (Critical):
Qwen3-ASR-1.7B-8bit(Always On) - Strategy: Uses MoE for high-speed reasoning while keeping ASR resident for instant voice-to-agent communication.
- Think:
Gemma-3-12B-it-4bit - ASR (Critical):
Qwen3-ASR-1.7B-8bit(Always On) - Strategy: Balanced for stability. Priorities ASR residency over LLM size.
- Think:
Qwen3-7B-4bit - ASR:
Qwen3-ASR-1.7B-4bit(On-demand)
To ensure maximum compatibility and zero-dependency mess, all components should be run via uv.
# Optimized for IM interaction (Feishu/Discord)
uv run --python 3.11 --with mlx-audio python -m mlx_audio.stt.generate \
--model ~/models/Qwen3-ASR-1.7B-8bit \
--audio "voice_message.ogg" \
--output-path /tmp/asr_result \
--language zh# Run via oMLX or direct uv
uv run --with mlx-lm python -m mlx_lm.generate \
--model ~/models/Qwen3.5-35B-A3B-4bit \
--prompt "Analyze this request..."Based on recent stress tests, the stack follows these optimization rules:
- Observation: Beyond 8,000 tokens, inference speed (TPS) on M4 chips experiences significant bandwidth throttling due to KV Cache size.
- Optimization: Use
--kv-bits 4for Qwen 3.5 MoE to maintain ~15 TPS even at 16k context.
- Winner:
Qwen3.5-35B-A3B(MoE) consistently outperformsGemma-3-12B(Dense) in both throughput and reasoning depth on Apple Silicon. - Throughput: Qwen 3.5 (~50 t/s) vs Gemma 3 (~15 t/s).
- Result: Qwen 3.5 retains 100% logic consistency in complex tool-calling scenarios, even when context is pushed to 32k limits (though speed drops to ~1 t/s).
- Environment: Always set
KMP_DUPLICATE_LIB_OK=TRUEto prevent OpenMP initialization crashes. - Library Sync: Use
mlx-lm >= 0.31.1for native Qwen 3.5 MoE support.
Hybrid approach: oMLX for LLM/VLM (high performance), Python libraries for Embedding/ASR/OCR (simplicity).
┌─────────────────────────────────────┐
│ oMLX (localhost:8000/v1) │
│ - LLM (Qwen3-14B, etc.) │
│ - VLM (vision-language models) │
│ - Continuous batching + SSD cache │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Python Libraries (direct call) │
│ - mlx-lm: Embedding │
│ - mlx-vlm: OCR (PaddleOCR-VL) │
│ - mlx-whisper: ASR (Qwen3-ASR) │
└─────────────────────────────────────┘
| Ability | Implementation | Model | Memory |
|---|---|---|---|
| 💬 Think | oMLX API | Qwen3-14B-4bit |
~8 GB |
| 👁️ See (VLM) | oMLX API | Any mlx-vlm model | varies |
| 📐 Embed | mlx-lm (Python) | Qwen3-Embedding-0.6B-4bit-DWQ |
~1 GB |
| 👂 Hear | mlx-whisper (Python) | Qwen3-ASR-1.7B-8bit |
~1.5 GB |
| 👁️ Read (OCR) | mlx-vlm (Python) | PaddleOCR-VL-1.5-6bit |
~3.3 GB |
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="local")
response = client.chat.completions.create(
model="Qwen3-14B-4bit",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)from mlx_lm import load
# Load from ~/models/ (oMLX-compatible path)
model, tokenizer = load("~/models/Qwen3-Embedding-0.6B-4bit-DWQ")
inputs = tokenizer("text to embed", return_tensors="np")
embeddings = model(**inputs).last_hidden_state.mean(axis=1)import mlx_whisper
# Load from ~/models/ (oMLX-compatible path)
result = mlx_whisper.transcribe(
"audio.wav",
path_or_hf_repo="~/models/Qwen3-ASR-1.7B-8bit"
)
print(result["text"])from mlx_vlm import load, generate
from mlx_vlm.utils import load_image
# Load from ~/models/ (oMLX-compatible path)
model, processor = load("~/models/PaddleOCR-VL-1.5-6bit")
image = load_image("document.jpg")
output = generate(model, processor, image, "OCR:", max_tokens=512, temp=0.0)
print(output)# List discovered models
curl http://localhost:8000/v1/models
# Restart service
launchctl kickstart -k gui/$(id -u)/com.omlx-server
# Logs
tail -f /tmp/omlx-server.log- All models stored in
~/models/using oMLX-compatible structure (e.g.,~/models/Qwen3-14B-4bit/) - oMLX is used only for LLM/VLM (chat/completions)
- Embedding/ASR/OCR are handled by Python libraries (mlx-lm, mlx-whisper, mlx-vlm)
- Future-proof: When oMLX adds support for Embedding/ASR, we can switch instantly without re-downloading models
mlx-local-inference/
├── SKILL.md
├── README.md
├── README_CN.md
├── references/
└── ...