Skip to content

bendusy/mlx-local-inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 MLX Local Inference Stack

Give your Apple Silicon Mac the power to hear, see, read, speak, think — all locally.

ClawHub Platform oMLX License

中文 · English


Installation

# Clone repository
git clone https://github.com/bendusy/mlx-local-inference.git
cd mlx-local-inference

# Install Python libraries
pip install mlx-lm mlx-vlm mlx-whisper huggingface_hub

# Download models to ~/models/ (oMLX-compatible structure)
python3 -c "
from huggingface_hub import snapshot_download
import os

models = [
    'mlx-community/Qwen3-Embedding-0.6B-4bit-DWQ',
    'mlx-community/Qwen3-ASR-1.7B-8bit',
    'mlx-community/Qwen3.5-35B-A3B-4bit'
]

for repo_id in models:
    model_name = repo_id.split('/')[-1]
    local_dir = os.path.expanduser(f'~/models/{model_name}')
    print(f'Downloading {model_name}...')
    snapshot_download(
        repo_id=repo_id,
        local_dir=local_dir,
        local_dir_use_symlinks=False
    )
"

# Install oMLX (for LLM/VLM) via Homebrew
brew tap omlx-ai/tap
brew install omlx

# Start oMLX server
omlx serve --model-dir ~/models --port 8000

Note: If you don't have Homebrew, install it first:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Why This Exists

Your M-series Mac has powerful unified memory — yet most AI workflows still send every request to the cloud. MLX Local Inference Stack turns your Mac into a fully self-contained AI workstation, with a memory-efficient design that works on 16 GB machines.

Model Selection Strategy (Unified Memory)

Choose the tier that matches your hardware. This stack prioritizes ASR (Speech-to-Text) to ensure seamless interaction via IM channels (Feishu, Discord).

🟢 32GB RAM Tier

  • Think/Vision: Qwen3.5-35B-A3B-4bit (MoE)
  • ASR (Critical): Qwen3-ASR-1.7B-8bit (Always On)
  • Strategy: Uses MoE for high-speed reasoning while keeping ASR resident for instant voice-to-agent communication.

🟡 16GB RAM Tier

  • Think: Gemma-3-12B-it-4bit
  • ASR (Critical): Qwen3-ASR-1.7B-8bit (Always On)
  • Strategy: Balanced for stability. Priorities ASR residency over LLM size.

⚪ 8GB RAM Tier

  • Think: Qwen3-7B-4bit
  • ASR: Qwen3-ASR-1.7B-4bit (On-demand)

🛠️ Portable Execution (via uv)

To ensure maximum compatibility and zero-dependency mess, all components should be run via uv.

👂 Hear — Instant ASR (High Priority)

# Optimized for IM interaction (Feishu/Discord)
uv run --python 3.11 --with mlx-audio python -m mlx_audio.stt.generate \
  --model ~/models/Qwen3-ASR-1.7B-8bit \
  --audio "voice_message.ogg" \
  --output-path /tmp/asr_result \
  --language zh

🧠 Think — Local LLM

# Run via oMLX or direct uv
uv run --with mlx-lm python -m mlx_lm.generate \
  --model ~/models/Qwen3.5-35B-A3B-4bit \
  --prompt "Analyze this request..."

📊 Performance Benchmark Results (M4 32GB)

Based on recent stress tests, the stack follows these optimization rules:

1. The 8k Token Wall

  • Observation: Beyond 8,000 tokens, inference speed (TPS) on M4 chips experiences significant bandwidth throttling due to KV Cache size.
  • Optimization: Use --kv-bits 4 for Qwen 3.5 MoE to maintain ~15 TPS even at 16k context.

2. MoE vs Dense Architecture

  • Winner: Qwen3.5-35B-A3B (MoE) consistently outperforms Gemma-3-12B (Dense) in both throughput and reasoning depth on Apple Silicon.
  • Throughput: Qwen 3.5 (~50 t/s) vs Gemma 3 (~15 t/s).

3. Tool Calling Precision

  • Result: Qwen 3.5 retains 100% logic consistency in complex tool-calling scenarios, even when context is pushed to 32k limits (though speed drops to ~1 t/s).

4. System Stability Fixes

  • Environment: Always set KMP_DUPLICATE_LIB_OK=TRUE to prevent OpenMP initialization crashes.
  • Library Sync: Use mlx-lm >= 0.31.1 for native Qwen 3.5 MoE support.

Architecture

Hybrid approach: oMLX for LLM/VLM (high performance), Python libraries for Embedding/ASR/OCR (simplicity).

┌─────────────────────────────────────┐
│  oMLX (localhost:8000/v1)           │
│  - LLM (Qwen3-14B, etc.)            │
│  - VLM (vision-language models)     │
│  - Continuous batching + SSD cache  │
└─────────────────────────────────────┘

┌─────────────────────────────────────┐
│  Python Libraries (direct call)     │
│  - mlx-lm: Embedding                │
│  - mlx-vlm: OCR (PaddleOCR-VL)      │
│  - mlx-whisper: ASR (Qwen3-ASR)     │
└─────────────────────────────────────┘

What Your Mac Gains

Ability Implementation Model Memory
💬 Think oMLX API Qwen3-14B-4bit ~8 GB
👁️ See (VLM) oMLX API Any mlx-vlm model varies
📐 Embed mlx-lm (Python) Qwen3-Embedding-0.6B-4bit-DWQ ~1 GB
👂 Hear mlx-whisper (Python) Qwen3-ASR-1.7B-8bit ~1.5 GB
👁️ Read (OCR) mlx-vlm (Python) PaddleOCR-VL-1.5-6bit ~3.3 GB

Usage

💬 LLM — Text Generation (via oMLX API)

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="local")

response = client.chat.completions.create(
    model="Qwen3-14B-4bit",
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

📐 Embed — Text Vectorization (via mlx-lm)

from mlx_lm import load

# Load from ~/models/ (oMLX-compatible path)
model, tokenizer = load("~/models/Qwen3-Embedding-0.6B-4bit-DWQ")
inputs = tokenizer("text to embed", return_tensors="np")
embeddings = model(**inputs).last_hidden_state.mean(axis=1)

👂 Hear — Speech Recognition (via mlx-whisper)

import mlx_whisper

# Load from ~/models/ (oMLX-compatible path)
result = mlx_whisper.transcribe(
    "audio.wav",
    path_or_hf_repo="~/models/Qwen3-ASR-1.7B-8bit"
)
print(result["text"])

👁️ Read — OCR (via mlx-vlm)

from mlx_vlm import load, generate
from mlx_vlm.utils import load_image

# Load from ~/models/ (oMLX-compatible path)
model, processor = load("~/models/PaddleOCR-VL-1.5-6bit")
image = load_image("document.jpg")

output = generate(model, processor, image, "OCR:", max_tokens=512, temp=0.0)
print(output)

Service Management (oMLX)

# List discovered models
curl http://localhost:8000/v1/models

# Restart service
launchctl kickstart -k gui/$(id -u)/com.omlx-server

# Logs
tail -f /tmp/omlx-server.log

Notes

  • All models stored in ~/models/ using oMLX-compatible structure (e.g., ~/models/Qwen3-14B-4bit/)
  • oMLX is used only for LLM/VLM (chat/completions)
  • Embedding/ASR/OCR are handled by Python libraries (mlx-lm, mlx-whisper, mlx-vlm)
  • Future-proof: When oMLX adds support for Embedding/ASR, we can switch instantly without re-downloading models

Project Structure

mlx-local-inference/
├── SKILL.md
├── README.md
├── README_CN.md
├── references/
└── ...

License

MIT

About

Full local AI inference stack on Apple Silicon via MLX — LLM, ASR, Embedding, OCR, TTS, Transcription

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages