AppleNeuralEngine-Kit is a comprehensive toolkit for running Large Language Models directly on Apple Silicon using the Neural Engine. It provides optimized conversion, efficient inference, and user-friendly interfaces for working with LLMs on macOS and iOS.
- Architecture-Aware Optimization: Automatically detects and optimizes models based on architecture (Llama, Qwen, Mistral, etc.)
- Interactive UI: Elegant SwiftUI chat interface with conversation history
- Visual Model Conversion: Convert models with a user-friendly macOS interface
- Real-Time Progress Tracking: Detailed conversion progress with ETA estimates
- Advanced Memory Management: Optimized multi-function chunks reduce memory usage by ~50%
- KV Cache Optimization: Specialized prefill models for fast token generation
- Python & Swift Integration: Seamless integration between conversion and inference
- Performance Analytics: Real-time metrics for token generation
# Clone the repository
git clone https://github.com/antmikinka/AppleNeuralEngine-Kit.git
cd AppleNeuralEngine-Kit
# Build the project
swift build
# Install Python dependencies for model conversion (optional)
cd scripts
pip install -r requirements.txtswift run ANEChatswift run ANEToolCLI --repo-id meta-llama/Llama-3.2-1B --input-text "Tell me about neural networks"# Using the Swift CLI
swift run ANEModelConverter convert-hf --model-id meta-llama/Llama-3.2-1B --output-dir ./models
# Using the Python script with detailed progress
python scripts/convert_hf_to_coreml.py --model_path meta-llama/Llama-3.2-1B --output_path ./models --verboseAppleNeuralEngine-Kit uses a sophisticated model conversion process that:
- Analyzes Model Architecture: Detects model type and optimizes accordingly
- Splits into Specialized Components: Separates embeddings, FFN, and LM head
- Optimizes for ANE: Applies architecture-specific optimizations for Apple Neural Engine
- Creates Multi-Function Chunks: Combines components to minimize memory usage
- Applies Quantization: Uses 4-6 bit LUT quantization for optimal size/quality balance
| Model | Tokens/Sec | Memory Usage | Size |
|---|---|---|---|
| Llama-3.2-1B (M1) | 7.0 | ~1.2 GB | 600MB |
| Llama-3.2-1B (M3) | 13.9 | ~1.2 GB | 600MB |
| Llama-3.2-3B (M3) | 5.2 | ~3.5 GB | 1.8GB |
- Usage Guide - Detailed usage instructions
- Architecture - System design overview
- Model Conversion - Converting models for ANE
- ANE Model Architecture - Technical details
- iOS Implementation - iOS deployment
- Contributing - Guidelines for contributors
- Changelog - Version history
- macOS 14 (Sonoma) or newer
- Apple Silicon Mac (M1/M2/M3 series)
- Swift 5.9 or newer
- Python 3.8+ with transformers and coremltools (for model conversion)
- Xcode Command Line Tools (for CoreML compilation)
- Llama Models (Llama 2, Llama 3, Llama 3.1, Llama 3.2)
- Mistral Models (Mistral 7B, Mistral 8x7B)
- Qwen Models (Qwen 1.5, Qwen 2)
- QwQ Models (Quantized versions)
- Phi Models (Phi-2, Phi-3)
- Gemma Models (Gemma 2B, 7B)
Contributions are welcome! Please read our Contributing Guidelines before submitting a pull request.
This project is licensed under the MIT License - see the LICENSE file for details.
This project builds upon:
- CoreML LLM CLI by Stephen Panaro
- ANEMLL for ANE-optimized conversion techniques
- LitGPT for model optimization techniques
- Apple Silicon 4-bit quantization for efficient model sizing
