-
Notifications
You must be signed in to change notification settings - Fork 10
Qwen3+BitNet compatibility + NPU dispatch for ternary matmuls #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
bong-water-water-bong
wants to merge
123
commits into
lemonade-sdk:main
from
bong-water-water-bong:main
Closed
Changes from 39 commits
Commits
Show all changes
123 commits
Select commit
Hold shift + click to select a range
098baf9
Fix model config initializer lifetimes
bong-water-water-bong f1b28f4
Fix dangling config reference causing SIGFPE on all models
bong-water-water-bong 325b9e8
Add BitNet 1.58-bit ternary model support
bong-water-water-bong b42d8fd
Clean up: mark unused out_features param in dequantize_bitnet_weight
bong-water-water-bong 12987b5
Support Bonsai 1-bit Qwen3 loading
bong-water-water-bong 1be3dca
Add BitNet dequantization to Llama loader
bong-water-water-bong f3ea92a
Support all 1.58-bit and 1-bit model variants (Falcon-E, Bonsai)
bong-water-water-bong b04281d
Fix code review: ensure hidden_act defaults to relu2 for BitNet models
bong-water-water-bong 25afb47
Auto-configure ROCm Tensile library paths
bong-water-water-bong ba75d26
Fix Lille-130m weight loading
bong-water-water-bong 16d9eb8
Auto-configure ROCm Tensile library paths + fix lille-130m weight prefix
bong-water-water-bong 4ebbd85
Fix OpenELM: use explicit num_query_heads/ffn_multipliers from config
bong-water-water-bong 44c902d
Fix quantized lm_head/embed_as_linear: use linear_forward in all models
bong-water-water-bong 26aad7e
Fix MXFP4 quantization support (issue #10)
bong-water-water-bong 59e8b78
Fix BitNet chat template capitalize filter and short-name model aliasing
bong-water-water-bong d14e188
BitNet: runtime quantized matmul (repack ternary → 2-bit affine) + gr…
bong-water-water-bong dba1381
BitNet: runtime quantized matmul — final improvements
bong-water-water-bong d0d33ad
BitNet: fall back to dequantize-at-load for correctness
bong-water-water-bong ef551f8
BitNet: dequantize-at-load with thorough analysis of quantized path
bong-water-water-bong 9bd0848
BitNet: fix 2-bit runtime repack layout
bong-water-water-bong 7b0c42a
Falcon-E: support inverse-scale BitLinear checkpoints
bong-water-water-bong fa6fc89
docs: universal HF loading path design spec
bong-water-water-bong 90f61a6
Universal HuggingFace loading path phase 1-3
bong-water-water-bong 72acd40
Universal HF loading: fix review findings
bong-water-water-bong a1445d1
Universal HF loading: auto-quantize, quantization_config, GGUF skeleton
bong-water-water-bong 9ab50ae
GGUF integration + auto-quantize verified
bong-water-water-bong b08a19c
Server + ModelManager: --auto-quantize and GGUF flags
bong-water-water-bong 20370ee
Server --auto-quantize + generic HF weight remapping
bong-water-water-bong 560c622
GGUF: full quant format support (Q4_0..Q6_K, K-quants)
bong-water-water-bong 049d031
PyTorch .bin → safetensors converter
bong-water-water-bong ec6896b
1-bit model support: sub-norm detection + key remapping
bong-water-water-bong 3bca870
Generic Llama fallback for unknown model types
bong-water-water-bong d03f974
1-bit activation quantization + weight pre-quantization
bong-water-water-bong a24022b
Architecture registration system + PyTorch trust_remote_code
bong-water-water-bong a9cd8f9
Edge case hardening: clear error messages for bad paths
bong-water-water-bong 7b0208b
Add NPU backend: IRON JIT GEMM on AMD XDNA NPU
bong-water-water-bong 77d3675
Qwen3+BitNet: per-projection norms + U8 ternary dequant
bong-water-water-bong 7aa38fb
Qwen3+BitNet: robustness fixes and comprehensive edge-case tests
bong-water-water-bong 20c386a
NPU dispatch: ternary GEMV backend for AMD XDNA NPU
bong-water-water-bong 62cc827
Merge branch 'feat/bitnet-support'
bong-water-water-bong 7cad8b5
Model registry expansion: 12 new models + 1-bit detection fixes
bong-water-water-bong 6b0fd28
Fix BitNetModel weight remapping: universal key mapping for all BitNe…
bong-water-water-bong a8e6588
Gemma 4 model implementation: architecture detection + weight loading
bong-water-water-bong 21d2437
Universal 1-bit model support: Gemma 4 fixed, all 13 models verified
bong-water-water-bong 6f37349
AQLM 1-bit support + universal 1-bit model registry
bong-water-water-bong 0739acd
Fix all remaining gaps: chat template, OLMo config, model registry
bong-water-water-bong 7af84b7
NPU opt-in via NPU_ENABLE=1 + Gemma 4 Unified alias
bong-water-water-bong a4ad7cd
MLX community architecture expansion: 8 new model types
bong-water-water-bong dfd01b0
Top 25 MLX community LLMs: all architectures covered
bong-water-water-bong fa4c892
Add build artifacts to .gitignore
bong-water-water-bong ae08166
Fix auto-quantize: skip embed/norm/lm_head weights
bong-water-water-bong ef74b86
Fix dangling config reference causing SIGFPE on all models
bong-water-water-bong 851bc2e
Add BitNet 1.58-bit ternary model support
bong-water-water-bong 9fe133d
Clean up: mark unused out_features param in dequantize_bitnet_weight
bong-water-water-bong e8e849e
Support Bonsai 1-bit Qwen3 loading
bong-water-water-bong 2dd9e2c
Add BitNet dequantization to Llama loader
bong-water-water-bong 5a453fa
Support all 1.58-bit and 1-bit model variants (Falcon-E, Bonsai)
bong-water-water-bong 8f8ed59
Fix code review: ensure hidden_act defaults to relu2 for BitNet models
bong-water-water-bong 4737232
Auto-configure ROCm Tensile library paths
bong-water-water-bong f408942
Fix Lille-130m weight loading
bong-water-water-bong 85ecaa6
Auto-configure ROCm Tensile library paths + fix lille-130m weight prefix
bong-water-water-bong 89172c9
Fix OpenELM: use explicit num_query_heads/ffn_multipliers from config
bong-water-water-bong d7c3f26
Fix quantized lm_head/embed_as_linear: use linear_forward in all models
bong-water-water-bong edc07f0
Fix MXFP4 quantization support (issue #10)
bong-water-water-bong ed54179
Fix BitNet chat template capitalize filter and short-name model aliasing
bong-water-water-bong 336eff1
BitNet: runtime quantized matmul (repack ternary → 2-bit affine) + gr…
bong-water-water-bong 8de196d
BitNet: runtime quantized matmul — final improvements
bong-water-water-bong dae526a
BitNet: fall back to dequantize-at-load for correctness
bong-water-water-bong a8dc753
BitNet: dequantize-at-load with thorough analysis of quantized path
bong-water-water-bong 859fe9d
BitNet: fix 2-bit runtime repack layout
bong-water-water-bong 6d059ba
Falcon-E: support inverse-scale BitLinear checkpoints
bong-water-water-bong 718e74a
docs: universal HF loading path design spec
bong-water-water-bong d36d9e2
Universal HuggingFace loading path phase 1-3
bong-water-water-bong 5dbcd3d
Universal HF loading: fix review findings
bong-water-water-bong 74295fe
Universal HF loading: auto-quantize, quantization_config, GGUF skeleton
bong-water-water-bong 9d54a3a
GGUF integration + auto-quantize verified
bong-water-water-bong 245f5a5
Server + ModelManager: --auto-quantize and GGUF flags
bong-water-water-bong c6a386d
Server --auto-quantize + generic HF weight remapping
bong-water-water-bong 0ca69e7
GGUF: full quant format support (Q4_0..Q6_K, K-quants)
bong-water-water-bong 33e6a1b
PyTorch .bin → safetensors converter
bong-water-water-bong 3c51336
1-bit model support: sub-norm detection + key remapping
bong-water-water-bong 4afee5b
Generic Llama fallback for unknown model types
bong-water-water-bong 838d684
1-bit activation quantization + weight pre-quantization
bong-water-water-bong e354c54
Architecture registration system + PyTorch trust_remote_code
bong-water-water-bong 80a9909
Edge case hardening: clear error messages for bad paths
bong-water-water-bong e56a0d0
Add NPU backend: IRON JIT GEMM on AMD XDNA NPU
bong-water-water-bong 5007a18
Qwen3+BitNet: per-projection norms + U8 ternary dequant
bong-water-water-bong bbe30ff
Qwen3+BitNet: robustness fixes and comprehensive edge-case tests
bong-water-water-bong fd20090
NPU dispatch: ternary GEMV backend for AMD XDNA NPU
bong-water-water-bong e0c126d
Model registry expansion: 12 new models + 1-bit detection fixes
bong-water-water-bong 4ea3edc
Fix BitNetModel weight remapping: universal key mapping for all BitNe…
bong-water-water-bong 3ed0401
Gemma 4 model implementation: architecture detection + weight loading
bong-water-water-bong e8ef988
Universal 1-bit model support: Gemma 4 fixed, all 13 models verified
bong-water-water-bong fcd3c4b
AQLM 1-bit support + universal 1-bit model registry
bong-water-water-bong 47a7d66
Fix all remaining gaps: chat template, OLMo config, model registry
bong-water-water-bong 84162ed
NPU opt-in via NPU_ENABLE=1 + Gemma 4 Unified alias
bong-water-water-bong bc8ece0
MLX community architecture expansion: 8 new model types
bong-water-water-bong e0fb903
Final polish: OLMo converter, Kimi K2 alias, NPU docs, README
bong-water-water-bong 29e697c
Merge detached branch: final polish
bong-water-water-bong 9fa3fd7
Gemma 4 full_attention: proper global head projections
bong-water-water-bong 52f0a1d
NPU ternary dispatch: wire up uint32→U8 repack + NPU kernel call
bong-water-water-bong 8f5fdf5
Robustness: fix bitnet-2b chat template + jinja file patching
bong-water-water-bong de42815
Complete NPU ternary dispatch: result returns as MLX array
bong-water-water-bong 8cd3f0d
Fix CI: add decode_arena stubs for ROCm build
bong-water-water-bong a04929e
Fix CI: stubs for all missing GPU primitives in NripeshN/mlx fork
bong-water-water-bong bd3b038
Add local CI + PR review to spare maintainer credits
bong-water-water-bong 46b4239
Fix ROCm CI: add gpu_stubs.h header, fix test_arena compile errors
bong-water-water-bong 6898dfd
Point to 1bit-systems/mlx fork (onebit.cpp branding)
bong-water-water-bong e0e2be1
Revert "Point to 1bit-systems/mlx fork (onebit.cpp branding)"
bong-water-water-bong 3b3e75b
ci: add diagnostic startup logging to server
bong-water-water-bong e0bd8cd
Merge PR #43: Qwen3+BitNet compatibility + NPU dispatch for ternary m…
bong-water-water-bong af31712
fix: OpenELM weight prefix mismatch causing NaN/segfault (issue #7)
bong-water-water-bong 6741033
Merge PR #43: Qwen3+BitNet compatibility + NPU dispatch for ternary m…
bong-water-water-bong 1ad1740
fix: disable Qwen3Next T=1 compiled decode path on ROCm (issue #8)
bong-water-water-bong 7dca7ae
docs: document issue #6 (Granite) requires upstream MLX fix
bong-water-water-bong 75bcd99
chore: ignore build directories
bong-water-water-bong 040cd34
fix: Qwen3Next disable T=1 compiled decode on ROCm (#8)
bong-water-water-bong 3c9e98a
fix: Gemma3 weight loading (sanitize discarding converted keys)
bong-water-water-bong 458a62d
fix: upgrade to ROCm 7.13 (TheRock) - fixes Granite strided_scan symbol
bong-water-water-bong 5b1cf20
chore: add PR-agent review + upstream issues watch
bong-water-water-bong dab6fa3
fix: use git diff --cached to detect newly created UPSTREAM_ISSUES.md
bong-water-water-bong 4c2cec0
chore: update upstream issue watch [skip ci]
github-actions[bot] e261a0d
ci: add Qodo merge review workflow
bong-water-water-bong File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,6 @@ | ||
| # Build | ||
| build/ | ||
| build_full/ | ||
| build-npu/ | ||
| cmake-build-*/ | ||
| out/ | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,72 @@ | ||
| #!/bin/bash | ||
| # Comprehensive benchmark across all fixed models on Strix Halo (gfx1151) | ||
| set -e | ||
|
|
||
| export ROCm_DIR=/tmp/rocm_sdk_core | ||
| source /tmp/rocm_venv/bin/activate | ||
| export LD_LIBRARY_PATH=$ROCm_DIR/lib:$LD_LIBRARY_PATH | ||
|
|
||
| CHAT=/home/bcloud/lemon-mlx-engine/build/chat | ||
| MAX_TOKENS=100 | ||
| PROMPT="What is the capital of France? Explain in one sentence." | ||
|
|
||
| echo "╔══════════════════════════════════════════════════════════════════════════╗" | ||
| echo "║ BENCHMARK: lemon-mlx-engine on Strix Halo (gfx1151) ║" | ||
| echo "║ Commit 26aad7e — All fixes applied ║" | ||
| echo "╚══════════════════════════════════════════════════════════════════════════╝" | ||
| echo "" | ||
| echo "Prompt: \"$PROMPT\"" | ||
| echo "Max tokens: $MAX_TOKENS, Temperature: 0.0 (greedy)" | ||
| echo "" | ||
|
|
||
| benchmark() { | ||
| local name="$1" | ||
| local model_path="$2" | ||
| shift 2 | ||
| local extra_args="$@" | ||
|
|
||
| echo "──────────────────────────────────────────────────────────────────────────" | ||
| echo "▶ $name" | ||
| echo " Path: $model_path" | ||
| [ -n "$extra_args" ] && echo " Args: $extra_args" | ||
| echo "" | ||
|
|
||
| local output | ||
| output=$(echo "$PROMPT" | timeout 120 $CHAT "$model_path" --max-tokens $MAX_TOKENS --temperature 0.0 $extra_args 2>&1) || true | ||
|
|
||
| echo "$output" | grep -E "(Loading model|bound HIP|Model loaded|Prompt:|Generation:|Assistant:|Error|error|Fatal|Segmentation|Unsupported)" | head -10 | ||
| echo "" | ||
| } | ||
|
|
||
| # 1. BASELINE: Llama-3.2-1B-Instruct-4bit | ||
| benchmark "Llama-3.2-1B-Instruct-4bit (baseline)" /home/bcloud/models/llama-1b | ||
|
|
||
| # 2. BitNet b1.58-2B-4T (1.58-bit ternary) | ||
| benchmark "BitNet b1.58-2B-4T (1.58-bit ternary)" /home/bcloud/models/bitnet-2b | ||
|
|
||
| # 3. Bonsai 1.7B (1-bit affine) | ||
| benchmark "Bonsai 1.7B (1-bit)" /home/bcloud/models/bonsai-1.7b | ||
|
|
||
| # 4. Bonsai 4B (1-bit affine) | ||
| benchmark "Bonsai 4B (1-bit)" /home/bcloud/models/bonsai-4b | ||
|
|
||
| # 5. Bonsai 8B (1-bit affine) — needs more VRAM | ||
| benchmark "Bonsai 8B (1-bit)" /home/bcloud/models/bonsai-8b | ||
|
|
||
| # 6. Qwen3-1.7B MXFP4 (issue #10 fix) | ||
| benchmark "Qwen3-1.7B-MLX-MXFP4 (MXFP4 quant)" /home/bcloud/models/qwen3-1.7b-mxfp4 | ||
|
|
||
| # 7. OpenELM-3B (issue #7 segfault fix) | ||
| benchmark "OpenELM-3B (issue #7 segfault fix)" /home/bcloud/models/openelm-3b --raw | ||
|
|
||
| # 8. Granite-4.0-H-Tiny (issue #6 crash fix) | ||
| benchmark "Granite-4.0-H-Tiny (issue #6 crash fix)" /home/bcloud/models/granite-4.0-h-tiny --raw | ||
|
|
||
| # 9. Lille-130M (issue #9 dequant fix) | ||
| benchmark "Lille-130M (issue #9 dequant fix)" /home/bcloud/models/lille-130m --raw | ||
|
|
||
| # 10. Falcon-E-3B (1.58-bit, inverse-scale BitLinear) | ||
| benchmark "Falcon-E-3B (1.58-bit, inverse-scale BitLinear)" /home/bcloud/models/falcon-e-3b | ||
|
|
||
| echo "════════════════════════════════════════════════════════════════════════════" | ||
| echo "Benchmark complete." |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defining
MLX_BUILD_NPUonly on thechatexecutable does not affect the already-builtmlx-lm-llm/mlx-lm-commonobject files where the inlinelinear_forward()calls are compiled from model.cppfiles. In an NPU build those calls are compiled without the NPU branch, so ternary matmuls never attemptnpu_try_ternary; propagate the definition/link dependency to the libraries that includequantized_linear.h.Useful? React with 👍 / 👎.