Skip to content
Closed
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
123 commits
Select commit Hold shift + click to select a range
098baf9
Fix model config initializer lifetimes
bong-water-water-bong Jun 24, 2026
f1b28f4
Fix dangling config reference causing SIGFPE on all models
bong-water-water-bong Jun 24, 2026
325b9e8
Add BitNet 1.58-bit ternary model support
bong-water-water-bong Jun 24, 2026
b42d8fd
Clean up: mark unused out_features param in dequantize_bitnet_weight
bong-water-water-bong Jun 24, 2026
12987b5
Support Bonsai 1-bit Qwen3 loading
bong-water-water-bong Jun 25, 2026
1be3dca
Add BitNet dequantization to Llama loader
bong-water-water-bong Jun 25, 2026
f3ea92a
Support all 1.58-bit and 1-bit model variants (Falcon-E, Bonsai)
bong-water-water-bong Jun 25, 2026
b04281d
Fix code review: ensure hidden_act defaults to relu2 for BitNet models
bong-water-water-bong Jun 25, 2026
25afb47
Auto-configure ROCm Tensile library paths
bong-water-water-bong Jun 25, 2026
ba75d26
Fix Lille-130m weight loading
bong-water-water-bong Jun 25, 2026
16d9eb8
Auto-configure ROCm Tensile library paths + fix lille-130m weight prefix
bong-water-water-bong Jun 25, 2026
4ebbd85
Fix OpenELM: use explicit num_query_heads/ffn_multipliers from config
bong-water-water-bong Jun 25, 2026
44c902d
Fix quantized lm_head/embed_as_linear: use linear_forward in all models
bong-water-water-bong Jun 25, 2026
26aad7e
Fix MXFP4 quantization support (issue #10)
bong-water-water-bong Jun 25, 2026
59e8b78
Fix BitNet chat template capitalize filter and short-name model aliasing
bong-water-water-bong Jun 25, 2026
d14e188
BitNet: runtime quantized matmul (repack ternary → 2-bit affine) + gr…
bong-water-water-bong Jun 25, 2026
dba1381
BitNet: runtime quantized matmul — final improvements
bong-water-water-bong Jun 25, 2026
d0d33ad
BitNet: fall back to dequantize-at-load for correctness
bong-water-water-bong Jun 25, 2026
ef551f8
BitNet: dequantize-at-load with thorough analysis of quantized path
bong-water-water-bong Jun 25, 2026
9bd0848
BitNet: fix 2-bit runtime repack layout
bong-water-water-bong Jun 25, 2026
7b0c42a
Falcon-E: support inverse-scale BitLinear checkpoints
bong-water-water-bong Jun 25, 2026
fa6fc89
docs: universal HF loading path design spec
bong-water-water-bong Jun 26, 2026
90f61a6
Universal HuggingFace loading path phase 1-3
bong-water-water-bong Jun 26, 2026
72acd40
Universal HF loading: fix review findings
bong-water-water-bong Jun 26, 2026
a1445d1
Universal HF loading: auto-quantize, quantization_config, GGUF skeleton
bong-water-water-bong Jun 26, 2026
9ab50ae
GGUF integration + auto-quantize verified
bong-water-water-bong Jun 26, 2026
b08a19c
Server + ModelManager: --auto-quantize and GGUF flags
bong-water-water-bong Jun 26, 2026
20370ee
Server --auto-quantize + generic HF weight remapping
bong-water-water-bong Jun 26, 2026
560c622
GGUF: full quant format support (Q4_0..Q6_K, K-quants)
bong-water-water-bong Jun 26, 2026
049d031
PyTorch .bin → safetensors converter
bong-water-water-bong Jun 26, 2026
ec6896b
1-bit model support: sub-norm detection + key remapping
bong-water-water-bong Jun 26, 2026
3bca870
Generic Llama fallback for unknown model types
bong-water-water-bong Jun 26, 2026
d03f974
1-bit activation quantization + weight pre-quantization
bong-water-water-bong Jun 26, 2026
a24022b
Architecture registration system + PyTorch trust_remote_code
bong-water-water-bong Jun 26, 2026
a9cd8f9
Edge case hardening: clear error messages for bad paths
bong-water-water-bong Jun 26, 2026
7b0208b
Add NPU backend: IRON JIT GEMM on AMD XDNA NPU
bong-water-water-bong Jun 26, 2026
77d3675
Qwen3+BitNet: per-projection norms + U8 ternary dequant
bong-water-water-bong Jun 26, 2026
7aa38fb
Qwen3+BitNet: robustness fixes and comprehensive edge-case tests
bong-water-water-bong Jun 26, 2026
20c386a
NPU dispatch: ternary GEMV backend for AMD XDNA NPU
bong-water-water-bong Jun 26, 2026
62cc827
Merge branch 'feat/bitnet-support'
bong-water-water-bong Jun 26, 2026
7cad8b5
Model registry expansion: 12 new models + 1-bit detection fixes
bong-water-water-bong Jun 26, 2026
6b0fd28
Fix BitNetModel weight remapping: universal key mapping for all BitNe…
bong-water-water-bong Jun 26, 2026
a8e6588
Gemma 4 model implementation: architecture detection + weight loading
bong-water-water-bong Jun 26, 2026
21d2437
Universal 1-bit model support: Gemma 4 fixed, all 13 models verified
bong-water-water-bong Jun 26, 2026
6f37349
AQLM 1-bit support + universal 1-bit model registry
bong-water-water-bong Jun 26, 2026
0739acd
Fix all remaining gaps: chat template, OLMo config, model registry
bong-water-water-bong Jun 26, 2026
7af84b7
NPU opt-in via NPU_ENABLE=1 + Gemma 4 Unified alias
bong-water-water-bong Jun 26, 2026
a4ad7cd
MLX community architecture expansion: 8 new model types
bong-water-water-bong Jun 26, 2026
dfd01b0
Top 25 MLX community LLMs: all architectures covered
bong-water-water-bong Jun 26, 2026
fa4c892
Add build artifacts to .gitignore
bong-water-water-bong Jun 26, 2026
ae08166
Fix auto-quantize: skip embed/norm/lm_head weights
bong-water-water-bong Jun 26, 2026
ef74b86
Fix dangling config reference causing SIGFPE on all models
bong-water-water-bong Jun 24, 2026
851bc2e
Add BitNet 1.58-bit ternary model support
bong-water-water-bong Jun 24, 2026
9fe133d
Clean up: mark unused out_features param in dequantize_bitnet_weight
bong-water-water-bong Jun 24, 2026
e8e849e
Support Bonsai 1-bit Qwen3 loading
bong-water-water-bong Jun 25, 2026
2dd9e2c
Add BitNet dequantization to Llama loader
bong-water-water-bong Jun 25, 2026
5a453fa
Support all 1.58-bit and 1-bit model variants (Falcon-E, Bonsai)
bong-water-water-bong Jun 25, 2026
8f8ed59
Fix code review: ensure hidden_act defaults to relu2 for BitNet models
bong-water-water-bong Jun 25, 2026
4737232
Auto-configure ROCm Tensile library paths
bong-water-water-bong Jun 25, 2026
f408942
Fix Lille-130m weight loading
bong-water-water-bong Jun 25, 2026
85ecaa6
Auto-configure ROCm Tensile library paths + fix lille-130m weight prefix
bong-water-water-bong Jun 25, 2026
89172c9
Fix OpenELM: use explicit num_query_heads/ffn_multipliers from config
bong-water-water-bong Jun 25, 2026
d7c3f26
Fix quantized lm_head/embed_as_linear: use linear_forward in all models
bong-water-water-bong Jun 25, 2026
edc07f0
Fix MXFP4 quantization support (issue #10)
bong-water-water-bong Jun 25, 2026
ed54179
Fix BitNet chat template capitalize filter and short-name model aliasing
bong-water-water-bong Jun 25, 2026
336eff1
BitNet: runtime quantized matmul (repack ternary → 2-bit affine) + gr…
bong-water-water-bong Jun 25, 2026
8de196d
BitNet: runtime quantized matmul — final improvements
bong-water-water-bong Jun 25, 2026
dae526a
BitNet: fall back to dequantize-at-load for correctness
bong-water-water-bong Jun 25, 2026
a8dc753
BitNet: dequantize-at-load with thorough analysis of quantized path
bong-water-water-bong Jun 25, 2026
859fe9d
BitNet: fix 2-bit runtime repack layout
bong-water-water-bong Jun 25, 2026
6d059ba
Falcon-E: support inverse-scale BitLinear checkpoints
bong-water-water-bong Jun 25, 2026
718e74a
docs: universal HF loading path design spec
bong-water-water-bong Jun 26, 2026
d36d9e2
Universal HuggingFace loading path phase 1-3
bong-water-water-bong Jun 26, 2026
5dbcd3d
Universal HF loading: fix review findings
bong-water-water-bong Jun 26, 2026
74295fe
Universal HF loading: auto-quantize, quantization_config, GGUF skeleton
bong-water-water-bong Jun 26, 2026
9d54a3a
GGUF integration + auto-quantize verified
bong-water-water-bong Jun 26, 2026
245f5a5
Server + ModelManager: --auto-quantize and GGUF flags
bong-water-water-bong Jun 26, 2026
c6a386d
Server --auto-quantize + generic HF weight remapping
bong-water-water-bong Jun 26, 2026
0ca69e7
GGUF: full quant format support (Q4_0..Q6_K, K-quants)
bong-water-water-bong Jun 26, 2026
33e6a1b
PyTorch .bin → safetensors converter
bong-water-water-bong Jun 26, 2026
3c51336
1-bit model support: sub-norm detection + key remapping
bong-water-water-bong Jun 26, 2026
4afee5b
Generic Llama fallback for unknown model types
bong-water-water-bong Jun 26, 2026
838d684
1-bit activation quantization + weight pre-quantization
bong-water-water-bong Jun 26, 2026
e354c54
Architecture registration system + PyTorch trust_remote_code
bong-water-water-bong Jun 26, 2026
80a9909
Edge case hardening: clear error messages for bad paths
bong-water-water-bong Jun 26, 2026
e56a0d0
Add NPU backend: IRON JIT GEMM on AMD XDNA NPU
bong-water-water-bong Jun 26, 2026
5007a18
Qwen3+BitNet: per-projection norms + U8 ternary dequant
bong-water-water-bong Jun 26, 2026
bbe30ff
Qwen3+BitNet: robustness fixes and comprehensive edge-case tests
bong-water-water-bong Jun 26, 2026
fd20090
NPU dispatch: ternary GEMV backend for AMD XDNA NPU
bong-water-water-bong Jun 26, 2026
e0c126d
Model registry expansion: 12 new models + 1-bit detection fixes
bong-water-water-bong Jun 26, 2026
4ea3edc
Fix BitNetModel weight remapping: universal key mapping for all BitNe…
bong-water-water-bong Jun 26, 2026
3ed0401
Gemma 4 model implementation: architecture detection + weight loading
bong-water-water-bong Jun 26, 2026
e8ef988
Universal 1-bit model support: Gemma 4 fixed, all 13 models verified
bong-water-water-bong Jun 26, 2026
fcd3c4b
AQLM 1-bit support + universal 1-bit model registry
bong-water-water-bong Jun 26, 2026
47a7d66
Fix all remaining gaps: chat template, OLMo config, model registry
bong-water-water-bong Jun 26, 2026
84162ed
NPU opt-in via NPU_ENABLE=1 + Gemma 4 Unified alias
bong-water-water-bong Jun 26, 2026
bc8ece0
MLX community architecture expansion: 8 new model types
bong-water-water-bong Jun 26, 2026
e0fb903
Final polish: OLMo converter, Kimi K2 alias, NPU docs, README
bong-water-water-bong Jun 26, 2026
29e697c
Merge detached branch: final polish
bong-water-water-bong Jun 26, 2026
9fa3fd7
Gemma 4 full_attention: proper global head projections
bong-water-water-bong Jun 26, 2026
52f0a1d
NPU ternary dispatch: wire up uint32→U8 repack + NPU kernel call
bong-water-water-bong Jun 26, 2026
8f5fdf5
Robustness: fix bitnet-2b chat template + jinja file patching
bong-water-water-bong Jun 26, 2026
de42815
Complete NPU ternary dispatch: result returns as MLX array
bong-water-water-bong Jun 26, 2026
8cd3f0d
Fix CI: add decode_arena stubs for ROCm build
bong-water-water-bong Jun 26, 2026
a04929e
Fix CI: stubs for all missing GPU primitives in NripeshN/mlx fork
bong-water-water-bong Jun 26, 2026
bd3b038
Add local CI + PR review to spare maintainer credits
bong-water-water-bong Jun 26, 2026
46b4239
Fix ROCm CI: add gpu_stubs.h header, fix test_arena compile errors
bong-water-water-bong Jun 26, 2026
6898dfd
Point to 1bit-systems/mlx fork (onebit.cpp branding)
bong-water-water-bong Jun 26, 2026
e0e2be1
Revert "Point to 1bit-systems/mlx fork (onebit.cpp branding)"
bong-water-water-bong Jun 26, 2026
3b3e75b
ci: add diagnostic startup logging to server
bong-water-water-bong Jun 27, 2026
e0bd8cd
Merge PR #43: Qwen3+BitNet compatibility + NPU dispatch for ternary m…
bong-water-water-bong Jun 27, 2026
af31712
fix: OpenELM weight prefix mismatch causing NaN/segfault (issue #7)
bong-water-water-bong Jun 27, 2026
6741033
Merge PR #43: Qwen3+BitNet compatibility + NPU dispatch for ternary m…
bong-water-water-bong Jun 27, 2026
1ad1740
fix: disable Qwen3Next T=1 compiled decode path on ROCm (issue #8)
bong-water-water-bong Jun 27, 2026
7dca7ae
docs: document issue #6 (Granite) requires upstream MLX fix
bong-water-water-bong Jun 27, 2026
75bcd99
chore: ignore build directories
bong-water-water-bong Jun 27, 2026
040cd34
fix: Qwen3Next disable T=1 compiled decode on ROCm (#8)
bong-water-water-bong Jun 27, 2026
3c9e98a
fix: Gemma3 weight loading (sanitize discarding converted keys)
bong-water-water-bong Jun 27, 2026
458a62d
fix: upgrade to ROCm 7.13 (TheRock) - fixes Granite strided_scan symbol
bong-water-water-bong Jun 27, 2026
5b1cf20
chore: add PR-agent review + upstream issues watch
bong-water-water-bong Jun 27, 2026
dab6fa3
fix: use git diff --cached to detect newly created UPSTREAM_ISSUES.md
bong-water-water-bong Jun 27, 2026
4c2cec0
chore: update upstream issue watch [skip ci]
github-actions[bot] Jun 27, 2026
e261a0d
ci: add Qodo merge review workflow
bong-water-water-bong Jun 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# Build
build/
build_full/
build-npu/
cmake-build-*/
out/
Expand Down
123 changes: 121 additions & 2 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,55 @@ FetchContent_Declare(
mlx
# Repo + branch — always build against the latest ROCm backend work.
GIT_REPOSITORY https://github.com/NripeshN/mlx.git
GIT_TAG rocm-support
GIT_TAG 6abf0b7e # rocm-support (pinned working ExecUpdate commit)
GIT_SHALLOW FALSE
)
FetchContent_MakeAvailable(mlx)
# Fetch MLX, apply local patches, then add it. Patching must happen before
# add_subdirectory()/FetchContent_MakeAvailable so CMakeLists.txt changes (for
# example removing unsupported ROCm clang flags) affect generated build files.
FetchContent_GetProperties(mlx)
if(NOT mlx_POPULATED)
FetchContent_Populate(mlx)
endif()
set(MLX_SOURCE_DIR "${mlx_SOURCE_DIR}")

if(MLX_BUILD_ROCM AND MLX_SOURCE_DIR AND EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/patches/mlx-rocm-build.patch")
execute_process(
COMMAND git apply --check "${CMAKE_CURRENT_SOURCE_DIR}/patches/mlx-rocm-build.patch"
WORKING_DIRECTORY "${MLX_SOURCE_DIR}"
RESULT_VARIABLE PATCH_CHECK_RESULT
ERROR_QUIET
OUTPUT_QUIET
)
if(PATCH_CHECK_RESULT EQUAL 0)
message(STATUS "Applying mlx-rocm-build.patch...")
execute_process(
COMMAND git apply "${CMAKE_CURRENT_SOURCE_DIR}/patches/mlx-rocm-build.patch"
WORKING_DIRECTORY "${MLX_SOURCE_DIR}"
RESULT_VARIABLE PATCH_RESULT
)
if(PATCH_RESULT EQUAL 0)
message(STATUS "Patch applied successfully")
else()
message(FATAL_ERROR "Failed to apply mlx-rocm-build.patch")
endif()
else()
execute_process(
COMMAND git apply --reverse --check "${CMAKE_CURRENT_SOURCE_DIR}/patches/mlx-rocm-build.patch"
WORKING_DIRECTORY "${MLX_SOURCE_DIR}"
RESULT_VARIABLE PATCH_REVERSE_CHECK_RESULT
ERROR_QUIET
OUTPUT_QUIET
)
if(PATCH_REVERSE_CHECK_RESULT EQUAL 0)
message(STATUS "mlx-rocm-build.patch already applied, skipping")
else()
message(FATAL_ERROR "mlx-rocm-build.patch does not apply to fetched MLX source")
endif()
endif()
endif()

add_subdirectory("${mlx_SOURCE_DIR}" "${mlx_BINARY_DIR}")

# nlohmann/json (MLX may already provide this)
if(NOT TARGET nlohmann_json::nlohmann_json)
Expand Down Expand Up @@ -113,6 +158,8 @@ add_library(mlx-lm-common
src/common/base_config.cpp
src/common/hub_api.cpp
src/common/safetensors.cpp
src/common/gguf_loader.cpp
src/common/registry.cpp
src/common/switch_layers.cpp
src/common/ssm_utils.cpp
src/common/rope_utils.cpp
Expand All @@ -136,6 +183,11 @@ target_link_libraries(mlx-lm-common PUBLIC
tokenizers_cpp
)
target_include_directories(mlx-lm-common PUBLIC ${minja_SOURCE_DIR}/include)
# Patched minja headers (capitalize filter, etc.) take precedence over the
# upstream minja version fetched by FetchContent.
target_include_directories(mlx-lm-common BEFORE PRIVATE
${CMAKE_CURRENT_SOURCE_DIR}/src/common/patched
)

# Propagate ROCm flag as compile definition so C++ code can use #if defined(MLX_BUILD_ROCM)
if(MLX_BUILD_ROCM)
Expand Down Expand Up @@ -189,6 +241,7 @@ add_library(mlx-lm-llm
src/llm/models/lfm2.cpp
src/llm/models/nemotron_h.cpp
src/llm/models/granite_moe_hybrid.cpp
src/llm/models/bitnet.cpp
)
target_link_libraries(mlx-lm-llm PUBLIC mlx-lm-common)

Expand Down Expand Up @@ -224,6 +277,63 @@ target_link_libraries(mlx-lm-vlm PUBLIC mlx-lm-common)
# stb include path (header-only)
target_include_directories(mlx-lm-common PUBLIC ${stb_SOURCE_DIR})

# NPU backend (optional, requires XRT)
# NPU backend (optional, requires IRON Python stack + XRT)
if(MLX_LM_BUILD_NPU)
# The NPU backend uses the IRON JIT via Python subprocess.
# Install IRON: pip install mlir-aie

# MLIR-AIE venv path for IRON JIT
set(NPU_VENV_DIR "${CMAKE_SOURCE_DIR}/../mlir-aie/.venv")

# Copy JIT helpers to build directory
configure_file(
src/npu/kernels/ternary_gemv.py
${CMAKE_BINARY_DIR}/bin/ternary_gemv.py
COPYONLY
)

# Find LLVM-AIE compiler
find_program(AIE2_CLANG clang++
PATHS "${NPU_VENV_DIR}/lib/python3.14/site-packages/llvm-aie/bin"
NO_DEFAULT_PATH
)
if(NOT AIE2_CLANG)
message(STATUS "NPU: LLVM-AIE clang not found, kernel will be JIT-compiled at runtime")
else()
# Compile the AIE kernel at build time
set(AIE_KERNEL_SRC "${CMAKE_SOURCE_DIR}/src/npu/kernels/ternary_gemv_aie.cpp")
set(AIE_KERNEL_OBJ "${CMAKE_BINARY_DIR}/kernels/ternary_gemv_aie.o")
add_custom_command(
OUTPUT ${AIE_KERNEL_OBJ}
COMMAND ${CMAKE_COMMAND} -E make_directory "${CMAKE_BINARY_DIR}/kernels"
COMMAND ${AIE2_CLANG} --target=aie2-none-unknown-elf -O2 -std=c++20
-c "${AIE_KERNEL_SRC}" -o "${AIE_KERNEL_OBJ}"
DEPENDS ${AIE_KERNEL_SRC}
COMMENT "Compiling AIE2 kernel: ternary_gemv_aie"
)
add_custom_target(aie_kernels ALL DEPENDS ${AIE_KERNEL_OBJ})
endif()

add_library(mlx-lm-npu STATIC
src/npu/npu_backend.cpp
)
target_include_directories(mlx-lm-npu PUBLIC
$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include>
)
target_compile_definitions(mlx-lm-npu PUBLIC
MLX_BUILD_NPU
NPU_INSTALL_DIR="${CMAKE_BINARY_DIR}"
)
if(AIE2_CLANG)
add_dependencies(mlx-lm-npu aie_kernels)
endif()
message(STATUS "NPU backend enabled (JIT path)")
if(AIE2_CLANG)
message(STATUS " AIE2 compiler: ${AIE2_CLANG}")
endif()
endif()

if(MLX_LM_BUILD_EXAMPLES)
add_executable(chat examples/chat.cpp)
target_link_libraries(chat PRIVATE mlx-lm-llm mlx-lm-common mlx-lm-core)
Expand All @@ -232,6 +342,10 @@ if(MLX_LM_BUILD_EXAMPLES)
target_compile_definitions(chat PRIVATE MLX_BUILD_ROCM)
target_link_libraries(chat PRIVATE hip::host)
endif()
if(MLX_LM_BUILD_NPU AND TARGET mlx-lm-npu)
target_link_libraries(chat PRIVATE mlx-lm-npu)
target_compile_definitions(chat PRIVATE MLX_BUILD_NPU)
Comment on lines +347 to +349

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Compile model libraries with NPU dispatch enabled

Defining MLX_BUILD_NPU only on the chat executable does not affect the already-built mlx-lm-llm/mlx-lm-common object files where the inline linear_forward() calls are compiled from model .cpp files. In an NPU build those calls are compiled without the NPU branch, so ternary matmuls never attempt npu_try_ternary; propagate the definition/link dependency to the libraries that include quantized_linear.h.

Useful? React with 👍 / 👎.

endif()

add_executable(diagnose examples/diagnose.cpp)
target_link_libraries(diagnose PRIVATE mlx-lm-llm mlx-lm-common mlx-lm-core)
Expand Down Expand Up @@ -268,6 +382,11 @@ if(MLX_LM_BUILD_EXAMPLES)
add_executable(test_sdpa_ref examples/test_sdpa_ref.cpp)
target_link_libraries(test_sdpa_ref PRIVATE mlx)

if(MLX_LM_BUILD_NPU AND TARGET mlx-lm-npu)
add_executable(test_npu examples/test_npu.cpp)
target_link_libraries(test_npu PRIVATE mlx-lm-npu)
endif()

add_executable(server
examples/server.cpp
src/common/server.cpp
Expand Down
72 changes: 72 additions & 0 deletions benchmark_all.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
#!/bin/bash
# Comprehensive benchmark across all fixed models on Strix Halo (gfx1151)
set -e

export ROCm_DIR=/tmp/rocm_sdk_core
source /tmp/rocm_venv/bin/activate
export LD_LIBRARY_PATH=$ROCm_DIR/lib:$LD_LIBRARY_PATH

CHAT=/home/bcloud/lemon-mlx-engine/build/chat
MAX_TOKENS=100
PROMPT="What is the capital of France? Explain in one sentence."

echo "╔══════════════════════════════════════════════════════════════════════════╗"
echo "║ BENCHMARK: lemon-mlx-engine on Strix Halo (gfx1151) ║"
echo "║ Commit 26aad7e — All fixes applied ║"
echo "╚══════════════════════════════════════════════════════════════════════════╝"
echo ""
echo "Prompt: \"$PROMPT\""
echo "Max tokens: $MAX_TOKENS, Temperature: 0.0 (greedy)"
echo ""

benchmark() {
local name="$1"
local model_path="$2"
shift 2
local extra_args="$@"

echo "──────────────────────────────────────────────────────────────────────────"
echo "▶ $name"
echo " Path: $model_path"
[ -n "$extra_args" ] && echo " Args: $extra_args"
echo ""

local output
output=$(echo "$PROMPT" | timeout 120 $CHAT "$model_path" --max-tokens $MAX_TOKENS --temperature 0.0 $extra_args 2>&1) || true

echo "$output" | grep -E "(Loading model|bound HIP|Model loaded|Prompt:|Generation:|Assistant:|Error|error|Fatal|Segmentation|Unsupported)" | head -10
echo ""
}

# 1. BASELINE: Llama-3.2-1B-Instruct-4bit
benchmark "Llama-3.2-1B-Instruct-4bit (baseline)" /home/bcloud/models/llama-1b

# 2. BitNet b1.58-2B-4T (1.58-bit ternary)
benchmark "BitNet b1.58-2B-4T (1.58-bit ternary)" /home/bcloud/models/bitnet-2b

# 3. Bonsai 1.7B (1-bit affine)
benchmark "Bonsai 1.7B (1-bit)" /home/bcloud/models/bonsai-1.7b

# 4. Bonsai 4B (1-bit affine)
benchmark "Bonsai 4B (1-bit)" /home/bcloud/models/bonsai-4b

# 5. Bonsai 8B (1-bit affine) — needs more VRAM
benchmark "Bonsai 8B (1-bit)" /home/bcloud/models/bonsai-8b

# 6. Qwen3-1.7B MXFP4 (issue #10 fix)
benchmark "Qwen3-1.7B-MLX-MXFP4 (MXFP4 quant)" /home/bcloud/models/qwen3-1.7b-mxfp4

# 7. OpenELM-3B (issue #7 segfault fix)
benchmark "OpenELM-3B (issue #7 segfault fix)" /home/bcloud/models/openelm-3b --raw

# 8. Granite-4.0-H-Tiny (issue #6 crash fix)
benchmark "Granite-4.0-H-Tiny (issue #6 crash fix)" /home/bcloud/models/granite-4.0-h-tiny --raw

# 9. Lille-130M (issue #9 dequant fix)
benchmark "Lille-130M (issue #9 dequant fix)" /home/bcloud/models/lille-130m --raw

# 10. Falcon-E-3B (1.58-bit, inverse-scale BitLinear)
benchmark "Falcon-E-3B (1.58-bit, inverse-scale BitLinear)" /home/bcloud/models/falcon-e-3b

echo "════════════════════════════════════════════════════════════════════════════"
echo "Benchmark complete."
Loading
Loading