Skip to content

bump mlx-lm to 0.31.3 and mlx-vlm to latest main#675

Open
Chedrian07 wants to merge 1 commit intojundot:mainfrom
Chedrian07:chore/bump-mlx-lm-vlm
Open

bump mlx-lm to 0.31.3 and mlx-vlm to latest main#675
Chedrian07 wants to merge 1 commit intojundot:mainfrom
Chedrian07:chore/bump-mlx-lm-vlm

Conversation

@Chedrian07
Copy link
Copy Markdown

Summary

  • Bump mlx-lm pin from dcbf6e3 to d9c63ff (v0.31.3 patch bump, #1124)
  • Bump mlx-vlm pin from 23e1dff to 3472132 (5 upstream fixes, non-breaking)
  • Keep pyproject.toml dependencies, [tool.uv] override-dependencies, and packaging/venvstacks.toml in lockstep

mlx-vlm changes pulled in

mlx-lm changes pulled in

Both compare ranges (dcbf6e3..d9c63ff and 23e1dff..3472132) contain only bug fixes and a patch version bump. No public API, signature, or required-argument changes — so existing oMLX adapters (VLMModelAdapter, BatchedEngine, scheduler integration with BatchGenerator, Gemma 4 tool-call paths) should be fully compatible.

Rationale for picking HEAD of both main branches:

  • mlx-vlm@3472132 includes the Gemma 4 hyphenated-tool-name fix, which improves OpenAI-spec compliance for oMLX's Gemma 4 tool-calling path.
  • mlx-vlm@3472132 also fixes a TurboQuant kernel race that can affect quantized VLMs under oMLX continuous batching.
  • mlx-lm@d9c63ff is just a patch-version bump; staying in lockstep keeps the override free of surprises.

Test plan

  • pip install -e . resolves against the new git pins in a clean venv
  • pytest -m "not slow" passes
  • Smoke test: load a Gemma 4 VLM + hyphenated tool name → parsed correctly
  • Smoke test: load a quantized VLM under --max-concurrent-requests 8 → no TurboQuant race
  • packaging/build.py --skip-venv still builds the app bundle without resolver errors

mlx-lm: dcbf6e3 -> d9c63ff (v0.31.3 patch bump, #1124)

mlx-vlm: 23e1dff -> 3472132
  - Fix Gemma 4 tool parser to accept hyphenated function names (#963)
  - Fix Gemma 4 audio: mel preprocessing, weight loading, feature extractor (#931)
  - Fix race condition in TurboQuant fused fast-quantize kernels (#967)
  - Fix Gemma 4 quantized per-layer projection loading (#935)
  - Snapshot cache.offset to prevent alias mutation under batched caches (#966)

Both ranges contain only bug fixes and the patch version bump; no API
or signature changes. Updates pyproject.toml dependencies, the uv
override-dependencies, and packaging/venvstacks.toml in lockstep.
@kyr0
Copy link
Copy Markdown

kyr0 commented Apr 10, 2026

I'm encountering issues with thread-local vs. thread-global behaviour from [email protected] on with Gemma 4. Is this PR well-tested? Do we have e2e integration tests?

I'm not going into details here because I'm working on custom mlx-lm and mlx forks. Having a hard time to pinpoint the exact root cause, but mlx 0.31.2 seems to introduce changes to thread locality and oMLX isn't prepared?

Ref tracking: ml-explore/mlx#3078

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants