-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Add support for LiquidAI's LFM2.5-VL vision-language model #9729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
LFM2.5-VL requires transformers>=4.58.0 or a specific commit (3c2517727ce28a30f5044e01663ee204deb1cdbe) due to the new TokenizersBackend class not available in transformers 4.57.1. This adds a version check in patcher.py that raises an informative error message with installation instructions when the model is loaded with an incompatible transformers version.
Fix infinite loop bug in LFMVLPlugin.process_messages() that occurred
when expanding image tokens. The issue was that both IMAGE_PLACEHOLDER
and self.image_token were `<image>`, causing the replacement loop to
repeatedly find new placeholders to expand.
Solution: Use two-phase replacement pattern (matching Qwen2VLPlugin):
1. First replace `<image>` → `{{image}}` × N (intermediate placeholder)
2. After loop, replace `{{image}}` → `<image>` (actual token)
Also adds proper _get_mm_inputs override to correctly process images
through the LFM2.5-VL image processor and retrieve spatial_shapes
tensor for calculating dynamic token counts per image.
Token calculation: (spatial_h × spatial_w) / (downsample_factor²)
Rename template and plugin from `lfm_vl` to `lfm2_vl` to match
the model's config.model_type ("lfm2_vl"), following the same
pattern as qwen2_vl.
Files updated:
- mm_plugin.py: Plugin registration
- template.py: Template name and mm_plugin reference
- constants.py: Model group template reference
- test_mm_plugin.py: Test function and variable names
Summary of ChangesHello @vovanphuc, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates LiquidAI's LFM2.5-VL vision-language model into the LLaMA-Factory, significantly enhancing its multimodal capabilities. The changes introduce a novel approach to handling image tokens dynamically based on image resolution, alongside a tailored chat template and necessary system configurations, allowing the framework to process and generate responses that incorporate both text and visual information. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
hiyouga
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the LiquidAI's LFM2.5-VL vision-language model. The changes include a new LFMVLPlugin for dynamic image token expansion, a corresponding chat template, model registration, and a transformers version check. The implementation is mostly correct, but I've found a critical issue in the LFMVLPlugin regarding image batching that will affect training with batch sizes greater than one. I've also suggested an improvement to the unit test to cover the new plugin's core logic.
Summary
Add multimodal (vision-language) support for LiquidAI's LFM2.5-VL to LLaMA-Factory.
Changes
LFMVLPluginclass with dynamic image token expansion based on spatial shapeslfm2_vlchat template with multimodal plugin supportmultimodal=TrueSupported Models
LiquidAI/LFM2.5-VL-1.6BKey Features
(spatial_h × spatial_w) / downsample_factor²<image>token (ID 396) with SigLIP2 NaFlex vision encoderTest plan
trust_remote_code=TrueReferences