-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic pixtral support, paving the way for vision models 🖼️ #153
Conversation
11f0c4b
to
b13ac3f
Compare
6a07c42
to
2212f16
Compare
Recent changes (mostly #171) broke a few things here, I grabbed the changes and created a clean commit from there. Also, "model-level rope" might not the best here because we have different kinds of rope for the vision encoder and the text decoder. So for now, the vision encoder rope is defined within the encoder code, while the "model-level rope" is the text decoder one. |
c2c7ba7
to
1bdbe09
Compare
6bb4c1b
to
7c3232f
Compare
The initial goal was to extend this PR with more llava-like vision architectures. This could have facilitated finetuning experiments with smaller models for instance. |
00348f7
to
d3250a7
Compare
d3250a7
to
daafc42
Compare
Conversion
Conversion is made to work with the
mistral-community
models, and was for now only tested withmistral-community/pixtral-12b
.The official mistralai models lack quite some information to be usable in our context. (HF checkpoints lack information as well, but it's more manageable.)
What works
The provided
test_inference.py
script in thepixtral
recipe allows to run inference on a few examples (grabbed from the Pixtral blog post).The configuration is using bitsandbytes quantization by default, to allow running on a 24G VRAM GPU (tested on 3090).
Some differences in a few methods (notably rope) lead to slight numerical differences with the HF implementation (as most of our models anyways).
What does not work (yet)
What might need to be improved (future work)
eole.models.model
? allow for differentadapter
classes?)