-
Notifications
You must be signed in to change notification settings - Fork 28
[Prototype] LoRA #180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Prototype] LoRA #180
Conversation
hey @jlamypoirier, thanks for kicking off LoRA!
Right now, the goal is to get the minimal implementation merged quickly. Please focus on the essential pieces, like ensuring transformer weights are excluded from gradient buffers and optimizer states. And please avoid any interaction with #168. Also, please link this PR to the feature issue so it properly tracks progress. Thanks. |
Why are we excluding MLP? There is no real memory gain to be expected without it... Or do you mean freezing the whole MLP? Note that applying LoRA on Q and V only is a lot harder than applying it on all layers including MLP, which I'm almost done with either way. TP and kernel optimization is not a priority, see PR description. |
LoRA by default applies only to Q and V because this provides the best tradeoff between efficiency, compute, and fine-tuning performance. The plan in #149 was to follow this standard, which is why MLP wasn’t included. Applying LoRA to MLP negates the efficiency gains that make LoRA appealing in the first place. While PEFT allows flexibility, this is rarely used in practice for good reason...
If TP and kernel optimizations aren't a priority, why do they appear in the PR checklist? Can we please keep this scoped to the minimal implementation first? Thanks. |
I'm not following here. I thought LoRA was largely about reducing memory usage? There is very little difference between QV only and all linear layers...
See the "optional" mention. |
LoRA does not work the way you're assuming. In standard LoRA fine-tuning, the MLP remains frozen (like everything else in the transformer). Only the LoRA weights on Q and V are trained. This is what enables LoRA to be memory-efficient while maintaining compute efficiency. People don't use LoRA with MLPs because it's not that useful. The plan in #149 was to follow the standard, minimal LoRA implementation. If there's a strong reason to deviate, let's discuss before implementing it. Otherwise, please keep this scoped as agreed. Thanks. |
As per my previous comment, I'm not following your reasoning. The massive memory reduction is obtained independently of whether MLP is included or not, and the computational difference is marginal. See https://pytorch.org/torchtune/0.3/tutorials/lora_finetune.html#trading-off-memory-and-model-performance-with-lora, that clearly shows little difference in memory usage, staying around 16 GiB for Llama-7B. Without LoRA the memory usage would be ~150 GiB. |
We do not need LoRA for MLP. |
Splitting the key and value will be difficult because the two are stored as the same tensor. @tscholak can you confirm that this is exactly what we want, and that we really don't want to add the key tensor? |
It sounds like some design choices in Fast-LLM are making LoRA harder to implement. Can you clarify whether keeping the tensor merged is actually necessary for performance, or whether we should split that? People normally use LoRA for Q and V only (exclusively so), so if the implementation is blocking that, we need to decide whether to fix the implementation or compromise on LoRA's expected behaviour. |
K and V are merged for performance, we can't split them in general without degrading performance and causing backward compatibility issues. What we can do is split it for the purpose of LoRA, but it will be difficult to implement and add 1-3 days of work. |
Why do you need to split K and V again? If they're stored as one tensor (which remains frozen for PEFT), can't you just extract V and apply LoRA there? That would avoid unnecessary complexity. Also, I see you added MLP and a config class for selecting where LoRA is used despite me explicitly asking you to postpone that. Please keep this scoped to the minimal implementation we agreed on. Remember that #149 also asks for a converter that stores the LoRA weights in a way that is compatible with HF's PEFT library, and all this will make this harder now. |
That's the goal.
As I already said, MLP was already implemented prior to the above directive. I committed and pushed existing code.
I am not following. The present implementation follows PEFT as closely as possible. Can you please elaborate on what is wrong here? |
The issue here is that Fast-LLM now has its own way of specifying where LoRA is applied, which is different from how PEFT does it (see PEFT's LoraConfig). Had we stuck to the original scope (LoRA only on Q and V), this would have worked out of the box. No additional translation needed. We could have just ignored LoRA's Now that there's extra configurability, we need to convert Fast-LLM's settings into PEFT's format (and vice versa), which adds complexity. And this is exactly what I wanted to avoid when I wrote down the issue description for #149. Did you read it? |
The translation is trivial.
I am following #149, I thought this was obvious. And as we agreed on, I exert my own judgment on implementation details unless there is a noticeable impact on the workload. |
✨ Description
Fix: #149
LoRA support
🔍 Type of change
Select all that apply: