Skip to content

[Prototype] LoRA #180

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

[Prototype] LoRA #180

wants to merge 4 commits into from

Conversation

jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Mar 7, 2025

✨ Description

Fix: #149

LoRA support

  • Basic LoRA wrapper
  • Basic LoRA Config
  • Add LoRA support in attention
  • Add LoRA support in MLP
  • Allow selecting which sub-layers LoRA applies to
  • Allow LoRA for value but not key
  • (Hard!) Exclude Original weights from gradient buffers and optimizer states.
  • Add tests
  • Make sure it works
  • (Optional) TP support
  • (Optional) Activation recomputation support.
  • (Optional) Optimize kernels

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

@tscholak
Copy link
Collaborator

tscholak commented Mar 10, 2025

hey @jlamypoirier, thanks for kicking off LoRA!
However, the list of features above doesn't align with the scoped execution plan of #149.
Specifically:

  • we only need LoRA on Wq and Wv for now, not MLP or selective sub-layer support.
  • we are not prioritizing TP support or any kernel optimizations.

Right now, the goal is to get the minimal implementation merged quickly. Please focus on the essential pieces, like ensuring transformer weights are excluded from gradient buffers and optimizer states. And please avoid any interaction with #168.

Also, please link this PR to the feature issue so it properly tracks progress.

Thanks.

@tscholak tscholak mentioned this pull request Mar 10, 2025
18 tasks
@jlamypoirier
Copy link
Collaborator Author

jlamypoirier commented Mar 10, 2025

Why are we excluding MLP? There is no real memory gain to be expected without it... Or do you mean freezing the whole MLP?

Note that applying LoRA on Q and V only is a lot harder than applying it on all layers including MLP, which I'm almost done with either way.

TP and kernel optimization is not a priority, see PR description.

@tscholak
Copy link
Collaborator

LoRA by default applies only to Q and V because this provides the best tradeoff between efficiency, compute, and fine-tuning performance. The plan in #149 was to follow this standard, which is why MLP wasn’t included. Applying LoRA to MLP negates the efficiency gains that make LoRA appealing in the first place. While PEFT allows flexibility, this is rarely used in practice for good reason...

TP and kernel optimization is not a priority, see PR description.

If TP and kernel optimizations aren't a priority, why do they appear in the PR checklist?

Can we please keep this scoped to the minimal implementation first? Thanks.

@jlamypoirier
Copy link
Collaborator Author

LoRA by default applies only to Q and V because this provides the best tradeoff between efficiency, compute, and fine-tuning performance. The plan in #149 was to follow this standard, which is why MLP wasn’t included. Applying LoRA to MLP negates the efficiency gains that make LoRA appealing in the first place. While PEFT allows flexibility, this is rarely used in practice for good reason...

I'm not following here. I thought LoRA was largely about reducing memory usage? There is very little difference between QV only and all linear layers...

If TP and kernel optimizations aren't a priority, why do they appear in the PR checklist?

See the "optional" mention.

@tscholak
Copy link
Collaborator

LoRA does not work the way you're assuming. In standard LoRA fine-tuning, the MLP remains frozen (like everything else in the transformer). Only the LoRA weights on Q and V are trained. This is what enables LoRA to be memory-efficient while maintaining compute efficiency. People don't use LoRA with MLPs because it's not that useful.

The plan in #149 was to follow the standard, minimal LoRA implementation. If there's a strong reason to deviate, let's discuss before implementing it. Otherwise, please keep this scoped as agreed. Thanks.

@jlamypoirier
Copy link
Collaborator Author

As per my previous comment, I'm not following your reasoning. The massive memory reduction is obtained independently of whether MLP is included or not, and the computational difference is marginal. See https://pytorch.org/torchtune/0.3/tutorials/lora_finetune.html#trading-off-memory-and-model-performance-with-lora, that clearly shows little difference in memory usage, staying around 16 GiB for Llama-7B. Without LoRA the memory usage would be ~150 GiB.

@tscholak
Copy link
Collaborator

We do not need LoRA for MLP.

@jlamypoirier
Copy link
Collaborator Author

Splitting the key and value will be difficult because the two are stored as the same tensor. @tscholak can you confirm that this is exactly what we want, and that we really don't want to add the key tensor?

@tscholak
Copy link
Collaborator

It sounds like some design choices in Fast-LLM are making LoRA harder to implement. Can you clarify whether keeping the tensor merged is actually necessary for performance, or whether we should split that? People normally use LoRA for Q and V only (exclusively so), so if the implementation is blocking that, we need to decide whether to fix the implementation or compromise on LoRA's expected behaviour.

@jlamypoirier
Copy link
Collaborator Author

K and V are merged for performance, we can't split them in general without degrading performance and causing backward compatibility issues. What we can do is split it for the purpose of LoRA, but it will be difficult to implement and add 1-3 days of work.

@tscholak
Copy link
Collaborator

Why do you need to split K and V again? If they're stored as one tensor (which remains frozen for PEFT), can't you just extract V and apply LoRA there? That would avoid unnecessary complexity.

Also, I see you added MLP and a config class for selecting where LoRA is used despite me explicitly asking you to postpone that. Please keep this scoped to the minimal implementation we agreed on. Remember that #149 also asks for a converter that stores the LoRA weights in a way that is compatible with HF's PEFT library, and all this will make this harder now.

@jlamypoirier
Copy link
Collaborator Author

jlamypoirier commented Mar 10, 2025

Why do you need to split K and V again? If they're stored as one tensor (which remains frozen for PEFT), can't you just extract V and apply LoRA there? That would avoid unnecessary complexity.

That's the goal.

Also, I see you added MLP and a config class for selecting where LoRA is used despite me explicitly asking you to postpone that. Please keep this scoped to the minimal implementation we agreed on.

As I already said, MLP was already implemented prior to the above directive. I committed and pushed existing code.

Remember that #149 also asks for a converter that stores the LoRA weights in a way that is compatible with HF's PEFT library, and all this will make this harder now.

I am not following. The present implementation follows PEFT as closely as possible. Can you please elaborate on what is wrong here?

@tscholak
Copy link
Collaborator

The issue here is that Fast-LLM now has its own way of specifying where LoRA is applied, which is different from how PEFT does it (see PEFT's LoraConfig).

Had we stuck to the original scope (LoRA only on Q and V), this would have worked out of the box. No additional translation needed. We could have just ignored LoRA's target_modules setting.

Now that there's extra configurability, we need to convert Fast-LLM's settings into PEFT's format (and vice versa), which adds complexity. And this is exactly what I wanted to avoid when I wrote down the issue description for #149. Did you read it?

@jlamypoirier
Copy link
Collaborator Author

The issue here is that Fast-LLM now has its own way of specifying where LoRA is applied, which is different from how PEFT does it (see PEFT's LoraConfig).

Had we stuck to the original scope (LoRA only on Q and V), this would have worked out of the box. No additional translation needed. We could have just ignored LoRA's target_modules setting.

The translation is trivial.

Now that there's extra configurability, we need to convert Fast-LLM's settings into PEFT's format (and vice versa), which adds complexity. And this is exactly what I wanted to avoid when I wrote down the issue description for #149. Did you read it?

I am following #149, I thought this was obvious. And as we agreed on, I exert my own judgment on implementation details unless there is a noticeable impact on the workload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feat] LoRA
2 participants