[Prototype] LoRA #180

jlamypoirier · 2025-03-07T23:57:35Z

✨ Description

LoRA support

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

tscholak · 2025-03-10T15:43:18Z

hey @jlamypoirier, thanks for kicking off LoRA!
However, the list of features above doesn't align with the scoped execution plan of #149.
Specifically:

we only need LoRA on Wq and Wv for now, not MLP or selective sub-layer support.
we are not prioritizing TP support or any kernel optimizations.

Right now, the goal is to get the minimal implementation merged quickly. Please focus on the essential pieces, like ensuring transformer weights are excluded from gradient buffers and optimizer states. And please avoid any interaction with #168.

Also, please link this PR to the feature issue so it properly tracks progress.

Thanks.

jlamypoirier · 2025-03-10T16:22:57Z

Why are we excluding MLP? There is no real memory gain to be expected without it... Or do you mean freezing the whole MLP?

Note that applying LoRA on Q and V only is a lot harder than applying it on all layers including MLP, which I'm almost done with either way.

TP and kernel optimization is not a priority, see PR description.

tscholak · 2025-03-10T16:50:05Z

LoRA by default applies only to Q and V because this provides the best tradeoff between efficiency, compute, and fine-tuning performance. The plan in #149 was to follow this standard, which is why MLP wasn’t included. Applying LoRA to MLP negates the efficiency gains that make LoRA appealing in the first place. While PEFT allows flexibility, this is rarely used in practice for good reason...

TP and kernel optimization is not a priority, see PR description.

If TP and kernel optimizations aren't a priority, why do they appear in the PR checklist?

Can we please keep this scoped to the minimal implementation first? Thanks.

jlamypoirier · 2025-03-10T17:07:35Z

LoRA by default applies only to Q and V because this provides the best tradeoff between efficiency, compute, and fine-tuning performance. The plan in #149 was to follow this standard, which is why MLP wasn’t included. Applying LoRA to MLP negates the efficiency gains that make LoRA appealing in the first place. While PEFT allows flexibility, this is rarely used in practice for good reason...

I'm not following here. I thought LoRA was largely about reducing memory usage? There is very little difference between QV only and all linear layers...

If TP and kernel optimizations aren't a priority, why do they appear in the PR checklist?

See the "optional" mention.

tscholak · 2025-03-10T17:43:05Z

LoRA does not work the way you're assuming. In standard LoRA fine-tuning, the MLP remains frozen (like everything else in the transformer). Only the LoRA weights on Q and V are trained. This is what enables LoRA to be memory-efficient while maintaining compute efficiency. People don't use LoRA with MLPs because it's not that useful.

The plan in #149 was to follow the standard, minimal LoRA implementation. If there's a strong reason to deviate, let's discuss before implementing it. Otherwise, please keep this scoped as agreed. Thanks.

jlamypoirier · 2025-03-10T18:14:50Z

As per my previous comment, I'm not following your reasoning. The massive memory reduction is obtained independently of whether MLP is included or not, and the computational difference is marginal. See https://pytorch.org/torchtune/0.3/tutorials/lora_finetune.html#trading-off-memory-and-model-performance-with-lora, that clearly shows little difference in memory usage, staying around 16 GiB for Llama-7B. Without LoRA the memory usage would be ~150 GiB.

tscholak · 2025-03-10T18:24:03Z

We do not need LoRA for MLP.

jlamypoirier · 2025-03-10T18:38:38Z

Splitting the key and value will be difficult because the two are stored as the same tensor. @tscholak can you confirm that this is exactly what we want, and that we really don't want to add the key tensor?

tscholak · 2025-03-10T18:50:38Z

It sounds like some design choices in Fast-LLM are making LoRA harder to implement. Can you clarify whether keeping the tensor merged is actually necessary for performance, or whether we should split that? People normally use LoRA for Q and V only (exclusively so), so if the implementation is blocking that, we need to decide whether to fix the implementation or compromise on LoRA's expected behaviour.

jlamypoirier · 2025-03-10T19:03:28Z

K and V are merged for performance, we can't split them in general without degrading performance and causing backward compatibility issues. What we can do is split it for the purpose of LoRA, but it will be difficult to implement and add 1-3 days of work.

tscholak · 2025-03-10T19:32:48Z

Why do you need to split K and V again? If they're stored as one tensor (which remains frozen for PEFT), can't you just extract V and apply LoRA there? That would avoid unnecessary complexity.

Also, I see you added MLP and a config class for selecting where LoRA is used despite me explicitly asking you to postpone that. Please keep this scoped to the minimal implementation we agreed on. Remember that #149 also asks for a converter that stores the LoRA weights in a way that is compatible with HF's PEFT library, and all this will make this harder now.

jlamypoirier · 2025-03-10T19:54:32Z

Why do you need to split K and V again? If they're stored as one tensor (which remains frozen for PEFT), can't you just extract V and apply LoRA there? That would avoid unnecessary complexity.

That's the goal.

Also, I see you added MLP and a config class for selecting where LoRA is used despite me explicitly asking you to postpone that. Please keep this scoped to the minimal implementation we agreed on.

As I already said, MLP was already implemented prior to the above directive. I committed and pushed existing code.

Remember that #149 also asks for a converter that stores the LoRA weights in a way that is compatible with HF's PEFT library, and all this will make this harder now.

I am not following. The present implementation follows PEFT as closely as possible. Can you please elaborate on what is wrong here?

tscholak · 2025-03-10T20:14:24Z

The issue here is that Fast-LLM now has its own way of specifying where LoRA is applied, which is different from how PEFT does it (see PEFT's LoraConfig).

Had we stuck to the original scope (LoRA only on Q and V), this would have worked out of the box. No additional translation needed. We could have just ignored LoRA's target_modules setting.

Now that there's extra configurability, we need to convert Fast-LLM's settings into PEFT's format (and vice versa), which adds complexity. And this is exactly what I wanted to avoid when I wrote down the issue description for #149. Did you read it?

jlamypoirier · 2025-03-10T20:37:17Z

The issue here is that Fast-LLM now has its own way of specifying where LoRA is applied, which is different from how PEFT does it (see PEFT's LoraConfig).

Had we stuck to the original scope (LoRA only on Q and V), this would have worked out of the box. No additional translation needed. We could have just ignored LoRA's target_modules setting.

The translation is trivial.

Now that there's extra configurability, we need to convert Fast-LLM's settings into PEFT's format (and vice versa), which adds complexity. And this is exactly what I wanted to avoid when I wrote down the issue description for #149. Did you read it?

I am following #149, I thought this was obvious. And as we agreed on, I exert my own judgment on implementation details unless there is a noticeable impact on the workload.

jlamypoirier added 2 commits March 7, 2025 18:47

lora

b30d22f

misc

ea0279e

tscholak mentioned this pull request Mar 10, 2025

Support for max_window_layers #157

Merged

18 tasks

MLP support

d19842a

Select layers

f890633

jlamypoirier mentioned this pull request Mar 11, 2025

Minimal LoRA implementation #182

Merged

17 tasks

jlamypoirier mentioned this pull request Mar 25, 2025

Add converter for Hugging Face peft (LoRA) #204

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Prototype] LoRA #180

[Prototype] LoRA #180

Uh oh!

jlamypoirier commented Mar 7, 2025 •

edited

Loading

Uh oh!

tscholak commented Mar 10, 2025 •

edited

Loading

Uh oh!

jlamypoirier commented Mar 10, 2025 •

edited

Loading

Uh oh!

tscholak commented Mar 10, 2025

Uh oh!

jlamypoirier commented Mar 10, 2025

Uh oh!

tscholak commented Mar 10, 2025

Uh oh!

jlamypoirier commented Mar 10, 2025

Uh oh!

tscholak commented Mar 10, 2025

Uh oh!

jlamypoirier commented Mar 10, 2025

Uh oh!

tscholak commented Mar 10, 2025

Uh oh!

jlamypoirier commented Mar 10, 2025

Uh oh!

tscholak commented Mar 10, 2025

Uh oh!

jlamypoirier commented Mar 10, 2025 •

edited

Loading

Uh oh!

tscholak commented Mar 10, 2025

Uh oh!

jlamypoirier commented Mar 10, 2025

Uh oh!

Uh oh!

[Prototype] LoRA #180

Are you sure you want to change the base?

[Prototype] LoRA #180

Uh oh!

Conversation

jlamypoirier commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

Uh oh!

tscholak commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jlamypoirier commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tscholak commented Mar 10, 2025

Uh oh!

jlamypoirier commented Mar 10, 2025

Uh oh!

tscholak commented Mar 10, 2025

Uh oh!

jlamypoirier commented Mar 10, 2025

Uh oh!

tscholak commented Mar 10, 2025

Uh oh!

jlamypoirier commented Mar 10, 2025

Uh oh!

tscholak commented Mar 10, 2025

Uh oh!

jlamypoirier commented Mar 10, 2025

Uh oh!

tscholak commented Mar 10, 2025

Uh oh!

jlamypoirier commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tscholak commented Mar 10, 2025

Uh oh!

jlamypoirier commented Mar 10, 2025

Uh oh!

Uh oh!

jlamypoirier commented Mar 7, 2025 •

edited

Loading

tscholak commented Mar 10, 2025 •

edited

Loading

jlamypoirier commented Mar 10, 2025 •

edited

Loading

jlamypoirier commented Mar 10, 2025 •

edited

Loading