Support for max_window_layers #157

bigximik · 2025-02-20T15:40:05Z

✨ Description

Closes #147
Also, added assert for window_size usage for non flash attention

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

Adds support for max_window_layers a threshold on which layers to use sliding window attention
Added assert for window_size usage for non flash attention

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes. (tested only new test cases)
📝 I have updated the documentation if needed. (not applicable)
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes. (tested only new test cases)
🚦 I have tested these changes on GPUs and verified training stability. (not applicable)
🏋️ I have tested the changes on realistic training workloads, if applicable. (not applicable)

…ility

tscholak

Looks super clean, thank you @bigximik!

jlamypoirier · 2025-03-04T22:32:58Z

@bigximik Why did you drop support for non-flash windowed attention? It should be supported.

jlamypoirier · 2025-03-04T23:04:37Z

fast_llm/layers/transformer/attention.py

+        window_size = self._config.window_size
+        if (
+            self._config.max_window_layers is not None
+            and self._layer_index < self._config.max_window_layers


I think this is incorrect because layer index starts at 1 for some reason.

tscholak · 2025-03-10T15:55:11Z

Hi @jlamypoirier, I strongly prefer that we don't merge unfinished, experimental features just to "play with them", especially when they're complex and introduce long-term maintenance overhead.

Let me reiterate that:

This feature is not urgent. The immediate priority is LoRA, and we shouldn't add distractions. LoRA as it is scoped right now also doesn't depend on this, contrary to what [Prototype] LoRA #180 suggests. I just commented on that.
The fact that this needs an experimental flag tells me it's not ready to be merged. If it's still evolving, it should stay in this branch until it's ready.
The max_window_layers discussion from Support for max_window_layers #157 isn't urgent either. Qwen2 defines it but published models don't actually use it, so we don't need to rush a general solution.
There will be a real need for this feature, when we experiment with SSM-transformer hybrid configurations, but that is weeks away.

Please let's keep focus on the roadmap and merge features when they're fully ready and actually needed. Thanks.

bigximik added 4 commits February 20, 2025 16:02

added max_wondows_layers

fe53fc8

assert on window_size without flash attention

1d24759

moved decision on sliding window size to a separate method for testab…

3e4714c

…ility

added test cases

341e845

bigximik requested review from tscholak and jlamypoirier February 20, 2025 15:40

tscholak approved these changes Feb 20, 2025

View reviewed changes

tscholak mentioned this pull request Feb 20, 2025

Option to vary configuration parameters across layers #155

Open

4 tasks

bigximik merged commit 947c1b6 into main Feb 21, 2025
4 checks passed

bigximik deleted the max_window_layers branch February 21, 2025 15:28

jlamypoirier reviewed Mar 4, 2025

View reviewed changes

jlamypoirier mentioned this pull request Mar 6, 2025

[Prototype] Option to configure layers independently #168

Draft

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for max_window_layers #157

Support for max_window_layers #157

bigximik commented Feb 20, 2025

tscholak left a comment

jlamypoirier commented Mar 4, 2025

jlamypoirier Mar 4, 2025

tscholak commented Mar 10, 2025

Support for max_window_layers #157

Support for max_window_layers #157

Conversation

bigximik commented Feb 20, 2025

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Testing

tscholak left a comment

Choose a reason for hiding this comment

jlamypoirier commented Mar 4, 2025

jlamypoirier Mar 4, 2025

Choose a reason for hiding this comment

tscholak commented Mar 10, 2025