Skip to content

Support for max_window_layers #157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 21, 2025
Merged

Support for max_window_layers #157

merged 4 commits into from
Feb 21, 2025

Conversation

bigximik
Copy link
Contributor

✨ Description

Closes #147
Also, added assert for window_size usage for non flash attention

πŸ” Type of change

Select all that apply:

  • πŸ› Bug fix (non-breaking change that addresses a specific issue)
  • πŸš€ New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • πŸ“ˆ Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • πŸ› οΈ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • πŸ“¦ Dependency bump (updates dependencies, including Dockerfile or package changes)
  • πŸ“ Documentation change (updates documentation, including new content or typo fixes)
  • πŸ”§ Infrastructure/Build change (affects build process, CI/CD, or dependencies)

πŸ“ Changes

List the key changes introduced in this PR:

  1. Adds support for max_window_layers a threshold on which layers to use sliding window attention
  2. Added assert for window_size usage for non flash attention

βœ… Checklist

Make sure the following tasks are completed before submitting the PR:

General

  • πŸ“œ I have read and followed the contributing guidelines.
  • 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
  • πŸŽ‰ The functionality is complete, and I have tested the changes. (tested only new test cases)
  • πŸ“ I have updated the documentation if needed. (not applicable)
  • ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
  • 🧩 I have commented my code, especially in hard-to-understand areas.

Testing

  • πŸ§ͺ I have added or updated tests to cover my changes.
  • βœ”οΈ New and existing tests pass locally with my changes. (tested only new test cases)
  • 🚦 I have tested these changes on GPUs and verified training stability. (not applicable)
  • πŸ‹οΈ I have tested the changes on realistic training workloads, if applicable. (not applicable)

Copy link
Collaborator

@tscholak tscholak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks super clean, thank you @bigximik!

@bigximik bigximik merged commit 947c1b6 into main Feb 21, 2025
4 checks passed
@bigximik bigximik deleted the max_window_layers branch February 21, 2025 15:28
@jlamypoirier
Copy link
Collaborator

@bigximik Why did you drop support for non-flash windowed attention? It should be supported.

window_size = self._config.window_size
if (
self._config.max_window_layers is not None
and self._layer_index < self._config.max_window_layers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is incorrect because layer index starts at 1 for some reason.

@tscholak
Copy link
Collaborator

Hi @jlamypoirier, I strongly prefer that we don't merge unfinished, experimental features just to "play with them", especially when they're complex and introduce long-term maintenance overhead.

Let me reiterate that:

  1. This feature is not urgent. The immediate priority is LoRA, and we shouldn't add distractions. LoRA as it is scoped right now also doesn't depend on this, contrary to what [Prototype] LoRAΒ #180 suggests. I just commented on that.
  2. The fact that this needs an experimental flag tells me it's not ready to be merged. If it's still evolving, it should stay in this branch until it's ready.
  3. The max_window_layers discussion from Support for max_window_layersΒ #157 isn't urgent either. Qwen2 defines it but published models don't actually use it, so we don't need to rush a general solution.
  4. There will be a real need for this feature, when we experiment with SSM-transformer hybrid configurations, but that is weeks away.

Please let's keep focus on the roadmap and merge features when they're fully ready and actually needed. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feat] Support max_window_layers
3 participants