Skip to content

[fix] dont use ep enabled variant of clip gradnorm#2144

Open
Jackmin801 wants to merge 2 commits intomainfrom
fix-ep-sft
Open

[fix] dont use ep enabled variant of clip gradnorm#2144
Jackmin801 wants to merge 2 commits intomainfrom
fix-ep-sft

Conversation

@Jackmin801
Copy link
Copy Markdown
Member

@Jackmin801 Jackmin801 commented Mar 31, 2026

This seems to not actually work but without it its fine


Note

Low Risk
Low risk: a small change limited to the SFT training loop’s gradient clipping call; main risk is behavior differences in gradient norm computation/clipping under expert-parallel configurations.

Overview
Removes use of the ep_enabled argument when calling clip_grad_norm_ in the SFT training loop, always using the default gradient-norm clipping path.

Written by Cursor Bugbot for commit 501b237. This will update automatically on new commits. Configure here.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Incomplete fix: RL trainer still uses broken ep_enabled
    • Removed ep_enabled parameter from clip_grad_norm_ call in RL trainer to match the fix already applied to SFT trainer.

Create PR

Or push these changes by commenting:

@cursor push 7e3946985b
Preview (7e3946985b)
diff --git a/src/prime_rl/trainer/rl/train.py b/src/prime_rl/trainer/rl/train.py
--- a/src/prime_rl/trainer/rl/train.py
+++ b/src/prime_rl/trainer/rl/train.py
@@ -462,9 +462,7 @@
 
         # Optionally, clip the gradients
 
-        grad_norm = clip_grad_norm_(
-            model.parameters(), max_norm=config.optim.max_norm, ep_enabled=parallel_dims.ep_enabled
-        )
+        grad_norm = clip_grad_norm_(model.parameters(), max_norm=config.optim.max_norm)
         if grad_norm.device.type == "cpu":
             grad_norm = grad_norm.to(torch.device("cuda"))
         zero_grad_ratio = get_zero_gradient_ratio(model.parameters(), parallel_dims.dp_replicate)

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

grad_norm = clip_grad_norm_(
model.parameters(), max_norm=config.optim.max_norm, ep_enabled=parallel_dims.ep_enabled
)
grad_norm = clip_grad_norm_(model.parameters(), max_norm=config.optim.max_norm)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete fix: RL trainer still uses broken ep_enabled

Medium Severity

The ep_enabled=parallel_dims.ep_enabled parameter was removed from clip_grad_norm_ in the SFT trainer because it "doesn't actually work," but the identical call in src/prime_rl/trainer/rl/train.py still passes ep_enabled=parallel_dims.ep_enabled. If the EP-enabled variant of clip_grad_norm_ is broken, the RL trainer has the same problem and likely needs the same fix.

Fix in Cursor Fix in Web

grad_norm = clip_grad_norm_(
model.parameters(), max_norm=config.optim.max_norm, ep_enabled=parallel_dims.ep_enabled
)
grad_norm = clip_grad_norm_(model.parameters(), max_norm=config.optim.max_norm)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo this will be broken with standard EP, so it should be just en_enabled=use_deepep and the same should be added to rl? Though I'd swear it worked before with deepep

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants