Fix bias gradient computations #76

baberabb · 2025-11-14T18:53:14Z

Couple of bug fixes to do with bias:

The gradients were contracted over both the batch and sequence dimension (dim=(0,1)), rather than just the sequence (dim=1).
Normalize weights with Adam before concatenating bias to avoid shape mismatch ([N, O, I+1] / [O, I] division error). ~~The biases are currently concatenated raw, as I wasn't sure the best way to handle them. More in comment.~~

update:

Added bias_avg_sq field to AdafactorNormalizer and AdamNormalizer to keep track of the bias second moments so we can handle bias normalization separately from weight gradients in AdafactorNormalizer.normalize_():
- Normalize bias from raw gradient G before weight processing
- Sum bias gradients over sequence dimension
- Append normalized bias as extra column when include_bias=True

Modified GradientCollectorCallback (with help from claude):

Extract bias second moments from both adam and adafactor optimziers
added scale_by_lr(lr) method to AdafactorNormalizer (also fixes bug where optimizer state tensors were being modified in-place)
added test_optimizer_state_extraction

~~Also added some unit tests. #75 should probably be merged before this.~~

Someone better at linear algebra than me should probably have a look at this as well.

bergson/gradients.py

luciaquirke · 2025-11-14T22:59:11Z

This is fabulous, thank you!! 🙏 Interested to hear what Nora thinks but I reckon exposing second moments for bias through the normalizer would be great

luciaquirke · 2025-11-17T22:06:06Z

Running

pip install -e ".[dev]"
pre-commit install

Should add formatting on commit, let me know if that doesn't work for some reason

baberabb · 2025-11-17T22:52:28Z

Running
pip install -e ".[dev]"
pre-commit install
Should add formatting on commit, let me know if that doesn't work for some reason

oh yeah, it was a problem with the ruff linter. it doesn't fix line length errors (leaves that to the formatter). Will add black back.

luciaquirke · 2025-12-16T05:51:19Z

@LouisYRYJ if we merge this in the next few days will it interfere with your big PR?

@baberabb we currently can't merge this because it breaks the build

baberabb · 2025-12-16T07:55:31Z

removed the workflow files! do you want me to rebase this on the other PR branch?

luciaquirke · 2025-12-16T23:34:36Z

Louis' PR just merged!! If you can rebase this on main we should be able to merge it too 🙏 🙏 🚀 TODO me do another once over to remember where we're at with the normalizers too

for more information, see https://pre-commit.ci

baberabb · 2025-12-19T17:59:16Z

bergson/gradients.py

+            bias_avg_sq=self.bias_avg_sq,  # Preserve bias second moments
+        )
+
+    def scale_by_lr(self, lr: float | Tensor) -> None:


I added this method, but let me know if not necessary. Also it does in-place ops, mirroring normalize, but maybe new tensors would be better?

I don't see the advantage of new tensors. If they are in-place operations I would consider just calling the function, i.e.
self.row.mul_(lr_sqrt)

instead of doing
self.row = self.row.mul_(lr_sqrt)

Otherwise one may think these are not in-place?

was waffling between the two, hence this cursed syntax. New tensors, since these are references initially, and easy to forget to replace before calling.

can we maybe choose only one then?

LouisYRYJ · 2025-12-19T18:15:07Z

bergson/collector/gradient_collectors.py

-            i = i + 1
-            setattr(module, LayerAdapter.in_attr(module), i)
-        if p is not None:
+        # Only project if no bias (bias requires full gradient to be materialized)


I am confused by this. You can project with bias, if you have no normalizer right?

yup. added it back in

baberabb · 2025-12-19T18:16:56Z

bergson/collector/gradient_collectors.py

-            if isinstance(normalizer, AdafactorNormalizer):
+        bias_grad = None
+
+        match normalizer:


These are quite repetitive, but I thought it would be more readable and less error-prone if the logic of each case is separate.

Also what are your thoughts on upstreaming this to HookCollectorBase. All the other classes seem to be using the same logic, and people can always overload it if they need to do something custom

I agree that it is quite repetitive. What exactly do you want to upstream? the forward and backward methods? I will be adding instances of HookBaseCollector in near future that will not be of this form. But could have a generic GradientCollector thing maybe that is then further inherited?

baberabb commented Nov 14, 2025

View reviewed changes

bergson/gradients.py Outdated Show resolved Hide resolved

baberabb force-pushed the bias branch from 9461cbf to 4722e6e Compare November 17, 2025 14:01

baberabb force-pushed the bias branch 2 times, most recently from 0a1cdb2 to 31e5008 Compare November 18, 2025 00:30

baberabb force-pushed the bias branch from 31e5008 to 7d9f91b Compare December 16, 2025 07:51

baberabb added 5 commits December 19, 2025 19:23

fix: sum bias gradients over sequence dim only, not batch + tests

0c74db9

fix: add normalizer bias support. fix trainer callback. add tests

ac89086

add scale_by_lr method to AdafactorNormalizer

d581873

fix tests

b88717d

fix device

f433265

baberabb force-pushed the bias branch from 10aa0c6 to f433265 Compare December 19, 2025 17:50

[pre-commit.ci] auto fixes from pre-commit.com hooks

d41264e

for more information, see https://pre-commit.ci

baberabb commented Dec 19, 2025

View reviewed changes

fix hf loop

db58e89

LouisYRYJ reviewed Dec 19, 2025

View reviewed changes

baberabb commented Dec 19, 2025

View reviewed changes

baberabb added 2 commits December 19, 2025 23:55

handle bias but not norm

46bdd34

fix test

e488e29

LouisYRYJ mentioned this pull request Jan 5, 2026

Add back optimizer-aware gradients #80

Merged

Fix bias gradient computations #76

Are you sure you want to change the base?

Fix bias gradient computations #76

Uh oh!

Conversation

baberabb commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

luciaquirke commented Nov 14, 2025

Uh oh!

luciaquirke commented Nov 17, 2025

Uh oh!

baberabb commented Nov 17, 2025

Uh oh!

luciaquirke commented Dec 16, 2025

Uh oh!

baberabb commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luciaquirke commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

baberabb Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

LouisYRYJ Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

baberabb Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

LouisYRYJ Dec 27, 2025

Choose a reason for hiding this comment

Uh oh!

LouisYRYJ Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

baberabb Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

baberabb Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LouisYRYJ Dec 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

baberabb commented Nov 14, 2025 •

edited

Loading

baberabb commented Dec 16, 2025 •

edited

Loading

luciaquirke commented Dec 16, 2025 •

edited

Loading

baberabb Dec 19, 2025 •

edited

Loading