Skip to content

Conversation

@luciaquirke
Copy link
Collaborator

@luciaquirke luciaquirke commented Dec 19, 2025

More VRAM efficient variant where preconditioners can be spread across an arbitrary number of nodes to compute large outer products. This is useful because preconditioners are often applied to a query and then the query is run across a large dataset, so slow but VRAM-efficient preconditioner computation and usage is a scalable pattern.

The gradients computed from each data point on one device needs to be sent to all the other devices for the preconditioners to be updated, so this is not a drop-in replacement for our regular gradient collector.

@luciaquirke luciaquirke changed the title [Option] Parallelize preconditioners across ranks #94 [Option] Parallelize preconditioners across ranks Dec 19, 2025
@luciaquirke luciaquirke force-pushed the multi-node branch 2 times, most recently from a9d1531 to 4061982 Compare December 21, 2025 01:12
@luciaquirke luciaquirke changed the title [Option] Parallelize preconditioners across ranks [Option] Parallelize preconditioners across ranks; multi-node FSDP Dec 21, 2025


@dataclass(kw_only=True)
class MultiNodeGradientCollector(HookCollectorBase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this going to be a replacement for GradientCollector? It seems like we don't it, if we have this one

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will merge this as a separate class for dogfooding and then replace the GradientCollector when we're convinced it's stable

Copy link
Collaborator Author

@luciaquirke luciaquirke Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this does a distributed operation with the data every step so all the preconditioners get all the data so it will probably be too slow to be our main collector. It's mostly aimed at collecting big preconditioners where you only need to process a small amount of data to get a reasonable estimate. I guess it will be equally fast if you skip the preconditioners but slower in a scenario where you could fit all the precs on the same rank


def build_worker(
rank: int,
local_rank: int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add to doc what this does

luciaquirke and others added 2 commits January 9, 2026 01:13
Automatically generated by python-semantic-release
@luciaquirke
Copy link
Collaborator Author

luciaquirke commented Jan 9, 2026

@norabelrose do you like this pattern where we have a distributed config dataclass that holds the rank information as properties, which return different values after the local_rank env variables are set? I was thinking of removing the local_rank parameters everywhere and always accessing them via the config object.

Or is it important to only initialize and pass in the rank parameters once they're set so users can't access potentially invalid variables except through os.environs?

https://github.com/EleutherAI/bergson/pull/100/changes#diff-e191f18aceff7de00f46cfdefddbff2d410ea97bf30d8c3bdf669eaa52c6b626

@luciaquirke luciaquirke force-pushed the multi-node branch 3 times, most recently from e1c1da9 to 8a9e3d7 Compare January 9, 2026 01:50
@luciaquirke luciaquirke changed the title [Option] Parallelize preconditioners across ranks; multi-node FSDP VRAM-efficient multi-GPU and/or multi-node preconditioner computation Jan 9, 2026
@luciaquirke
Copy link
Collaborator Author

luciaquirke commented Jan 9, 2026

@LouisYRYJ I extracted the multi node args into a config object and updated some names for clarity, going to merge for dogfooding today

@luciaquirke luciaquirke force-pushed the multi-node branch 3 times, most recently from f933ace to ed20a46 Compare January 9, 2026 01:58
@luciaquirke luciaquirke merged commit aaca464 into main Jan 9, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants