Skip to content

moe_act target and better docs#2161

Open
faresobeid wants to merge 2 commits intomainfrom
moe_act
Open

moe_act target and better docs#2161
faresobeid wants to merge 2 commits intomainfrom
moe_act

Conversation

@faresobeid
Copy link
Copy Markdown
Contributor

@faresobeid faresobeid commented Apr 1, 2026

Adds moe_act selective AC target and adds some docs on selective AC tuning


Note

Medium Risk
Changes the trainer’s default activation-checkpointing behavior for custom models (now implicitly enables selective AC), which can affect memory/throughput characteristics and recomputation. Adds a new selective AC hook in MoE expert code paths, so correctness/perf should be validated on representative MoE workloads.

Overview
Selective activation checkpointing is expanded and made easier to use for custom models. When trainer.model.ac is unset, the trainer now implicitly enables selective activation checkpointing (mode="selective", targets=["norm"]) for the custom implementation, while HF models still default to AC disabled; explicitly setting [trainer.model.ac] (or --model.ac) continues to mean full-layer checkpointing.

Adds a new selective target moe_act for MoE layers. The selective AC system can now checkpoint only the routed expert activation function, and it is automatically skipped when routed_experts is also enabled to avoid nested/double checkpointing; MoE expert implementations were refactored to expose moe_act as a hookable method.

Docs and changelog were updated to describe the new defaulting behavior and provide selective-AC tuning guidance, and unit tests were added/updated to cover the new defaults and moe_act patching/subsumption behavior.

Written by Cursor Bugbot for commit 2b0abaa. This will update automatically on new commits. Configure here.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Activation checkpointing discards intermediate activations during the forward pass and recomputes them during the backward pass, trading compute for memory.

To enable it, use:
If `trainer.model.ac` is unset, supported custom implementations default to selective AC on the cheapest target:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh ? where is it define, I don't like this tbh

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would agree here. a config unset should not default do doing it anyway

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can ofc default to selective ac if this is reasonable, but it should be explicit in the configs imo (e.g. an agent should see ac.mode = selective instead of ac.mode = None)

Copy link
Copy Markdown
Member

@samsja samsja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the magic logic that auto select the ac even when the cli doesn't enable it, lets just enable ac by default instead ??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants