Skip to content

fix(checkpointer): use dp_reshardable sharding type for megatron-core >=0.11#1344

Open
theNefelibata wants to merge 3 commits into
areal-project:mainfrom
theNefelibata:fix/checkpointer-dp-reshardable
Open

fix(checkpointer): use dp_reshardable sharding type for megatron-core >=0.11#1344
theNefelibata wants to merge 3 commits into
areal-project:mainfrom
theNefelibata:fix/checkpointer-dp-reshardable

Conversation

@theNefelibata
Copy link
Copy Markdown

@theNefelibata theNefelibata commented May 15, 2026

Summary

  • megatron-core >=0.11 removed flattened_range support in ShardedTensor.validate_metadata_integrity(), but the default sharding type (fully_sharded_model_space) still sets flattened_range, causing checkpoint save/load to fail
  • Switch to dp_reshardable sharding type which does not rely on flattened_range

Closes #1343

Test plan

  • Verify checkpoint save/load works with megatron-core >=0.11
  • Verify backward compatibility with older megatron-core versions

… >=0.11

megatron-core >=0.11 removed flattened_range support in
ShardedTensor.validate_metadata_integrity(), but the default sharding
type (fully_sharded_model_space) still sets flattened_range, causing
save/load to fail. Switch to dp_reshardable which does not rely on
flattened_range.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread areal/engine/megatron_utils/checkpointer.py Outdated
theNefelibata and others added 2 commits May 18, 2026 19:50
Move the hard-coded 'dp_reshardable' sharding type into

MegatronEngineConfig so users with legacy checkpoints saved

under 'fully_sharded_model_space' can load them by flipping the

config instead of patching the source.

Key changes:

- Add distrib_optim_sharding_type field to MegatronEngineConfig

  (default 'dp_reshardable')

- Plumb the value through MegatronCheckpointManager and use it in

  generate_state_dict instead of the hard-coded string
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] trajectory dump/replay for offline training debugging

2 participants