Skip to content

Conversation

MagellaX
Copy link

Summary

Adds a lightweight validation layer and a configuration summary printed at startup, inspired by GPT-NeoX, resolving Issue #124.

Key features

  • megatron/arguments.py
    • _validate_and_summarize_args(args) — runs sanity checks:
      • hidden_size % num_attention_heads == 0
      • global_batch_size % data_parallel_size == 0
      • pad_vocab_size_to (if set) divisible by TP size
      • fp16 / bf16 mutual-exclusion echoed
    • Builds a rank-0 console table summarising world-size layout, model dims, batch sizes, precision, and passes.
    • Raises ValueError if any rule fails, aborting early before costly init.

Why it matters

Early mis-configs (e.g., mismatched hidden/head sizes or bad batch divisibility) now surface instantly, saving hours of debugging and wasted GPU time.

Testing

  • pytest -q tests — all existing tests pass.
  • Launched pretrain_gpt_tiny.sh on 1 GPU and 4 GPU runs; summary appears once on rank 0.
  • Introduced an invalid hidden_size (not divisible by heads) — run aborts immediately with clear error.

Backward compatibility

Purely additive logging/validation. No impact on training logic or performance.


Fixes #124

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

adding consistency calculations/checks at init time
1 participant