Skip to content

Reorg: de-number dirs + restructure examples (Option B) [DO NOT MERGE — HyperPod blocker]#1119

Draft
KeitaW wants to merge 13 commits into
mainfrom
worktree-repo-reorg
Draft

Reorg: de-number dirs + restructure examples (Option B) [DO NOT MERGE — HyperPod blocker]#1119
KeitaW wants to merge 13 commits into
mainfrom
worktree-repo-reorg

Conversation

@KeitaW

@KeitaW KeitaW commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

Implements the repository reorganization from #1056 (Option B was selected in
the RFC). Two parts, both via git mv to preserve history:

  1. Remove numeric prefixes from all directories.
  2. Restructure examples/ into framework-centric (training/, inference/)
    and use-case-centric (use-cases/) trees.

The awsome-inference merge is not in this PR (deferred to a follow-up).

Caution

DO NOT MERGE until the HyperPod service blocker is cleared (see below).

Part 1 — de-numbering

Current New
0.docs/ assets/ (renamed from the RFC's docs/ — it holds only diagram images + editable sources, no prose)
1.architectures/ architectures/ (merged into the existing bare dir from #1109)
2.ami_and_containers/ ami_and_containers/
3.test_cases/ examples/
4.validation_and_observability/ validation_and_observability/
micro-benchmarks/ unchanged

architectures/ subdirs de-numbered; 5.sagemaker-hyperpod
sagemaker-hyperpod-slurm, 7.sagemaker-hyperpod-ekssagemaker-hyperpod-eks.

Part 2 — examples/ restructure (Option B)

Flattened the pytorch//megatron/ accelerator axis. Placement follows the RFC
rule (framework-with-swappable-models → training/; single model/task is the
subject → use-cases/).

  • examples/training/ — fsdp, deepspeed, torchtitan, trl, picotron, verl,
    openrlhf, mosaicml-composer, neuronx-distributed, optimum-neuron, nvrx, ddp,
    megatron-lm, nemo, nemo-rl, nemo1.0, bionemo, jax
  • examples/use-cases/ — detr-finetune, nanovlm, isaac-lab, vjepa2,
    vjepa2.1, llm-distillation, esm2-hyperpod
  • examples/inference/ — placeholder (stub README) for the future
    awsome-inference merge

Names normalized to lowercase-kebab. bionemo placed under training/ (it's a
NeMo-based framework) per maintainer decision.

Commits (move/edit split kept throughout)

  1. de-numbering git mv (1084 files, 0 content changes)
  2. de-numbering reference updates
  3. Option B git mv (475 renames)
  4. Option B reference updates + README "Examples" section rewrite

Moves are committed separately from edits so git log --follow rename detection
stays intact.

⚠️ Merge blocker (RFC hard gate)

The SageMaker HyperPod console reads lifecycle scripts directly from
1.architectures/5.sagemaker-hyperpod/LifecycleScripts/... on main. This PR
changes that path. Per #1056 it must not land until:

  1. A pre-reorg release tag (e.g. v2.0.0-pre-reorg) is cut.Donev2.0.0-pre-reorg snapshots main (commit 0edaf15, numbered structure intact).
  2. ⬜ The HyperPod service team pins the console CloudFormation to that tag. (remaining blocker)

Verification

  • All moves register as R (rename), zero delete+add.
  • Zero surviving references to renamed directories repo-wide (only external URLs
    like kubeflow's examples/pytorch/... remain, correctly untouched).
  • CI path filters repointed (examples/training/fsdp/**, examples/training/megatron-lm/**, …).
  • Relative-link resolver run across all moved READMEs. The only ../-depth
    breakage caused by this PR (detr-finetune, which changed depth) is fixed.

Note: pre-existing broken links (NOT touched)

The reorg surfaces ~20 relative links that were already broken on main
e.g. many examples/.../README.md use ../../architectures where the depth
requires ../../../architectures, and nemo1.0's 1.bmk-pretrain-*.sh targets
don't exist. These are left untouched (out of scope / surgical changes). Happy to
fix them in a follow-up now that paths are stable — flagging for a maintainer call.

Refs #1056

KeitaW added 4 commits June 3, 2026 05:35
De-number all top-level and architectures/ subdirectories via git mv to
preserve history. No content changes in this commit.

Top-level:
  0.docs/                        -> docs/
  1.architectures/               -> architectures/ (merged into existing dir)
  2.ami_and_containers/          -> ami_and_containers/
  3.test_cases/                  -> examples/
  4.validation_and_observability/ -> validation_and_observability/
  micro-benchmarks/              (unchanged)

architectures/ subdirs:
  0.common, 1.vpc_network, 2.aws-parallelcluster, 3.aws-batch,
  4.amazon-eks, 6.ldap_server, 8.accounting-database -> de-numbered
  5.sagemaker-hyperpod    -> sagemaker-hyperpod-slurm (explicit orchestrator)
  7.sagemaker-hyperpod-eks -> sagemaker-hyperpod-eks (de-numbered)

Refs #1056
Follow-up to the git mv commit. Updates README links, CI path filters
and workflow scripts, CODEOWNERS, PR template, Makefiles, shell/sbatch
launch scripts, and docs to point at the new de-numbered directory paths
(e.g. 3.test_cases/ -> examples/, 5.sagemaker-hyperpod -> sagemaker-hyperpod-slurm).

Committed separately from the renames so git rename detection (--follow)
stays intact on the moved files.

Refs #1056
Flatten the pytorch/megatron accelerator axis and reorganize examples by
the RFC #1056 Option B rule (git mv only; no content changes):

- Framework-centric demos -> examples/training/<framework>/
  fsdp, deepspeed, torchtitan, trl, picotron, verl, openrlhf,
  mosaicml-composer, neuronx-distributed, optimum-neuron, nvrx, ddp,
  megatron-lm, nemo, nemo-rl, nemo1.0, bionemo, jax
- Model/task-centric demos -> examples/use-cases/<name>/
  detr-finetune, nanovlm, isaac-lab, vjepa2, vjepa2.1,
  llm-distillation, esm2-hyperpod
- examples/inference/ added as a placeholder for the future
  awsome-inference merge.

Names normalized to lowercase-kebab. bionemo placed under training/
(it is a NeMo-based framework) per maintainer decision.

Refs #1056
Follow-up to the Option B git mv commit. Per-path reference updates
(not a token swap, since paths drop a segment and change case):

- README links, CI path filters and workflow scripts, shell/sbatch
  launch scripts, and docs repointed:
  examples/pytorch/<fw>  -> examples/training/<fw>
  examples/megatron/<fw> -> examples/training/<fw>
  examples/jax           -> examples/training/jax
  examples/pytorch/ddp/detr-finetune -> examples/use-cases/detr-finetune
  examples/23.SMHP-esm2  -> examples/use-cases/esm2-hyperpod  (etc.)
- Rewrote the root README 'Examples' section to describe the new
  training / inference / use-cases axes (prose, not a rename).
- Fixed relative ../architectures links in detr-finetune, whose
  directory depth changed (moved from pytorch/ddp/detr-finetune to
  use-cases/detr-finetune).

Pre-existing broken links on main (e.g. FSDP's ../../architectures,
nemo1.0's missing 1.bmk-*.sh targets) are left untouched — out of scope.

Refs #1056
@KeitaW KeitaW changed the title Reorg (1/N): remove numeric directory prefixes [DO NOT MERGE — HyperPod blocker] Reorg: de-number dirs + restructure examples (Option B) [DO NOT MERGE — HyperPod blocker] Jun 3, 2026
KeitaW added 9 commits June 3, 2026 07:32
These links were already broken on main (independent of the reorg) and are
now corrected:

- 16 links with the wrong number of ../ to reach repo-root architectures/
  (and one ami_and_containers/) — e.g. examples/training/fsdp/README.md used
  ../../architectures where the directory depth requires ../../../architectures.
- 2 links in examples/training/nemo/PERFORMANCE.md pointing at ../Dockerfile and
  ../slurm/README.md (one level too high) — the targets are in the same dir.
- 1 malformed URL in architectures/amazon-eks/README.md: [eksctl](eksctl.io)
  -> [eksctl](https://eksctl.io) (line 11 already used the correct form).

Each corrected target was verified to resolve to an existing file/dir.

Not fixed (genuinely missing targets / content gaps, need author input):
nemo1.0 README's 1.bmk-pretrain-gpt3-{5b,40b,175b}.sh, verl observability
img/ray-dashboard.png, and a few missing files/dirs/placeholders under
architectures/.

Refs #1056
…servability/ (git mv only)

Extends the de-numbering to the internal subdirectories of these two
top-level dirs (previously left numbered as out-of-RFC-scope; now
requested). git mv only, no content changes:

  ami_and_containers/1.amazon_machine_image -> amazon_machine_image
  ami_and_containers/3.pcluster_create_dlami -> pcluster_create_dlami
  validation_and_observability/1.pytorch-env-validation -> pytorch-env-validation
  validation_and_observability/2.gpu-cluster-healthcheck -> gpu-cluster-healthcheck
  validation_and_observability/3.efa-node-exporter -> efa-node-exporter
  validation_and_observability/4.prometheus-grafana -> prometheus-grafana
  validation_and_observability/5.nsight -> nsight

(prometheus-grafana/1click-dashboards-deployment is unchanged — '1click'
is a word, not a numeric prefix.)

Refs #1056
Follow-up to the git mv commit. Repoints README links/tables, .gitignore,
buildspec, and cross-references to the de-numbered subdirectory paths:

  ami_and_containers/{1.amazon_machine_image,3.pcluster_create_dlami}
  validation_and_observability/{1.pytorch-env-validation,
    2.gpu-cluster-healthcheck,3.efa-node-exporter,4.prometheus-grafana,5.nsight}

Includes the nemo1.0 README link to ami_and_containers/amazon_machine_image.

Refs #1056
The directory contains only diagram images and editable diagram sources
(.png, .graffle, .pptx, .drawio) referenced by architecture READMEs -- no
prose documentation. 'assets/' describes the contents accurately (covers
both rendered images and editable sources). git mv only.

Refs #1056
Updates the 5 architecture READMEs that embed diagrams to the new
assets/ path (../../docs/<img> -> ../../assets/<img>).

Refs #1056
# Conflicts:
#	architectures/sagemaker-hyperpod-slurm/LifecycleScripts/base-config/utils/create_users.sh
Follow-up to the origin/main merge. PR #1110 (DeepSeek-V3 disaggregated
inference with vLLM/UCCL-EP/NIXL) merged to main after this branch was cut,
landing at the old numbered path 3.test_cases/pytorch/vllm/. Per Option B it
is inference, so it now lives at examples/inference/vllm/dsv3-uccl-nixl/.

- Fixed a stale 'cd 3.test_cases/pytorch/vllm/...' path in the example README.
- Replaced the examples/inference/ placeholder stub with an index listing the
  vLLM example.
- Updated the root README examples tree (inference no longer a placeholder).

Refs #1056
…ibuted-ai

The repo was renamed from awsome-distributed-training to awsome-distributed-ai.
Updates 156 stale references across 75 files: github.com/raw.githubusercontent
URLs, git clone URLs, and 'cd'/filesystem path references to the cloned repo.

IMPORTANT: deliberately preserves the 14 'awsome-distributed-training.s3.amazonaws.com'
references. That is an S3 *bucket* name (hosts the CloudFormation templates for the
1-click deploy buttons), not a repo reference. Verified via HTTP that the legacy
templates (Vpc.yaml, FSxLustre.yaml, parallelcluster-prerequisites.yaml, etc.) exist
ONLY in the old bucket (200) and return 403 in the new awsome-distributed-ai bucket --
renaming them would break the deploy links. The newer PCS templates already correctly
use the awsome-distributed-ai bucket.

Refs #1056
…S3 bucket

Uploaded the latest repo versions of 7 templates to the renamed
awsome-distributed-ai S3 bucket (account 159553542841) with public-read /
text/yaml to match the existing PCS templates, and repointed their 1-click
deploy links from the old awsome-distributed-training bucket:

  0.aws-batch-distributed-training.yaml, 0.private-bucket.yaml,
  parallelcluster-prerequisites{,-p1}.yaml, studio-slurm.yaml,
  cluster-observability{,-os-grafana}.yaml

All 7 verified to return HTTP 200 from the new bucket.

Vpc.yaml and FSxLustre.yaml are intentionally left pointing at the old
awsome-distributed-training bucket: their repo source has diverged
substantially with no publish manifest, and the canonical source is unconfirmed.
They stay on the proven-working old-bucket objects until that is resolved.

Refs #1056
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant