Reorg: de-number dirs + restructure examples (Option B) [DO NOT MERGE — HyperPod blocker]#1119
Draft
KeitaW wants to merge 13 commits into
Draft
Reorg: de-number dirs + restructure examples (Option B) [DO NOT MERGE — HyperPod blocker]#1119KeitaW wants to merge 13 commits into
KeitaW wants to merge 13 commits into
Conversation
De-number all top-level and architectures/ subdirectories via git mv to preserve history. No content changes in this commit. Top-level: 0.docs/ -> docs/ 1.architectures/ -> architectures/ (merged into existing dir) 2.ami_and_containers/ -> ami_and_containers/ 3.test_cases/ -> examples/ 4.validation_and_observability/ -> validation_and_observability/ micro-benchmarks/ (unchanged) architectures/ subdirs: 0.common, 1.vpc_network, 2.aws-parallelcluster, 3.aws-batch, 4.amazon-eks, 6.ldap_server, 8.accounting-database -> de-numbered 5.sagemaker-hyperpod -> sagemaker-hyperpod-slurm (explicit orchestrator) 7.sagemaker-hyperpod-eks -> sagemaker-hyperpod-eks (de-numbered) Refs #1056
Follow-up to the git mv commit. Updates README links, CI path filters and workflow scripts, CODEOWNERS, PR template, Makefiles, shell/sbatch launch scripts, and docs to point at the new de-numbered directory paths (e.g. 3.test_cases/ -> examples/, 5.sagemaker-hyperpod -> sagemaker-hyperpod-slurm). Committed separately from the renames so git rename detection (--follow) stays intact on the moved files. Refs #1056
Flatten the pytorch/megatron accelerator axis and reorganize examples by the RFC #1056 Option B rule (git mv only; no content changes): - Framework-centric demos -> examples/training/<framework>/ fsdp, deepspeed, torchtitan, trl, picotron, verl, openrlhf, mosaicml-composer, neuronx-distributed, optimum-neuron, nvrx, ddp, megatron-lm, nemo, nemo-rl, nemo1.0, bionemo, jax - Model/task-centric demos -> examples/use-cases/<name>/ detr-finetune, nanovlm, isaac-lab, vjepa2, vjepa2.1, llm-distillation, esm2-hyperpod - examples/inference/ added as a placeholder for the future awsome-inference merge. Names normalized to lowercase-kebab. bionemo placed under training/ (it is a NeMo-based framework) per maintainer decision. Refs #1056
Follow-up to the Option B git mv commit. Per-path reference updates (not a token swap, since paths drop a segment and change case): - README links, CI path filters and workflow scripts, shell/sbatch launch scripts, and docs repointed: examples/pytorch/<fw> -> examples/training/<fw> examples/megatron/<fw> -> examples/training/<fw> examples/jax -> examples/training/jax examples/pytorch/ddp/detr-finetune -> examples/use-cases/detr-finetune examples/23.SMHP-esm2 -> examples/use-cases/esm2-hyperpod (etc.) - Rewrote the root README 'Examples' section to describe the new training / inference / use-cases axes (prose, not a rename). - Fixed relative ../architectures links in detr-finetune, whose directory depth changed (moved from pytorch/ddp/detr-finetune to use-cases/detr-finetune). Pre-existing broken links on main (e.g. FSDP's ../../architectures, nemo1.0's missing 1.bmk-*.sh targets) are left untouched — out of scope. Refs #1056
These links were already broken on main (independent of the reorg) and are now corrected: - 16 links with the wrong number of ../ to reach repo-root architectures/ (and one ami_and_containers/) — e.g. examples/training/fsdp/README.md used ../../architectures where the directory depth requires ../../../architectures. - 2 links in examples/training/nemo/PERFORMANCE.md pointing at ../Dockerfile and ../slurm/README.md (one level too high) — the targets are in the same dir. - 1 malformed URL in architectures/amazon-eks/README.md: [eksctl](eksctl.io) -> [eksctl](https://eksctl.io) (line 11 already used the correct form). Each corrected target was verified to resolve to an existing file/dir. Not fixed (genuinely missing targets / content gaps, need author input): nemo1.0 README's 1.bmk-pretrain-gpt3-{5b,40b,175b}.sh, verl observability img/ray-dashboard.png, and a few missing files/dirs/placeholders under architectures/. Refs #1056
…servability/ (git mv only) Extends the de-numbering to the internal subdirectories of these two top-level dirs (previously left numbered as out-of-RFC-scope; now requested). git mv only, no content changes: ami_and_containers/1.amazon_machine_image -> amazon_machine_image ami_and_containers/3.pcluster_create_dlami -> pcluster_create_dlami validation_and_observability/1.pytorch-env-validation -> pytorch-env-validation validation_and_observability/2.gpu-cluster-healthcheck -> gpu-cluster-healthcheck validation_and_observability/3.efa-node-exporter -> efa-node-exporter validation_and_observability/4.prometheus-grafana -> prometheus-grafana validation_and_observability/5.nsight -> nsight (prometheus-grafana/1click-dashboards-deployment is unchanged — '1click' is a word, not a numeric prefix.) Refs #1056
Follow-up to the git mv commit. Repoints README links/tables, .gitignore,
buildspec, and cross-references to the de-numbered subdirectory paths:
ami_and_containers/{1.amazon_machine_image,3.pcluster_create_dlami}
validation_and_observability/{1.pytorch-env-validation,
2.gpu-cluster-healthcheck,3.efa-node-exporter,4.prometheus-grafana,5.nsight}
Includes the nemo1.0 README link to ami_and_containers/amazon_machine_image.
Refs #1056
The directory contains only diagram images and editable diagram sources (.png, .graffle, .pptx, .drawio) referenced by architecture READMEs -- no prose documentation. 'assets/' describes the contents accurately (covers both rendered images and editable sources). git mv only. Refs #1056
Updates the 5 architecture READMEs that embed diagrams to the new assets/ path (../../docs/<img> -> ../../assets/<img>). Refs #1056
# Conflicts: # architectures/sagemaker-hyperpod-slurm/LifecycleScripts/base-config/utils/create_users.sh
Follow-up to the origin/main merge. PR #1110 (DeepSeek-V3 disaggregated inference with vLLM/UCCL-EP/NIXL) merged to main after this branch was cut, landing at the old numbered path 3.test_cases/pytorch/vllm/. Per Option B it is inference, so it now lives at examples/inference/vllm/dsv3-uccl-nixl/. - Fixed a stale 'cd 3.test_cases/pytorch/vllm/...' path in the example README. - Replaced the examples/inference/ placeholder stub with an index listing the vLLM example. - Updated the root README examples tree (inference no longer a placeholder). Refs #1056
…ibuted-ai The repo was renamed from awsome-distributed-training to awsome-distributed-ai. Updates 156 stale references across 75 files: github.com/raw.githubusercontent URLs, git clone URLs, and 'cd'/filesystem path references to the cloned repo. IMPORTANT: deliberately preserves the 14 'awsome-distributed-training.s3.amazonaws.com' references. That is an S3 *bucket* name (hosts the CloudFormation templates for the 1-click deploy buttons), not a repo reference. Verified via HTTP that the legacy templates (Vpc.yaml, FSxLustre.yaml, parallelcluster-prerequisites.yaml, etc.) exist ONLY in the old bucket (200) and return 403 in the new awsome-distributed-ai bucket -- renaming them would break the deploy links. The newer PCS templates already correctly use the awsome-distributed-ai bucket. Refs #1056
…S3 bucket
Uploaded the latest repo versions of 7 templates to the renamed
awsome-distributed-ai S3 bucket (account 159553542841) with public-read /
text/yaml to match the existing PCS templates, and repointed their 1-click
deploy links from the old awsome-distributed-training bucket:
0.aws-batch-distributed-training.yaml, 0.private-bucket.yaml,
parallelcluster-prerequisites{,-p1}.yaml, studio-slurm.yaml,
cluster-observability{,-os-grafana}.yaml
All 7 verified to return HTTP 200 from the new bucket.
Vpc.yaml and FSxLustre.yaml are intentionally left pointing at the old
awsome-distributed-training bucket: their repo source has diverged
substantially with no publish manifest, and the canonical source is unconfirmed.
They stay on the proven-working old-bucket objects until that is resolved.
Refs #1056
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the repository reorganization from #1056 (Option B was selected in
the RFC). Two parts, both via
git mvto preserve history:examples/into framework-centric (training/,inference/)and use-case-centric (
use-cases/) trees.The
awsome-inferencemerge is not in this PR (deferred to a follow-up).Caution
DO NOT MERGE until the HyperPod service blocker is cleared (see below).
Part 1 — de-numbering
0.docs/assets/(renamed from the RFC'sdocs/— it holds only diagram images + editable sources, no prose)1.architectures/architectures/(merged into the existing bare dir from #1109)2.ami_and_containers/ami_and_containers/3.test_cases/examples/4.validation_and_observability/validation_and_observability/micro-benchmarks/architectures/subdirs de-numbered;5.sagemaker-hyperpod→sagemaker-hyperpod-slurm,7.sagemaker-hyperpod-eks→sagemaker-hyperpod-eks.Part 2 — examples/ restructure (Option B)
Flattened the
pytorch//megatron/accelerator axis. Placement follows the RFCrule (framework-with-swappable-models →
training/; single model/task is thesubject →
use-cases/).examples/training/— fsdp, deepspeed, torchtitan, trl, picotron, verl,openrlhf, mosaicml-composer, neuronx-distributed, optimum-neuron, nvrx, ddp,
megatron-lm, nemo, nemo-rl, nemo1.0, bionemo, jax
examples/use-cases/— detr-finetune, nanovlm, isaac-lab, vjepa2,vjepa2.1, llm-distillation, esm2-hyperpod
examples/inference/— placeholder (stub README) for the futureawsome-inferencemergeNames normalized to lowercase-kebab.
bionemoplaced undertraining/(it's aNeMo-based framework) per maintainer decision.
Commits (move/edit split kept throughout)
git mv(1084 files, 0 content changes)git mv(475 renames)Moves are committed separately from edits so
git log --followrename detectionstays intact.
The SageMaker HyperPod console reads lifecycle scripts directly from
1.architectures/5.sagemaker-hyperpod/LifecycleScripts/...onmain. This PRchanges that path. Per #1056 it must not land until:
A pre-reorg release tag (e.g.✅ Done —v2.0.0-pre-reorg) is cut.v2.0.0-pre-reorgsnapshotsmain(commit0edaf15, numbered structure intact).Verification
R(rename), zero delete+add.like kubeflow's
examples/pytorch/...remain, correctly untouched).examples/training/fsdp/**,examples/training/megatron-lm/**, …).../-depthbreakage caused by this PR (detr-finetune, which changed depth) is fixed.
Note: pre-existing broken links (NOT touched)
The reorg surfaces ~20 relative links that were already broken on
main—e.g. many
examples/.../README.mduse../../architectureswhere the depthrequires
../../../architectures, andnemo1.0's1.bmk-pretrain-*.shtargetsdon't exist. These are left untouched (out of scope / surgical changes). Happy to
fix them in a follow-up now that paths are stable — flagging for a maintainer call.
Refs #1056