Motivation
The current awsome-distributed-training repository (now awsome-distributed-ai) has grown to cover a broad set of distributed ML workloads on AWS, but the directory layout carries legacy decisions that make it harder to navigate and contribute to:
-
Numbered prefixes impose a rigid ordering that no longer reflects how users discover content. Directories like 1.architectures/, 2.ami_and_containers/, 3.test_cases/ suggest a sequential workflow, but most users land directly on a specific test case or architecture via search or a shared link. The numbering also creates friction when adding new top-level directories (what number should it get?).
-
Training and inference are split across two separate repositories. The awsome-inference repo covers inference workloads (SGLang, TRT-LLM, NIMs, Ray Serve, Dynamo, etc.) with its own infrastructure and project structure. Maintaining two repos leads to duplicated IaC (both have VPC/EKS setup), divergent conventions, and a fragmented contributor/user experience. Users doing both training and inference on the same cluster shouldn't need two repos.
-
The test_cases/ directory has two kinds of content mixed together. Some examples are framework-centric — the demo's subject is the training/inference framework (FSDP, Megatron-LM, NeMo), and model variants underneath illustrate it (pytorch/FSDP/llama3, megatron/nemo/qwen3). Others are use-case-centric — the demo's subject is a specific model or use case, and the framework is incidental (pytorch/ddp/detr-finetune is really a DETR-on-COCO fine-tuning example, not a DDP demo). Forcing both into a single framework-first hierarchy buries the latter under the wrong axis and makes them undiscoverable.
Proposed Change
1. Remove numeric prefixes from all directories
| Current |
Proposed |
0.docs/ |
docs/ |
1.architectures/ |
architectures/ |
2.ami_and_containers/ |
ami_and_containers/ |
3.test_cases/ |
examples/ |
4.validation_and_observability/ |
validation_and_observability/ |
micro-benchmarks/ |
micro-benchmarks/ (unchanged) |
Also remove numbering from subdirectories within architectures/:
| Current |
Proposed |
architectures/0.common/ |
architectures/common/ |
architectures/1.vpc_network/ |
architectures/vpc_network/ |
architectures/2.aws-parallelcluster/ |
architectures/aws-parallelcluster/ |
architectures/5.sagemaker-hyperpod/ |
architectures/sagemaker-hyperpod-slurm/ |
| ... |
... |
The HyperPod directory is renamed sagemaker-hyperpod-slurm to be explicit about the orchestrator (vs sagemaker-hyperpod-eks).
2. Merge awsome-inference into this repository
Content from aws-samples/awsome-inference would be absorbed:
| awsome-inference source |
Destination |
1.infrastructure/ |
architectures/ (merge with existing infra; deduplicate VPC/EKS) |
2.projects/*, 3.use-cases/* |
examples/inference/<engine>/ or examples/use-cases/<name>/ per the placement rule below |
The awsome-inference repo is archived after the merge with the README pointing here.
3. Replace test_cases/ with a unified examples/ tree
The renamed examples/ directory will house all training and inference workload examples. The directory layout under examples/ has two viable shapes that I'd like community input on. Both options drop numbering, unify fine-tuning into training/ (fine-tuning libraries like PEFT, TRL, NeMo-RL live at the same abstraction level as pre-training libraries like FSDP, DeepSpeed, Megatron-LM — both are training-engine showcases), and absorb awsome-inference content under inference/.
Option A — framework-centric only (closer to today's structure)
Organize all examples by their training/inference framework. Model- or task-anchored demos (DETR fine-tune, RAG, instruct-tuning) live under the framework subdirectory that happens to power them.
examples/
├── training/
│ ├── fsdp/
│ ├── deepspeed/
│ ├── torchtitan/
│ ├── megatron-lm/
│ ├── nemo/
│ ├── maxtext/
│ ├── peft/
│ ├── trl/
│ └── nemo-rl/
└── inference/
├── vllm/
├── sglang/
├── trtllm/
├── nim/
├── dynamo/
└── ray-serve/
Pros: simpler taxonomy; single placement rule (find the framework); preserves framework-team ownership cleanly; mirrors the current test_cases/ structure for muscle memory.
Cons: use-case-anchored demos (DETR fine-tune, RAG, instruct-tuning) sit awkwardly under whichever framework happens to power them; discoverability suffers for customer-intent readers searching for "DETR" or "RAG"; multi-framework end-to-end recipes (e.g. fine-tune with TRL → serve with vLLM) have no obvious home.
Option B — framework-centric + use-case-centric
Two organizing principles sit as siblings under examples/:
examples/{training,inference}/<framework>/<model>/ — framework-centric. The training/inference engine is the demo subject; model variants underneath illustrate it. Example: examples/training/fsdp/llama3/.
examples/use-cases/<name>/ — use-case-centric. The model or task is the demo subject; the framework is incidental. Example: examples/use-cases/detr-finetune/.
examples/
├── training/ # framework-centric: training/fine-tuning engines
│ ├── fsdp/
│ ├── deepspeed/
│ ├── torchtitan/
│ ├── megatron-lm/
│ ├── nemo/
│ ├── maxtext/
│ ├── peft/
│ ├── trl/
│ └── nemo-rl/
├── inference/ # framework-centric: inference engines
│ ├── vllm/
│ ├── sglang/
│ ├── trtllm/
│ ├── nim/
│ ├── dynamo/
│ └── ray-serve/
└── use-cases/ # use-case-centric: end-to-end
├── detr-finetune/
├── qwen3-rag/
├── llama3-instruct-tuning/
└── stable-diffusion-serving/
Pros: customer-intent demos have a discoverable home; multi-framework end-to-end recipes get a clean landing spot; the two axes match the two kinds of content that already exist in the repo.
Cons: requires a placement rule for contributors (when does something belong under a framework vs use-cases/?); risks use-cases/ becoming a dumping ground for anything that doesn't obviously fit elsewhere.
Placement rule (Option B only)
- Framework is what changes between sibling directories →
examples/{training,inference}/<framework>/<model>/. The model is the vehicle; swapping it gives "the same FSDP example with a different model."
- Model or task is what changes between sibling directories →
examples/use-cases/<name>/. The framework is incidental; swapping it would still leave a recognizable DETR fine-tune or RAG demo.
Default to use-cases/ when ambiguous — that path is more discoverable for the customer-intent reader, and burying a use-case demo under the wrong framework directory is the more costly mistake.
Worked examples:
- A LoRA fine-tune demonstrating PEFT against three model sizes →
examples/training/peft/{llama3-8b,llama3-70b,qwen3-32b}/
- A DETR object-detection fine-tune that happens to use PyTorch DDP →
examples/use-cases/detr-finetune/
- A vLLM serving demo with three quantization variants →
examples/inference/vllm/{llama3-fp8,llama3-int4,llama3-bf16}/
- A retrieval-augmented Qwen3 chat demo →
examples/use-cases/qwen3-rag/
The three-level maximum from examples/ to any leaf (examples/training/fsdp/llama3/) also satisfies the "no deeply nested paths" rule the Frameworks team style guide calls for.
Proposed final top-level structure
The top-level layout is the same in both options; only the contents of examples/ differ (per section 3 above):
awsome-distributed-ai/
├── architectures/
│ ├── common/
│ ├── vpc_network/
│ ├── aws-parallelcluster/
│ ├── amazon-eks/
│ ├── sagemaker-hyperpod-slurm/
│ └── sagemaker-hyperpod-eks/
├── ami_and_containers/
├── examples/ # Option A or Option B layout — see section 3
├── validation_and_observability/
├── micro-benchmarks/
├── deprecated/
├── docs/
│ ├── CONTRIBUTING.md
│ ├── DEPRECATION_POLICY.md
│ └── architecture-decisions/
├── .github/
├── LICENSE
└── README.md
Impact & Scope
- Affected areas: Every directory, all internal links, README references, CI workflows, external documentation, blog posts, and workshop materials that reference current paths.
- Breaking changes: Yes — all directory paths change. This is a one-time, coordinated change.
- Migration needed: Yes.
- GitHub redirects do not work for path renames within a repo; only repo-level transfers get auto-redirects.
- All cross-references in READMEs, workshop guides, and external docs need updating.
- CI workflows (
.github/workflows/) that use path filters will need path updates.
git mv preserves blame history; we should avoid re-creating files from scratch.
- Consider adding a compatibility script or symlinks in a transitional release (though symlinks don't render on GitHub).
HyperPod Console dependency (hard blocker)
The SageMaker HyperPod console (more specifically, its CloudFormation templates, here and here) reads lifecycle scripts directly from this repository using the current directory paths (e.g., 1.architectures/5.sagemaker-hyperpod/...). Renaming these directories would break the production HyperPod service's ability to locate and execute lifecycle scripts.
The proposed approach uses releases as stable reference points to decouple the service from main branch paths during the transition:
| Step |
Action |
Outcome |
| 1. Create a release |
Tag the current state of the repo (e.g., v2.0.0-pre-reorg) before any renames |
Provides a permanent, immutable snapshot with the old directory structure that the service can pin to |
| 2. Update service → pinned release |
HyperPod team updates the console CloudFormation templates to reference lifecycle scripts from the release tag instead of main |
Service is decoupled from main — renames on main no longer affect production |
| 3. Reorganize the repository |
Perform all renames, merges, and restructuring on main |
main now has the new directory structure; the pinned release is unaffected |
| 4. Update service → new paths |
HyperPod team updates the console to reference the new paths on main (or a post-reorg release tag) |
Service is back on main with the new structure |
This approach ensures zero downtime for HyperPod users — at no point are the paths the service references invalid. The release tag serves as a safety net: even if step 4 is delayed, the service continues to work against the pinned release.
Alternative: Migrate lifecycle scripts to sagemaker-hyperpod-cluster-setup
A cleaner long-term option is to move the lifecycle scripts out of this repository entirely and into aws/sagemaker-hyperpod-cluster-setup, which is the official repo for HyperPod cluster setup and already contains the CloudFormation templates that reference them. This would:
- Eliminate the cross-repo dependency — the scripts and the templates that reference them would live in the same repo, owned by the same team.
- Decouple this repo from production service concerns —
awsome-distributed-ai becomes purely a collection of reference examples and best practices, not a dependency of a production AWS service.
- Make future reorganizations safe — no need to coordinate with the HyperPod team for future directory changes in this repo.
If this approach is chosen, the migration sequence becomes: migrate lifecycle scripts to sagemaker-hyperpod-cluster-setup → update CloudFormation templates to reference the new location → reorganize this repo freely.
Migration plan (high level)
- Create a tracking issue for every external document/workshop that references current paths.
- Create a release (e.g.,
v2.0.0-pre-reorg) to snapshot the current directory structure.
- Coordinate with HyperPod service team to update console CloudFormation templates to reference the release tag instead of
main (hard blocker — steps 4–6 cannot proceed until this is deployed).
- Perform all renames via
git mv in a single PR to preserve history.
- Merge
awsome-inference content in a follow-up PR (or same PR if manageable).
- Update all internal README links, CI path filters, and Makefile references.
- Coordinate with HyperPod service team to update console references from the release tag to the new paths on
main.
- Archive
awsome-inference with a pointer to this repo.
- Update external docs/workshops.
Alternatives Considered
-
Pure use-case-centric hierarchy (examples/<model>/{training,inference,fine-tuning}/<framework>/): Better customer landing for "I want to run Qwen3" intent and the orientation proposed in the Frameworks team style guide. However, framework-level changes (PyTorch upgrades, FSDP API changes) sweep N model directories instead of one subtree; framework contributors lose a codebase home; and framework-agnostic content (micro-benchmarks, NCCL/EFA validation) has no natural location. The Qwen3-scatter concern this orientation solves is better addressed by Option B's placement rule (framework-centric when the framework is the subject, use-case-centric when the model is the subject) than by inverting the whole hierarchy.
-
Keep numbering, just merge inference: Solves fragmentation but retains the awkward numbering convention. Since we're already doing a disruptive rename for the merge, removing numbers at the same time minimizes total disruption.
-
Merge both repos into a new repo with a new name: Avoids breaking existing links to this repo, but loses the GitHub star count, issue history, and contributor graph. The awsome-distributed-ai name is well-known enough to keep.
-
Do nothing: Maintain two repos, accept the divergence. This gets worse over time as more inference content is added and infrastructure is duplicated.
Feedback Period
4 weeks
CC List
Motivation
The current
awsome-distributed-trainingrepository (nowawsome-distributed-ai) has grown to cover a broad set of distributed ML workloads on AWS, but the directory layout carries legacy decisions that make it harder to navigate and contribute to:Numbered prefixes impose a rigid ordering that no longer reflects how users discover content. Directories like
1.architectures/,2.ami_and_containers/,3.test_cases/suggest a sequential workflow, but most users land directly on a specific test case or architecture via search or a shared link. The numbering also creates friction when adding new top-level directories (what number should it get?).Training and inference are split across two separate repositories. The
awsome-inferencerepo covers inference workloads (SGLang, TRT-LLM, NIMs, Ray Serve, Dynamo, etc.) with its own infrastructure and project structure. Maintaining two repos leads to duplicated IaC (both have VPC/EKS setup), divergent conventions, and a fragmented contributor/user experience. Users doing both training and inference on the same cluster shouldn't need two repos.The
test_cases/directory has two kinds of content mixed together. Some examples are framework-centric — the demo's subject is the training/inference framework (FSDP, Megatron-LM, NeMo), and model variants underneath illustrate it (pytorch/FSDP/llama3,megatron/nemo/qwen3). Others are use-case-centric — the demo's subject is a specific model or use case, and the framework is incidental (pytorch/ddp/detr-finetuneis really a DETR-on-COCO fine-tuning example, not a DDP demo). Forcing both into a single framework-first hierarchy buries the latter under the wrong axis and makes them undiscoverable.Proposed Change
1. Remove numeric prefixes from all directories
0.docs/docs/1.architectures/architectures/2.ami_and_containers/ami_and_containers/3.test_cases/examples/4.validation_and_observability/validation_and_observability/micro-benchmarks/micro-benchmarks/(unchanged)Also remove numbering from subdirectories within
architectures/:architectures/0.common/architectures/common/architectures/1.vpc_network/architectures/vpc_network/architectures/2.aws-parallelcluster/architectures/aws-parallelcluster/architectures/5.sagemaker-hyperpod/architectures/sagemaker-hyperpod-slurm/The HyperPod directory is renamed
sagemaker-hyperpod-slurmto be explicit about the orchestrator (vssagemaker-hyperpod-eks).2. Merge
awsome-inferenceinto this repositoryContent from
aws-samples/awsome-inferencewould be absorbed:1.infrastructure/architectures/(merge with existing infra; deduplicate VPC/EKS)2.projects/*,3.use-cases/*examples/inference/<engine>/orexamples/use-cases/<name>/per the placement rule belowThe
awsome-inferencerepo is archived after the merge with the README pointing here.3. Replace
test_cases/with a unifiedexamples/treeThe renamed
examples/directory will house all training and inference workload examples. The directory layout underexamples/has two viable shapes that I'd like community input on. Both options drop numbering, unify fine-tuning intotraining/(fine-tuning libraries like PEFT, TRL, NeMo-RL live at the same abstraction level as pre-training libraries like FSDP, DeepSpeed, Megatron-LM — both are training-engine showcases), and absorbawsome-inferencecontent underinference/.Option A — framework-centric only (closer to today's structure)
Organize all examples by their training/inference framework. Model- or task-anchored demos (DETR fine-tune, RAG, instruct-tuning) live under the framework subdirectory that happens to power them.
Pros: simpler taxonomy; single placement rule (find the framework); preserves framework-team ownership cleanly; mirrors the current
test_cases/structure for muscle memory.Cons: use-case-anchored demos (DETR fine-tune, RAG, instruct-tuning) sit awkwardly under whichever framework happens to power them; discoverability suffers for customer-intent readers searching for "DETR" or "RAG"; multi-framework end-to-end recipes (e.g. fine-tune with TRL → serve with vLLM) have no obvious home.
Option B — framework-centric + use-case-centric
Two organizing principles sit as siblings under
examples/:examples/{training,inference}/<framework>/<model>/— framework-centric. The training/inference engine is the demo subject; model variants underneath illustrate it. Example:examples/training/fsdp/llama3/.examples/use-cases/<name>/— use-case-centric. The model or task is the demo subject; the framework is incidental. Example:examples/use-cases/detr-finetune/.Pros: customer-intent demos have a discoverable home; multi-framework end-to-end recipes get a clean landing spot; the two axes match the two kinds of content that already exist in the repo.
Cons: requires a placement rule for contributors (when does something belong under a framework vs
use-cases/?); risksuse-cases/becoming a dumping ground for anything that doesn't obviously fit elsewhere.Placement rule (Option B only)
examples/{training,inference}/<framework>/<model>/. The model is the vehicle; swapping it gives "the same FSDP example with a different model."examples/use-cases/<name>/. The framework is incidental; swapping it would still leave a recognizable DETR fine-tune or RAG demo.Default to
use-cases/when ambiguous — that path is more discoverable for the customer-intent reader, and burying a use-case demo under the wrong framework directory is the more costly mistake.Worked examples:
examples/training/peft/{llama3-8b,llama3-70b,qwen3-32b}/examples/use-cases/detr-finetune/examples/inference/vllm/{llama3-fp8,llama3-int4,llama3-bf16}/examples/use-cases/qwen3-rag/The three-level maximum from
examples/to any leaf (examples/training/fsdp/llama3/) also satisfies the "no deeply nested paths" rule the Frameworks team style guide calls for.Proposed final top-level structure
The top-level layout is the same in both options; only the contents of
examples/differ (per section 3 above):Impact & Scope
.github/workflows/) that use path filters will need path updates.git mvpreserves blame history; we should avoid re-creating files from scratch.HyperPod Console dependency (hard blocker)
The SageMaker HyperPod console (more specifically, its CloudFormation templates, here and here) reads lifecycle scripts directly from this repository using the current directory paths (e.g.,
1.architectures/5.sagemaker-hyperpod/...). Renaming these directories would break the production HyperPod service's ability to locate and execute lifecycle scripts.The proposed approach uses releases as stable reference points to decouple the service from
mainbranch paths during the transition:v2.0.0-pre-reorg) before any renamesmainmain— renames onmainno longer affect productionmainmainnow has the new directory structure; the pinned release is unaffectedmain(or a post-reorg release tag)mainwith the new structureThis approach ensures zero downtime for HyperPod users — at no point are the paths the service references invalid. The release tag serves as a safety net: even if step 4 is delayed, the service continues to work against the pinned release.
Alternative: Migrate lifecycle scripts to
sagemaker-hyperpod-cluster-setupA cleaner long-term option is to move the lifecycle scripts out of this repository entirely and into
aws/sagemaker-hyperpod-cluster-setup, which is the official repo for HyperPod cluster setup and already contains the CloudFormation templates that reference them. This would:awsome-distributed-aibecomes purely a collection of reference examples and best practices, not a dependency of a production AWS service.If this approach is chosen, the migration sequence becomes: migrate lifecycle scripts to
sagemaker-hyperpod-cluster-setup→ update CloudFormation templates to reference the new location → reorganize this repo freely.Migration plan (high level)
v2.0.0-pre-reorg) to snapshot the current directory structure.main(hard blocker — steps 4–6 cannot proceed until this is deployed).git mvin a single PR to preserve history.awsome-inferencecontent in a follow-up PR (or same PR if manageable).main.awsome-inferencewith a pointer to this repo.Alternatives Considered
Pure use-case-centric hierarchy (
examples/<model>/{training,inference,fine-tuning}/<framework>/): Better customer landing for "I want to run Qwen3" intent and the orientation proposed in the Frameworks team style guide. However, framework-level changes (PyTorch upgrades, FSDP API changes) sweep N model directories instead of one subtree; framework contributors lose a codebase home; and framework-agnostic content (micro-benchmarks, NCCL/EFA validation) has no natural location. The Qwen3-scatter concern this orientation solves is better addressed by Option B's placement rule (framework-centric when the framework is the subject, use-case-centric when the model is the subject) than by inverting the whole hierarchy.Keep numbering, just merge inference: Solves fragmentation but retains the awkward numbering convention. Since we're already doing a disruptive rename for the merge, removing numbers at the same time minimizes total disruption.
Merge both repos into a new repo with a new name: Avoids breaking existing links to this repo, but loses the GitHub star count, issue history, and contributor graph. The
awsome-distributed-ainame is well-known enough to keep.Do nothing: Maintain two repos, accept the divergence. This gets worse over time as more inference content is added and infrastructure is duplicated.
Feedback Period
4 weeks
CC List