Skip to content

[RFC]: Repository reorganization — remove numbering, merge awsome-inference #1056

@KeitaW

Description

@KeitaW

Motivation

The current awsome-distributed-training repository (now awsome-distributed-ai) has grown to cover a broad set of distributed ML workloads on AWS, but the directory layout carries legacy decisions that make it harder to navigate and contribute to:

  1. Numbered prefixes impose a rigid ordering that no longer reflects how users discover content. Directories like 1.architectures/, 2.ami_and_containers/, 3.test_cases/ suggest a sequential workflow, but most users land directly on a specific test case or architecture via search or a shared link. The numbering also creates friction when adding new top-level directories (what number should it get?).

  2. Training and inference are split across two separate repositories. The awsome-inference repo covers inference workloads (SGLang, TRT-LLM, NIMs, Ray Serve, Dynamo, etc.) with its own infrastructure and project structure. Maintaining two repos leads to duplicated IaC (both have VPC/EKS setup), divergent conventions, and a fragmented contributor/user experience. Users doing both training and inference on the same cluster shouldn't need two repos.

  3. The test_cases/ directory has two kinds of content mixed together. Some examples are framework-centric — the demo's subject is the training/inference framework (FSDP, Megatron-LM, NeMo), and model variants underneath illustrate it (pytorch/FSDP/llama3, megatron/nemo/qwen3). Others are use-case-centric — the demo's subject is a specific model or use case, and the framework is incidental (pytorch/ddp/detr-finetune is really a DETR-on-COCO fine-tuning example, not a DDP demo). Forcing both into a single framework-first hierarchy buries the latter under the wrong axis and makes them undiscoverable.

Proposed Change

1. Remove numeric prefixes from all directories

Current Proposed
0.docs/ docs/
1.architectures/ architectures/
2.ami_and_containers/ ami_and_containers/
3.test_cases/ examples/
4.validation_and_observability/ validation_and_observability/
micro-benchmarks/ micro-benchmarks/ (unchanged)

Also remove numbering from subdirectories within architectures/:

Current Proposed
architectures/0.common/ architectures/common/
architectures/1.vpc_network/ architectures/vpc_network/
architectures/2.aws-parallelcluster/ architectures/aws-parallelcluster/
architectures/5.sagemaker-hyperpod/ architectures/sagemaker-hyperpod-slurm/
... ...

The HyperPod directory is renamed sagemaker-hyperpod-slurm to be explicit about the orchestrator (vs sagemaker-hyperpod-eks).

2. Merge awsome-inference into this repository

Content from aws-samples/awsome-inference would be absorbed:

awsome-inference source Destination
1.infrastructure/ architectures/ (merge with existing infra; deduplicate VPC/EKS)
2.projects/*, 3.use-cases/* examples/inference/<engine>/ or examples/use-cases/<name>/ per the placement rule below

The awsome-inference repo is archived after the merge with the README pointing here.

3. Replace test_cases/ with a unified examples/ tree

The renamed examples/ directory will house all training and inference workload examples. The directory layout under examples/ has two viable shapes that I'd like community input on. Both options drop numbering, unify fine-tuning into training/ (fine-tuning libraries like PEFT, TRL, NeMo-RL live at the same abstraction level as pre-training libraries like FSDP, DeepSpeed, Megatron-LM — both are training-engine showcases), and absorb awsome-inference content under inference/.

Option A — framework-centric only (closer to today's structure)

Organize all examples by their training/inference framework. Model- or task-anchored demos (DETR fine-tune, RAG, instruct-tuning) live under the framework subdirectory that happens to power them.

examples/
├── training/
│   ├── fsdp/
│   ├── deepspeed/
│   ├── torchtitan/
│   ├── megatron-lm/
│   ├── nemo/
│   ├── maxtext/
│   ├── peft/
│   ├── trl/
│   └── nemo-rl/
└── inference/
    ├── vllm/
    ├── sglang/
    ├── trtllm/
    ├── nim/
    ├── dynamo/
    └── ray-serve/

Pros: simpler taxonomy; single placement rule (find the framework); preserves framework-team ownership cleanly; mirrors the current test_cases/ structure for muscle memory.

Cons: use-case-anchored demos (DETR fine-tune, RAG, instruct-tuning) sit awkwardly under whichever framework happens to power them; discoverability suffers for customer-intent readers searching for "DETR" or "RAG"; multi-framework end-to-end recipes (e.g. fine-tune with TRL → serve with vLLM) have no obvious home.

Option B — framework-centric + use-case-centric

Two organizing principles sit as siblings under examples/:

  • examples/{training,inference}/<framework>/<model>/framework-centric. The training/inference engine is the demo subject; model variants underneath illustrate it. Example: examples/training/fsdp/llama3/.
  • examples/use-cases/<name>/use-case-centric. The model or task is the demo subject; the framework is incidental. Example: examples/use-cases/detr-finetune/.
examples/
├── training/                    # framework-centric: training/fine-tuning engines
│   ├── fsdp/
│   ├── deepspeed/
│   ├── torchtitan/
│   ├── megatron-lm/
│   ├── nemo/
│   ├── maxtext/
│   ├── peft/
│   ├── trl/
│   └── nemo-rl/
├── inference/                   # framework-centric: inference engines
│   ├── vllm/
│   ├── sglang/
│   ├── trtllm/
│   ├── nim/
│   ├── dynamo/
│   └── ray-serve/
└── use-cases/                   # use-case-centric: end-to-end
    ├── detr-finetune/
    ├── qwen3-rag/
    ├── llama3-instruct-tuning/
    └── stable-diffusion-serving/

Pros: customer-intent demos have a discoverable home; multi-framework end-to-end recipes get a clean landing spot; the two axes match the two kinds of content that already exist in the repo.

Cons: requires a placement rule for contributors (when does something belong under a framework vs use-cases/?); risks use-cases/ becoming a dumping ground for anything that doesn't obviously fit elsewhere.

Placement rule (Option B only)

  • Framework is what changes between sibling directoriesexamples/{training,inference}/<framework>/<model>/. The model is the vehicle; swapping it gives "the same FSDP example with a different model."
  • Model or task is what changes between sibling directoriesexamples/use-cases/<name>/. The framework is incidental; swapping it would still leave a recognizable DETR fine-tune or RAG demo.

Default to use-cases/ when ambiguous — that path is more discoverable for the customer-intent reader, and burying a use-case demo under the wrong framework directory is the more costly mistake.

Worked examples:

  • A LoRA fine-tune demonstrating PEFT against three model sizes → examples/training/peft/{llama3-8b,llama3-70b,qwen3-32b}/
  • A DETR object-detection fine-tune that happens to use PyTorch DDP → examples/use-cases/detr-finetune/
  • A vLLM serving demo with three quantization variants → examples/inference/vllm/{llama3-fp8,llama3-int4,llama3-bf16}/
  • A retrieval-augmented Qwen3 chat demo → examples/use-cases/qwen3-rag/

The three-level maximum from examples/ to any leaf (examples/training/fsdp/llama3/) also satisfies the "no deeply nested paths" rule the Frameworks team style guide calls for.

Proposed final top-level structure

The top-level layout is the same in both options; only the contents of examples/ differ (per section 3 above):

awsome-distributed-ai/
├── architectures/
│   ├── common/
│   ├── vpc_network/
│   ├── aws-parallelcluster/
│   ├── amazon-eks/
│   ├── sagemaker-hyperpod-slurm/
│   └── sagemaker-hyperpod-eks/
├── ami_and_containers/
├── examples/                    # Option A or Option B layout — see section 3
├── validation_and_observability/
├── micro-benchmarks/
├── deprecated/
├── docs/
│   ├── CONTRIBUTING.md
│   ├── DEPRECATION_POLICY.md
│   └── architecture-decisions/
├── .github/
├── LICENSE
└── README.md

Impact & Scope

  • Affected areas: Every directory, all internal links, README references, CI workflows, external documentation, blog posts, and workshop materials that reference current paths.
  • Breaking changes: Yes — all directory paths change. This is a one-time, coordinated change.
  • Migration needed: Yes.
    • GitHub redirects do not work for path renames within a repo; only repo-level transfers get auto-redirects.
    • All cross-references in READMEs, workshop guides, and external docs need updating.
    • CI workflows (.github/workflows/) that use path filters will need path updates.
    • git mv preserves blame history; we should avoid re-creating files from scratch.
    • Consider adding a compatibility script or symlinks in a transitional release (though symlinks don't render on GitHub).

HyperPod Console dependency (hard blocker)

The SageMaker HyperPod console (more specifically, its CloudFormation templates, here and here) reads lifecycle scripts directly from this repository using the current directory paths (e.g., 1.architectures/5.sagemaker-hyperpod/...). Renaming these directories would break the production HyperPod service's ability to locate and execute lifecycle scripts.

The proposed approach uses releases as stable reference points to decouple the service from main branch paths during the transition:

Step Action Outcome
1. Create a release Tag the current state of the repo (e.g., v2.0.0-pre-reorg) before any renames Provides a permanent, immutable snapshot with the old directory structure that the service can pin to
2. Update service → pinned release HyperPod team updates the console CloudFormation templates to reference lifecycle scripts from the release tag instead of main Service is decoupled from main — renames on main no longer affect production
3. Reorganize the repository Perform all renames, merges, and restructuring on main main now has the new directory structure; the pinned release is unaffected
4. Update service → new paths HyperPod team updates the console to reference the new paths on main (or a post-reorg release tag) Service is back on main with the new structure

This approach ensures zero downtime for HyperPod users — at no point are the paths the service references invalid. The release tag serves as a safety net: even if step 4 is delayed, the service continues to work against the pinned release.

Alternative: Migrate lifecycle scripts to sagemaker-hyperpod-cluster-setup

A cleaner long-term option is to move the lifecycle scripts out of this repository entirely and into aws/sagemaker-hyperpod-cluster-setup, which is the official repo for HyperPod cluster setup and already contains the CloudFormation templates that reference them. This would:

  • Eliminate the cross-repo dependency — the scripts and the templates that reference them would live in the same repo, owned by the same team.
  • Decouple this repo from production service concernsawsome-distributed-ai becomes purely a collection of reference examples and best practices, not a dependency of a production AWS service.
  • Make future reorganizations safe — no need to coordinate with the HyperPod team for future directory changes in this repo.

If this approach is chosen, the migration sequence becomes: migrate lifecycle scripts to sagemaker-hyperpod-cluster-setup → update CloudFormation templates to reference the new location → reorganize this repo freely.

Migration plan (high level)

  1. Create a tracking issue for every external document/workshop that references current paths.
  2. Create a release (e.g., v2.0.0-pre-reorg) to snapshot the current directory structure.
  3. Coordinate with HyperPod service team to update console CloudFormation templates to reference the release tag instead of main (hard blocker — steps 4–6 cannot proceed until this is deployed).
  4. Perform all renames via git mv in a single PR to preserve history.
  5. Merge awsome-inference content in a follow-up PR (or same PR if manageable).
  6. Update all internal README links, CI path filters, and Makefile references.
  7. Coordinate with HyperPod service team to update console references from the release tag to the new paths on main.
  8. Archive awsome-inference with a pointer to this repo.
  9. Update external docs/workshops.

Alternatives Considered

  1. Pure use-case-centric hierarchy (examples/<model>/{training,inference,fine-tuning}/<framework>/): Better customer landing for "I want to run Qwen3" intent and the orientation proposed in the Frameworks team style guide. However, framework-level changes (PyTorch upgrades, FSDP API changes) sweep N model directories instead of one subtree; framework contributors lose a codebase home; and framework-agnostic content (micro-benchmarks, NCCL/EFA validation) has no natural location. The Qwen3-scatter concern this orientation solves is better addressed by Option B's placement rule (framework-centric when the framework is the subject, use-case-centric when the model is the subject) than by inverting the whole hierarchy.

  2. Keep numbering, just merge inference: Solves fragmentation but retains the awkward numbering convention. Since we're already doing a disruptive rename for the merge, removing numbers at the same time minimizes total disruption.

  3. Merge both repos into a new repo with a new name: Avoids breaking existing links to this repo, but loses the GitHub star count, issue history, and contributor graph. The awsome-distributed-ai name is well-known enough to keep.

  4. Do nothing: Maintain two repos, accept the divergence. This gets worse over time as more inference content is added and infrastructure is duplicated.

Feedback Period

4 weeks

CC List

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions