[RFC]: Repository reorganization — remove numbering, merge awsome-inference

## Motivation

The current `awsome-distributed-training` repository (now `awsome-distributed-ai`) has grown to cover a broad set of distributed ML workloads on AWS, but the directory layout carries legacy decisions that make it harder to navigate and contribute to:

1. **Numbered prefixes impose a rigid ordering** that no longer reflects how users discover content. Directories like `1.architectures/`, `2.ami_and_containers/`, `3.test_cases/` suggest a sequential workflow, but most users land directly on a specific test case or architecture via search or a shared link. The numbering also creates friction when adding new top-level directories (what number should it get?).

2. **Training and inference are split across two separate repositories.** The [`awsome-inference`](https://github.com/aws-samples/awsome-inference) repo covers inference workloads (SGLang, TRT-LLM, NIMs, Ray Serve, Dynamo, etc.) with its own infrastructure and project structure. Maintaining two repos leads to duplicated IaC (both have VPC/EKS setup), divergent conventions, and a fragmented contributor/user experience. Users doing both training *and* inference on the same cluster shouldn't need two repos.

3. **The `test_cases/` directory has two kinds of content mixed together.** Some examples are *framework-centric* — the demo's subject is the training/inference framework (FSDP, Megatron-LM, NeMo), and model variants underneath illustrate it (`pytorch/FSDP/llama3`, `megatron/nemo/qwen3`). Others are *use-case-centric* — the demo's subject is a specific model or use case, and the framework is incidental (`pytorch/ddp/detr-finetune` is really a DETR-on-COCO fine-tuning example, not a DDP demo). Forcing both into a single framework-first hierarchy buries the latter under the wrong axis and makes them undiscoverable.

## Proposed Change

### 1. Remove numeric prefixes from all directories

| Current | Proposed |
|---|---|
| `0.docs/` | `docs/` |
| `1.architectures/` | `architectures/` |
| `2.ami_and_containers/` | `ami_and_containers/` |
| `3.test_cases/` | `examples/` |
| `4.validation_and_observability/` | `validation_and_observability/` |
| `micro-benchmarks/` | `micro-benchmarks/` (unchanged) |

Also remove numbering from subdirectories within `architectures/`:

| Current | Proposed |
|---|---|
| `architectures/0.common/` | `architectures/common/` |
| `architectures/1.vpc_network/` | `architectures/vpc_network/` |
| `architectures/2.aws-parallelcluster/` | `architectures/aws-parallelcluster/` |
| `architectures/5.sagemaker-hyperpod/` | `architectures/sagemaker-hyperpod-slurm/` |
| ... | ... |

The HyperPod directory is renamed `sagemaker-hyperpod-slurm` to be explicit about the orchestrator (vs `sagemaker-hyperpod-eks`).

### 2. Merge `awsome-inference` into this repository

Content from [`aws-samples/awsome-inference`](https://github.com/aws-samples/awsome-inference) would be absorbed:

| awsome-inference source | Destination |
|---|---|
| `1.infrastructure/` | `architectures/` (merge with existing infra; deduplicate VPC/EKS) |
| `2.projects/*`, `3.use-cases/*` | `examples/inference/<engine>/` or `examples/use-cases/<name>/` per the placement rule below |

The `awsome-inference` repo is archived after the merge with the README pointing here.

### 3. Replace `test_cases/` with a unified `examples/` tree

The renamed `examples/` directory will house all training and inference workload examples. The directory layout *under* `examples/` has two viable shapes that I'd like community input on. Both options drop numbering, unify fine-tuning into `training/` (fine-tuning libraries like PEFT, TRL, NeMo-RL live at the same abstraction level as pre-training libraries like FSDP, DeepSpeed, Megatron-LM — both are training-engine showcases), and absorb `awsome-inference` content under `inference/`.

#### Option A — framework-centric only (closer to today's structure)

Organize all examples by their training/inference framework. Model- or task-anchored demos (DETR fine-tune, RAG, instruct-tuning) live under the framework subdirectory that happens to power them.

```
examples/
├── training/
│   ├── fsdp/
│   ├── deepspeed/
│   ├── torchtitan/
│   ├── megatron-lm/
│   ├── nemo/
│   ├── maxtext/
│   ├── peft/
│   ├── trl/
│   └── nemo-rl/
└── inference/
    ├── vllm/
    ├── sglang/
    ├── trtllm/
    ├── nim/
    ├── dynamo/
    └── ray-serve/
```

**Pros:** simpler taxonomy; single placement rule (find the framework); preserves framework-team ownership cleanly; mirrors the current `test_cases/` structure for muscle memory.

**Cons:** use-case-anchored demos (DETR fine-tune, RAG, instruct-tuning) sit awkwardly under whichever framework happens to power them; discoverability suffers for customer-intent readers searching for "DETR" or "RAG"; multi-framework end-to-end recipes (e.g. fine-tune with TRL → serve with vLLM) have no obvious home.

#### Option B — framework-centric + use-case-centric

Two organizing principles sit as siblings under `examples/`:

- **`examples/{training,inference}/<framework>/<model>/`** — *framework-centric*. The training/inference engine is the demo subject; model variants underneath illustrate it. Example: `examples/training/fsdp/llama3/`.
- **`examples/use-cases/<name>/`** — *use-case-centric*. The model or task is the demo subject; the framework is incidental. Example: `examples/use-cases/detr-finetune/`.

```
examples/
├── training/                    # framework-centric: training/fine-tuning engines
│   ├── fsdp/
│   ├── deepspeed/
│   ├── torchtitan/
│   ├── megatron-lm/
│   ├── nemo/
│   ├── maxtext/
│   ├── peft/
│   ├── trl/
│   └── nemo-rl/
├── inference/                   # framework-centric: inference engines
│   ├── vllm/
│   ├── sglang/
│   ├── trtllm/
│   ├── nim/
│   ├── dynamo/
│   └── ray-serve/
└── use-cases/                   # use-case-centric: end-to-end
    ├── detr-finetune/
    ├── qwen3-rag/
    ├── llama3-instruct-tuning/
    └── stable-diffusion-serving/
```

**Pros:** customer-intent demos have a discoverable home; multi-framework end-to-end recipes get a clean landing spot; the two axes match the two kinds of content that already exist in the repo.

**Cons:** requires a placement rule for contributors (when does something belong under a framework vs `use-cases/`?); risks `use-cases/` becoming a dumping ground for anything that doesn't obviously fit elsewhere.

### Placement rule (Option B only)

- **Framework is what changes between sibling directories** → `examples/{training,inference}/<framework>/<model>/`. The model is the vehicle; swapping it gives "the same FSDP example with a different model."
- **Model or task is what changes between sibling directories** → `examples/use-cases/<name>/`. The framework is incidental; swapping it would still leave a recognizable DETR fine-tune or RAG demo.

Default to `use-cases/` when ambiguous — that path is more discoverable for the customer-intent reader, and burying a use-case demo under the wrong framework directory is the more costly mistake.

Worked examples:
- A LoRA fine-tune demonstrating PEFT against three model sizes → `examples/training/peft/{llama3-8b,llama3-70b,qwen3-32b}/`
- A DETR object-detection fine-tune that happens to use PyTorch DDP → `examples/use-cases/detr-finetune/`
- A vLLM serving demo with three quantization variants → `examples/inference/vllm/{llama3-fp8,llama3-int4,llama3-bf16}/`
- A retrieval-augmented Qwen3 chat demo → `examples/use-cases/qwen3-rag/`

The three-level maximum from `examples/` to any leaf (`examples/training/fsdp/llama3/`) also satisfies the "no deeply nested paths" rule the Frameworks team style guide calls for.

### Proposed final top-level structure

The top-level layout is the same in both options; only the contents of `examples/` differ (per section 3 above):

```
awsome-distributed-ai/
├── architectures/
│   ├── common/
│   ├── vpc_network/
│   ├── aws-parallelcluster/
│   ├── amazon-eks/
│   ├── sagemaker-hyperpod-slurm/
│   └── sagemaker-hyperpod-eks/
├── ami_and_containers/
├── examples/                    # Option A or Option B layout — see section 3
├── validation_and_observability/
├── micro-benchmarks/
├── deprecated/
├── docs/
│   ├── CONTRIBUTING.md
│   ├── DEPRECATION_POLICY.md
│   └── architecture-decisions/
├── .github/
├── LICENSE
└── README.md
```

## Impact & Scope

- **Affected areas**: Every directory, all internal links, README references, CI workflows, external documentation, blog posts, and workshop materials that reference current paths.
- **Breaking changes**: Yes — all directory paths change. This is a one-time, coordinated change.
- **Migration needed**: Yes.
  - GitHub redirects do **not** work for path renames within a repo; only repo-level transfers get auto-redirects.
  - All cross-references in READMEs, workshop guides, and external docs need updating.
  - CI workflows (`.github/workflows/`) that use path filters will need path updates.
  - `git mv` preserves blame history; we should avoid re-creating files from scratch.
  - Consider adding a compatibility script or symlinks in a transitional release (though symlinks don't render on GitHub).

### HyperPod Console dependency (hard blocker)

The **SageMaker HyperPod console (more specifically, its CloudFormation templates, [here](https://github.com/aws/sagemaker-hyperpod-cluster-setup/blob/main/eks/cloudformation/lifecycle-script-template.yaml#L14) and [here](https://github.com/aws/sagemaker-hyperpod-cluster-setup/blob/main/slurm/cloudformation/slurm-lifecycle-script-template.yaml#L28)) reads lifecycle scripts directly from this repository using the current directory paths** (e.g., `1.architectures/5.sagemaker-hyperpod/...`). Renaming these directories would break the production HyperPod service's ability to locate and execute lifecycle scripts.

The proposed approach uses **releases as stable reference points** to decouple the service from `main` branch paths during the transition:

| Step | Action | Outcome |
|---|---|---|
| **1. Create a release** | Tag the current state of the repo (e.g., `v2.0.0-pre-reorg`) before any renames | Provides a **permanent, immutable snapshot** with the old directory structure that the service can pin to |
| **2. Update service → pinned release** | HyperPod team updates the console CloudFormation templates to reference lifecycle scripts from the release tag instead of `main` | Service is **decoupled from `main`** — renames on `main` no longer affect production |
| **3. Reorganize the repository** | Perform all renames, merges, and restructuring on `main` | `main` now has the new directory structure; the pinned release is unaffected |
| **4. Update service → new paths** | HyperPod team updates the console to reference the new paths on `main` (or a post-reorg release tag) | Service is back on `main` with the new structure |

This approach ensures **zero downtime** for HyperPod users — at no point are the paths the service references invalid. The release tag serves as a safety net: even if step 4 is delayed, the service continues to work against the pinned release.

**Alternative: Migrate lifecycle scripts to `sagemaker-hyperpod-cluster-setup`**

A cleaner long-term option is to move the lifecycle scripts out of this repository entirely and into [`aws/sagemaker-hyperpod-cluster-setup`](https://github.com/aws/sagemaker-hyperpod-cluster-setup/tree/main), which is the official repo for HyperPod cluster setup and already contains the CloudFormation templates that reference them. This would:

- **Eliminate the cross-repo dependency** — the scripts and the templates that reference them would live in the same repo, owned by the same team.
- **Decouple this repo from production service concerns** — `awsome-distributed-ai` becomes purely a collection of reference examples and best practices, not a dependency of a production AWS service.
- **Make future reorganizations safe** — no need to coordinate with the HyperPod team for future directory changes in this repo.

If this approach is chosen, the migration sequence becomes: migrate lifecycle scripts to `sagemaker-hyperpod-cluster-setup` → update CloudFormation templates to reference the new location → reorganize this repo freely.

### Migration plan (high level)

1. Create a tracking issue for every external document/workshop that references current paths.
2. **Create a release** (e.g., `v2.0.0-pre-reorg`) to snapshot the current directory structure.
3. **Coordinate with HyperPod service team** to update console CloudFormation templates to reference the release tag instead of `main` (hard blocker — steps 4–6 cannot proceed until this is deployed).
4. Perform all renames via `git mv` in a single PR to preserve history.
5. Merge `awsome-inference` content in a follow-up PR (or same PR if manageable).
6. Update all internal README links, CI path filters, and Makefile references.
7. **Coordinate with HyperPod service team** to update console references from the release tag to the new paths on `main`.
8. Archive `awsome-inference` with a pointer to this repo.
9. Update external docs/workshops.

## Alternatives Considered

1. **Pure use-case-centric hierarchy** (`examples/<model>/{training,inference,fine-tuning}/<framework>/`): Better customer landing for "I want to run Qwen3" intent and the orientation proposed in the Frameworks team style guide. However, framework-level changes (PyTorch upgrades, FSDP API changes) sweep N model directories instead of one subtree; framework contributors lose a codebase home; and framework-agnostic content (micro-benchmarks, NCCL/EFA validation) has no natural location. The Qwen3-scatter concern this orientation solves is better addressed by Option B's placement rule (framework-centric when the framework is the subject, use-case-centric when the model is the subject) than by inverting the whole hierarchy.

2. **Keep numbering, just merge inference**: Solves fragmentation but retains the awkward numbering convention. Since we're already doing a disruptive rename for the merge, removing numbers at the same time minimizes total disruption.

3. **Merge both repos into a new repo with a new name**: Avoids breaking existing links to *this* repo, but loses the GitHub star count, issue history, and contributor graph. The `awsome-distributed-ai` name is well-known enough to keep.

4. **Do nothing**: Maintain two repos, accept the divergence. This gets worse over time as more inference content is added and infrastructure is duplicated.

## Feedback Period

4 weeks

## CC List

Current	Proposed
`0.docs/`	`docs/`
`1.architectures/`	`architectures/`
`2.ami_and_containers/`	`ami_and_containers/`
`3.test_cases/`	`examples/`
`4.validation_and_observability/`	`validation_and_observability/`
`micro-benchmarks/`	`micro-benchmarks/` (unchanged)

Current	Proposed
`architectures/0.common/`	`architectures/common/`
`architectures/1.vpc_network/`	`architectures/vpc_network/`
`architectures/2.aws-parallelcluster/`	`architectures/aws-parallelcluster/`
`architectures/5.sagemaker-hyperpod/`	`architectures/sagemaker-hyperpod-slurm/`
...	...

awsome-inference source	Destination
`1.infrastructure/`	`architectures/` (merge with existing infra; deduplicate VPC/EKS)
`2.projects/`, `3.use-cases/`	`examples/inference/<engine>/` or `examples/use-cases/<name>/` per the placement rule below

Step	Action	Outcome
1. Create a release	Tag the current state of the repo (e.g., `v2.0.0-pre-reorg`) before any renames	Provides a permanent, immutable snapshot with the old directory structure that the service can pin to
2. Update service → pinned release	HyperPod team updates the console CloudFormation templates to reference lifecycle scripts from the release tag instead of `main`	Service is decoupled from `main` — renames on `main` no longer affect production
3. Reorganize the repository	Perform all renames, merges, and restructuring on `main`	`main` now has the new directory structure; the pinned release is unaffected
4. Update service → new paths	HyperPod team updates the console to reference the new paths on `main` (or a post-reorg release tag)	Service is back on `main` with the new structure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Repository reorganization — remove numbering, merge awsome-inference #1056

Motivation

Proposed Change

1. Remove numeric prefixes from all directories

2. Merge `awsome-inference` into this repository

3. Replace `test_cases/` with a unified `examples/` tree

Option A — framework-centric only (closer to today's structure)

Option B — framework-centric + use-case-centric

Placement rule (Option B only)

Proposed final top-level structure

Impact & Scope

HyperPod Console dependency (hard blocker)

Migration plan (high level)

Alternatives Considered

Feedback Period

CC List

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[RFC]: Repository reorganization — remove numbering, merge awsome-inference #1056

Description

Motivation

Proposed Change

1. Remove numeric prefixes from all directories

2. Merge awsome-inference into this repository

3. Replace test_cases/ with a unified examples/ tree

Option A — framework-centric only (closer to today's structure)

Option B — framework-centric + use-case-centric

Placement rule (Option B only)

Proposed final top-level structure

Impact & Scope

HyperPod Console dependency (hard blocker)

Migration plan (high level)

Alternatives Considered

Feedback Period

CC List

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. Merge `awsome-inference` into this repository

3. Replace `test_cases/` with a unified `examples/` tree