Skip to content

CI: build & publish hpc-cloud images via GitHub Actions (nccl-tests first)#1113

Draft
KeitaW wants to merge 1 commit into
mainfrom
add-image-build-ci
Draft

CI: build & publish hpc-cloud images via GitHub Actions (nccl-tests first)#1113
KeitaW wants to merge 1 commit into
mainfrom
add-image-build-ci

Conversation

@KeitaW

@KeitaW KeitaW commented May 29, 2026

Copy link
Copy Markdown
Collaborator

What

Adds a reusable, manifest-driven GitHub Actions pipeline that builds and publishes the hpc-cloud
container images, wired up for nccl-tests as the first image.

  • .github/image-manifest.yaml — per-image build args + tag inputs (source of truth for nccl-tests).
  • .github/workflows/build-image.yml — reusable (workflow_call): prepare → build (arch matrix) → merge.
    Heavy builds run on CodeBuild-hosted GitHub Actions runners (native x86 + Graviton fleets,
    privileged), each arch pushes by digest, then a merge job assembles the multi-arch manifest list
    and tags :<TAG> + :latest.
  • .github/workflows/release-images.yml — caller: PR / push-to-main (path-filtered) / workflow_dispatch /
    release triggers, plus a manifest-buildspec-parity gate that fails the run if the six versions drift
    between the manifest and micro-benchmarks/nccl-tests/buildspec.yaml.

For nccl-tests this publishes
public.ecr.aws/hpc-cloud/nccl-tests:cuda13.0.2-efa1.48.0-ofiv1.19.0-ncclv2.30.4-1-testsv2.18.3 (+ :latest),
multi-arch amd64 + arm64.

Why

The per-image CodePipelines that previously published these images pulled source via GitHub v1 OAuth
source actions, which broke when this repo was transferred/renamed
(aws-samples/awsome-distributed-trainingawslabs/awsome-distributed-ai). The last successful
nccl-tests publish was 2026-02-20; the CUDA-13 image was never built. Moving the pipeline to GitHub
Actions makes build status visible on the PR and removes the deprecated OAuth integration.

⚠️ Merge order / dependencies (why this is a draft)

The build jobs cannot succeed until the supporting AWS infrastructure is in place:

  1. The CodeBuild runner projects/fleets (adai-image-builder-{x86,arm}), their WORKFLOW_JOB_QUEUED
    webhooks, and the ECR-Public push policy — provisioned by Terraform in the internal infra repo.
  2. The repo variable ECR_PUBLISH_ROLE_ARN set to
    arn:aws:iam::159553542841:role/awslabs-AOSH-GitHubActionsRole.

The manifest-buildspec-parity job runs on ubuntu-latest and is safe to run immediately. Mark this PR
ready (gh pr ready) once the infra is applied and validated on nccl-tests.

Validation

  • build-image.yml / release-images.yml / image-manifest.yaml are YAML-valid; tag derivation matches
    buildspec.yaml byte-for-byte; the parity gate covers all six versions (incl. GDRCOPY_VERSION).

Reusable, manifest-driven workflow that builds multi-arch images on CodeBuild-hosted GitHub Actions runners and publishes to public.ecr.aws/hpc-cloud. Wires up nccl-tests first; other images onboard via a manifest entry. Replaces the per-image CodePipelines whose GitHub v1 OAuth source broke at the awsome-distributed-training -> awsome-distributed-ai rename.

- .github/image-manifest.yaml: per-image build args + tag inputs (nccl-tests)

- .github/workflows/build-image.yml: reusable prepare -> build(matrix) -> merge (multi-arch manifest)

- .github/workflows/release-images.yml: caller + manifest/buildspec version-parity gate

Requires the CodeBuild runner infra (Terraform, separate repo) applied and the ECR_PUBLISH_ROLE_ARN repo variable set before build jobs can succeed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant