CI: build & publish hpc-cloud images via GitHub Actions (nccl-tests first)#1113
Draft
KeitaW wants to merge 1 commit into
Draft
CI: build & publish hpc-cloud images via GitHub Actions (nccl-tests first)#1113KeitaW wants to merge 1 commit into
KeitaW wants to merge 1 commit into
Conversation
Reusable, manifest-driven workflow that builds multi-arch images on CodeBuild-hosted GitHub Actions runners and publishes to public.ecr.aws/hpc-cloud. Wires up nccl-tests first; other images onboard via a manifest entry. Replaces the per-image CodePipelines whose GitHub v1 OAuth source broke at the awsome-distributed-training -> awsome-distributed-ai rename. - .github/image-manifest.yaml: per-image build args + tag inputs (nccl-tests) - .github/workflows/build-image.yml: reusable prepare -> build(matrix) -> merge (multi-arch manifest) - .github/workflows/release-images.yml: caller + manifest/buildspec version-parity gate Requires the CodeBuild runner infra (Terraform, separate repo) applied and the ECR_PUBLISH_ROLE_ARN repo variable set before build jobs can succeed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a reusable, manifest-driven GitHub Actions pipeline that builds and publishes the
hpc-cloudcontainer images, wired up for nccl-tests as the first image.
.github/image-manifest.yaml— per-image build args + tag inputs (source of truth fornccl-tests)..github/workflows/build-image.yml— reusable (workflow_call):prepare → build (arch matrix) → merge.Heavy builds run on CodeBuild-hosted GitHub Actions runners (native x86 + Graviton fleets,
privileged), each arch pushes by digest, then a
mergejob assembles the multi-arch manifest listand tags
:<TAG>+:latest..github/workflows/release-images.yml— caller: PR / push-to-main (path-filtered) /workflow_dispatch/release triggers, plus a
manifest-buildspec-paritygate that fails the run if the six versions driftbetween the manifest and
micro-benchmarks/nccl-tests/buildspec.yaml.For
nccl-teststhis publishespublic.ecr.aws/hpc-cloud/nccl-tests:cuda13.0.2-efa1.48.0-ofiv1.19.0-ncclv2.30.4-1-testsv2.18.3(+:latest),multi-arch amd64 + arm64.
Why
The per-image CodePipelines that previously published these images pulled source via GitHub v1 OAuth
source actions, which broke when this repo was transferred/renamed
(
aws-samples/awsome-distributed-training→awslabs/awsome-distributed-ai). The last successfulnccl-testspublish was 2026-02-20; the CUDA-13 image was never built. Moving the pipeline to GitHubActions makes build status visible on the PR and removes the deprecated OAuth integration.
The build jobs cannot succeed until the supporting AWS infrastructure is in place:
adai-image-builder-{x86,arm}), theirWORKFLOW_JOB_QUEUEDwebhooks, and the ECR-Public push policy — provisioned by Terraform in the internal infra repo.
ECR_PUBLISH_ROLE_ARNset toarn:aws:iam::159553542841:role/awslabs-AOSH-GitHubActionsRole.The
manifest-buildspec-parityjob runs onubuntu-latestand is safe to run immediately. Mark this PRready (
gh pr ready) once the infra is applied and validated onnccl-tests.Validation
build-image.yml/release-images.yml/image-manifest.yamlare YAML-valid; tag derivation matchesbuildspec.yamlbyte-for-byte; the parity gate covers all six versions (incl.GDRCOPY_VERSION).