Skip to content

RFC: GitHub Actions CI/CD with protected deployment environment and approval gates #73

@scottschreckengaust

Description

@scottschreckengaust

RFC: Automated Deployment Pipeline with Protected Environments

Status: Draft (revised per feedback)

Author: @scottschreckengaust

Related: #70 (context stack names), #72 (ephemeral cleanup)


Summary

Establish a GitHub Actions deployment pipeline that:

  1. Builds and synthesizes CDK once per compute_type in build.yml (always all registered types)
  2. Stores cdk-<compute_type>.out as immutable deployment artifacts (synth once, deploy exact artifact)
  3. Gates all deployments behind a protected GitHub environment (deploy) requiring manual approval — triggered by deploy label (with optional type qualifiers)
  4. Deploys to AWS using OIDC federation assuming CDK bootstrap roles (no long-lived credentials)
  5. Stack naming: main-<compute_type>-prd for production, ephemeral for PRs/branches
  6. On successful deployment: creates a GitHub Release (drafted → published) with tagged main and cdk-*.out artifacts
  7. Cleanup targets stacks tagged with github:* context keys (presence of any github:sha != none), gated behind approval with cancel-in-progress concurrency

Decisions (from discussion)

Question Decision
PR deployments Opt-in via deploy label (with optional type qualifiers)
Synth strategy Once in build.yml for ALL registered compute_types, deploy the exact artifact — no re-synth
Cleanup approval Always manually gated — later runs cancel prior pending requests
Cost gate No — resource review in approval is sufficient
Permissions boundary Yes — use CDK bootstrap roles (deploy, lookup, file-publishing, image-publishing)
main deploy approval Always require — never skip, even after PR merge
Deploy selection Label-driven: deploy = all registered types, deploy:<type> = only that type
Baselines Per-compute_type against main-<compute_type>-prd — stored as release artifacts

Design

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│  GitHub Actions                                                      │
│                                                                      │
│  build.yml (CI) — every push/PR                                      │
│  ├─ steps: install → compile → test → lint → synth (per compute_type)│
│  ├─ matrix: ALL registered compute_types (static list, always built) │
│  ├─ artifact: cdk-<compute_type>.out (immutable, uploaded per leg)   │
│  └─ output: stack_name, is_protected, compute_type                   │
│                                                                      │
│  deploy.yml (CD) — on `deploy` label OR main merge                   │
│  ├─ trigger: label added + build success, OR push to main            │
│  ├─ environment: "deploy" (ALWAYS requires approval, no bypass)      │
│  ├─ matrix: filtered by labels (deploy=all, deploy:<type>=one)       │
│  ├─ steps:                                                           │
│  │   ├─ download cdk-<compute_type>.out artifact (exact build output)│
│  │   ├─ configure-aws-credentials (OIDC → CDK bootstrap roles)      │
│  │   ├─ baseline-diff (compare vs last release baseline)             │
│  │   ├─ post diff summary to deployment log                          │
│  │   ├─ cdk deploy --app cdk-<compute_type>.out --require-approval never │
│  │   └─ on success: draft release → tag → attach artifacts → publish │
│  └─ concurrency: one deploy at a time per stack                      │
│                                                                      │
│  cleanup.yml                                                         │
│  ├─ trigger: schedule (every 4h) + workflow_dispatch                 │
│  ├─ environment: "deploy" (ALWAYS requires approval)                 │
│  ├─ concurrency: cancel-in-progress (later runs cancel prior)        │
│  └─ steps: find stacks with github:* tags → force-detach ENIs → del │
└─────────────────────────────────────────────────────────────────────┘
         │
         │ OIDC (aws-actions/configure-aws-credentials)
         │ role-to-assume: CDK deploy role
         ▼
┌─────────────────────────────────────────────────────────────────────┐
│  AWS Account                                                         │
│  ├─ IAM OIDC Provider (token.actions.githubusercontent.com)          │
│  ├─ CDK Bootstrap Roles (permissions boundary):                      │
│  │   ├─ cdk-hnb659fds-deploy-role-*                                  │
│  │   ├─ cdk-hnb659fds-lookup-role-*                                  │
│  │   ├─ cdk-hnb659fds-file-publishing-role-*                         │
│  │   └─ cdk-hnb659fds-image-publishing-role-*                        │
│  ├─ CloudFormation Stacks (tagged: github:sha != 'none')             │
│  │   ├─ main-agentcore-prd (protected, terminationProtection=true)   │
│  │   ├─ main-ecs-prd (protected, terminationProtection=true)         │
│  │   ├─ pr-42-abc1234-agentcore (ephemeral, tagged)                  │
│  │   └─ commit-abc1234-ecs (ephemeral, tagged)                       │
│  └─ CDK Bootstrap (cdk-toolkit stack)                                │
└─────────────────────────────────────────────────────────────────────┘

Label-Driven Deploy Selection

Key principle: Build ALL, deploy selectively

build.yml always synthesizes all registered compute_types (today: [agentcore]). Labels only control what deploy.yml deploys.

Labels

Label Types deployed Use case
deploy All registered types Standard full deployment
deploy:agentcore agentcore only Deploy only agentcore
deploy:ecs ecs only Deploy only ECS (when available)
deploy:* All (same as deploy) Explicit "all" synonym
No deploy* label Nothing deployed Default (CI only)

Resolution logic (in deploy.yml)

- name: Resolve deploy targets from labels
  id: targets
  run: |
    LABELS='${{ toJson(github.event.pull_request.labels.*.name) }}'
    # All registered compute_types (must match build.yml matrix)
    ALL_TYPES='["agentcore"]'

    if echo "$LABELS" | jq -e 'index("deploy:*")' > /dev/null; then
      # deploy:* = all (explicit synonym)
      echo "matrix=$ALL_TYPES" >> "$GITHUB_OUTPUT"
    elif echo "$LABELS" | jq -e '[.[] | select(startswith("deploy:"))] | length > 0' > /dev/null; then
      # Specific type labels — deploy only those
      TYPES=$(echo "$LABELS" | jq '[.[] | select(startswith("deploy:")) | ltrimstr("deploy:")]')
      echo "matrix=$TYPES" >> "$GITHUB_OUTPUT"
    elif echo "$LABELS" | jq -e 'index("deploy")' > /dev/null; then
      # Plain "deploy" = all registered types
      echo "matrix=$ALL_TYPES" >> "$GITHUB_OUTPUT"
    else
      echo 'matrix=[]' >> "$GITHUB_OUTPUT"
    fi

Release Flow

Successful deployments from main produce GitHub Releases:

main merge
  → build.yml (synth ALL registered compute_types in matrix)
  → upload artifacts: cdk-agentcore.out, (cdk-ecs.out when available, ...)
  → deploy.yml (approval gate — downloads exact artifacts, label filters which deploy)
  → successful deployment
  → Draft Release created:
      Tag: v<date>-<short-sha> (e.g. v2026.05.11-abc1234)
      Assets:
        - cdk-agentcore.out.tar.gz
        - (cdk-ecs.out.tar.gz when available)
        - agentcore.resource-types.json (baseline)
        - (ecs.resource-types.json when available)
  → Publish Release

Baselines live in releases, not in the repo. The diff step downloads the baseline from the latest published release for that compute_type:

- name: Download baseline from latest release
  run: |
    LATEST=$(gh release view --json tagName -q .tagName 2>/dev/null || echo "")
    if [[ -n "$LATEST" ]]; then
      gh release download "$LATEST" \
        --pattern "${{ matrix.compute_type }}.resource-types.json" \
        --dir /tmp/baseline/ || true
    fi

This means:

  • No baseline commits polluting the repo history
  • Baselines are immutable (tied to a release tag)
  • First deploy (no prior release) has no baseline → everything shows as "new" (correct)
  • Rollback = re-deploy from a prior release's cdk-*.out artifact

Synth-Once, Deploy-Exact Artifact

The cdk.out is synthesized exactly once per compute_type during build.yml. The deploy.yml never re-synths — it downloads and deploys the exact artifact:

# build.yml — always synths ALL registered types
strategy:
  matrix:
    compute_type: [agentcore]  # extend when new types are ready

# Context is generated into cdk/cdk.context.json before build
- name: Generate CDK context
  run: |
    jq -n \
      --arg compute_type "${{ matrix.compute_type }}" \
      --arg stackName "backgroundagent-dev" \
      --arg sha "$TAG_SHA" \
      ... \
      '{ "compute_type": $compute_type, "stackName": $stackName, "github:sha": $sha, ... }' \
      > cdk/cdk.context.json

- uses: actions/upload-artifact@v4
  with:
    name: cdk-${{ matrix.compute_type }}-out
    path: |
      cdk/cdk.out/
      cdk/cdk.context.json

# deploy.yml (no synth — uses exact artifact from build)
- uses: actions/download-artifact@v4
  with:
    name: cdk-${{ matrix.compute_type }}-out
    path: cdk-${{ matrix.compute_type }}.out/

- name: Deploy
  run: npx cdk deploy --app cdk-${{ matrix.compute_type }}.out --all --require-approval never

This guarantees what was tested in CI is exactly what gets deployed — no new Date() drift, no env var differences, no CDK version skew.


Permissions: CDK Bootstrap Role Assumption

The GitHub OIDC role only needs permission to assume the CDK bootstrap roles. This is the CDK security best practice:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sts:AssumeRole",
      "Resource": [
        "arn:aws:iam::ACCOUNT:role/cdk-hnb659fds-deploy-role-*",
        "arn:aws:iam::ACCOUNT:role/cdk-hnb659fds-lookup-role-*",
        "arn:aws:iam::ACCOUNT:role/cdk-hnb659fds-file-publishing-role-*",
        "arn:aws:iam::ACCOUNT:role/cdk-hnb659fds-image-publishing-role-*"
      ]
    },
    {
      "Sid": "CleanupENIs",
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeNetworkInterfaces",
        "ec2:DetachNetworkInterface",
        "ec2:DeleteNetworkInterface",
        "cloudformation:ListStacks",
        "cloudformation:DescribeStacks",
        "cloudformation:DeleteStack",
        "cloudformation:ListStackResources"
      ],
      "Resource": "*"
    }
  ]
}

Trust policy (OIDC):

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::ACCOUNT:oidc-provider/token.actions.githubusercontent.com"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
      },
      "StringLike": {
        "token.actions.githubusercontent.com:sub": "repo:aws-samples/sample-autonomous-cloud-coding-agents:*"
      }
    }
  }]
}

Stack Naming and Tagging

Git ref Label Stack name Protected
main (auto) main-agentcore-prd true
main deploy:ecs main-ecs-prd true
PR #42 deploy pr-42-abc1234-agentcore false
PR #42 deploy:ecs pr-42-abc1234-ecs false
Branch push deploy commit-abc1234-agentcore false

All stacks deployed via this pipeline are identified by the 13 github:* tags applied via CDK context (PR #91, #93). Cleanup identifies CI-deployed stacks by checking github:sha != none. Additionally:

Tags.of(stack).add('compute_type', computeType);

The compute_type tag enables per-type baseline queries and cost attribution.


GitHub Environment: deploy

Setting Value Rationale
Required reviewers ≥1 reviewer, NOT the actor who triggered Prevents self-approval
Wait timer 0 (manual approval is the gate)
Deployment branches All branches Allow PR deploys via label
Allow administrators to bypass No No bypass for anyone
Prevent self-review Yes Enforce separation of duties

Environment secrets:

Secret Value
AWS_ROLE_ARN arn:aws:iam::ACCOUNT:role/GitHubActionsCDKRole
AWS_REGION us-east-1

Cleanup Workflow

name: Cleanup Ephemeral Stacks
on:
  schedule:
    - cron: '0 */4 * * *'
  workflow_dispatch:
    inputs:
      max_age_hours:
        description: 'Max age in hours (0 = all non-protected)'
        default: '0'
      dry_run:
        description: 'Dry run mode'
        type: boolean
        default: true

concurrency:
  group: cleanup-ephemeral
  cancel-in-progress: true  # later runs cancel prior pending requests

jobs:
  cleanup:
    runs-on: ubuntu-latest
    environment: deploy  # ALWAYS requires approval
    permissions:
      id-token: write
      contents: read

    steps:
      - uses: actions/checkout@v4

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: ${{ secrets.AWS_REGION }}

      - name: Run cleanup
        env:
          MAX_AGE_HOURS: ${{ inputs.max_age_hours || '0' }}
        run: ./scripts/cleanup-ephemeral-stacks.sh --tag-key github:sha --tag-value-not none

Resource Baseline and Diff (via Releases)

Diff output example (shown to approver in Step Summary)

## ⚠️ New AWS Resource Types (agentcore)

The following resource types are NEW compared to latest release v2026.05.10-fa647ca:

  + AWS::EKS::Cluster
  + AWS::EKS::Nodegroup
  + AWS::IAM::OpenIDConnectProvider

Approver action: Verify cost model, quotas, security posture, and cleanup behavior.

## Resource count: 47 → 50 (+3)

Approval Gate: What Reviewers Should Check

The deployment summary provides:

  1. Resource type diff from baseline (new/removed services)
  2. Full cdk diff (property-level changes from the synthesized artifact)
  3. Compute type and stack name being deployed
  4. Labels that triggered the deployment

Per new resource type, verify:

Check How
Cost model AWS Pricing / awspricing MCP
Service quotas aws service-quotas list-service-quotas --service-code <code>
Security posture Public endpoints? VPC-only? Encryption at rest?
IAM blast radius What * permissions does CDK grant?
Cleanup behavior RemovalPolicy.DESTROY? Orphan risk?
Regional availability Available in target region?

Implementation Plan

Phase 1: Foundation

  • Create GitHub environment deploy (no self-approval, no bypass, prevent self-review)
  • Set up AWS OIDC provider
  • Create GitHub Actions role with sts:AssumeRole to CDK bootstrap roles
  • CDK bootstrap the target account

Phase 2: Build pipeline

Phase 3: Deploy pipeline

Phase 4: Cleanup

  • Add ENI cleanup if necessary to GitHub assumed role
  • Update cleanup-ephemeral-stacks.sh to target by github:sha tag presence
  • Create cleanup.yml with approval gate and cancel-in-progress
  • Schedule every 4h

Phase 5: Observability

  • CloudWatch alarms (stack count, ENI leaks, cost)
  • Document approval checklist in CONTRIBUTING.md

Security Considerations

  • No long-lived credentials: OIDC only → assumes CDK bootstrap roles
  • Permissions boundary: GitHub role can ONLY assume the 4 CDK bootstrap roles + ENI cleanup
  • No self-approval: Enforced at GitHub environment level
  • No admin bypass: Even org owners must get approval
  • Audit trail: GitHub deployment history + CloudTrail
  • Tag-based targeting: Cleanup identifies stacks by github:sha tag (applied to all CI-deployed stacks)
  • Termination protection: main-*-prd stacks cannot be accidentally deleted
  • Artifact integrity: What CI tested is exactly what gets deployed (no re-synth)

References

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions