Skip to content

Commit f57b931

Browse files
ayushtr-awsbgagentisadekskrokoko
authored
feat(cdk): Phase-0 deploy-then-verify integ tests (#236) (#295)
* feat(cdk): add Phase-0 deploy-then-verify integ tests (#236) Introduce automated deploy-then-verify integration testing via @aws-cdk/integ-tests-alpha + integ-runner — the first synth→deploy→ assert→destroy coverage against real AWS (ADR-013 Tier 3 / Phase 0). - cdk/test/integ/integ.task-api-smoke.ts: a trimmed TaskApiSmokeStack (TaskApi + task/events tables; omits orchestrator + AgentCore Runtime/Memory to avoid colliding with the live backgroundagent-dev stack and to keep the task at SUBMITTED). Asserts the create-and- persist happy path: Cognito adminCreateUser → adminSetUserPassword → initiateAuth → POST /tasks → GET /tasks/{id} (200, SUBMITTED) → DynamoDB getItem (user_id persisted). Forced teardown on success and failure. - .github/workflows/integ.yml: workflow_dispatch + nightly schedule (never per-PR — it deploys billable resources to a shared account); OIDC creds, concurrency guard, pinned action SHAs, and an if: always() force-destroy safety net. - cdk/mise.toml: mise //cdk:integ task. - cdk/package.json: pin @aws-cdk/integ-runner 2.199.0 + @aws-cdk/integ-tests-alpha 2.257.0-alpha.0 (devDeps); exclude test/integ/ from jest test + coverage paths. - .gitignore: ignore regenerated *.snapshot/ and cdk-integ.out.* dirs. - docs(ROADMAP, ADR-013): record Phase 0 landed (+ Starlight mirrors). * fix(integ): scrub account ID, randomize integ password, harden teardown Address security-review findings on the Phase-0 integ test (#236): - Remove the hardcoded AWS account ID from three comments (integ.yml header, integ.task-api-smoke.ts constructor, cdk/mise.toml task desc). The runner deploys into whatever account the active credentials / secrets.AWS_ROLE_TO_ASSUME resolve to, so naming the account added no value and leaked it into source. - Generate the throwaway Cognito password per-synth via randomBytes instead of a hardcoded literal, so no credential-shaped string lives in the tree. Still satisfies the default Cognito policy by construction. - Fix the workflow teardown safety-net: `cdk destroy backgroundagent-integ` synthesizes src/main.ts, which does not contain the integ stack, so it was a no-op. Delete the stack by its literal CloudFormation name via the AWS CLI (idempotent) instead. - Add a gitleaks rule to catch bare 12-digit AWS account IDs going forward, with allowlists for AWS-published placeholders and lockfiles/snapshots. * fix(integ): gate dispatch to main, default region, drop dead snapshot flag Address pre-merge review on the Phase-0 integ harness (#236): - Add `if: github.ref == 'refs/heads/main'` to the integ job. workflow_dispatch can target any branch and the job assumes the privileged integ role, so without this guard a feature-branch edit to test/integ/*.ts could be dispatched against that role. Complements the OIDC trust-policy restriction. - Default aws-region to us-east-1 when vars.AWS_REGION is unset, so the credentials action never runs region-less. - Drop --update-on-failed from the integ-runner invocation: .snapshot/ is gitignored, so there is no committed snapshot to diff against or update. --force still re-runs the deploy-then-verify unconditionally. * fix(integ): run integ job in `integ` GitHub Environment The job assumes AWS_ROLE_TO_ASSUME, which is an environment-scoped secret (deploy.yml's diff/deploy jobs each declare an environment), so a job with no environment resolves it to empty and configure-aws-credentials fails. The GitHub role's OIDC trust also gates on `sub = repo:...:environment:<name>`, so the integ workflow must run in a named environment whose name is added to the trust policy. Declare `environment: integ`. Account-side setup (separate, owner has IAM access) creates the `integ` GitHub Environment, scopes the role/region secrets to it, and adds `...:environment:integ` to the role's trust `sub`. * feat(integ): drive integ smoke per-PR via workflow_run, admin-gated Rework the integ workflow to mirror the deploy.yml model so a reviewer sees a merge-blocking check on the PR instead of manually dispatching a job. build.yml completes -> workflow_run resolves whether the diff touches cdk/** or agent/** -> the `integ` environment's required reviewer approves -> deploy/assert/destroy runs against the shared account -> an `integ-smoke` commit status is posted back to the PR head. Docs/cli-only PRs get an immediate green skip so the required check never deadlocks. Fork PRs run behind the same approval gate (the approver authorizes fork-authored test code to run with the privileged role). Nightly schedule dropped; workflow_dispatch retained (main only). Refs #236. * fix(integ): address review nits — error handling, test quality, fork-PR safety - integ.yml: drop unused actions:read; add EXIT-trap error status, explicit /files failure handling (no false-green skip), and a safe-to-test fork label gate in resolve; guard report job on resolve success; region-pin and fail-loud the teardown wait. - .gitleaks.toml: report aws-account-id capture group 1 (bare 12-digit ID). - integ smoke test: add unauthenticated POST assertion (expect 401); clarify the user_id assertion comment (identity binding proven transitively by the authenticated GET). - ADR-013: document residual-risk acceptance for fork-PR integ execution. --------- Co-authored-by: bgagent <bgagent@noreply.github.com> Co-authored-by: Sphia Sadek <isadeks@gmail.com> Co-authored-by: Alain Krok <alkrok@amazon.com>
1 parent 8c0f5e3 commit f57b931

11 files changed

Lines changed: 629 additions & 2 deletions

File tree

.github/workflows/integ.yml

Lines changed: 275 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,275 @@
1+
name: integ
2+
3+
# Phase-0 deploy-then-verify integration tests (issue #236). integ-runner
4+
# deploys a trimmed Task API stack into the shared account behind
5+
# secrets.AWS_ROLE_TO_ASSUME, runs the create-and-persist smoke assertions, then
6+
# tears the stack down.
7+
#
8+
# Trigger model mirrors deploy.yml: build.yml completes -> workflow_run picks it
9+
# up in the trusted base-repo context (secrets/OIDC available even for fork PRs)
10+
# -> we resolve whether the PR touches cdk/** or agent/** -> an admin approves
11+
# the `integ` environment gate -> deploy/assert/destroy runs against the shared
12+
# account -> a commit status `integ-smoke` is posted back to the PR head so it
13+
# shows up as a (required) check that blocks merge.
14+
#
15+
# Local dev path is unchanged: run `mise //cdk:integ` with your own AWS creds.
16+
#
17+
# Nightly schedule was intentionally dropped (previously 07:00 UTC) — the per-PR
18+
# path plus manual dispatch is the agreed coverage; this is not an oversight.
19+
on:
20+
# zizmor: ignore[dangerous-triggers] — intentional; workflow_run is required so
21+
# fork PRs can run against the shared account (a fork `pull_request` job gets no
22+
# secrets/OIDC). Mitigations: build-success guard, path-filter, `integ`
23+
# environment approval gate (admin reviews fork test code before it runs with
24+
# the privileged role), least-privilege role, status-only tokens per job.
25+
workflow_run:
26+
workflows: [build]
27+
types: [completed]
28+
workflow_dispatch: {}
29+
30+
# Only one integ run at a time against the shared account — overlapping deploys
31+
# would collide on the single hardcoded `backgroundagent-integ` stack name.
32+
concurrency:
33+
group: cdk-integ
34+
cancel-in-progress: false
35+
36+
permissions: {}
37+
38+
jobs:
39+
# Decides whether this PR needs the integ run (touches cdk/** or agent/**) and
40+
# posts the gating `integ-smoke` status. Always runs on a successful build so
41+
# docs/cli-only PRs get an immediate green (skipped) status and never deadlock
42+
# the required check.
43+
resolve:
44+
# Manual dispatch is restricted to main (defence in depth — the `integ`
45+
# environment approval is the primary gate). PR runs come via workflow_run.
46+
if: >-
47+
(github.event_name == 'workflow_dispatch' && github.ref == 'refs/heads/main') ||
48+
(github.event_name == 'workflow_run' && github.event.workflow_run.conclusion == 'success')
49+
runs-on: ubuntu-latest
50+
permissions:
51+
statuses: write
52+
pull-requests: read
53+
outputs:
54+
applicable: ${{ steps.decide.outputs.applicable }}
55+
head_sha: ${{ steps.decide.outputs.head_sha }}
56+
head_repo: ${{ steps.decide.outputs.head_repo }}
57+
steps:
58+
- name: Resolve applicability and post pending status
59+
id: decide
60+
env:
61+
GH_TOKEN: ${{ github.token }}
62+
REPO: ${{ github.repository }}
63+
EVENT_NAME: ${{ github.event_name }}
64+
# Empty for workflow_dispatch.
65+
PR_NUMBER_FROM_EVENT: ${{ github.event.workflow_run.pull_requests[0].number }}
66+
WF_HEAD_SHA: ${{ github.event.workflow_run.head_sha }}
67+
WF_HEAD_REPO: ${{ github.event.workflow_run.head_repository.full_name }}
68+
run: |
69+
set -euo pipefail
70+
71+
# workflow_dispatch: no PR context — run against the dispatched ref
72+
# (the job's own checkout defaults). Mark applicable, skip status post.
73+
if [[ "$EVENT_NAME" == "workflow_dispatch" ]]; then
74+
echo "applicable=true" >> "$GITHUB_OUTPUT"
75+
echo "head_sha=${GITHUB_SHA}" >> "$GITHUB_OUTPUT"
76+
echo "head_repo=${REPO}" >> "$GITHUB_OUTPUT"
77+
echo "Manual dispatch — running integ against ${GITHUB_SHA}."
78+
exit 0
79+
fi
80+
81+
HEAD_SHA="$WF_HEAD_SHA"
82+
echo "head_sha=${HEAD_SHA}" >> "$GITHUB_OUTPUT"
83+
echo "head_repo=${WF_HEAD_REPO}" >> "$GITHUB_OUTPUT"
84+
85+
# Track whether we've posted a terminal integ-smoke status. If the job
86+
# dies (failed API call, runner crash) before reaching one, the EXIT
87+
# trap posts `error` so the required check resolves instead of hanging
88+
# pending forever and silently blocking merge.
89+
STATUS_POSTED=""
90+
91+
post_status() {
92+
# $1=state $2=description
93+
gh api -X POST "repos/$REPO/statuses/$HEAD_SHA" \
94+
-f context=integ-smoke \
95+
-f state="$1" \
96+
-f description="$2" \
97+
-f target_url="${{ github.server_url }}/$REPO/actions/runs/${{ github.run_id }}" \
98+
>/dev/null
99+
STATUS_POSTED="yes"
100+
}
101+
102+
on_exit() {
103+
rc=$?
104+
# Only meaningful in the workflow_run (PR) context and only if we have
105+
# a SHA to post against and haven't already posted a terminal status.
106+
if [[ $rc -ne 0 && -z "$STATUS_POSTED" && -n "${HEAD_SHA:-}" ]]; then
107+
gh api -X POST "repos/$REPO/statuses/$HEAD_SHA" \
108+
-f context=integ-smoke \
109+
-f state=error \
110+
-f description="resolve step failed before gating" \
111+
-f target_url="${{ github.server_url }}/$REPO/actions/runs/${{ github.run_id }}" \
112+
>/dev/null 2>&1 || true
113+
fi
114+
}
115+
trap on_exit EXIT
116+
117+
resolve_pr_number() {
118+
if [[ -n "$PR_NUMBER_FROM_EVENT" ]]; then
119+
echo "$PR_NUMBER_FROM_EVENT"
120+
return
121+
fi
122+
gh api "repos/$REPO/commits/$HEAD_SHA/pulls" --jq '.[0].number // empty' 2>/dev/null || true
123+
}
124+
PR_NUMBER=$(resolve_pr_number)
125+
126+
if [[ -z "$PR_NUMBER" ]]; then
127+
echo "::warning::No PR resolved for $HEAD_SHA — nothing to gate; skipping."
128+
echo "applicable=false" >> "$GITHUB_OUTPUT"
129+
exit 0
130+
fi
131+
132+
# Fork-PR safety: only run fork-authored code after a maintainer has
133+
# applied the `safe-to-test` label (defence in depth on top of the
134+
# `integ` environment approval). If it's absent, leave the status
135+
# pending and don't run — re-trigger once the label is added.
136+
if [[ "$WF_HEAD_REPO" != "$REPO" ]]; then
137+
if ! LABELS=$(gh api "repos/$REPO/issues/$PR_NUMBER/labels" --jq '.[].name'); then
138+
echo "::error::Failed to read labels for PR #$PR_NUMBER."
139+
exit 1
140+
fi
141+
if ! echo "$LABELS" | grep -qx 'safe-to-test'; then
142+
post_status pending "awaiting safe-to-test label on fork PR"
143+
echo "applicable=false" >> "$GITHUB_OUTPUT"
144+
echo "Fork PR #$PR_NUMBER lacks safe-to-test label — not running."
145+
exit 0
146+
fi
147+
fi
148+
149+
# Path-filter must happen here (not on.pull_request.paths) because the
150+
# trigger is workflow_run. Fail loud on API error: a failed or truncated
151+
# /files response must NOT fall through to a false-green skip. With
152+
# `set -e`, an assignment inside an `if !` condition does not trip
153+
# errexit, so we handle the failure explicitly and let the EXIT trap
154+
# post `error`.
155+
if ! CHANGED=$(gh api "repos/$REPO/pulls/$PR_NUMBER/files" --paginate --jq '.[].filename'); then
156+
echo "::error::Failed to list changed files for PR #$PR_NUMBER."
157+
exit 1
158+
fi
159+
if echo "$CHANGED" | grep -Eq '^(cdk|agent)/'; then
160+
post_status pending "awaiting admin approval / running"
161+
echo "applicable=true" >> "$GITHUB_OUTPUT"
162+
echo "PR #$PR_NUMBER touches cdk/** or agent/** — integ applies."
163+
else
164+
post_status success "skipped — no cdk/** or agent/** changes"
165+
echo "applicable=false" >> "$GITHUB_OUTPUT"
166+
echo "PR #$PR_NUMBER has no cdk/** or agent/** changes — integ skipped (green)."
167+
fi
168+
169+
# The admin-gated deploy -> assert -> destroy. The `integ` environment's
170+
# required reviewer is the approval gate; while it waits, the integ-smoke
171+
# status stays pending and merge stays blocked.
172+
integ:
173+
needs: resolve
174+
if: needs.resolve.outputs.applicable == 'true'
175+
name: CDK integ smoke (Task API)
176+
runs-on: ubuntu-latest
177+
environment: integ
178+
timeout-minutes: 45
179+
permissions:
180+
id-token: write
181+
contents: read
182+
env:
183+
CI: "true"
184+
MISE_EXPERIMENTAL: "1"
185+
steps:
186+
- name: Checkout PR head (incl. forks)
187+
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
188+
with:
189+
# Approving the `integ` environment authorizes this fork-authored test
190+
# code to run with the privileged role — the approver MUST review
191+
# cdk/test/integ/** changes before approving.
192+
repository: ${{ needs.resolve.outputs.head_repo }}
193+
ref: ${{ needs.resolve.outputs.head_sha }}
194+
persist-credentials: false
195+
196+
- name: Configure AWS credentials
197+
uses: aws-actions/configure-aws-credentials@7474bc4690e29a8392af63c5b98e7449536d5c3a # v4.3.1
198+
with:
199+
role-to-assume: ${{ secrets.AWS_ROLE_TO_ASSUME }}
200+
# Fall back to us-east-1 if the repo variable is unset, so the action
201+
# never runs region-less (which would fail credential resolution).
202+
aws-region: ${{ vars.AWS_REGION || 'us-east-1' }}
203+
204+
- name: Install mise
205+
uses: jdx/mise-action@1648a7812b9aeae629881980618f079932869151 # v4.0.1
206+
with:
207+
cache: true
208+
209+
- name: Setup Node.js
210+
uses: actions/setup-node@48b55a011bda9f5d6aeb4c2d9c7362e8dae4041e # v6.4.0
211+
with:
212+
node-version: 22.x
213+
214+
- name: Install dependencies
215+
run: yarn install --immutable
216+
217+
- name: Run integ tests (deploy → assert → destroy)
218+
run: mise //cdk:integ
219+
220+
# Safety net: integ-runner forces teardown on success and failure, but if
221+
# the run is cancelled or crashes mid-deploy the stack can be stranded in
222+
# the shared account. Delete it directly via CloudFormation so we never
223+
# leak billable resources.
224+
#
225+
# NOTE: `cdk destroy backgroundagent-integ` would NOT work here — it
226+
# synthesizes the main app (src/main.ts), which does not contain the integ
227+
# stack, so it exits 0 having deleted nothing. Target the stack by its
228+
# literal CloudFormation name instead. delete-stack is idempotent (no-op if
229+
# already gone), so `|| true` only guards transient API errors.
230+
- name: Ensure stack torn down
231+
if: always()
232+
env:
233+
AWS_REGION: ${{ vars.AWS_REGION || 'us-east-1' }}
234+
AWS_DEFAULT_REGION: ${{ vars.AWS_REGION || 'us-east-1' }}
235+
run: |
236+
set -euo pipefail
237+
aws cloudformation delete-stack --stack-name backgroundagent-integ || true
238+
# No `|| true` on the wait: a DELETE_FAILED must surface loudly so we
239+
# never silently leak billable resources in the shared account.
240+
aws cloudformation wait stack-delete-complete --stack-name backgroundagent-integ
241+
242+
# Post the final integ-smoke status back to the PR head so the check flips from
243+
# pending to success/failure. Skipped for workflow_dispatch (no PR to gate).
244+
report:
245+
needs: [resolve, integ]
246+
if: >-
247+
always() &&
248+
needs.resolve.result == 'success' &&
249+
needs.resolve.outputs.applicable == 'true' &&
250+
github.event_name == 'workflow_run'
251+
runs-on: ubuntu-latest
252+
permissions:
253+
statuses: write
254+
steps:
255+
- name: Post final integ-smoke status
256+
env:
257+
GH_TOKEN: ${{ github.token }}
258+
REPO: ${{ github.repository }}
259+
HEAD_SHA: ${{ needs.resolve.outputs.head_sha }}
260+
INTEG_RESULT: ${{ needs.integ.result }}
261+
run: |
262+
set -euo pipefail
263+
if [[ "$INTEG_RESULT" == "success" ]]; then
264+
STATE=success
265+
DESC="deploy → assert → destroy passed"
266+
else
267+
STATE=failure
268+
DESC="integ run ${INTEG_RESULT}"
269+
fi
270+
gh api -X POST "repos/$REPO/statuses/$HEAD_SHA" \
271+
-f context=integ-smoke \
272+
-f state="$STATE" \
273+
-f description="$DESC" \
274+
-f target_url="${{ github.server_url }}/$REPO/actions/runs/${{ github.run_id }}" \
275+
>/dev/null

.gitignore

Lines changed: 7 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

.gitleaks.toml

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,44 @@ stopwords = ["wat-opaque-123"]
1818
[[allowlists]]
1919
description = "Test fixture signing secret in Slack verification unit test (not a real credential)."
2020
stopwords = ["test-signing-secret-abc123"]
21+
22+
# Catch bare 12-digit AWS account IDs. The default ruleset does not flag these,
23+
# which is how a real account ID reached a committed comment in the #236 integ
24+
# work. RE2 (Go) has no lookarounds, so the non-digit neighbours are captured in
25+
# non-capturing groups; the account ID itself is group 1.
26+
[[rules]]
27+
id = "aws-account-id"
28+
description = "AWS account ID (12-digit, bare or in ARN/role context)"
29+
regex = '''(?:^|[^0-9])([0-9]{12})(?:[^0-9]|$)'''
30+
# Report/redact capture group 1 (the bare 12-digit ID) rather than the whole
31+
# match, which would include the non-digit delimiter neighbours.
32+
secretGroup = 1
33+
keywords = [
34+
"account",
35+
"aws",
36+
"arn:aws",
37+
"iam::",
38+
"role/",
39+
"assumed-role",
40+
"AWS_ACCOUNT",
41+
"AccountId",
42+
"account_id",
43+
]
44+
45+
[[rules.allowlists]]
46+
description = "AWS-published placeholder IDs, localstack default, and lockfiles/snapshots."
47+
regexes = [
48+
'''123456789012''',
49+
'''000000000000''',
50+
'''111122223333''',
51+
'''444455556666''',
52+
]
53+
paths = [
54+
'''(.*/)?test/.*\.snapshot/.*''',
55+
'''(.*/)?CHANGELOG\.md$''',
56+
'''(.*/)?yarn\.lock$''',
57+
'''(.*/)?package-lock\.json$''',
58+
'''(.*/)?uv\.lock$''',
59+
]
60+
# Per-line opt-out: append `# gitleaks:allow account-id` on a legitimate line.
61+
stopwords = ["gitleaks:allow account-id"]

cdk/mise.toml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,17 @@ run = "npx cdk destroy"
5858
description = "cdk diff"
5959
run = "npx cdk diff"
6060

61+
[tasks.integ]
62+
description = "CDK deploy-then-verify integration tests (integ-runner). Needs AWS creds for the shared integ account."
63+
depends = [":compile"]
64+
run = [
65+
"mkdir -p $TMPDIR",
66+
# No --update-on-failed: .snapshot/ is gitignored, so there is no committed
67+
# snapshot to diff against or update. --force re-runs the deploy-then-verify
68+
# unconditionally, which is what we want in CI.
69+
"npx integ-runner --language typescript --directory test/integ --force",
70+
]
71+
6172
[tasks.bundle]
6273
description = "No-op bundle step for cdk/cdk.json (NodejsFunction bundles at synth time)"
6374
run = "node -e \"process.exit(0)\""

cdk/package.json

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,8 @@
3737
"ulid": "^3.0.2"
3838
},
3939
"devDependencies": {
40+
"@aws-cdk/integ-runner": "2.199.0",
41+
"@aws-cdk/integ-tests-alpha": "2.257.0-alpha.0",
4042
"@cdklabs/eslint-plugin": "^2",
4143
"@stylistic/eslint-plugin": "^2",
4244
"@types/aws-lambda": "^8.10.161",
@@ -90,10 +92,12 @@
9092
}
9193
},
9294
"coveragePathIgnorePatterns": [
93-
"/node_modules/"
95+
"/node_modules/",
96+
"<rootDir>/test/integ/"
9497
],
9598
"testPathIgnorePatterns": [
96-
"/node_modules/"
99+
"/node_modules/",
100+
"<rootDir>/test/integ/"
97101
],
98102
"watchPathIgnorePatterns": [
99103
"/node_modules/"

0 commit comments

Comments
 (0)