Review agent should flag floating-URL + pinned-checksum as high-severity correctness issue

## What happened

On [PR #1506](https://github.com/fullsend-ai/fullsend/pull/1506), the code agent added acli to the sandbox Containerfile using `https://acli.atlassian.com/linux/latest/…` (a floating URL) with SHA256 checksums per architecture. The review agent's first pass correctly flagged no-checksum as high severity. After checksums were added, the second pass downgraded the floating URL finding to **low** severity under the `supply-chain` category and **approved** the PR.

The human reviewer ([ralphbean's comment](https://github.com/fullsend-ai/fullsend/pull/1506#discussion_r2112822858)) pointed out this pattern will break builds on `main` whenever Atlassian releases a new CLI version — the checksum won't match and `sha256sum -c` will fail. This is not a theoretical risk; it's a deterministic future breakage.

The review agent's approval decision followed the documented logic correctly: low-severity-only findings → approve. The problem is in the severity classification, not the decision tree.

## What could go better

The review agent evaluated the floating URL exclusively through the **supply-chain** dimension, where checksums genuinely mitigate the binary substitution risk. It missed a separate finding under the **correctness** dimension: a floating URL combined with a pinned checksum creates a deterministic build failure when the upstream artifact changes. These are two distinct risks from the same code pattern.

I am confident this is a gap in the severity guidance rather than a model reasoning failure, because:
1. The agent's reasoning was internally consistent — it correctly noted checksums mitigate supply-chain risk.
2. The correctness dimension guidance in the code-review skill focuses on logic errors, edge cases, and test adequacy. It doesn't mention build reproducibility or CI reliability patterns.
3. The Containerfile context makes this pattern especially impactful — a broken image build blocks all agent work across the org.

Uncertainty: I haven't verified whether adding explicit guidance will reliably change the model's classification. The model might still under-weight build reproducibility even with guidance, especially when the supply-chain dimension already covers part of the finding. However, the existing pattern of dimension-specific severity anchors (e.g., permission expansions → always HIGH) shows that explicit guidance does influence classification in practice.

## Proposed change

Add a build reproducibility severity anchor to the **correctness** dimension guidance in `internal/scaffold/fullsend-repo/skills/code-review/SKILL.md`.

In the section that describes the correctness dimension's evaluation criteria, add a subsection analogous to the existing permission-manifest guidance under platform security:

```
**Build reproducibility (Dockerfiles, CI pipelines, install scripts):**

When a build artifact is downloaded from a URL that does not include an explicit version identifier (floating tags like `latest`, version-less paths, branch references):

- If the download has NO integrity verification: **high** — supply chain risk AND non-reproducible builds.
- If the download has integrity verification (checksums, signature validation) but the URL still floats: **high** — the checksum will fail deterministically when the upstream artifact changes, breaking CI on the default branch. This is a correctness issue distinct from supply-chain risk.
- If the download uses a versioned URL WITH integrity verification: no finding (follows existing patterns).

This applies to Dockerfiles, GitHub Actions, Makefiles, and any build script that fetches external artifacts.
```

This guidance ensures the agent produces **two separate findings** when both supply-chain and correctness risks are present, rather than collapsing them into a single supply-chain finding where checksums appear to mitigate the issue.

## Validation criteria

1. On the next PR that adds a floating-URL download with pinned checksums to a Dockerfile or CI pipeline, the review agent should produce a **high-severity correctness finding** about build reproducibility, separate from any supply-chain finding.
2. The review outcome should be `request-changes` (not `approve`) when a floating-URL + pinned-checksum pattern is present.
3. Verify across 3 review agent runs on PRs modifying Dockerfiles or CI configs that the agent consistently distinguishes supply-chain risk from build reproducibility risk as separate findings.

---
_Generated by retro agent from https://github.com/fullsend-ai/fullsend/pull/1506_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review agent should flag floating-URL + pinned-checksum as high-severity correctness issue #1515

What happened

What could go better

Proposed change

Validation criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Review agent should flag floating-URL + pinned-checksum as high-severity correctness issue #1515

Description

What happened

What could go better

Proposed change

Validation criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions