Skip to content

Review agent should flag floating-URL + pinned-checksum as high-severity correctness issue #1515

@fullsend-ai-retro

Description

@fullsend-ai-retro

What happened

On PR #1506, the code agent added acli to the sandbox Containerfile using https://acli.atlassian.com/linux/latest/… (a floating URL) with SHA256 checksums per architecture. The review agent's first pass correctly flagged no-checksum as high severity. After checksums were added, the second pass downgraded the floating URL finding to low severity under the supply-chain category and approved the PR.

The human reviewer (ralphbean's comment) pointed out this pattern will break builds on main whenever Atlassian releases a new CLI version — the checksum won't match and sha256sum -c will fail. This is not a theoretical risk; it's a deterministic future breakage.

The review agent's approval decision followed the documented logic correctly: low-severity-only findings → approve. The problem is in the severity classification, not the decision tree.

What could go better

The review agent evaluated the floating URL exclusively through the supply-chain dimension, where checksums genuinely mitigate the binary substitution risk. It missed a separate finding under the correctness dimension: a floating URL combined with a pinned checksum creates a deterministic build failure when the upstream artifact changes. These are two distinct risks from the same code pattern.

I am confident this is a gap in the severity guidance rather than a model reasoning failure, because:

  1. The agent's reasoning was internally consistent — it correctly noted checksums mitigate supply-chain risk.
  2. The correctness dimension guidance in the code-review skill focuses on logic errors, edge cases, and test adequacy. It doesn't mention build reproducibility or CI reliability patterns.
  3. The Containerfile context makes this pattern especially impactful — a broken image build blocks all agent work across the org.

Uncertainty: I haven't verified whether adding explicit guidance will reliably change the model's classification. The model might still under-weight build reproducibility even with guidance, especially when the supply-chain dimension already covers part of the finding. However, the existing pattern of dimension-specific severity anchors (e.g., permission expansions → always HIGH) shows that explicit guidance does influence classification in practice.

Proposed change

Add a build reproducibility severity anchor to the correctness dimension guidance in internal/scaffold/fullsend-repo/skills/code-review/SKILL.md.

In the section that describes the correctness dimension's evaluation criteria, add a subsection analogous to the existing permission-manifest guidance under platform security:

**Build reproducibility (Dockerfiles, CI pipelines, install scripts):**

When a build artifact is downloaded from a URL that does not include an explicit version identifier (floating tags like `latest`, version-less paths, branch references):

- If the download has NO integrity verification: **high** — supply chain risk AND non-reproducible builds.
- If the download has integrity verification (checksums, signature validation) but the URL still floats: **high** — the checksum will fail deterministically when the upstream artifact changes, breaking CI on the default branch. This is a correctness issue distinct from supply-chain risk.
- If the download uses a versioned URL WITH integrity verification: no finding (follows existing patterns).

This applies to Dockerfiles, GitHub Actions, Makefiles, and any build script that fetches external artifacts.

This guidance ensures the agent produces two separate findings when both supply-chain and correctness risks are present, rather than collapsing them into a single supply-chain finding where checksums appear to mitigate the issue.

Validation criteria

  1. On the next PR that adds a floating-URL download with pinned checksums to a Dockerfile or CI pipeline, the review agent should produce a high-severity correctness finding about build reproducibility, separate from any supply-chain finding.
  2. The review outcome should be request-changes (not approve) when a floating-URL + pinned-checksum pattern is present.
  3. Verify across 3 review agent runs on PRs modifying Dockerfiles or CI configs that the agent consistently distinguishes supply-chain risk from build reproducibility risk as separate findings.

Generated by retro agent from #1506

Metadata

Metadata

Assignees

No one assigned

    Labels

    agent/reviewReview agentcomponent/harnessAgent harness, config, and skills loadingfeatureFeature-category issue awaiting human prioritizationgood-fullsend-issueSuitable for fullsend to tackle autonomously once self-hostedpriority/highSignificant impact, address soontriagedTriaged but awaiting human prioritization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions