Skip to content

refactor(red-team): align GOAT with Meta's paper — drop pre-generated plan and stage hints#340

Open
Aryansharma28 wants to merge 2 commits intofeat/red-team-dynamic-techniquesfrom
feat/red-team-goat-paper-fidelity
Open

refactor(red-team): align GOAT with Meta's paper — drop pre-generated plan and stage hints#340
Aryansharma28 wants to merge 2 commits intofeat/red-team-dynamic-techniquesfrom
feat/red-team-goat-paper-fidelity

Conversation

@Aryansharma28
Copy link
Copy Markdown
Contributor

Summary

Stacks on #306 (GOAT strategy). Moves GOAT from ~60% to ~85% faithful to Meta's ICML 2025 GOAT paper by stripping two Crescendo-derived artifacts that aren't in the paper:

  1. No pre-generated attack plan. The paper's attacker reasons turn-by-turn from the technique catalogue + objective + conversation history. It doesn't pre-generate a plan via a separate metaprompt LLM call. We were, as a Crescendo carryover. Now gated behind a new `needs_metaprompt_plan` / `needsMetapromptPlan` property on the strategy — GoatStrategy returns `False`, so `_generate_attack_plan` is skipped entirely for GOAT (saves one LLM call on turn 1).
  2. No stage hints in the attacker's system prompt. The paper has no early/mid/late concept — adaptation is driven entirely by the score/hint feedback in the attacker's private conversation history. Removed the `_STAGES` array, the `_get_stage` method, and the `Stage: EARLY — …` block from `build_system_prompt`. `get_phase_name` still returns `early`/`mid`/`late` for telemetry dashboards, but the label is no longer surfaced to the attacker.

What this changes for users

  • `RedTeamAgent.goat(...)` skips the metaprompt LLM call on turn 1 → ~33% cheaper startup, and the description-keyed stale-plan failure mode (red team: _attack_plan cache survives across scenarios with different descriptions #327) disappears for GOAT.
  • GOAT attacker prompt is smaller and more focused — no plan, no stage text. The attacker reasons off catalogue + score feedback alone, matching the paper's methodology.
  • `GOAT_METAPROMPT_TEMPLATE` is dropped from the public API (it's no longer rendered anywhere). Removed from `scenario.init.py` and `javascript/src/agents/red-team/index.ts`.
  • `goat()` / `redTeamGoat` factories simplified — no more `setdefault` / object-spread template dance, which also removes the foot-gun fixed in 81f9ef4.
  • Crescendo behaviour unchanged.

Paper gap remaining (not in this PR)

The biggest remaining gap is structured chain-of-thought output (`observation` / `strategy` / `reply` JSON fields). Tracked in langwatch/langwatch#2142 and #330. That's the next follow-up (stacked branch `feat/red-team-structured-output`).

Test plan

  • Python: 152 red-team tests pass (stage boundary tests rewritten as progress-bucket tests; prompt-contains-X assertions updated to assert ATTACK PLAN and Stage: are absent from GOAT prompts)
  • TypeScript: 103 red-team tests pass (same treatment)
  • Smoke test: `GoatStrategy.needs_metaprompt_plan is False`, `CrescendoStrategy.needs_metaprompt_plan is True`, GOAT prompt contains technique catalogue + turn number but no plan/stage blocks, `scenario.GOAT_METAPROMPT_TEMPLATE` no longer exists.
  • Manual: run a 10-turn GOAT attack on bank-demo to confirm behaviour is reasonable (requires API keys).

Related

Part of EPIC: langwatch/langwatch#1713 (Scenarios Red Teaming)

Makes progress toward — but does not close — the following:

Does not address:

🤖 Generated with Claude Code

… plan + stage hints

Meta's GOAT paper (ICML 2025) does not pre-generate an attack plan via a
metaprompt LLM call, and the attacker's system prompt has no early/mid/late
stage hints. Adaptation is driven entirely by the per-turn score/hint
feedback that lives in the attacker's private conversation history
(H_attacker).

Changes:
  - Add `needs_metaprompt_plan` (Python) / `needsMetapromptPlan` (TS)
    property on the strategy interface, default True. GoatStrategy overrides
    to False; CrescendoStrategy keeps True.
  - Orchestrator (`call()`) consults the flag and skips `_generateAttackPlan`
    when False, saving one LLM call on turn 1 and eliminating the
    description-keyed stale-plan bug entirely for GOAT.
  - Remove `_STAGES` / `STAGES` array and `_get_stage` / `getStage` methods
    from GoatStrategy. `build_system_prompt` no longer renders a "Stage:" line
    or an ATTACK PLAN section.
  - Keep `get_phase_name` returning a coarse progress bucket
    (`early`/`mid`/`late`) for telemetry dashboards only — this label is no
    longer surfaced to the attacker.
  - Drop `GOAT_METAPROMPT_TEMPLATE` constant and its public export (Python
    `scenario.__init__.py` + JS `index.ts`). GOAT never renders a template.
  - Simplify `goat()` / `redTeamGoat` factories: no more `setdefault` /
    object-spread template injection dance (and no accompanying foot-gun).
  - Update tests: rewrite GoatStrategy stage tests as progress-bucket tests;
    add explicit assertions that GOAT prompts contain no ATTACK PLAN section
    and no stage hints; drop obsolete template-rendering tests.

Net effect: GOAT behaviour moves from ~60% to ~85% faithful to the paper.
The remaining gap is structured output (observation/strategy/reply JSON),
tracked in #2142 / #330.

Tests: 152 Python + 103 JS passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@Aryansharma28 Aryansharma28 added the firefighting Urgent fix that bypasses approval check label Apr 14, 2026
@Aryansharma28 Aryansharma28 deleted the branch feat/red-team-dynamic-techniques April 14, 2026 13:06
Aryansharma28 added a commit that referenced this pull request Apr 14, 2026
This reverts commit e62c292.

Reason: Claude squash-merged #306 without realizing it would auto-close #340
(whose base was #306's branch). Reverting so #306, #340, #341 can land in the
proper stacked order.
@Aryansharma28 Aryansharma28 removed the firefighting Urgent fix that bypasses approval check label Apr 14, 2026
@Aryansharma28 Aryansharma28 reopened this Apr 14, 2026
@github-actions github-actions bot added the low-risk-change PR qualifies as low-risk per policy and can be merged without manual review label Apr 14, 2026
Aryansharma28 added a commit that referenced this pull request Apr 14, 2026
…#345)

This reverts commit e62c292.

Reason: Claude squash-merged #306 without realizing it would auto-close #340
(whose base was #306's branch). Reverting so #306, #340, #341 can land in the
proper stacked order.
@Aryansharma28 Aryansharma28 reopened this Apr 14, 2026
@github-actions github-actions bot removed the low-risk-change PR qualifies as low-risk per policy and can be merged without manual review label Apr 14, 2026
@github-actions github-actions bot added the low-risk-change PR qualifies as low-risk per policy and can be merged without manual review label Apr 14, 2026
…ly JSON (#341)

* feat(red-team): structured attacker output — observation / strategy / reply JSON

Meta's GOAT paper (ICML 2025) has the attacker emit a structured chain of
thought at every turn: observation of the target's last response, strategy
(which technique it will use and why), then the actual reply. We were
letting the attacker emit free text — losing the reasoning signal, making
per-turn technique selection invisible to telemetry, and leaving no gate
against the attacker skipping the "Thought" step entirely.

This commit implements the paper's contract:

- New `JSON_OUTPUT_CONTRACT` module constant (Python `_red_team/base.py`,
  TypeScript `red-team-strategy.ts`) appended to every attacker system
  prompt. Instructs the attacker to emit
      {"observation": "...", "strategy": "...", "reply": "..."}
  and nothing else.

- Added to both `CrescendoStrategy` and `GoatStrategy` system prompts so
  both strategies benefit.

- `RedTeamAgent._parse_attacker_output` (Python) / `parseAttackerOutput`
  (TS) parses the attacker's raw output into `(reply, observation, strategy)`.
  Strips markdown fences, handles malformed JSON, coerces non-string
  fields. Fallback on parse failure: the whole raw response becomes the
  `reply` and a WARN-level log fires — the scenario keeps running.

- Only `reply` reaches the target. `observation` and `strategy` are emitted
  as OpenTelemetry span attributes (`red_team.reasoning.observation`,
  `red_team.reasoning.strategy`, `red_team.reasoning.parse_failed`) so
  dashboards can answer "which technique works against which target?" —
  the paper's core selling point.

- The raw JSON output is kept in H_attacker so the attacker sees its own
  format on subsequent turns (keeps the output shape consistent with the
  system prompt's directive).

Paper fidelity: moves GOAT from ~85% to ~95% faithful.

Tests: 163 Python (+11) and 115 JS (+12) passing.
  - Parser: well-formed JSON, code-fence stripping (```json and ```),
    malformed JSON fallback, missing/empty reply fallback, non-object JSON
    fallback, non-string field coercion, whitespace trimming.
  - Contract: OUTPUT FORMAT section present in both GOAT and Crescendo
    system prompts.
  - Existing tests (which mock attacker with plain strings) continue to
    pass via the graceful fallback path.

Closes langwatch/scenario#2142 (structured attacker output).
Closes #330 (GOAT technique telemetry) once consumers
wire the span attributes to dashboards.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* refactor(red-team): scope JSON output contract to GOAT only

Crescendo does not emit structured output in Microsoft's Crescendo paper —
applying the JSON contract to both strategies was scope creep for a
GOAT-focused PR stack. Roll it back on Crescendo behind a new per-strategy
flag so the parser only runs when the attacker is actually instructed to
emit JSON.

Changes:
  - Add `emits_structured_output` (Python) / `emitsStructuredOutput` (TS)
    property on RedTeamStrategy. Default False/falsy. GoatStrategy overrides
    to True.
  - Remove `JSON_OUTPUT_CONTRACT` import and interpolation from
    CrescendoStrategy in both Python and TS.
  - Gate the parser in `call()`: run it only when the strategy's
    `emits_structured_output` flag is set. Otherwise use raw attacker
    output as the reply with no parsing and no telemetry spam.
  - Update tests: GoatStrategy asserts OUTPUT FORMAT present + flag true;
    CrescendoStrategy asserts OUTPUT FORMAT absent + flag falsy.

Tests: 164 Python (+1) / 116 JS (+1) passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix(red-team): type-check Crescendo's optional emitsStructuredOutput via interface

CI typecheck in vitest-examples failed on:
  expect(new CrescendoStrategy().emitsStructuredOutput).toBeUndefined();
because `emitsStructuredOutput` is declared on the RedTeamStrategy interface
as optional but never added to the Crescendo class — so direct access on the
concrete type is a TS2339 error under strict mode.

Access via the interface type so the optional property is visible.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
@github-actions github-actions bot removed the low-risk-change PR qualifies as low-risk per policy and can be merged without manual review label Apr 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.

The PR changes runtime behavior and public API surface: it removes the pre-generated metaprompt plan, eliminates stage hints from attacker prompts, adds structured JSON parsing for attacker output, and skips an LLM call for GOAT. These are functional and API changes (GOAT_METAPROMPT_TEMPLATE removed) rather than purely documentation/tests, so they fall outside the low-risk categories and should receive a normal review.

This PR requires a manual review before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant