refactor(red-team): align GOAT with Meta's paper — drop pre-generated plan and stage hints by Aryansharma28 · Pull Request #340 · langwatch/scenario

Aryansharma28 · 2026-04-14T11:57:47Z

Summary

Stacks on #306 (GOAT strategy). Moves GOAT from ~60% to ~85% faithful to Meta's ICML 2025 GOAT paper by stripping two Crescendo-derived artifacts that aren't in the paper:

No pre-generated attack plan. The paper's attacker reasons turn-by-turn from the technique catalogue + objective + conversation history. It doesn't pre-generate a plan via a separate metaprompt LLM call. We were, as a Crescendo carryover. Now gated behind a new `needs_metaprompt_plan` / `needsMetapromptPlan` property on the strategy — GoatStrategy returns `False`, so `_generate_attack_plan` is skipped entirely for GOAT (saves one LLM call on turn 1).
No stage hints in the attacker's system prompt. The paper has no early/mid/late concept — adaptation is driven entirely by the score/hint feedback in the attacker's private conversation history. Removed the `_STAGES` array, the `_get_stage` method, and the `Stage: EARLY — …` block from `build_system_prompt`. `get_phase_name` still returns `early`/`mid`/`late` for telemetry dashboards, but the label is no longer surfaced to the attacker.

What this changes for users

`RedTeamAgent.goat(...)` skips the metaprompt LLM call on turn 1 → ~33% cheaper startup, and the description-keyed stale-plan failure mode (red team: _attack_plan cache survives across scenarios with different descriptions #327) disappears for GOAT.
GOAT attacker prompt is smaller and more focused — no plan, no stage text. The attacker reasons off catalogue + score feedback alone, matching the paper's methodology.
`GOAT_METAPROMPT_TEMPLATE` is dropped from the public API (it's no longer rendered anywhere). Removed from `scenario.init.py` and `javascript/src/agents/red-team/index.ts`.
`goat()` / `redTeamGoat` factories simplified — no more `setdefault` / object-spread template dance, which also removes the foot-gun fixed in 81f9ef4.
Crescendo behaviour unchanged.

Paper gap remaining (not in this PR)

The biggest remaining gap is structured chain-of-thought output (`observation` / `strategy` / `reply` JSON fields). Tracked in langwatch/langwatch#2142 and #330. That's the next follow-up (stacked branch `feat/red-team-structured-output`).

Test plan

Python: 152 red-team tests pass (stage boundary tests rewritten as progress-bucket tests; prompt-contains-X assertions updated to assert ATTACK PLAN and Stage: are absent from GOAT prompts)
TypeScript: 103 red-team tests pass (same treatment)
Smoke test: `GoatStrategy.needs_metaprompt_plan is False`, `CrescendoStrategy.needs_metaprompt_plan is True`, GOAT prompt contains technique catalogue + turn number but no plan/stage blocks, `scenario.GOAT_METAPROMPT_TEMPLATE` no longer exists.
Manual: run a 10-turn GOAT attack on bank-demo to confirm behaviour is reasonable (requires API keys).

red team: _attack_plan cache survives across scenarios with different descriptions #327 (attack_plan cache): bug no longer reachable via GOAT path (GOAT doesn't generate a plan); still live for Crescendo.
red team: strategies should own their default metaprompt template #328 (strategies should own default template): sidesteps the issue for GOAT by making the template irrelevant entirely; proper refactor still pending for Crescendo.
red team: edge cases — empty techniques list, empty metaprompt_plan, scorer failure cascade #333 (empty `metaprompt_plan` malformed prompt): no longer reachable for GOAT; still live for Crescendo.

Does not address:

red team: injection_probability desyncs H_attacker from what the target saw #326 (injection H_attacker desync) — still present, still documented as a caveat.
red team: capture which GOAT technique the attacker chose per turn (telemetry) #330 (technique telemetry), langwatch/scenario#2142 (structured output) — next PR.

🤖 Generated with Claude Code

… plan + stage hints Meta's GOAT paper (ICML 2025) does not pre-generate an attack plan via a metaprompt LLM call, and the attacker's system prompt has no early/mid/late stage hints. Adaptation is driven entirely by the per-turn score/hint feedback that lives in the attacker's private conversation history (H_attacker). Changes: - Add `needs_metaprompt_plan` (Python) / `needsMetapromptPlan` (TS) property on the strategy interface, default True. GoatStrategy overrides to False; CrescendoStrategy keeps True. - Orchestrator (`call()`) consults the flag and skips `_generateAttackPlan` when False, saving one LLM call on turn 1 and eliminating the description-keyed stale-plan bug entirely for GOAT. - Remove `_STAGES` / `STAGES` array and `_get_stage` / `getStage` methods from GoatStrategy. `build_system_prompt` no longer renders a "Stage:" line or an ATTACK PLAN section. - Keep `get_phase_name` returning a coarse progress bucket (`early`/`mid`/`late`) for telemetry dashboards only — this label is no longer surfaced to the attacker. - Drop `GOAT_METAPROMPT_TEMPLATE` constant and its public export (Python `scenario.__init__.py` + JS `index.ts`). GOAT never renders a template. - Simplify `goat()` / `redTeamGoat` factories: no more `setdefault` / object-spread template injection dance (and no accompanying foot-gun). - Update tests: rewrite GoatStrategy stage tests as progress-bucket tests; add explicit assertions that GOAT prompts contain no ATTACK PLAN section and no stage hints; drop obsolete template-rendering tests. Net effect: GOAT behaviour moves from ~60% to ~85% faithful to the paper. The remaining gap is structured output (observation/strategy/reply JSON), tracked in #2142 / #330. Tests: 152 Python + 103 JS passing. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

This reverts commit e62c292. Reason: Claude squash-merged #306 without realizing it would auto-close #340 (whose base was #306's branch). Reverting so #306, #340, #341 can land in the proper stacked order.

…#345) This reverts commit e62c292. Reason: Claude squash-merged #306 without realizing it would auto-close #340 (whose base was #306's branch). Reverting so #306, #340, #341 can land in the proper stacked order.

…ly JSON (#341) * feat(red-team): structured attacker output — observation / strategy / reply JSON Meta's GOAT paper (ICML 2025) has the attacker emit a structured chain of thought at every turn: observation of the target's last response, strategy (which technique it will use and why), then the actual reply. We were letting the attacker emit free text — losing the reasoning signal, making per-turn technique selection invisible to telemetry, and leaving no gate against the attacker skipping the "Thought" step entirely. This commit implements the paper's contract: - New `JSON_OUTPUT_CONTRACT` module constant (Python `_red_team/base.py`, TypeScript `red-team-strategy.ts`) appended to every attacker system prompt. Instructs the attacker to emit {"observation": "...", "strategy": "...", "reply": "..."} and nothing else. - Added to both `CrescendoStrategy` and `GoatStrategy` system prompts so both strategies benefit. - `RedTeamAgent._parse_attacker_output` (Python) / `parseAttackerOutput` (TS) parses the attacker's raw output into `(reply, observation, strategy)`. Strips markdown fences, handles malformed JSON, coerces non-string fields. Fallback on parse failure: the whole raw response becomes the `reply` and a WARN-level log fires — the scenario keeps running. - Only `reply` reaches the target. `observation` and `strategy` are emitted as OpenTelemetry span attributes (`red_team.reasoning.observation`, `red_team.reasoning.strategy`, `red_team.reasoning.parse_failed`) so dashboards can answer "which technique works against which target?" — the paper's core selling point. - The raw JSON output is kept in H_attacker so the attacker sees its own format on subsequent turns (keeps the output shape consistent with the system prompt's directive). Paper fidelity: moves GOAT from ~85% to ~95% faithful. Tests: 163 Python (+11) and 115 JS (+12) passing. - Parser: well-formed JSON, code-fence stripping (```json and ```), malformed JSON fallback, missing/empty reply fallback, non-object JSON fallback, non-string field coercion, whitespace trimming. - Contract: OUTPUT FORMAT section present in both GOAT and Crescendo system prompts. - Existing tests (which mock attacker with plain strings) continue to pass via the graceful fallback path. Closes langwatch/scenario#2142 (structured attacker output). Closes #330 (GOAT technique telemetry) once consumers wire the span attributes to dashboards. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * refactor(red-team): scope JSON output contract to GOAT only Crescendo does not emit structured output in Microsoft's Crescendo paper — applying the JSON contract to both strategies was scope creep for a GOAT-focused PR stack. Roll it back on Crescendo behind a new per-strategy flag so the parser only runs when the attacker is actually instructed to emit JSON. Changes: - Add `emits_structured_output` (Python) / `emitsStructuredOutput` (TS) property on RedTeamStrategy. Default False/falsy. GoatStrategy overrides to True. - Remove `JSON_OUTPUT_CONTRACT` import and interpolation from CrescendoStrategy in both Python and TS. - Gate the parser in `call()`: run it only when the strategy's `emits_structured_output` flag is set. Otherwise use raw attacker output as the reply with no parsing and no telemetry spam. - Update tests: GoatStrategy asserts OUTPUT FORMAT present + flag true; CrescendoStrategy asserts OUTPUT FORMAT absent + flag falsy. Tests: 164 Python (+1) / 116 JS (+1) passing. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(red-team): type-check Crescendo's optional emitsStructuredOutput via interface CI typecheck in vitest-examples failed on: expect(new CrescendoStrategy().emitsStructuredOutput).toBeUndefined(); because `emitsStructuredOutput` is declared on the RedTeamStrategy interface as optional but never added to the Crescendo class — so direct access on the concrete type is a TS2339 error under strict mode. Access via the interface type so the optional property is visible. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

github-actions · 2026-04-14T13:36:36Z

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.

The PR changes runtime behavior and public API surface: it removes the pre-generated metaprompt plan, eliminates stage hints from attacker prompts, adds structured JSON parsing for attacker output, and skips an LLM call for GOAT. These are functional and API changes (GOAT_METAPROMPT_TEMPLATE removed) rather than purely documentation/tests, so they fall outside the low-risk categories and should receive a normal review.

This PR requires a manual review before merging.

Aryansharma28 added the firefighting Urgent fix that bypasses approval check label Apr 14, 2026

Aryansharma28 deleted the branch feat/red-team-dynamic-techniques April 14, 2026 13:06

Aryansharma28 closed this Apr 14, 2026

Aryansharma28 mentioned this pull request Apr 14, 2026

revert: premature squash merge of #306 to restore stacked PR workflow #345

Merged

Aryansharma28 removed the firefighting Urgent fix that bypasses approval check label Apr 14, 2026

Aryansharma28 reopened this Apr 14, 2026

github-actions bot added the low-risk-change PR qualifies as low-risk per policy and can be merged without manual review label Apr 14, 2026

Aryansharma28 closed this in #345 Apr 14, 2026

Aryansharma28 reopened this Apr 14, 2026

github-actions bot removed the low-risk-change PR qualifies as low-risk per policy and can be merged without manual review label Apr 14, 2026

Aryansharma28 mentioned this pull request Apr 14, 2026

feat: add GOAT strategy with dynamic technique selection for RedTeamAgent #346

Open

github-actions bot added the low-risk-change PR qualifies as low-risk per policy and can be merged without manual review label Apr 14, 2026

github-actions bot removed the low-risk-change PR qualifies as low-risk per policy and can be merged without manual review label Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(red-team): align GOAT with Meta's paper — drop pre-generated plan and stage hints#340

refactor(red-team): align GOAT with Meta's paper — drop pre-generated plan and stage hints#340
Aryansharma28 wants to merge 2 commits intofeat/red-team-dynamic-techniquesfrom
feat/red-team-goat-paper-fidelity

Aryansharma28 commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Aryansharma28 commented Apr 14, 2026

Summary

What this changes for users

Paper gap remaining (not in this PR)

Test plan

Related

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant