refactor(red-team): align GOAT with Meta's paper — drop pre-generated plan and stage hints#340
Open
Aryansharma28 wants to merge 2 commits intofeat/red-team-dynamic-techniquesfrom
Open
Conversation
… plan + stage hints
Meta's GOAT paper (ICML 2025) does not pre-generate an attack plan via a
metaprompt LLM call, and the attacker's system prompt has no early/mid/late
stage hints. Adaptation is driven entirely by the per-turn score/hint
feedback that lives in the attacker's private conversation history
(H_attacker).
Changes:
- Add `needs_metaprompt_plan` (Python) / `needsMetapromptPlan` (TS)
property on the strategy interface, default True. GoatStrategy overrides
to False; CrescendoStrategy keeps True.
- Orchestrator (`call()`) consults the flag and skips `_generateAttackPlan`
when False, saving one LLM call on turn 1 and eliminating the
description-keyed stale-plan bug entirely for GOAT.
- Remove `_STAGES` / `STAGES` array and `_get_stage` / `getStage` methods
from GoatStrategy. `build_system_prompt` no longer renders a "Stage:" line
or an ATTACK PLAN section.
- Keep `get_phase_name` returning a coarse progress bucket
(`early`/`mid`/`late`) for telemetry dashboards only — this label is no
longer surfaced to the attacker.
- Drop `GOAT_METAPROMPT_TEMPLATE` constant and its public export (Python
`scenario.__init__.py` + JS `index.ts`). GOAT never renders a template.
- Simplify `goat()` / `redTeamGoat` factories: no more `setdefault` /
object-spread template injection dance (and no accompanying foot-gun).
- Update tests: rewrite GoatStrategy stage tests as progress-bucket tests;
add explicit assertions that GOAT prompts contain no ATTACK PLAN section
and no stage hints; drop obsolete template-rendering tests.
Net effect: GOAT behaviour moves from ~60% to ~85% faithful to the paper.
The remaining gap is structured output (observation/strategy/reply JSON),
tracked in #2142 / #330.
Tests: 152 Python + 103 JS passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
This was referenced Apr 14, 2026
Open
…ly JSON (#341) * feat(red-team): structured attacker output — observation / strategy / reply JSON Meta's GOAT paper (ICML 2025) has the attacker emit a structured chain of thought at every turn: observation of the target's last response, strategy (which technique it will use and why), then the actual reply. We were letting the attacker emit free text — losing the reasoning signal, making per-turn technique selection invisible to telemetry, and leaving no gate against the attacker skipping the "Thought" step entirely. This commit implements the paper's contract: - New `JSON_OUTPUT_CONTRACT` module constant (Python `_red_team/base.py`, TypeScript `red-team-strategy.ts`) appended to every attacker system prompt. Instructs the attacker to emit {"observation": "...", "strategy": "...", "reply": "..."} and nothing else. - Added to both `CrescendoStrategy` and `GoatStrategy` system prompts so both strategies benefit. - `RedTeamAgent._parse_attacker_output` (Python) / `parseAttackerOutput` (TS) parses the attacker's raw output into `(reply, observation, strategy)`. Strips markdown fences, handles malformed JSON, coerces non-string fields. Fallback on parse failure: the whole raw response becomes the `reply` and a WARN-level log fires — the scenario keeps running. - Only `reply` reaches the target. `observation` and `strategy` are emitted as OpenTelemetry span attributes (`red_team.reasoning.observation`, `red_team.reasoning.strategy`, `red_team.reasoning.parse_failed`) so dashboards can answer "which technique works against which target?" — the paper's core selling point. - The raw JSON output is kept in H_attacker so the attacker sees its own format on subsequent turns (keeps the output shape consistent with the system prompt's directive). Paper fidelity: moves GOAT from ~85% to ~95% faithful. Tests: 163 Python (+11) and 115 JS (+12) passing. - Parser: well-formed JSON, code-fence stripping (```json and ```), malformed JSON fallback, missing/empty reply fallback, non-object JSON fallback, non-string field coercion, whitespace trimming. - Contract: OUTPUT FORMAT section present in both GOAT and Crescendo system prompts. - Existing tests (which mock attacker with plain strings) continue to pass via the graceful fallback path. Closes langwatch/scenario#2142 (structured attacker output). Closes #330 (GOAT technique telemetry) once consumers wire the span attributes to dashboards. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * refactor(red-team): scope JSON output contract to GOAT only Crescendo does not emit structured output in Microsoft's Crescendo paper — applying the JSON contract to both strategies was scope creep for a GOAT-focused PR stack. Roll it back on Crescendo behind a new per-strategy flag so the parser only runs when the attacker is actually instructed to emit JSON. Changes: - Add `emits_structured_output` (Python) / `emitsStructuredOutput` (TS) property on RedTeamStrategy. Default False/falsy. GoatStrategy overrides to True. - Remove `JSON_OUTPUT_CONTRACT` import and interpolation from CrescendoStrategy in both Python and TS. - Gate the parser in `call()`: run it only when the strategy's `emits_structured_output` flag is set. Otherwise use raw attacker output as the reply with no parsing and no telemetry spam. - Update tests: GoatStrategy asserts OUTPUT FORMAT present + flag true; CrescendoStrategy asserts OUTPUT FORMAT absent + flag falsy. Tests: 164 Python (+1) / 116 JS (+1) passing. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(red-team): type-check Crescendo's optional emitsStructuredOutput via interface CI typecheck in vitest-examples failed on: expect(new CrescendoStrategy().emitsStructuredOutput).toBeUndefined(); because `emitsStructuredOutput` is declared on the RedTeamStrategy interface as optional but never added to the Crescendo class — so direct access on the concrete type is a TS2339 error under strict mode. Access via the interface type so the optional property is visible. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
Contributor
|
Automated low-risk assessment This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.
This PR requires a manual review before merging. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacks on #306 (GOAT strategy). Moves GOAT from ~60% to ~85% faithful to Meta's ICML 2025 GOAT paper by stripping two Crescendo-derived artifacts that aren't in the paper:
What this changes for users
Paper gap remaining (not in this PR)
The biggest remaining gap is structured chain-of-thought output (`observation` / `strategy` / `reply` JSON fields). Tracked in langwatch/langwatch#2142 and #330. That's the next follow-up (stacked branch `feat/red-team-structured-output`).
Test plan
Related
Part of EPIC: langwatch/langwatch#1713 (Scenarios Red Teaming)
Makes progress toward — but does not close — the following:
Does not address:
🤖 Generated with Claude Code