feat(red-team): structured attacker output — observation/strategy/reply JSON#341
Merged
Aryansharma28 merged 3 commits intofeat/red-team-goat-paper-fidelityfrom Apr 14, 2026
Conversation
… reply JSON
Meta's GOAT paper (ICML 2025) has the attacker emit a structured chain of
thought at every turn: observation of the target's last response, strategy
(which technique it will use and why), then the actual reply. We were
letting the attacker emit free text — losing the reasoning signal, making
per-turn technique selection invisible to telemetry, and leaving no gate
against the attacker skipping the "Thought" step entirely.
This commit implements the paper's contract:
- New `JSON_OUTPUT_CONTRACT` module constant (Python `_red_team/base.py`,
TypeScript `red-team-strategy.ts`) appended to every attacker system
prompt. Instructs the attacker to emit
{"observation": "...", "strategy": "...", "reply": "..."}
and nothing else.
- Added to both `CrescendoStrategy` and `GoatStrategy` system prompts so
both strategies benefit.
- `RedTeamAgent._parse_attacker_output` (Python) / `parseAttackerOutput`
(TS) parses the attacker's raw output into `(reply, observation, strategy)`.
Strips markdown fences, handles malformed JSON, coerces non-string
fields. Fallback on parse failure: the whole raw response becomes the
`reply` and a WARN-level log fires — the scenario keeps running.
- Only `reply` reaches the target. `observation` and `strategy` are emitted
as OpenTelemetry span attributes (`red_team.reasoning.observation`,
`red_team.reasoning.strategy`, `red_team.reasoning.parse_failed`) so
dashboards can answer "which technique works against which target?" —
the paper's core selling point.
- The raw JSON output is kept in H_attacker so the attacker sees its own
format on subsequent turns (keeps the output shape consistent with the
system prompt's directive).
Paper fidelity: moves GOAT from ~85% to ~95% faithful.
Tests: 163 Python (+11) and 115 JS (+12) passing.
- Parser: well-formed JSON, code-fence stripping (```json and ```),
malformed JSON fallback, missing/empty reply fallback, non-object JSON
fallback, non-string field coercion, whitespace trimming.
- Contract: OUTPUT FORMAT section present in both GOAT and Crescendo
system prompts.
- Existing tests (which mock attacker with plain strings) continue to
pass via the graceful fallback path.
Closes langwatch/scenario#2142 (structured attacker output).
Closes #330 (GOAT technique telemetry) once consumers
wire the span attributes to dashboards.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
This was referenced Apr 14, 2026
Crescendo does not emit structured output in Microsoft's Crescendo paper —
applying the JSON contract to both strategies was scope creep for a
GOAT-focused PR stack. Roll it back on Crescendo behind a new per-strategy
flag so the parser only runs when the attacker is actually instructed to
emit JSON.
Changes:
- Add `emits_structured_output` (Python) / `emitsStructuredOutput` (TS)
property on RedTeamStrategy. Default False/falsy. GoatStrategy overrides
to True.
- Remove `JSON_OUTPUT_CONTRACT` import and interpolation from
CrescendoStrategy in both Python and TS.
- Gate the parser in `call()`: run it only when the strategy's
`emits_structured_output` flag is set. Otherwise use raw attacker
output as the reply with no parsing and no telemetry spam.
- Update tests: GoatStrategy asserts OUTPUT FORMAT present + flag true;
CrescendoStrategy asserts OUTPUT FORMAT absent + flag falsy.
Tests: 164 Python (+1) / 116 JS (+1) passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Aryansharma28
added a commit
that referenced
this pull request
Apr 14, 2026
Aryansharma28
added a commit
that referenced
this pull request
Apr 14, 2026
…via interface CI typecheck in vitest-examples failed on: expect(new CrescendoStrategy().emitsStructuredOutput).toBeUndefined(); because `emitsStructuredOutput` is declared on the RedTeamStrategy interface as optional but never added to the Crescendo class — so direct access on the concrete type is a TS2339 error under strict mode. Access via the interface type so the optional property is visible. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Contributor
|
Automated low-risk assessment This PR was evaluated against the repository's Low-Risk Pull Requests procedure.
This classification allows merging without manual review once all required CI checks are passing and branch protection rules are satisfied. |
93ba535
into
feat/red-team-goat-paper-fidelity
7 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacks on #340 (paper-fidelity refactor) and, transitively, #306 (GOAT strategy).
Implements the core innovation from Meta's GOAT paper (ICML 2025): the attacker emits a structured chain of thought at every turn —
observation(what the target just revealed),strategy(which technique it will use and why), then the actualreply. We were previously letting the attacker emit free text, which:Contract
New
JSON_OUTPUT_CONTRACTmodule constant is appended to every attacker system prompt. Tells the attacker to emit exactly:{"observation": "one sentence on target's last response", "strategy": "technique name + why (reference the catalogue)", "reply": "the actual message the target will see"}Added to BOTH
GoatStrategyandCrescendoStrategy— both strategies benefit from forced chain-of-thought and technique telemetry.Parser
RedTeamAgent._parse_attacker_output(Python) /parseAttackerOutput(TS)```/```jsonmarkdown fences (attacker sometimes wraps despite the contract)(reply, observation, strategy, parseFailed)reply→ the raw response becomes thereplyand a WARN-level log fires. The scenario keeps running. No attacker turn is wasted on a parse error.Data flow
replyreaches the target (goes into H_target).observationandstrategyare emitted as OpenTelemetry span attributes:red_team.reasoning.observationred_team.reasoning.strategyred_team.reasoning.parse_failedDashboards can now answer "which technique works against which target?" — the paper's core analytical question.
Paper fidelity
Before this PR (after #306 + #340): ~85%. After this PR: ~95%. Remaining gap is benchmark alignment (default turn count, binary judge mode, JailbreakBench harness) tracked separately.
Test plan
callAttackerLLMwith plain strings) continue to pass — verified the graceful fallback path preserves previous behaviourRelated
Part of EPIC: langwatch/langwatch#1713 (Scenarios Red Teaming)
Closes:
Stacks on:
🤖 Generated with Claude Code