feat(red-team): structured attacker output — observation/strategy/reply JSON by Aryansharma28 · Pull Request #341 · langwatch/scenario

Aryansharma28 · 2026-04-14T12:08:10Z

Summary

Stacks on #340 (paper-fidelity refactor) and, transitively, #306 (GOAT strategy).

Implements the core innovation from Meta's GOAT paper (ICML 2025): the attacker emits a structured chain of thought at every turn — observation (what the target just revealed), strategy (which technique it will use and why), then the actual reply. We were previously letting the attacker emit free text, which:

Lost the reasoning signal — the "Thought" step was optional and often skipped
Made per-turn technique selection invisible to telemetry (biggest gap per red team: capture which GOAT technique the attacker chose per turn (telemetry) #330)
Prevented the kind of per-target analytics the paper measures

Contract

New JSON_OUTPUT_CONTRACT module constant is appended to every attacker system prompt. Tells the attacker to emit exactly:

{"observation": "one sentence on target's last response",
 "strategy": "technique name + why (reference the catalogue)",
 "reply": "the actual message the target will see"}

Added to BOTH GoatStrategy and CrescendoStrategy — both strategies benefit from forced chain-of-thought and technique telemetry.

Parser

RedTeamAgent._parse_attacker_output (Python) / parseAttackerOutput (TS)
Strips ``` / ```json markdown fences (attacker sometimes wraps despite the contract)
Returns (reply, observation, strategy, parseFailed)
Graceful degradation: malformed JSON or missing reply → the raw response becomes the reply and a WARN-level log fires. The scenario keeps running. No attacker turn is wasted on a parse error.

Data flow

Only reply reaches the target (goes into H_target).
The full raw JSON stays in H_attacker so the attacker sees its own format on the next turn (consistent with the directive).
observation and strategy are emitted as OpenTelemetry span attributes:
- red_team.reasoning.observation
- red_team.reasoning.strategy
- red_team.reasoning.parse_failed

Dashboards can now answer "which technique works against which target?" — the paper's core analytical question.

Paper fidelity

Before this PR (after #306 + #340): ~85%. After this PR: ~95%. Remaining gap is benchmark alignment (default turn count, binary judge mode, JailbreakBench harness) tracked separately.

Test plan

Python: 163 red-team tests passing (+11 new: parser well-formed, code-fence stripping, malformed fallback, missing/empty reply fallback, non-object JSON fallback, non-string field coercion, whitespace trimming, OUTPUT FORMAT section presence in both prompts)
TypeScript: 115 red-team tests passing (+12 new — same coverage mirrored)
Existing tests (mock callAttackerLLM with plain strings) continue to pass — verified the graceful fallback path preserves previous behaviour
Manual: verify in a live run that real attacker LLMs emit the JSON format correctly (requires API keys)

… reply JSON Meta's GOAT paper (ICML 2025) has the attacker emit a structured chain of thought at every turn: observation of the target's last response, strategy (which technique it will use and why), then the actual reply. We were letting the attacker emit free text — losing the reasoning signal, making per-turn technique selection invisible to telemetry, and leaving no gate against the attacker skipping the "Thought" step entirely. This commit implements the paper's contract: - New `JSON_OUTPUT_CONTRACT` module constant (Python `_red_team/base.py`, TypeScript `red-team-strategy.ts`) appended to every attacker system prompt. Instructs the attacker to emit {"observation": "...", "strategy": "...", "reply": "..."} and nothing else. - Added to both `CrescendoStrategy` and `GoatStrategy` system prompts so both strategies benefit. - `RedTeamAgent._parse_attacker_output` (Python) / `parseAttackerOutput` (TS) parses the attacker's raw output into `(reply, observation, strategy)`. Strips markdown fences, handles malformed JSON, coerces non-string fields. Fallback on parse failure: the whole raw response becomes the `reply` and a WARN-level log fires — the scenario keeps running. - Only `reply` reaches the target. `observation` and `strategy` are emitted as OpenTelemetry span attributes (`red_team.reasoning.observation`, `red_team.reasoning.strategy`, `red_team.reasoning.parse_failed`) so dashboards can answer "which technique works against which target?" — the paper's core selling point. - The raw JSON output is kept in H_attacker so the attacker sees its own format on subsequent turns (keeps the output shape consistent with the system prompt's directive). Paper fidelity: moves GOAT from ~85% to ~95% faithful. Tests: 163 Python (+11) and 115 JS (+12) passing. - Parser: well-formed JSON, code-fence stripping (```json and ```), malformed JSON fallback, missing/empty reply fallback, non-object JSON fallback, non-string field coercion, whitespace trimming. - Contract: OUTPUT FORMAT section present in both GOAT and Crescendo system prompts. - Existing tests (which mock attacker with plain strings) continue to pass via the graceful fallback path. Closes langwatch/scenario#2142 (structured attacker output). Closes #330 (GOAT technique telemetry) once consumers wire the span attributes to dashboards. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Crescendo does not emit structured output in Microsoft's Crescendo paper — applying the JSON contract to both strategies was scope creep for a GOAT-focused PR stack. Roll it back on Crescendo behind a new per-strategy flag so the parser only runs when the attacker is actually instructed to emit JSON. Changes: - Add `emits_structured_output` (Python) / `emitsStructuredOutput` (TS) property on RedTeamStrategy. Default False/falsy. GoatStrategy overrides to True. - Remove `JSON_OUTPUT_CONTRACT` import and interpolation from CrescendoStrategy in both Python and TS. - Gate the parser in `call()`: run it only when the strategy's `emits_structured_output` flag is set. Otherwise use raw attacker output as the reply with no parsing and no telemetry spam. - Update tests: GoatStrategy asserts OUTPUT FORMAT present + flag true; CrescendoStrategy asserts OUTPUT FORMAT absent + flag falsy. Tests: 164 Python (+1) / 116 JS (+1) passing. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

This reverts commit e62c292. Reason: Claude squash-merged #306 without realizing it would auto-close #340 (whose base was #306's branch). Reverting so #306, #340, #341 can land in the proper stacked order.

…#345) This reverts commit e62c292. Reason: Claude squash-merged #306 without realizing it would auto-close #340 (whose base was #306's branch). Reverting so #306, #340, #341 can land in the proper stacked order.

…via interface CI typecheck in vitest-examples failed on: expect(new CrescendoStrategy().emitsStructuredOutput).toBeUndefined(); because `emitsStructuredOutput` is declared on the RedTeamStrategy interface as optional but never added to the Crescendo class — so direct access on the concrete type is a TS2339 error under strict mode. Access via the interface type so the optional property is visible. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

github-actions · 2026-04-14T13:27:17Z

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure.

Scope: Adds a JSON_OUTPUT_CONTRACT to GOAT system prompts, implements parseAttackerOutput/_parse_attacker_output in JS and Python, updates GoatStrategy and RedTeamAgent to parse structured attacker output and emit red_team.reasoning.* telemetry attributes, and adds corresponding unit tests (JS + Python).
Exclusions confirmed: no changes to auth, security settings, database schema, business-critical logic, or external integrations.
Classification: low-risk-change under the documented policy.

The change is limited to red-team attacker prompts, parsing of attacker LLM output, and telemetry (adds a JSON output contract, parsers in TS/Python, and span attributes). It does not touch authentication/authorization, secrets, encryption, database schemas/migrations, business-critical financial logic, or external integrations, and includes graceful fallbacks and tests to preserve prior behavior on parse failures.

This classification allows merging without manual review once all required CI checks are passing and branch protection rules are satisfied.

This was referenced Apr 14, 2026

epic: scenarios red teaming langwatch/langwatch#1713

Open

red team: capture which GOAT technique the attacker chose per turn (telemetry) #330

Open

Aryansharma28 added the firefighting Urgent fix that bypasses approval check label Apr 14, 2026

Aryansharma28 mentioned this pull request Apr 14, 2026

revert: premature squash merge of #306 to restore stacked PR workflow #345

Merged

Aryansharma28 removed the firefighting Urgent fix that bypasses approval check label Apr 14, 2026

Aryansharma28 mentioned this pull request Apr 14, 2026

feat: add GOAT strategy with dynamic technique selection for RedTeamAgent #346

Open

github-actions bot added the low-risk-change PR qualifies as low-risk per policy and can be merged without manual review label Apr 14, 2026

Aryansharma28 added the firefighting Urgent fix that bypasses approval check label Apr 14, 2026

Aryansharma28 merged commit 93ba535 into feat/red-team-goat-paper-fidelity Apr 14, 2026
7 checks passed

Aryansharma28 deleted the feat/red-team-structured-output branch April 14, 2026 13:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(red-team): structured attacker output — observation/strategy/reply JSON#341

feat(red-team): structured attacker output — observation/strategy/reply JSON#341
Aryansharma28 merged 3 commits intofeat/red-team-goat-paper-fidelityfrom
feat/red-team-structured-output

Aryansharma28 commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Aryansharma28 commented Apr 14, 2026

Summary

Contract

Parser

Data flow

Paper fidelity

Test plan

Related

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant