Skip to content

feat(red-team): structured attacker output — observation/strategy/reply JSON#341

Merged
Aryansharma28 merged 3 commits intofeat/red-team-goat-paper-fidelityfrom
feat/red-team-structured-output
Apr 14, 2026
Merged

feat(red-team): structured attacker output — observation/strategy/reply JSON#341
Aryansharma28 merged 3 commits intofeat/red-team-goat-paper-fidelityfrom
feat/red-team-structured-output

Conversation

@Aryansharma28
Copy link
Copy Markdown
Contributor

Summary

Stacks on #340 (paper-fidelity refactor) and, transitively, #306 (GOAT strategy).

Implements the core innovation from Meta's GOAT paper (ICML 2025): the attacker emits a structured chain of thought at every turn — observation (what the target just revealed), strategy (which technique it will use and why), then the actual reply. We were previously letting the attacker emit free text, which:

Contract

New JSON_OUTPUT_CONTRACT module constant is appended to every attacker system prompt. Tells the attacker to emit exactly:

{"observation": "one sentence on target's last response",
 "strategy": "technique name + why (reference the catalogue)",
 "reply": "the actual message the target will see"}

Added to BOTH GoatStrategy and CrescendoStrategy — both strategies benefit from forced chain-of-thought and technique telemetry.

Parser

  • RedTeamAgent._parse_attacker_output (Python) / parseAttackerOutput (TS)
  • Strips ``` / ```json markdown fences (attacker sometimes wraps despite the contract)
  • Returns (reply, observation, strategy, parseFailed)
  • Graceful degradation: malformed JSON or missing reply → the raw response becomes the reply and a WARN-level log fires. The scenario keeps running. No attacker turn is wasted on a parse error.

Data flow

  • Only reply reaches the target (goes into H_target).
  • The full raw JSON stays in H_attacker so the attacker sees its own format on the next turn (consistent with the directive).
  • observation and strategy are emitted as OpenTelemetry span attributes:
    • red_team.reasoning.observation
    • red_team.reasoning.strategy
    • red_team.reasoning.parse_failed

Dashboards can now answer "which technique works against which target?" — the paper's core analytical question.

Paper fidelity

Before this PR (after #306 + #340): ~85%. After this PR: ~95%. Remaining gap is benchmark alignment (default turn count, binary judge mode, JailbreakBench harness) tracked separately.

Test plan

  • Python: 163 red-team tests passing (+11 new: parser well-formed, code-fence stripping, malformed fallback, missing/empty reply fallback, non-object JSON fallback, non-string field coercion, whitespace trimming, OUTPUT FORMAT section presence in both prompts)
  • TypeScript: 115 red-team tests passing (+12 new — same coverage mirrored)
  • Existing tests (mock callAttackerLLM with plain strings) continue to pass — verified the graceful fallback path preserves previous behaviour
  • Manual: verify in a live run that real attacker LLMs emit the JSON format correctly (requires API keys)

Related

Part of EPIC: langwatch/langwatch#1713 (Scenarios Red Teaming)

Closes:

Stacks on:

🤖 Generated with Claude Code

… reply JSON

Meta's GOAT paper (ICML 2025) has the attacker emit a structured chain of
thought at every turn: observation of the target's last response, strategy
(which technique it will use and why), then the actual reply. We were
letting the attacker emit free text — losing the reasoning signal, making
per-turn technique selection invisible to telemetry, and leaving no gate
against the attacker skipping the "Thought" step entirely.

This commit implements the paper's contract:

- New `JSON_OUTPUT_CONTRACT` module constant (Python `_red_team/base.py`,
  TypeScript `red-team-strategy.ts`) appended to every attacker system
  prompt. Instructs the attacker to emit
      {"observation": "...", "strategy": "...", "reply": "..."}
  and nothing else.

- Added to both `CrescendoStrategy` and `GoatStrategy` system prompts so
  both strategies benefit.

- `RedTeamAgent._parse_attacker_output` (Python) / `parseAttackerOutput`
  (TS) parses the attacker's raw output into `(reply, observation, strategy)`.
  Strips markdown fences, handles malformed JSON, coerces non-string
  fields. Fallback on parse failure: the whole raw response becomes the
  `reply` and a WARN-level log fires — the scenario keeps running.

- Only `reply` reaches the target. `observation` and `strategy` are emitted
  as OpenTelemetry span attributes (`red_team.reasoning.observation`,
  `red_team.reasoning.strategy`, `red_team.reasoning.parse_failed`) so
  dashboards can answer "which technique works against which target?" —
  the paper's core selling point.

- The raw JSON output is kept in H_attacker so the attacker sees its own
  format on subsequent turns (keeps the output shape consistent with the
  system prompt's directive).

Paper fidelity: moves GOAT from ~85% to ~95% faithful.

Tests: 163 Python (+11) and 115 JS (+12) passing.
  - Parser: well-formed JSON, code-fence stripping (```json and ```),
    malformed JSON fallback, missing/empty reply fallback, non-object JSON
    fallback, non-string field coercion, whitespace trimming.
  - Contract: OUTPUT FORMAT section present in both GOAT and Crescendo
    system prompts.
  - Existing tests (which mock attacker with plain strings) continue to
    pass via the graceful fallback path.

Closes langwatch/scenario#2142 (structured attacker output).
Closes #330 (GOAT technique telemetry) once consumers
wire the span attributes to dashboards.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Crescendo does not emit structured output in Microsoft's Crescendo paper —
applying the JSON contract to both strategies was scope creep for a
GOAT-focused PR stack. Roll it back on Crescendo behind a new per-strategy
flag so the parser only runs when the attacker is actually instructed to
emit JSON.

Changes:
  - Add `emits_structured_output` (Python) / `emitsStructuredOutput` (TS)
    property on RedTeamStrategy. Default False/falsy. GoatStrategy overrides
    to True.
  - Remove `JSON_OUTPUT_CONTRACT` import and interpolation from
    CrescendoStrategy in both Python and TS.
  - Gate the parser in `call()`: run it only when the strategy's
    `emits_structured_output` flag is set. Otherwise use raw attacker
    output as the reply with no parsing and no telemetry spam.
  - Update tests: GoatStrategy asserts OUTPUT FORMAT present + flag true;
    CrescendoStrategy asserts OUTPUT FORMAT absent + flag falsy.

Tests: 164 Python (+1) / 116 JS (+1) passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@Aryansharma28 Aryansharma28 added the firefighting Urgent fix that bypasses approval check label Apr 14, 2026
Aryansharma28 added a commit that referenced this pull request Apr 14, 2026
This reverts commit e62c292.

Reason: Claude squash-merged #306 without realizing it would auto-close #340
(whose base was #306's branch). Reverting so #306, #340, #341 can land in the
proper stacked order.
@Aryansharma28 Aryansharma28 removed the firefighting Urgent fix that bypasses approval check label Apr 14, 2026
Aryansharma28 added a commit that referenced this pull request Apr 14, 2026
…#345)

This reverts commit e62c292.

Reason: Claude squash-merged #306 without realizing it would auto-close #340
(whose base was #306's branch). Reverting so #306, #340, #341 can land in the
proper stacked order.
…via interface

CI typecheck in vitest-examples failed on:
  expect(new CrescendoStrategy().emitsStructuredOutput).toBeUndefined();
because `emitsStructuredOutput` is declared on the RedTeamStrategy interface
as optional but never added to the Crescendo class — so direct access on the
concrete type is a TS2339 error under strict mode.

Access via the interface type so the optional property is visible.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@github-actions github-actions bot added the low-risk-change PR qualifies as low-risk per policy and can be merged without manual review label Apr 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure.

  • Scope: Adds a JSON_OUTPUT_CONTRACT to GOAT system prompts, implements parseAttackerOutput/_parse_attacker_output in JS and Python, updates GoatStrategy and RedTeamAgent to parse structured attacker output and emit red_team.reasoning.* telemetry attributes, and adds corresponding unit tests (JS + Python).
  • Exclusions confirmed: no changes to auth, security settings, database schema, business-critical logic, or external integrations.
  • Classification: low-risk-change under the documented policy.

The change is limited to red-team attacker prompts, parsing of attacker LLM output, and telemetry (adds a JSON output contract, parsers in TS/Python, and span attributes). It does not touch authentication/authorization, secrets, encryption, database schemas/migrations, business-critical financial logic, or external integrations, and includes graceful fallbacks and tests to preserve prior behavior on parse failures.

This classification allows merging without manual review once all required CI checks are passing and branch protection rules are satisfied.

@Aryansharma28 Aryansharma28 added the firefighting Urgent fix that bypasses approval check label Apr 14, 2026
@Aryansharma28 Aryansharma28 merged commit 93ba535 into feat/red-team-goat-paper-fidelity Apr 14, 2026
7 checks passed
@Aryansharma28 Aryansharma28 deleted the feat/red-team-structured-output branch April 14, 2026 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

firefighting Urgent fix that bypasses approval check low-risk-change PR qualifies as low-risk per policy and can be merged without manual review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant