feat: structured attacker output (JSON with rationale + summary) by Aryansharma28 · Pull Request #292 · langwatch/scenario

Aryansharma28 · 2026-03-20T11:21:12Z

Summary

Implements #2142 — structured JSON output from the attacker LLM with question, rationale, and response_summary fields.

question: The actual message sent to the target (extracted from JSON)
rationale: Why this technique was chosen — stored in H_attacker for debugging/observability
response_summary: Compressed summary of the target's last response — enables context compression over long conversations
Retry-once on JSON parse failure with correction nudge, fallback to raw text
Prerequisite for #2143 (dynamic technique selection) and #2145 (scan-wide memory)

Research backing

PyRIT (Microsoft): Uses exact same 3-field schema (generated_question, rationale_behind_jailbreak, last_response_summary) with up to 10 retries
DeepTeam: Same 3-field Pydantic schema, validated at parse time
GOAT (Meta, 97% ASR): Uses labeled-section format (Observation/Thought/Strategy/Reply) for chain-of-attack reasoning
X-Teaming (98.1% ASR): XML-tagged structured output with separate planner/attacker/verifier agents

Test plan

Unit tests for JSON parsing (valid, malformed, missing keys)
Retry logic test (parse failure → correction → success)
Fallback test (double parse failure → raw text used)
Integration test: rationale + summary stored in H_attacker, only question sent to target
Context compression verification over multi-turn conversations
Existing tests still pass (backward compatibility)

🤖 Generated with Claude Code

Part of EPIC: #1713

Tracks #2142 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

github-actions · 2026-03-20T11:21:30Z

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.

The change modifies core attacker-agent behavior: it alters how LLM outputs are parsed (retry/fallback), stores rationale in H_attacker, and changes what is sent to external targets (only the question field). These touch integration/security-sensitive logic and application behavior rather than simple UI/docs/test config, so they do not meet the low-risk criteria and require a normal review.

This PR requires a manual review before merging.

feat: structured attacker output (JSON with rationale + summary)

173f7e8

Tracks #2142 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Aryansharma28 closed this Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: structured attacker output (JSON with rationale + summary)#292

feat: structured attacker output (JSON with rationale + summary)#292
Aryansharma28 wants to merge 1 commit intomainfrom
feat/red-team-structured-output

Aryansharma28 commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Aryansharma28 commented Mar 20, 2026

Summary

Research backing

Test plan

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant