Skip to content

feat: structured attacker output (JSON with rationale + summary)#292

Closed
Aryansharma28 wants to merge 1 commit intomainfrom
feat/red-team-structured-output
Closed

feat: structured attacker output (JSON with rationale + summary)#292
Aryansharma28 wants to merge 1 commit intomainfrom
feat/red-team-structured-output

Conversation

@Aryansharma28
Copy link
Copy Markdown
Contributor

Summary

Implements #2142 — structured JSON output from the attacker LLM with question, rationale, and response_summary fields.

  • question: The actual message sent to the target (extracted from JSON)
  • rationale: Why this technique was chosen — stored in H_attacker for debugging/observability
  • response_summary: Compressed summary of the target's last response — enables context compression over long conversations
  • Retry-once on JSON parse failure with correction nudge, fallback to raw text
  • Prerequisite for #2143 (dynamic technique selection) and #2145 (scan-wide memory)

Research backing

  • PyRIT (Microsoft): Uses exact same 3-field schema (generated_question, rationale_behind_jailbreak, last_response_summary) with up to 10 retries
  • DeepTeam: Same 3-field Pydantic schema, validated at parse time
  • GOAT (Meta, 97% ASR): Uses labeled-section format (Observation/Thought/Strategy/Reply) for chain-of-attack reasoning
  • X-Teaming (98.1% ASR): XML-tagged structured output with separate planner/attacker/verifier agents

Test plan

  • Unit tests for JSON parsing (valid, malformed, missing keys)
  • Retry logic test (parse failure → correction → success)
  • Fallback test (double parse failure → raw text used)
  • Integration test: rationale + summary stored in H_attacker, only question sent to target
  • Context compression verification over multi-turn conversations
  • Existing tests still pass (backward compatibility)

🤖 Generated with Claude Code

Part of EPIC: #1713

Tracks #2142

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@github-actions
Copy link
Copy Markdown
Contributor

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.

The change modifies core attacker-agent behavior: it alters how LLM outputs are parsed (retry/fallback), stores rationale in H_attacker, and changes what is sent to external targets (only the question field). These touch integration/security-sensitive logic and application behavior rather than simple UI/docs/test config, so they do not meet the low-risk criteria and require a normal review.

This PR requires a manual review before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant