Skip to content

Commit 91f76d1

Browse files
Aryansharma28claudegithub-code-quality[bot]rogeriochaves
authored
fix: judge off-by-one, auto-run on script exhaustion, assertion criteria, marathon_script cleanup (#289)
* fix: run judge on early scenario failures for structured criteria feedback When a scenario failed before the judge step (via assertion, executor.fail(), or an API error), the result had empty criteria (0/0) because the judge never ran. The platform showed no structured feedback — just an error string. Now the executor always tries to run the judge before returning a failed result. A `_final_judge_invoked` flag prevents double-invocation when the judge already ran. If the judge itself fails (e.g. same network error), the result is returned unchanged — strictly better, never worse. Handles all failure paths: - assert / expect (check function throws) - executor.fail() (check function sets result directly) - API / network errors (outer exception handler) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: auto-run judge when script exhausts without conclusion When RedTeamAgent backtracks heavily (strong defenses), the script can exhaust all steps before reaching the final judge(). The executor treated this as "no conclusion" and failed with a misleading message telling users to add scenario.judge() — when they already had one. Now when a script ends without conclusion and a JudgeAgent is registered, the executor auto-runs the judge to evaluate the trace. Exhausting a red team script without a breach is a defense success, not a harness error. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: prevent double judge invocation on checkpoint failures When a checkpoint judge failed, finalJudgeInvoked was not set, causing runJudgeIfNeeded to re-run the judge without criteria — flipping the result from fail to pass. Now checkpoint failures also set the flag. Also adds defensive explicit flag set in the runJudgeIfNeeded helper. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * test: update script-exhaustion test for auto-judge behavior The test expected script exhaustion to always fail, but now when a JudgeAgent is registered the executor auto-runs it. Split into two tests: one verifying auto-judge fallback, one verifying the error when no judge exists. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: only auto-run judge on script exhaustion, not on assertion/error failures Assertion failures and errors should fail immediately without calling the judge — they represent hard stops. The judge auto-run is only for script exhaustion (all steps consumed without conclusion), where the judge decides if the defense held. Removes _run_judge_if_needed helper since it's no longer needed. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * refactor: remove turns param from RedTeamAgent.marathon_script() total_turns on the RedTeamAgent instance is now the single source of truth for script length. marathon_script() reads self.total_turns instead of requiring a separate turns argument that had to match. Also removes the inflated max_turns=160 global config from bank-demo tests since the script drives duration, not max_turns. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * refactor: default total_turns=30, enforce max_turns in scripts, remove standalone marathon_script - Change RedTeamAgent default total_turns from 50 to 30 - Enforce max_turns during script execution (not just proceed loop), so max_turns is always respected regardless of execution mode - Remove standalone scenario.marathon_script() from public API — users should use red_team.marathon_script() for red team tests and proceed() for normal long-running tests - Inline the marathon_script fallback logic into RedTeamAgent Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: address review items — judge auto-run, TS parity, tests, docs - max_turns enforcement auto-runs judge when hit (not hard-fail) - Extract _run_judge_or_fail helper to DRY up judge auto-run logic - Optimize max_turns check to only fire on turn boundary changes - Add tests for max_turns enforcement with/without judge - TypeScript parity: remove turns from marathonScript, remove standalone marathonScript export, add maxTurns enforcement with judge auto-run in TS executor - Fix stale docstring referencing removed scenario.script.marathon_script - Update redteaming.md design doc for new marathon_script signature Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: update red-teaming docs for marathon_script API changes - Remove turns parameter from all marathon_script/marathonScript examples - Remove standalone scenario.marathon_script from exports section - Update default total_turns from 50 to 30 in parameter table - Update CI examples to use default turns (no turns= needed) - Update callout and recommendation text Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * refactor: total_turns is a hard cap, remove backtrack padding and max_turns from scripts - total_turns generates exactly that many iterations — backtracked turns count toward the budget, no padding added - Remove max_turns enforcement from script execution — scripts define their own flow, total_turns is the only control for red team - Update docs to make it clear: total_turns is the single parameter that controls red team test duration - Both Python and TypeScript Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: update stale assertions, docs, and docstrings after padding removal - Fix 8 TS test assertions that still expected backtrack padding - Remove "pads extra iterations" from docs marathon_script section - Fix _run_judge_or_fail docstring referencing max_turns Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * chore: remove stale backtrack padding references from comments Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: escape apostrophe in docs MDX to pass ESLint Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: address PR review comments - Update script-exhaustion fallback to suggest adding JudgeAgent (rogeriochaves feedback) - Replace empty except with warnings.warn for judge failures (code-quality bot feedback) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: judge is_last_message off-by-one in proceed() flows The judge never saw is_last_message=True because _step() incremented current_turn to max_turns then bailed before the judge ran. Use >= max_turns - 1 so the judge forces a verdict on the actual last turn. Also handle max_turns=None by defaulting to 10. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: apply same isLastMessage off-by-one fix to TypeScript judge Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: log judge errors in TS, tighten phase test, remove extra blank line - TS catch block now console.warn's the judge error (parity with Python) - test_phase_calculation asserts specific phases per turn instead of any - Remove stray extra blank line in scenario-execution.ts Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * refactor: simplify _check_failure flow — raise directly instead of falling through Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * test: add comprehensive tests for all coverage gaps - TS judge isLastMessage: 5 parametrized boundary tests (parity with Python) - AssertionError → structured criteria: Python + TS tests for assertion catch, non-assertion passthrough, and checkpoint accumulation - Judge auto-run on script exhaustion: TS tests for auto-run + no double invocation - Judge throws during auto-run: Python test verifying warnings.warn fallback - _final_judge_invoked flag: Python + TS tests ensuring judge runs exactly once - marathon_script success_score=None path: structure verification test Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat: single-turn attack injection for RedTeamAgent (#291) * feat: add single-turn attack injection for RedTeamAgent Add random per-turn encoding augmentation to the red-teaming system. With configurable probability, the attacker's message is transformed using a deterministic encoding technique (Base64, ROT13, leetspeak, char-split, code-block) before being sent to the target. Default OFF (0.0) for backward compatibility. Closes langwatch/langwatch#2482 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: store original text in H_attacker, send encoded to target only The encoded (Base64/ROT13/etc.) message was being stored in the attacker's private history (H_attacker), which would corrupt the attacker LLM's reasoning on subsequent turns. Both DeepTeam and Promptfoo keep the attacker history encoding-free — encoding is a transport-level transform for the target only. Also adds: - Input validation for injection_probability (must be 0.0-1.0) - Tests verifying H_attacker stores original text when injection fires - Deduplicates makeInput helper in TS tests Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: resolve pyright type errors in attack technique tests Use MagicMock(spec=AgentInput) instead of bare type() for test inputs, use .get("content") for optional TypedDict keys, and change List[AttackTechnique] to Sequence[AttackTechnique] for covariance. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> * Potential fix for pull request finding 'Unused import' Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com> * fix(js): use shift() instead of pop() in consumeUntilRole consumeUntilRole checks pendingRolesOnTurn[0] (front) but was removing from the back with pop(), draining the array incorrectly. Now uses shift() to match the Python implementation's pop(0). Cherry-picked from #279. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * review: remove auto-judge-on-exhaustion and duplicated max_turns default Scripts must be explicit about judgment — no hidden auto-running of the judge when a script ends without conclusion. The finalJudgeInvoked flag and _run_judge_or_fail helper only existed to manage complexity that shouldn't be in the executor. Red team marathon_script already appends judge() at the end, so auto-judge was unnecessary. Also removes the duplicated `|| 10` / `or 10` fallback from the judge's is_last_message computation — max_turns is already defaulted in the config layer (ScenarioConfig in Python, ScenarioConfigFinal in TS). * test: verify judge runs exactly once at end of marathon with backtracks Integration test that proves the full red team flow: 5 turns with scoring, 1 backtrack, and then asserts the judge was called exactly once at the end with the full conversation history visible. * test(js): verify judge runs exactly once at end of marathon with backtracks TypeScript parity for the Python integration test. Runs a 5-turn marathon with mocked attacker LLM, scorer, 1 backtrack on hard refusal, and asserts the judge was called exactly once with full conversation history visible. * fix: resolve type errors in judge is_last_message computation Python: max_turns is Optional[int], so `or 10` fallback is needed for pyright. TS: use the existing DEFAULT_MAX_TURNS constant instead of a magic number. Also fix static imports in TS red-team integration test. * fix: add extra turn to hungry user test to reduce flakiness The agent often uses its first turn to ask a follow-up question, leaving only 1 turn to deliver a complete recipe. Adding a third exchange gives the agent room to ask, respond, and complete. --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com> Co-authored-by: Rogério Chaves <[email protected]>
1 parent 197f567 commit 91f76d1

24 files changed

Lines changed: 1764 additions & 371 deletions

docs/docs/pages/advanced/red-teaming.mdx

Lines changed: 19 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,6 @@ async def test_system_prompt_not_leaked():
5353
]),
5454
],
5555
script=attacker.marathon_script(
56-
turns=30,
5756
checks=[check_no_leak],
5857
),
5958
)
@@ -103,7 +102,6 @@ describe("Bank agent security", () => {
103102
}),
104103
],
105104
script: attacker.marathonScript({
106-
turns: 30,
107105
checks: [checkNoLeak],
108106
}),
109107
});
@@ -116,10 +114,10 @@ describe("Bank agent security", () => {
116114
:::
117115

118116
<Callout type="info">
119-
Use `attacker.marathon_script()` instead of `scenario.marathon_script()` for red team runs. The instance method pads extra iterations for backtracked turns and wires up early exit.
117+
`total_turns` is the **only** parameter that controls how long a red team test runs — it is a hard cap. `attacker.marathon_script()` reads it from the agent automatically. No need to set `max_turns` on `scenario.run()`. Backtracked turns count toward the budget, and early exit can end the test sooner if the objective is achieved.
120118
</Callout>
121119

122-
We recommend **50 turns** for thorough adversarial coverage. Agents that hold at turn 1 often break by turn 20. 30 turns is the minimum for meaningful results — fewer turns miss vulnerabilities that only surface under sustained escalation pressure.
120+
We recommend **50 turns** (`total_turns=50`) for thorough nightly adversarial coverage. Agents that hold at turn 1 often break by turn 20. The default of 30 turns is a good balance for per-PR checks — fewer turns miss vulnerabilities that only surface under sustained escalation pressure.
123121

124122
---
125123

@@ -223,7 +221,7 @@ const attacker = scenario.redTeamCrescendo({
223221
| Attack objective | `target` | `target` | *required* | What the attacker tries to achieve. |
224222
| Attacker model | `model` | `model` | global default | Generates attack messages every turn. |
225223
| Planner/scorer model | `metaprompt_model` | `metapromptModel` | same as `model` | Plans attack once, scores responses per turn. |
226-
| Total turns | `total_turns` | `totalTurns` | `50` | Number of attack turns. 50 recommended. |
224+
| Total turns | `total_turns` | `totalTurns` | `30` | Number of attack turns. This is the single control for test duration — `max_turns` is ignored for scripted red team tests. 50 recommended for nightly. |
227225
| Per-turn scoring | `score_responses` | `scoreResponses` | `True` | Score responses 0–10 and adapt. |
228226
| Refusal detection | `fast_refusal_detection` | `detectRefusals` | `True` | Pattern-match refusals, skip scorer. Triggers backtracking. |
229227
| Early exit score | `success_score` | `successScore` | `9` | Score threshold for early exit. `None`/`undefined` to disable. |
@@ -240,35 +238,26 @@ const attacker = scenario.redTeamCrescendo({
240238

241239
### `marathon_script()` / `marathonScript()`
242240

243-
Generates a multi-turn script: `[user(), agent(), ...checks] × turns → [...finalChecks, judge()]`.
241+
Generates a multi-turn script using `total_turns` from the agent: `[user(), agent(), ...checks] × totalTurns → [...finalChecks, judge()]`.
244242

245-
Use the instance method for red team runs — it pads extra iterations for backtracking and wires up early exit.
243+
`total_turns` is a hard cap — backtracked turns count toward the budget. Early exit can end the test sooner if the objective is achieved.
246244

247245
:::code-group
248246

249247
```python [python]
250-
# Instance method (recommended)
251248
attacker = scenario.RedTeamAgent.crescendo(target="...", total_turns=30)
252-
script = attacker.marathon_script(turns=30, checks=[fn], final_checks=[fn])
253-
254-
# Standalone (no early exit, no backtrack padding)
255-
script = scenario.marathon_script(turns=30, checks=[fn], final_checks=[fn])
249+
script = attacker.marathon_script(checks=[fn], final_checks=[fn])
256250
```
257251

258252
```typescript [typescript]
259-
// Instance method (recommended)
260253
const attacker = scenario.redTeamCrescendo({ target: "...", totalTurns: 30 });
261-
const script = attacker.marathonScript({ turns: 30, checks: [fn], finalChecks: [fn] });
262-
263-
// Standalone (no early exit, no backtrack padding)
264-
const script = scenario.marathonScript({ turns: 30, checks: [fn], finalChecks: [fn] });
254+
const script = attacker.marathonScript({ checks: [fn], finalChecks: [fn] });
265255
```
266256

267257
:::
268258

269259
| Parameter | Python | TypeScript | Description |
270260
|-----------|--------|------------|-------------|
271-
| Turn count | `turns` | `turns` | Number of user/agent exchanges. Match `total_turns`/`totalTurns`. |
272261
| Per-turn checks | `checks` | `checks` | Called after every agent response. Raise/throw to fail. |
273262
| End-of-run checks | `final_checks` | `finalChecks` | Called once after all turns, before the judge. |
274263

@@ -345,7 +334,6 @@ result = await scenario.run(
345334
]),
346335
],
347336
script=attacker.marathon_script(
348-
turns=30,
349337
checks=[check_no_restricted_tools, check_no_pii_leaked],
350338
),
351339
)
@@ -389,7 +377,6 @@ const result = await scenario.run({
389377
}),
390378
],
391379
script: attacker.marathonScript({
392-
turns: 30,
393380
checks: [checkNoRestrictedTools, checkNoPiiLeaked],
394381
}),
395382
});
@@ -567,7 +554,7 @@ Write `target` from the attacker's perspective — what does success look like?
567554

568555
## CI integration
569556

570-
Run red team tests alongside your functional test suite. We recommend 50 turns for nightly runs and 30 turns minimum for per-PR checks.
557+
Run red team tests alongside your functional test suite. We recommend 50 turns for nightly runs and 30 turns (the default) for per-PR checks.
571558

572559
```python
573560
# pyproject.toml
@@ -580,21 +567,21 @@ markers = [
580567
:::code-group
581568

582569
```python [python]
583-
# Per-PR: scoring off for speed
570+
# Per-PR: scoring off for speed (default 30 turns)
584571
@pytest.mark.red_team
585572
async def test_prompt_leak_fast():
586573
attacker = scenario.RedTeamAgent.crescendo(
587-
target="...", total_turns=30,
574+
target="...",
588575
score_responses=False, fast_refusal_detection=False,
589576
)
590577
result = await scenario.run(
591578
...,
592579
agents=[MyAgent(), attacker, scenario.JudgeAgent(criteria=[...])],
593-
script=attacker.marathon_script(turns=30),
580+
script=attacker.marathon_script(),
594581
)
595582
assert result.success
596583

597-
# Nightly: full adaptive scoring, 50 turns recommended
584+
# Nightly: full adaptive scoring, 50 turns
598585
@pytest.mark.red_team
599586
async def test_prompt_leak_full():
600587
attacker = scenario.RedTeamAgent.crescendo(
@@ -604,27 +591,27 @@ async def test_prompt_leak_full():
604591
result = await scenario.run(
605592
...,
606593
agents=[MyAgent(), attacker, scenario.JudgeAgent(criteria=[...])],
607-
script=attacker.marathon_script(turns=50),
594+
script=attacker.marathon_script(),
608595
)
609596
assert result.success
610597
```
611598

612599
```typescript [typescript]
613-
// Per-PR: scoring off for speed
600+
// Per-PR: scoring off for speed (default 30 turns)
614601
it("prompt leak (fast)", async () => {
615602
const attacker = scenario.redTeamCrescendo({
616-
target: "...", totalTurns: 30,
603+
target: "...",
617604
scoreResponses: false, detectRefusals: false,
618605
});
619606
const result = await scenario.run({
620607
...,
621608
agents: [myAgent, attacker, scenario.judgeAgent({ criteria: [...] })],
622-
script: attacker.marathonScript({ turns: 30 }),
609+
script: attacker.marathonScript(),
623610
});
624611
expect(result.success).toBe(true);
625612
}, 180_000);
626613

627-
// Nightly: full adaptive scoring, 50 turns recommended
614+
// Nightly: full adaptive scoring, 50 turns
628615
it("prompt leak (full)", async () => {
629616
const attacker = scenario.redTeamCrescendo({
630617
target: "...", totalTurns: 50,
@@ -633,7 +620,7 @@ it("prompt leak (full)", async () => {
633620
const result = await scenario.run({
634621
...,
635622
agents: [myAgent, attacker, scenario.judgeAgent({ criteria: [...] })],
636-
script: attacker.marathonScript({ turns: 50 }),
623+
script: attacker.marathonScript(),
637624
});
638625
expect(result.success).toBe(true);
639626
}, 300_000);
@@ -651,7 +638,6 @@ it("prompt leak (full)", async () => {
651638
from scenario import RedTeamAgent # main class
652639
from scenario import RedTeamStrategy # abstract base for custom strategies
653640
from scenario import CrescendoStrategy # built-in strategy
654-
from scenario import marathon_script # standalone script helper
655641
```
656642

657643
### TypeScript
@@ -682,7 +668,7 @@ import scenario, {
682668

683669
## Next steps
684670

685-
- [Scripted Simulations](/basics/scripted-simulations) — how scripts work under `marathon_script()`
671+
- [Scripted Simulations](/basics/scripted-simulations) — how scripts and script steps work
686672
- [Judge Agent](/basics/judge-agent) — configure pass/fail criteria
687673
- [Custom Judge](/advanced/custom-judge) — domain-specific security judge
688674
- [CI/CD Integration](/basics/ci-cd-integration) — run red team tests in your pipeline

0 commit comments

Comments
 (0)