Commit 91f76d1
fix: judge off-by-one, auto-run on script exhaustion, assertion criteria, marathon_script cleanup (#289)
* fix: run judge on early scenario failures for structured criteria feedback
When a scenario failed before the judge step (via assertion, executor.fail(),
or an API error), the result had empty criteria (0/0) because the judge never
ran. The platform showed no structured feedback — just an error string.
Now the executor always tries to run the judge before returning a failed
result. A `_final_judge_invoked` flag prevents double-invocation when the
judge already ran. If the judge itself fails (e.g. same network error),
the result is returned unchanged — strictly better, never worse.
Handles all failure paths:
- assert / expect (check function throws)
- executor.fail() (check function sets result directly)
- API / network errors (outer exception handler)
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
* fix: auto-run judge when script exhausts without conclusion
When RedTeamAgent backtracks heavily (strong defenses), the script can
exhaust all steps before reaching the final judge(). The executor treated
this as "no conclusion" and failed with a misleading message telling users
to add scenario.judge() — when they already had one.
Now when a script ends without conclusion and a JudgeAgent is registered,
the executor auto-runs the judge to evaluate the trace. Exhausting a red
team script without a breach is a defense success, not a harness error.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
* fix: prevent double judge invocation on checkpoint failures
When a checkpoint judge failed, finalJudgeInvoked was not set, causing
runJudgeIfNeeded to re-run the judge without criteria — flipping the
result from fail to pass. Now checkpoint failures also set the flag.
Also adds defensive explicit flag set in the runJudgeIfNeeded helper.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
* test: update script-exhaustion test for auto-judge behavior
The test expected script exhaustion to always fail, but now when a
JudgeAgent is registered the executor auto-runs it. Split into two
tests: one verifying auto-judge fallback, one verifying the error
when no judge exists.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
* fix: only auto-run judge on script exhaustion, not on assertion/error failures
Assertion failures and errors should fail immediately without calling
the judge — they represent hard stops. The judge auto-run is only for
script exhaustion (all steps consumed without conclusion), where the
judge decides if the defense held.
Removes _run_judge_if_needed helper since it's no longer needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
* refactor: remove turns param from RedTeamAgent.marathon_script()
total_turns on the RedTeamAgent instance is now the single source of
truth for script length. marathon_script() reads self.total_turns
instead of requiring a separate turns argument that had to match.
Also removes the inflated max_turns=160 global config from bank-demo
tests since the script drives duration, not max_turns.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
* refactor: default total_turns=30, enforce max_turns in scripts, remove standalone marathon_script
- Change RedTeamAgent default total_turns from 50 to 30
- Enforce max_turns during script execution (not just proceed loop),
so max_turns is always respected regardless of execution mode
- Remove standalone scenario.marathon_script() from public API — users
should use red_team.marathon_script() for red team tests and
proceed() for normal long-running tests
- Inline the marathon_script fallback logic into RedTeamAgent
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
* fix: address review items — judge auto-run, TS parity, tests, docs
- max_turns enforcement auto-runs judge when hit (not hard-fail)
- Extract _run_judge_or_fail helper to DRY up judge auto-run logic
- Optimize max_turns check to only fire on turn boundary changes
- Add tests for max_turns enforcement with/without judge
- TypeScript parity: remove turns from marathonScript, remove
standalone marathonScript export, add maxTurns enforcement with
judge auto-run in TS executor
- Fix stale docstring referencing removed scenario.script.marathon_script
- Update redteaming.md design doc for new marathon_script signature
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
* docs: update red-teaming docs for marathon_script API changes
- Remove turns parameter from all marathon_script/marathonScript examples
- Remove standalone scenario.marathon_script from exports section
- Update default total_turns from 50 to 30 in parameter table
- Update CI examples to use default turns (no turns= needed)
- Update callout and recommendation text
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
* refactor: total_turns is a hard cap, remove backtrack padding and max_turns from scripts
- total_turns generates exactly that many iterations — backtracked
turns count toward the budget, no padding added
- Remove max_turns enforcement from script execution — scripts define
their own flow, total_turns is the only control for red team
- Update docs to make it clear: total_turns is the single parameter
that controls red team test duration
- Both Python and TypeScript
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
* fix: update stale assertions, docs, and docstrings after padding removal
- Fix 8 TS test assertions that still expected backtrack padding
- Remove "pads extra iterations" from docs marathon_script section
- Fix _run_judge_or_fail docstring referencing max_turns
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
* chore: remove stale backtrack padding references from comments
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
* fix: escape apostrophe in docs MDX to pass ESLint
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
* fix: address PR review comments
- Update script-exhaustion fallback to suggest adding JudgeAgent
(rogeriochaves feedback)
- Replace empty except with warnings.warn for judge failures
(code-quality bot feedback)
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
* fix: judge is_last_message off-by-one in proceed() flows
The judge never saw is_last_message=True because _step() incremented
current_turn to max_turns then bailed before the judge ran. Use
>= max_turns - 1 so the judge forces a verdict on the actual last turn.
Also handle max_turns=None by defaulting to 10.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
* fix: apply same isLastMessage off-by-one fix to TypeScript judge
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
* fix: log judge errors in TS, tighten phase test, remove extra blank line
- TS catch block now console.warn's the judge error (parity with Python)
- test_phase_calculation asserts specific phases per turn instead of any
- Remove stray extra blank line in scenario-execution.ts
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
* refactor: simplify _check_failure flow — raise directly instead of falling through
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
* test: add comprehensive tests for all coverage gaps
- TS judge isLastMessage: 5 parametrized boundary tests (parity with Python)
- AssertionError → structured criteria: Python + TS tests for assertion
catch, non-assertion passthrough, and checkpoint accumulation
- Judge auto-run on script exhaustion: TS tests for auto-run + no double invocation
- Judge throws during auto-run: Python test verifying warnings.warn fallback
- _final_judge_invoked flag: Python + TS tests ensuring judge runs exactly once
- marathon_script success_score=None path: structure verification test
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
* feat: single-turn attack injection for RedTeamAgent (#291)
* feat: add single-turn attack injection for RedTeamAgent
Add random per-turn encoding augmentation to the red-teaming system.
With configurable probability, the attacker's message is transformed
using a deterministic encoding technique (Base64, ROT13, leetspeak,
char-split, code-block) before being sent to the target. Default OFF
(0.0) for backward compatibility.
Closes langwatch/langwatch#2482
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
* fix: store original text in H_attacker, send encoded to target only
The encoded (Base64/ROT13/etc.) message was being stored in the
attacker's private history (H_attacker), which would corrupt the
attacker LLM's reasoning on subsequent turns. Both DeepTeam and
Promptfoo keep the attacker history encoding-free — encoding is a
transport-level transform for the target only.
Also adds:
- Input validation for injection_probability (must be 0.0-1.0)
- Tests verifying H_attacker stores original text when injection fires
- Deduplicates makeInput helper in TS tests
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
* fix: resolve pyright type errors in attack technique tests
Use MagicMock(spec=AgentInput) instead of bare type() for test inputs,
use .get("content") for optional TypedDict keys, and change
List[AttackTechnique] to Sequence[AttackTechnique] for covariance.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
* Potential fix for pull request finding 'Unused import'
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
* fix(js): use shift() instead of pop() in consumeUntilRole
consumeUntilRole checks pendingRolesOnTurn[0] (front) but was removing
from the back with pop(), draining the array incorrectly. Now uses
shift() to match the Python implementation's pop(0).
Cherry-picked from #279.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
* review: remove auto-judge-on-exhaustion and duplicated max_turns default
Scripts must be explicit about judgment — no hidden auto-running of the
judge when a script ends without conclusion. The finalJudgeInvoked flag
and _run_judge_or_fail helper only existed to manage complexity that
shouldn't be in the executor. Red team marathon_script already appends
judge() at the end, so auto-judge was unnecessary.
Also removes the duplicated `|| 10` / `or 10` fallback from the judge's
is_last_message computation — max_turns is already defaulted in the
config layer (ScenarioConfig in Python, ScenarioConfigFinal in TS).
* test: verify judge runs exactly once at end of marathon with backtracks
Integration test that proves the full red team flow: 5 turns with
scoring, 1 backtrack, and then asserts the judge was called exactly
once at the end with the full conversation history visible.
* test(js): verify judge runs exactly once at end of marathon with backtracks
TypeScript parity for the Python integration test. Runs a 5-turn
marathon with mocked attacker LLM, scorer, 1 backtrack on hard refusal,
and asserts the judge was called exactly once with full conversation
history visible.
* fix: resolve type errors in judge is_last_message computation
Python: max_turns is Optional[int], so `or 10` fallback is needed for
pyright. TS: use the existing DEFAULT_MAX_TURNS constant instead of a
magic number. Also fix static imports in TS red-team integration test.
* fix: add extra turn to hungry user test to reduce flakiness
The agent often uses its first turn to ask a follow-up question,
leaving only 1 turn to deliver a complete recipe. Adding a third
exchange gives the agent room to ask, respond, and complete.
---------
Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Co-authored-by: Rogério Chaves <[email protected]>1 parent 197f567 commit 91f76d1
24 files changed
Lines changed: 1764 additions & 371 deletions
File tree
- docs/docs/pages/advanced
- javascript/src
- agents
- __tests__
- judge
- __tests__
- red-team
- __tests__
- execution
- __tests__
- script
- python
- examples
- scenario
- _red_team
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
53 | 53 | | |
54 | 54 | | |
55 | 55 | | |
56 | | - | |
57 | 56 | | |
58 | 57 | | |
59 | 58 | | |
| |||
103 | 102 | | |
104 | 103 | | |
105 | 104 | | |
106 | | - | |
107 | 105 | | |
108 | 106 | | |
109 | 107 | | |
| |||
116 | 114 | | |
117 | 115 | | |
118 | 116 | | |
119 | | - | |
| 117 | + | |
120 | 118 | | |
121 | 119 | | |
122 | | - | |
| 120 | + | |
123 | 121 | | |
124 | 122 | | |
125 | 123 | | |
| |||
223 | 221 | | |
224 | 222 | | |
225 | 223 | | |
226 | | - | |
| 224 | + | |
227 | 225 | | |
228 | 226 | | |
229 | 227 | | |
| |||
240 | 238 | | |
241 | 239 | | |
242 | 240 | | |
243 | | - | |
| 241 | + | |
244 | 242 | | |
245 | | - | |
| 243 | + | |
246 | 244 | | |
247 | 245 | | |
248 | 246 | | |
249 | 247 | | |
250 | | - | |
251 | 248 | | |
252 | | - | |
253 | | - | |
254 | | - | |
255 | | - | |
| 249 | + | |
256 | 250 | | |
257 | 251 | | |
258 | 252 | | |
259 | | - | |
260 | 253 | | |
261 | | - | |
262 | | - | |
263 | | - | |
264 | | - | |
| 254 | + | |
265 | 255 | | |
266 | 256 | | |
267 | 257 | | |
268 | 258 | | |
269 | 259 | | |
270 | 260 | | |
271 | | - | |
272 | 261 | | |
273 | 262 | | |
274 | 263 | | |
| |||
345 | 334 | | |
346 | 335 | | |
347 | 336 | | |
348 | | - | |
349 | 337 | | |
350 | 338 | | |
351 | 339 | | |
| |||
389 | 377 | | |
390 | 378 | | |
391 | 379 | | |
392 | | - | |
393 | 380 | | |
394 | 381 | | |
395 | 382 | | |
| |||
567 | 554 | | |
568 | 555 | | |
569 | 556 | | |
570 | | - | |
| 557 | + | |
571 | 558 | | |
572 | 559 | | |
573 | 560 | | |
| |||
580 | 567 | | |
581 | 568 | | |
582 | 569 | | |
583 | | - | |
| 570 | + | |
584 | 571 | | |
585 | 572 | | |
586 | 573 | | |
587 | | - | |
| 574 | + | |
588 | 575 | | |
589 | 576 | | |
590 | 577 | | |
591 | 578 | | |
592 | 579 | | |
593 | | - | |
| 580 | + | |
594 | 581 | | |
595 | 582 | | |
596 | 583 | | |
597 | | - | |
| 584 | + | |
598 | 585 | | |
599 | 586 | | |
600 | 587 | | |
| |||
604 | 591 | | |
605 | 592 | | |
606 | 593 | | |
607 | | - | |
| 594 | + | |
608 | 595 | | |
609 | 596 | | |
610 | 597 | | |
611 | 598 | | |
612 | 599 | | |
613 | | - | |
| 600 | + | |
614 | 601 | | |
615 | 602 | | |
616 | | - | |
| 603 | + | |
617 | 604 | | |
618 | 605 | | |
619 | 606 | | |
620 | 607 | | |
621 | 608 | | |
622 | | - | |
| 609 | + | |
623 | 610 | | |
624 | 611 | | |
625 | 612 | | |
626 | 613 | | |
627 | | - | |
| 614 | + | |
628 | 615 | | |
629 | 616 | | |
630 | 617 | | |
| |||
633 | 620 | | |
634 | 621 | | |
635 | 622 | | |
636 | | - | |
| 623 | + | |
637 | 624 | | |
638 | 625 | | |
639 | 626 | | |
| |||
651 | 638 | | |
652 | 639 | | |
653 | 640 | | |
654 | | - | |
655 | 641 | | |
656 | 642 | | |
657 | 643 | | |
| |||
682 | 668 | | |
683 | 669 | | |
684 | 670 | | |
685 | | - | |
| 671 | + | |
686 | 672 | | |
687 | 673 | | |
688 | 674 | | |
0 commit comments