feat: add template red team dashboard CLI#16
feat: add template red team dashboard CLI#16claytonlin1110 wants to merge 1 commit intoAffineFoundation:mainfrom
Conversation
|
@angosr please check and give me feedback for this |
angosr
left a comment
There was a problem hiding this comment.
Review: PR #16 — feat: add template red team dashboard CLI
Significance Gate: CONDITIONAL PASS
A quick automated probe for template GT correctness is useful as a first pass. However, this tool does NOT replace the CLAUDE.md Red Team Review (6 mandatory checks require human judgment: world knowledge attack, memorization space analysis, etc.). The tool should be positioned as a complement, not a substitute.
BLOCKING: 5 generated report files committed to the repository
redteam/20260326_123231/report.md
redteam/20260326_123324/report.md
redteam/20260326_123403/report.md
redteam/20260326_123651/report.md
redteam/20260326_123828/report.md
Generated output must not be committed to source control. Add redteam/ to .gitignore and remove these files from the PR.
BLOCKING: Unrelated changes bundled
-
liveweb_arena/__init__.py: Lazy-loadingBrowserEngine/BrowserSession— identical to the change in PR #12 (rejected). This is an infrastructure change that should be its own PR with proper justification. -
liveweb_arena/core/cache.py: Windowsfcntlportability fix — unrelated to the red team CLI. Should be a separate fix PR.
These bundled changes make review harder and risk sneaking unrelated modifications through a tooling PR.
CONCERN: _infer_probe_urls is fragile and hard to maintain
The URL inference uses per-plugin heuristics:
if plugin_name == "openlibrary":
a = vi.get("book_a_query")
...
if plugin_name == "arxiv":
category = vi.get("category")
...
if plugin_name == "stooq":
symbol = vi.get("symbol") or vi.get("symbol_a")
...This creates a parallel code path that must be manually kept in sync with template validation_info keys. When a new template changes key names or adds new URL patterns, this function silently breaks. The probe gives false negatives (GT fails because the URL wasn't inferred) that look like template bugs.
Better approach: let each template declare its probe URLs via a method like get_probe_urls(validation_info) -> List[str], keeping the knowledge co-located with the template code.
CONCERN: Probe bypasses page cache pipeline
The real GT collection uses GTCollector.on_page_visit() which is triggered by the browser's page cache. This probe calls plugin.fetch_api_data(url) directly and feeds it into on_page_visit. But:
- The probe's
api_datatimestamp differs from what a real browser visit would produce - The probe doesn't respect GT data priority rules (CLAUDE.md §6: detail page > list page)
- The probe visits URLs in a fixed order, while real agents visit pages in unpredictable order
A probe that passes doesn't guarantee the real eval pipeline works. This should be documented clearly.
Required Actions
- Remove the 5 committed report files; add
redteam/to.gitignore - Split out
__init__.pylazy-loading andcache.pyfcntl fix into separate PRs - Document clearly that this tool is a supplement to (not replacement for) CLAUDE.md Red Team Review and eval.py testing
Summary
This PR adds a Template Red Team Dashboard as a lightweight CLI (python -m liveweb_arena.redteam) that probes templates without running the browser agent. It executes a deterministic “API semantic probe” by calling each plugin’s fetch_api_data() for a minimal set of inferred URLs, feeding those snapshots through the real GTCollector + template get_ground_truth(). It then emits actionable template-quality metrics and artifacts (report.json, report.md) to support red-team review, anti-memorization checks, and quick regressions in CI.
Motivation
Template quality issues are easy to ship unintentionally:
What’s included
Entry point: liveweb_arena/redteam/main.py
Key capabilities:
Artifacts:
Core logic: liveweb_arena/redteam/probe.py
How it works:
Metrics: liveweb_arena/redteam/metrics.py
Computed per template:
Flags (in CLI):
--fail-on-violation (exit code 2 if any violation)
--min-gt-success 0..1
--max-collapse 0..1
--max-baseline 0..1
--min-stability 0..1 (requires --repeat >= 2)
Two issues prevented running python -m liveweb_arena.redteam on Windows: