feat: add template red team dashboard CLI by claytonlin1110 · Pull Request #16 · AffineFoundation/liveweb-arena

claytonlin1110 · 2026-03-26T12:47:54Z

Summary

This PR adds a Template Red Team Dashboard as a lightweight CLI (python -m liveweb_arena.redteam) that probes templates without running the browser agent. It executes a deterministic “API semantic probe” by calling each plugin’s fetch_api_data() for a minimal set of inferred URLs, feeding those snapshots through the real GTCollector + template get_ground_truth(). It then emits actionable template-quality metrics and artifacts (report.json, report.md) to support red-team review, anti-memorization checks, and quick regressions in CI.

Motivation

Template quality issues are easy to ship unintentionally:

Memorizable templates (collapsed parameter space or small answer space)
Semantics drift (question meaning doesn’t match what the API actually returns)
Instability (GT changes across close repeats due to volatile sources)
Solvability/GT binding issues (GT depends on data that isn’t collected via the intended navigation path)

What’s included

New liveweb_arena.redteam CLI
Entry point: liveweb_arena/redteam/main.py

Key capabilities:

Targeted runs via --templates plugin/template[/variant]
Bulk runs via --all-templates (auto-resolves registered templates with a known plugin/cache source)
Plugin filtering via --plugins coingecko stooq ...
Template discovery without probing via --list-templates

Artifacts:

Writes report.json and report.md to ./redteam// (or --output-dir).

Deterministic API probe pipeline (no browser, no LLM)
Core logic: liveweb_arena/redteam/probe.py

How it works:

Generates tasks through the real TaskManager.generate_composite_task(...) for each (seed, template) pair.
For each generated question, infers a minimal set of probe URLs (conservative heuristics per plugin where needed).
Calls plugin fetch_api_data(url) for each probe URL and feeds results into GTCollector.on_page_visit(...).
Calls GTCollector.fetch_remaining_api_gt() which triggers the template’s real get_ground_truth(validation_info) logic.
Captures success/failure, GT values, and probe URLs per sample.

Metrics: collapse, baseline, GT success, stability
Metrics: liveweb_arena/redteam/metrics.py

Computed per template:

GT success rate: fraction of samples where GT could be collected from probed data
Unique questions / unique GT values: simple diversity indicators
Cross-parameter collapse rate: detects whether distinct validation_info configurations collapse to identical GT outputs

CI gating (threshold enforcement)
Flags (in CLI):

--fail-on-violation (exit code 2 if any violation)
--min-gt-success 0..1
--max-collapse 0..1
--max-baseline 0..1
--min-stability 0..1 (requires --repeat >= 2)

Windows importability fixes (unblocks running tooling on Windows)
Two issues prevented running python -m liveweb_arena.redteam on Windows:

liveweb_arena/init.py eagerly imported the browser layer.
liveweb_arena/core/cache.py imported POSIX-only fcntl.

claytonlin1110 · 2026-03-26T12:49:21Z

@angosr please check and give me feedback for this

angosr

Review: PR #16 — feat: add template red team dashboard CLI

Significance Gate: CONDITIONAL PASS

A quick automated probe for template GT correctness is useful as a first pass. However, this tool does NOT replace the CLAUDE.md Red Team Review (6 mandatory checks require human judgment: world knowledge attack, memorization space analysis, etc.). The tool should be positioned as a complement, not a substitute.

BLOCKING: 5 generated report files committed to the repository

redteam/20260326_123231/report.md
redteam/20260326_123324/report.md
redteam/20260326_123403/report.md
redteam/20260326_123651/report.md
redteam/20260326_123828/report.md

Generated output must not be committed to source control. Add redteam/ to .gitignore and remove these files from the PR.

BLOCKING: Unrelated changes bundled

liveweb_arena/__init__.py: Lazy-loading BrowserEngine/BrowserSession — identical to the change in PR #12 (rejected). This is an infrastructure change that should be its own PR with proper justification.
liveweb_arena/core/cache.py: Windows fcntl portability fix — unrelated to the red team CLI. Should be a separate fix PR.

These bundled changes make review harder and risk sneaking unrelated modifications through a tooling PR.

CONCERN: `_infer_probe_urls` is fragile and hard to maintain

The URL inference uses per-plugin heuristics:

if plugin_name == "openlibrary":
    a = vi.get("book_a_query")
    ...
if plugin_name == "arxiv":
    category = vi.get("category")
    ...
if plugin_name == "stooq":
    symbol = vi.get("symbol") or vi.get("symbol_a")
    ...

This creates a parallel code path that must be manually kept in sync with template validation_info keys. When a new template changes key names or adds new URL patterns, this function silently breaks. The probe gives false negatives (GT fails because the URL wasn't inferred) that look like template bugs.

Better approach: let each template declare its probe URLs via a method like get_probe_urls(validation_info) -> List[str], keeping the knowledge co-located with the template code.

CONCERN: Probe bypasses page cache pipeline

The real GT collection uses GTCollector.on_page_visit() which is triggered by the browser's page cache. This probe calls plugin.fetch_api_data(url) directly and feeds it into on_page_visit. But:

The probe's api_data timestamp differs from what a real browser visit would produce
The probe doesn't respect GT data priority rules (CLAUDE.md §6: detail page > list page)
The probe visits URLs in a fixed order, while real agents visit pages in unpredictable order

A probe that passes doesn't guarantee the real eval pipeline works. This should be documented clearly.

Required Actions

Remove the 5 committed report files; add redteam/ to .gitignore
Split out __init__.py lazy-loading and cache.py fcntl fix into separate PRs
Document clearly that this tool is a supplement to (not replacement for) CLAUDE.md Red Team Review and eval.py testing

feat: add template red team dashboard CLI

1712748

angosr requested changes Mar 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add template red team dashboard CLI#16

feat: add template red team dashboard CLI#16
claytonlin1110 wants to merge 1 commit intoAffineFoundation:mainfrom
claytonlin1110:feat/redteam-dashboard

claytonlin1110 commented Mar 26, 2026

Uh oh!

claytonlin1110 commented Mar 26, 2026

Uh oh!

angosr left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

claytonlin1110 commented Mar 26, 2026

Summary

Motivation

What’s included

Uh oh!

claytonlin1110 commented Mar 26, 2026

Uh oh!

angosr left a comment

Choose a reason for hiding this comment

Review: PR #16 — feat: add template red team dashboard CLI

Significance Gate: CONDITIONAL PASS

BLOCKING: 5 generated report files committed to the repository

BLOCKING: Unrelated changes bundled

CONCERN: _infer_probe_urls is fragile and hard to maintain

CONCERN: Probe bypasses page cache pipeline

Required Actions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CONCERN: `_infer_probe_urls` is fragile and hard to maintain