Stabilize LiveWeb browser, cache, and reachability handling by wangtong10086 · Pull Request #11 · AffineFoundation/liveweb-arena

wangtong10086 · 2026-03-20T07:17:20Z

Summary

stabilize browser and cache handling for noisy sites including stooq and taostats
improve protocol parsing and llm failure diagnostics for benchmark and rollout stability
add structured attribution for blocked domains and taostats environment failures
disable reasoning for OpenRouter Kimi requests using the provider-compatible payload

Testing

.venv/bin/pytest tests/core/test_llm_client.py -q
.venv/bin/pytest tests/core/test_agent_loop_recovery.py tests/core/test_interceptor.py tests/core/test_reachability_audit.py -q

angosr

Review: Stabilize LiveWeb browser, cache, and reachability handling

Scope

Large PR: +9936/-609 across 41 files. Covers browser lifecycle, cache, interceptor, protocol parsing, LLM client, reachability audit, and batch eval scripts.

Tests: 274 passed, 4 skipped (1 test file has hardcoded path issue).

BLOCKING Issues

1. File Size Violations (CLAUDE.md: max 500 lines)

File	Lines	Over Limit
env.py	1586	3.2x
browser.py	1546	3.1x
cache.py	1235	2.5x
llm_client.py	1202	2.4x
interceptor.py	843	1.7x
agent_loop.py	647	1.3x
agent_protocol.py	607	1.2x
reachability_audit.py	589	1.2x

8 files exceed the 500-line limit. browser.py and env.py are over 3x. These need to be split into focused modules.

2. Excessive Silent Exception Swallowing (CLAUDE.md Rule 9)

browser.py alone has 30+ except Exception blocks, many with pass or silent return False/None. Examples:

Lines 319, 330, 333, 344, 353, 356, 368 — browser lifecycle exceptions silently swallowed
Lines 494, 498, 503, 521, 553 — navigation failures silently ignored
Lines 727, 763, 771, 797, 903, 930, 957, 993 — UI interaction failures silently passed

CLAUDE.md Rule 9: "Never add fallback/default values to mask errors. If data is missing or computation fails, raise the error explicitly so the root cause is exposed."

When browser actions fail silently, the system continues with invalid state, producing incorrect observations and undiagnosable failures.

3. Hardcoded Path in Test

# tests/test_batch_eval_script.py:12
SCRIPT_PATH = Path("/home/xmyf/liveweb-arena/scripts/batch_eval.py")

Must use relative paths or __file__-based resolution.

HIGH Issues

4. uv.lock Committed (1870 lines)

uv.lock is a lockfile that should be in .gitignore for library projects, or if intentionally committed, should be in a separate PR. It adds 1870 lines of noise to this already large PR.

5. Over-Engineering: Excessive Fallback Chains

The browser and cache modules have evolved from "fix root cause" to "catch every exception and continue". This masks real bugs:

# Pattern repeated 30+ times in browser.py:
try:
    await some_action()
except Exception:
    pass  # silently continue

CLAUDE.md: "Fix Root Cause - Never patch over problems, always solve at the root."

6. batch_eval.py and reachability_audit_report.py

New scripts (+361 and +321 lines) are operational tools. While useful, they add significant scope to an already large PR. Consider splitting into a separate PR.

MEDIUM Issues

7. Kimi Reasoning Disable (llm_client.py)

The Kimi/OpenRouter reasoning disable logic adds provider-specific code. Consider making this configurable via environment variables rather than hardcoding model detection:

# Current: hardcoded model detection
if "kimi" in model_name.lower():
    # disable reasoning

8. ReachabilityAudit Complexity

reachability_audit.py (589 lines) introduces a complex audit system. While useful for diagnostics, it adds overhead to every evaluation run. Should be optional/configurable.

Positive Aspects

Browser retry logic addresses real instability issues
Protocol parsing improvements make the system more robust to LLM output variations
Structured failure attribution is valuable for diagnostics
Good test coverage for new code (360 lines of agent loop recovery tests)
Taostats retry and failure handling improvements are solid

Recommendations

Split into 3-4 smaller PRs — This PR touches too many concerns: browser lifecycle, cache, protocol, diagnostics, batch eval, Kimi support
Fix file sizes — Split browser.py, cache.py, llm_client.py, env.py into focused modules
Replace silent exceptions — At minimum, add logging. Ideally, fix root causes instead of catching everything
Remove uv.lock or put in separate commit
Fix hardcoded path in test_batch_eval_script.py
Make reachability audit optional — Add flag to disable for performance-sensitive runs

Verdict: REQUEST CHANGES

The file size violations and pervasive silent exception swallowing are fundamental CLAUDE.md violations that must be addressed before merge. The PR's goals are sound but the implementation needs cleanup.

xmyf added 20 commits March 14, 2026 03:37

feat: stabilize liveweb arena eval execution

ac2d5f0

feat: update .gitignore to include .cache directory

2209b06

fix: bypass proxy for local llm endpoints

5349b48

fix: harden browser lifecycle for rl rollouts

b85717a

fix: harden liveweb protocol and site handling

698516a

fix: retry normalized openlibrary search queries

e80c114

fix: retry transient taostats api fetches

1d6bf13

fix: make batch eval incremental and deterministic

5b81a32

feat: add recoverable liveweb format retries

632d3e4

feat: tune liveweb format recovery sampling

9827b57

feat: audit liveweb reachability and recovery failures

71ae7bf

fix: restore cache manager compatibility helper

69570ad

fix: bound recovery context and refine browser audit

e30da8f

Stabilize browser and cache handling for noisy data sites

99a826a

Improve protocol parsing and LLM failure diagnostics

bef5001

Add structured attribution for blocked domains and taostats failures

76afd0b

Disable Kimi reasoning in OpenRouter requests

171826d

Stabilize taostats cache setup and failure attribution

c24bcf8

Harden taostats list actions and UI target attribution

aef7546

Fail fast on repeated disallowed-domain navigation

17701c5

angosr requested changes Mar 20, 2026

View reviewed changes

xmyf added 3 commits March 22, 2026 14:14

Split strict eval from fast collection runtime

63a2427

Harden LiveWeb recovery and browser proxy flow

c5c3594

Add think ablation experiment tooling

77d3085

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stabilize LiveWeb browser, cache, and reachability handling#11

Stabilize LiveWeb browser, cache, and reachability handling#11
wangtong10086 wants to merge 23 commits intoAffineFoundation:mainfrom
wangtong10086:codex/liveweb-arena-stability-20260314

wangtong10086 commented Mar 20, 2026

Uh oh!

angosr left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wangtong10086 commented Mar 20, 2026

Summary

Testing

Uh oh!

angosr left a comment

Choose a reason for hiding this comment

Review: Stabilize LiveWeb browser, cache, and reachability handling

Scope

BLOCKING Issues

1. File Size Violations (CLAUDE.md: max 500 lines)

2. Excessive Silent Exception Swallowing (CLAUDE.md Rule 9)

3. Hardcoded Path in Test

HIGH Issues

4. uv.lock Committed (1870 lines)

5. Over-Engineering: Excessive Fallback Chains

6. batch_eval.py and reachability_audit_report.py

MEDIUM Issues

7. Kimi Reasoning Disable (llm_client.py)

8. ReachabilityAudit Complexity

Positive Aspects

Recommendations

Verdict: REQUEST CHANGES

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants