Stabilize LiveWeb browser, cache, and reachability handling#11
Stabilize LiveWeb browser, cache, and reachability handling#11wangtong10086 wants to merge 23 commits intoAffineFoundation:mainfrom
Conversation
angosr
left a comment
There was a problem hiding this comment.
Review: Stabilize LiveWeb browser, cache, and reachability handling
Scope
Large PR: +9936/-609 across 41 files. Covers browser lifecycle, cache, interceptor, protocol parsing, LLM client, reachability audit, and batch eval scripts.
Tests: 274 passed, 4 skipped (1 test file has hardcoded path issue).
BLOCKING Issues
1. File Size Violations (CLAUDE.md: max 500 lines)
| File | Lines | Over Limit |
|---|---|---|
| env.py | 1586 | 3.2x |
| browser.py | 1546 | 3.1x |
| cache.py | 1235 | 2.5x |
| llm_client.py | 1202 | 2.4x |
| interceptor.py | 843 | 1.7x |
| agent_loop.py | 647 | 1.3x |
| agent_protocol.py | 607 | 1.2x |
| reachability_audit.py | 589 | 1.2x |
8 files exceed the 500-line limit. browser.py and env.py are over 3x. These need to be split into focused modules.
2. Excessive Silent Exception Swallowing (CLAUDE.md Rule 9)
browser.py alone has 30+ except Exception blocks, many with pass or silent return False/None. Examples:
- Lines 319, 330, 333, 344, 353, 356, 368 — browser lifecycle exceptions silently swallowed
- Lines 494, 498, 503, 521, 553 — navigation failures silently ignored
- Lines 727, 763, 771, 797, 903, 930, 957, 993 — UI interaction failures silently passed
CLAUDE.md Rule 9: "Never add fallback/default values to mask errors. If data is missing or computation fails, raise the error explicitly so the root cause is exposed."
When browser actions fail silently, the system continues with invalid state, producing incorrect observations and undiagnosable failures.
3. Hardcoded Path in Test
# tests/test_batch_eval_script.py:12
SCRIPT_PATH = Path("/home/xmyf/liveweb-arena/scripts/batch_eval.py")Must use relative paths or __file__-based resolution.
HIGH Issues
4. uv.lock Committed (1870 lines)
uv.lock is a lockfile that should be in .gitignore for library projects, or if intentionally committed, should be in a separate PR. It adds 1870 lines of noise to this already large PR.
5. Over-Engineering: Excessive Fallback Chains
The browser and cache modules have evolved from "fix root cause" to "catch every exception and continue". This masks real bugs:
# Pattern repeated 30+ times in browser.py:
try:
await some_action()
except Exception:
pass # silently continueCLAUDE.md: "Fix Root Cause - Never patch over problems, always solve at the root."
6. batch_eval.py and reachability_audit_report.py
New scripts (+361 and +321 lines) are operational tools. While useful, they add significant scope to an already large PR. Consider splitting into a separate PR.
MEDIUM Issues
7. Kimi Reasoning Disable (llm_client.py)
The Kimi/OpenRouter reasoning disable logic adds provider-specific code. Consider making this configurable via environment variables rather than hardcoding model detection:
# Current: hardcoded model detection
if "kimi" in model_name.lower():
# disable reasoning8. ReachabilityAudit Complexity
reachability_audit.py (589 lines) introduces a complex audit system. While useful for diagnostics, it adds overhead to every evaluation run. Should be optional/configurable.
Positive Aspects
- Browser retry logic addresses real instability issues
- Protocol parsing improvements make the system more robust to LLM output variations
- Structured failure attribution is valuable for diagnostics
- Good test coverage for new code (360 lines of agent loop recovery tests)
- Taostats retry and failure handling improvements are solid
Recommendations
- Split into 3-4 smaller PRs — This PR touches too many concerns: browser lifecycle, cache, protocol, diagnostics, batch eval, Kimi support
- Fix file sizes — Split browser.py, cache.py, llm_client.py, env.py into focused modules
- Replace silent exceptions — At minimum, add logging. Ideally, fix root causes instead of catching everything
- Remove uv.lock or put in separate commit
- Fix hardcoded path in test_batch_eval_script.py
- Make reachability audit optional — Add flag to disable for performance-sensitive runs
Verdict: REQUEST CHANGES
The file size violations and pervasive silent exception swallowing are fundamental CLAUDE.md violations that must be addressed before merge. The PR's goals are sound but the implementation needs cleanup.
Summary
Testing
.venv/bin/pytest tests/core/test_llm_client.py -q.venv/bin/pytest tests/core/test_agent_loop_recovery.py tests/core/test_interceptor.py tests/core/test_reachability_audit.py -q