Skip to content

Stabilize LiveWeb browser, cache, and reachability handling#11

Open
wangtong10086 wants to merge 23 commits intoAffineFoundation:mainfrom
wangtong10086:codex/liveweb-arena-stability-20260314
Open

Stabilize LiveWeb browser, cache, and reachability handling#11
wangtong10086 wants to merge 23 commits intoAffineFoundation:mainfrom
wangtong10086:codex/liveweb-arena-stability-20260314

Conversation

@wangtong10086
Copy link

Summary

  • stabilize browser and cache handling for noisy sites including stooq and taostats
  • improve protocol parsing and llm failure diagnostics for benchmark and rollout stability
  • add structured attribution for blocked domains and taostats environment failures
  • disable reasoning for OpenRouter Kimi requests using the provider-compatible payload

Testing

  • .venv/bin/pytest tests/core/test_llm_client.py -q
  • .venv/bin/pytest tests/core/test_agent_loop_recovery.py tests/core/test_interceptor.py tests/core/test_reachability_audit.py -q

Copy link
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Stabilize LiveWeb browser, cache, and reachability handling

Scope

Large PR: +9936/-609 across 41 files. Covers browser lifecycle, cache, interceptor, protocol parsing, LLM client, reachability audit, and batch eval scripts.

Tests: 274 passed, 4 skipped (1 test file has hardcoded path issue).


BLOCKING Issues

1. File Size Violations (CLAUDE.md: max 500 lines)

File Lines Over Limit
env.py 1586 3.2x
browser.py 1546 3.1x
cache.py 1235 2.5x
llm_client.py 1202 2.4x
interceptor.py 843 1.7x
agent_loop.py 647 1.3x
agent_protocol.py 607 1.2x
reachability_audit.py 589 1.2x

8 files exceed the 500-line limit. browser.py and env.py are over 3x. These need to be split into focused modules.

2. Excessive Silent Exception Swallowing (CLAUDE.md Rule 9)

browser.py alone has 30+ except Exception blocks, many with pass or silent return False/None. Examples:

  • Lines 319, 330, 333, 344, 353, 356, 368 — browser lifecycle exceptions silently swallowed
  • Lines 494, 498, 503, 521, 553 — navigation failures silently ignored
  • Lines 727, 763, 771, 797, 903, 930, 957, 993 — UI interaction failures silently passed

CLAUDE.md Rule 9: "Never add fallback/default values to mask errors. If data is missing or computation fails, raise the error explicitly so the root cause is exposed."

When browser actions fail silently, the system continues with invalid state, producing incorrect observations and undiagnosable failures.

3. Hardcoded Path in Test

# tests/test_batch_eval_script.py:12
SCRIPT_PATH = Path("/home/xmyf/liveweb-arena/scripts/batch_eval.py")

Must use relative paths or __file__-based resolution.


HIGH Issues

4. uv.lock Committed (1870 lines)

uv.lock is a lockfile that should be in .gitignore for library projects, or if intentionally committed, should be in a separate PR. It adds 1870 lines of noise to this already large PR.

5. Over-Engineering: Excessive Fallback Chains

The browser and cache modules have evolved from "fix root cause" to "catch every exception and continue". This masks real bugs:

# Pattern repeated 30+ times in browser.py:
try:
    await some_action()
except Exception:
    pass  # silently continue

CLAUDE.md: "Fix Root Cause - Never patch over problems, always solve at the root."

6. batch_eval.py and reachability_audit_report.py

New scripts (+361 and +321 lines) are operational tools. While useful, they add significant scope to an already large PR. Consider splitting into a separate PR.


MEDIUM Issues

7. Kimi Reasoning Disable (llm_client.py)

The Kimi/OpenRouter reasoning disable logic adds provider-specific code. Consider making this configurable via environment variables rather than hardcoding model detection:

# Current: hardcoded model detection
if "kimi" in model_name.lower():
    # disable reasoning

8. ReachabilityAudit Complexity

reachability_audit.py (589 lines) introduces a complex audit system. While useful for diagnostics, it adds overhead to every evaluation run. Should be optional/configurable.


Positive Aspects

  • Browser retry logic addresses real instability issues
  • Protocol parsing improvements make the system more robust to LLM output variations
  • Structured failure attribution is valuable for diagnostics
  • Good test coverage for new code (360 lines of agent loop recovery tests)
  • Taostats retry and failure handling improvements are solid

Recommendations

  1. Split into 3-4 smaller PRs — This PR touches too many concerns: browser lifecycle, cache, protocol, diagnostics, batch eval, Kimi support
  2. Fix file sizes — Split browser.py, cache.py, llm_client.py, env.py into focused modules
  3. Replace silent exceptions — At minimum, add logging. Ideally, fix root causes instead of catching everything
  4. Remove uv.lock or put in separate commit
  5. Fix hardcoded path in test_batch_eval_script.py
  6. Make reachability audit optional — Add flag to disable for performance-sensitive runs

Verdict: REQUEST CHANGES

The file size violations and pervasive silent exception swallowing are fundamental CLAUDE.md violations that must be addressed before merge. The PR's goals are sound but the implementation needs cleanup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants