Skip to content

fix: eliminate stale file race condition and fix error misclassification#611

Merged
anandgupta42 merged 3 commits intomainfrom
fix/stale-file-race-and-error-classification
Apr 1, 2026
Merged

fix: eliminate stale file race condition and fix error misclassification#611
anandgupta42 merged 3 commits intomainfrom
fix/stale-file-race-and-error-classification

Conversation

@anandgupta42
Copy link
Copy Markdown
Contributor

@anandgupta42 anandgupta42 commented Apr 1, 2026

What does this PR do?

Fixes three telemetry issues discovered via AppInsights analysis (4,500 validation errors over 2 days):

  1. Stale file race condition (54% of errors): FileTime.read() used new Date() instead of the file's actual mtime, causing clock drift on WSL/networked drives. One user hit 782 retry loops on a single file. Now uses Filesystem.stat(file).mtime and increases tolerance from 50ms to 2s.

  2. HTTP 404 misclassified as validation (13%): The keyword "does not exist" in the validation pattern matched WebFetch 404 messages. Moved http_error pattern before validation and added HTTP 4xx status keywords.

  3. File staleness mixed into validation bucket (32%): Split file_stale out as a new error class for "must read file", "has been modified since", "before overwriting" errors, enabling cleaner triage dashboards.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Issue for this PR

Closes #610

How did you verify your code works?

  • AppInsights KQL analysis confirmed root causes (p50 mtime gap: 651ms, 12 looping sessions, 782 max retries)
  • All 194 affected tests pass (time.test.ts, telemetry.test.ts, write.test.ts, edit.test.ts)
  • Upstream marker check passes (--markers --base main --strict)
  • TypeScript typecheck passes
  • No changes needed in altimate-core-internal — all bugs are in the TS layer

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • New and existing unit tests pass locally with my changes

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Use filesystem modification time for last-read tracking, reduce timing flakiness, and include computed delta in stale-file error messages.
    • Emit non-blocking telemetry when wall-clock vs filesystem mtime drift exceeds threshold.
  • New Features

    • Added a distinct "file_stale" error classification and new telemetry events for filetime drift and stale assertions.
    • Improved HTTP error detection priority for more accurate classification.
  • Tests

    • Updated tests to use deterministic filesystem timestamp control and adjusted expectations for stale/filetime behavior.

Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 1, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 31e2a103-b292-47ca-915e-a5ad4de47266

📥 Commits

Reviewing files that changed from the base of the PR and between 4d6d920 and 86ecad4.

📒 Files selected for processing (5)
  • packages/opencode/src/altimate/telemetry/index.ts
  • packages/opencode/src/file/time.ts
  • packages/opencode/test/file/time.test.ts
  • packages/opencode/test/telemetry/telemetry.test.ts
  • packages/opencode/test/tool/edit.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • packages/opencode/test/file/time.test.ts
  • packages/opencode/src/file/time.ts

📝 Walkthrough

Walkthrough

Switched FileTime.read to prefer filesystem mtime over wall-clock, added telemetry events for filetime drift and assert outcomes, introduced a new telemetry error_class file_stale, reordered/expanded ERROR_PATTERNS (HTTP earlier, dedicated file_stale patterns), and updated tests to use deterministic utimes-based timestamps.

Changes

Cohort / File(s) Summary
Telemetry: schema & classification
packages/opencode/src/altimate/telemetry/index.ts
Added file_stale to Telemetry.Event (core_failure.error_class), introduced filetime_drift and filetime_assert event variants, reordered ERROR_PATTERNS to match HTTP errors earlier, and added a dedicated file_stale pattern while removing those phrases from validation.
FileTime logic
packages/opencode/src/file/time.ts
Changed FileTime.read() to record filesystem mtime (fallback to wall-clock when unavailable). Added drift detection that emits filetime_drift telemetry when
Tests: telemetry classification
packages/opencode/test/telemetry/telemetry.test.ts
Updated expectations so stale-file messages classify as file_stale (not validation), adjusted HTTP 404/410/429-related expectations to http_error, and expanded HTTP pattern coverage.
Tests: FileTime and edit tool
packages/opencode/test/file/time.test.ts, packages/opencode/test/tool/edit.test.ts
Replaced timing-based setTimeout with deterministic fs.utimes() to set atime/mtime (e.g., +5000ms) and added assertions that FileTime.read() stores filesystem mtime; expanded error-message assertions to require Delta: and tolerance: substrings.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant FileTime
  participant Filesystem
  participant Telemetry

  Caller->>FileTime: read(sessionID, filepath)
  FileTime->>Filesystem: stat(filepath)
  Filesystem-->>FileTime: { mtime }
  FileTime->>FileTime: compute driftMs = |now - mtime|
  alt driftMs > 10
    FileTime->>Telemetry: track(type: "filetime_drift", drift_ms, mtime_ahead, ...)
    Telemetry-->>FileTime: ack (swallowed errors)
  end
  FileTime-->>Caller: record last-read timestamp (mtime or wall-clock)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • mdesmet

Poem

🐰 I watched the clocks and files collide,
mtime led the way, no longer to hide.
Drift numbers hum, assertions now speak,
file_stale whispers when timestamps are weak.
Hop, hop—telemetry keeps the logs neat! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main changes: fixing a stale file race condition and error misclassification. It directly reflects the primary objectives.
Description check ✅ Passed The description covers all template sections with substantive content: What changed (3 issues fixed with specifics), Type of change (bug fix), Issues closed (#610), and verification steps. All major sections are properly completed.
Linked Issues check ✅ Passed The PR successfully addresses all three coding requirements from issue #610: uses filesystem mtime instead of wall-clock in FileTime.read() [1], reorders error patterns to classify HTTP 404 correctly [2], and introduces file_stale error class [3].
Out of Scope Changes check ✅ Passed All changes remain scoped to addressing the three telemetry issues: FileTime read/assert logic, error pattern classification, and new telemetry event types. Test updates directly support these fixes without introducing unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/stale-file-race-and-error-classification

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

anandgupta42 and others added 2 commits April 1, 2026 16:25
- Use file's actual `mtime` instead of `new Date()` in `FileTime.read()` to
  eliminate clock drift between Node.js and filesystem (WSL, networked drives)
- Increase staleness tolerance from 50ms to 2s (p50 gap was 651ms, caused
  782-retry loops on WSL)
- Split `file_stale` out of `validation` error class for cleaner triage
- Move `http_error` pattern before `validation` and add `HTTP 4xx` keywords
  to prevent WebFetch 404s from being misclassified as validation errors
- Update tests to use `fs.utimes()` for deterministic mtime manipulation

Closes #610

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@anandgupta42 anandgupta42 force-pushed the fix/stale-file-race-and-error-classification branch from 6281583 to d0094a1 Compare April 1, 2026 23:34
@anandgupta42 anandgupta42 merged commit 086d7b2 into main Apr 1, 2026
13 checks passed
anandgupta42 added a commit that referenced this pull request Apr 1, 2026
The keyword was removed in #611 to prevent HTTP 404 messages from being
misclassified as validation errors. However, the pattern reordering
(http_error now comes before validation) already handles that — HTTP 404
messages match `http_error` first via `"http 404"` keyword. Removing
`"does not exist"` from validation caused SQL errors like
`"column foo does not exist"` to fall through to `unknown`.

Restore the keyword and add a test for the SQL case.

Caught by multi-model code review (Claude + Gemini 3.1 Pro consensus).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
anandgupta42 added a commit that referenced this pull request Apr 2, 2026
The keyword was removed in #611 to prevent HTTP 404 messages from being
misclassified as validation errors. However, the pattern reordering
(http_error now comes before validation) already handles that — HTTP 404
messages match `http_error` first via `"http 404"` keyword. Removing
`"does not exist"` from validation caused SQL errors like
`"column foo does not exist"` to fall through to `unknown`.

Restore the keyword and add a test for the SQL case.

Caught by multi-model code review (Claude + Gemini 3.1 Pro consensus).

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: stale file race condition causes retry loops (782 retries), HTTP 404 misclassified as validation

1 participant