Skip to content

fix: telemetry improvements from deep AppInsights analysis#587

Merged
anandgupta42 merged 1 commit intomainfrom
fix/telemetry-analysis-improvements
Mar 30, 2026
Merged

fix: telemetry improvements from deep AppInsights analysis#587
anandgupta42 merged 1 commit intomainfrom
fix/telemetry-analysis-improvements

Conversation

@anandgupta42
Copy link
Copy Markdown
Contributor

@anandgupta42 anandgupta42 commented Mar 30, 2026

What does this PR do?

Fixes telemetry gaps identified through deep analysis of altimate-code-os Azure AppInsights data (10-day window, 3,678 events, 8 machines):

  1. Error classification — adds 4 new error classes (file_not_found, edit_mismatch, not_configured, resource_exhausted) to reduce "unknown" from 85%+ to ~50%
  2. Session metadata — adds os, arch, node_version to session_start event for environment segmentation
  3. Doom loop protection — adds per-tool call counter (threshold=30) to catch varied-input loops like todowrite 2,080x
  4. Token visibility — adds tokens_input_total field for Anthropic where tokens_input excludes cached tokens
  5. Query documentation — adds KQL reference documenting customDimensions vs customMeasurements

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Issue for this PR

Closes #586

How did you verify your code works?

  • 189 telemetry tests pass (0 failures, 620 assertions)
  • TypeScript typecheck passes (5/5 packages via turbo)
  • Upstream marker guard passes
  • Multi-model code review (5 models: Claude, Gemini 3.1 Pro, Kimi K2.5, MiniMax M2.5, GLM-5) — all actionable findings addressed
  • Verified token data in live AppInsights via az monitor app-insights query

Checklist

  • My code follows the project's coding standards
  • I have added tests that prove my fix is effective
  • New and existing unit tests pass locally
  • I have reviewed my own code
  • I have run the linters and fixed any issues

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added detection for repetitive tool execution with user alerts when repeat threshold is reached
    • Enhanced token usage tracking to include cached input tokens in total counts
  • Improvements

    • Expanded system environment telemetry collection with OS, architecture, and runtime information
    • Refined error classification for better categorization

Based on 10-day telemetry analysis of altimate-code-os:

Error classification (P0):
- Add 4 new error classes: `file_not_found`, `edit_mismatch`,
  `not_configured`, `resource_exhausted`
- Move warehouse/driver keywords from `connection` to `not_configured`
- Reduces "unknown" error classification from 85%+ to ~50%

Session metadata (P0):
- Add `os`, `arch`, `node_version` to `session_start` event
- Enables environment-based segmentation in dashboards

Doom loop detection (P1):
- Add per-tool call counter (threshold=30) to catch varied-input loops
- Emits `doom_loop_detected` telemetry event when triggered
- Addresses todowrite tool called 2,080x by one user

Token visibility (P1):
- Add `tokens_input_total` field to generation events
- Includes cached tokens for Anthropic (where `tokens_input` excludes cache)
- Only emitted when it differs from `tokens_input`

Telemetry query docs (P2):
- Add KQL reference documenting `customDimensions` vs `customMeasurements`
- Prevents analysts from querying the wrong column

Cleanup:
- Rename `telemetry-moat-signals.test.ts` → `telemetry-signals.test.ts`
- Remove "moat" terminology from test comments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 30, 2026

📝 Walkthrough

Walkthrough

This PR extends telemetry event schemas with environment metadata (OS, architecture, Node version), improves error classification with four new error categories, introduces per-tool call counting to detect doom loops, and adds total input token accounting that includes cached tokens.

Changes

Cohort / File(s) Summary
Telemetry Schema & Error Classification
packages/opencode/src/altimate/telemetry/index.ts
Extended Telemetry.Event schema: added os, arch, node_version to session_start; added optional tokens_input_total to generation events; expanded core_failure error_class union with file_not_found, edit_mismatch, not_configured, resource_exhausted. Rewrote ERROR_PATTERNS to split error classifications and add dedicated keyword groups for each new error class. Added KQL documentation comment block.
Token Tracking
packages/opencode/src/session/index.ts
Added inputTotal field to Session.getUsage() return value, computed as sum of adjusted input tokens plus cache read/write input tokens.
Doom Loop Detection
packages/opencode/src/session/processor.ts
Introduced per-tool repeat call tracking via toolCallCounts map. When a tool's call count reaches threshold (30), emits doom_loop_detected telemetry event, issues PermissionNext.ask with permission: "doom_loop", and resets counter. Extended generation telemetry to conditionally include tokens_input_total when different from tokens_input.
Session Initialization
packages/opencode/src/session/prompt.ts
Added environment metadata (os: process.platform, arch: process.arch, node_version: process.version) to session_start telemetry event in SessionPrompt.loop at step 1.
Test Updates
packages/opencode/test/altimate/telemetry-signals.test.ts, packages/opencode/test/telemetry/telemetry.test.ts
Updated test payloads to include os, arch, node_version in session creation and event tracking. Reorganized error classification tests: moved patterns from "connection" group into dedicated not_configured, file_not_found, edit_mismatch, resource_exhausted test cases. Removed "moat" terminology from test descriptions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

  • #445 — Extends Telemetry.Event schema with new event variants and fields in the telemetry subsystem.
  • #336 — Modifies packages/opencode/src/session/processor.ts to adjust token accounting logic and per-generation telemetry.
  • #245 — Updates core_failure error classification patterns and error_class union in telemetry schema.

Suggested labels

contributor

Suggested reviewers

  • mdesmet
  • kulvirgit

Poem

🐰 Telemetry now tracks where we hop and bound,
With OS, arch, and Node all neatly bound,
Doom loops detected before they spiral deep,
And token counts tallied—cached tokens don't sleep! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Title clearly identifies the main change (telemetry improvements) and references the analysis source (AppInsights), directly aligning with the primary purpose of the PR.
Linked Issues check ✅ Passed All coding objectives from #586 are met: error classification extended with 4 new classes (#586), session metadata added (os/arch/node_version #586), doom loop protection implemented with counter/threshold (#586), tokens_input_total field added (#586), and moat terminology removed (#586).
Out of Scope Changes check ✅ Passed All changes directly support the linked issue objectives: telemetry schema/classification updates, session metadata collection, doom loop detection, token tracking improvements, and test file renaming—all within scope of fixing identified telemetry gaps.
Description check ✅ Passed The PR description comprehensively covers the required template sections with detailed explanations of changes, verification methods, and complete checklist.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/telemetry-analysis-improvements

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@anandgupta42 anandgupta42 merged commit 75b077f into main Mar 30, 2026
16 checks passed
kulvirgit pushed a commit that referenced this pull request Mar 30, 2026
Based on 10-day telemetry analysis of altimate-code-os:

Error classification (P0):
- Add 4 new error classes: `file_not_found`, `edit_mismatch`,
  `not_configured`, `resource_exhausted`
- Move warehouse/driver keywords from `connection` to `not_configured`
- Reduces "unknown" error classification from 85%+ to ~50%

Session metadata (P0):
- Add `os`, `arch`, `node_version` to `session_start` event
- Enables environment-based segmentation in dashboards

Doom loop detection (P1):
- Add per-tool call counter (threshold=30) to catch varied-input loops
- Emits `doom_loop_detected` telemetry event when triggered
- Addresses todowrite tool called 2,080x by one user

Token visibility (P1):
- Add `tokens_input_total` field to generation events
- Includes cached tokens for Anthropic (where `tokens_input` excludes cache)
- Only emitted when it differs from `tokens_input`

Telemetry query docs (P2):
- Add KQL reference documenting `customDimensions` vs `customMeasurements`
- Prevents analysts from querying the wrong column

Cleanup:
- Rename `telemetry-moat-signals.test.ts` → `telemetry-signals.test.ts`
- Remove "moat" terminology from test comments

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Telemetry gaps: 85% errors unclassified, missing session metadata, no doom loop protection

1 participant