Skip to content

feat(telemetry): restore delta metrics, add delta_function_count & active-developer signal#1535

Closed
anthonyiscoding wants to merge 7 commits into
mainfrom
agent/telemetry-engineer/2f2d1c02
Closed

feat(telemetry): restore delta metrics, add delta_function_count & active-developer signal#1535
anthonyiscoding wants to merge 7 commits into
mainfrom
agent/telemetry-engineer/2f2d1c02

Conversation

@anthonyiscoding
Copy link
Copy Markdown
Contributor

@anthonyiscoding anthonyiscoding commented Apr 23, 2026

Summary

Fixes items 4 & 5 from III-3. Re-enables the delta metrics that were commented out in #1390 and extends them so Amplitude can cleanly separate active development from passive runtime (the Rohit-agentmemory problem).

Tracking-plan diff

Event type: heartbeat and engine_stopped (anonymous Amplitude stream).

Added

Property Type Description
delta_function_count int (signed) Change in registered user-function count since last heartbeat. Negative if functions were removed.
delta_trigger_count int (signed) Change in registered trigger count.
delta_worker_count int (signed) Change in connected worker count.
delta_function_registrations uint Cumulative-counter delta — catches hot-reload / edit cycles where the function id stays the same.
delta_trigger_registrations uint Same, for trigger registrations.
functions_added string[] Function ids that appeared since last heartbeat.
functions_removed string[] Function ids that disappeared.
workers_added string[] Worker descriptors (runtime:framework) that appeared.
workers_removed string[] Worker descriptors that disappeared.
is_active bool delta_invocations_total > 0 — runtime activity this period.
is_active_developer bool Any of: functions added/removed, re-registration, or signed-count change ≠ 0.

Restored (were commented out in #1390)

delta_invocations_total, delta_invocations_success, delta_invocations_error, delta_api_requests, delta_queue_emits, delta_queue_consumes, delta_pubsub_publishes, delta_pubsub_subscribes, delta_cron_executions.

Deprecated / Changed

None. All additions.

Cohort definition (Amplitude)

Active Developer cohort: users who sent a heartbeat where is_active_developer = true AND session_start = false in the last 28 days.

  • is_active_developer is always false on the session-start heartbeat by design (first snapshot has no prior state to diff against), so this filter keeps the cohort clean and excludes the inevitable boot-time registration noise.
  • Active Runner cohort (complement): is_active = true AND NOT in the Active Developer cohort. Captures the agentmemory-style population — people running iii as an embedded runtime without touching the code.

Design notes

  • DeltaAccumulator is now shared between the session-start spawn and the interval loop spawn via Arc<Mutex<_>>, so the loop's first delta is relative to the session-start snapshot rather than re-reading zero. The original commented-out code created two separate accumulators, which would have triple-counted activity across the session-start → first-loop-tick window.
  • State-change deltas (delta_function_count, functions_added, etc.) are deliberately zeroed on the first snapshot so session start is not misclassified as active development.
  • is_active_developer intentionally excludes pure worker churn (SDK reconnects with the same function set) — those don't indicate code edits.

Test plan

  • cargo test -p iii --lib workers::telemetry:: — 151 tests pass (7 new tests cover add/remove/stable/reregistration/worker-churn/first-snapshot/schema).
  • cargo fmt -p iii --check clean.
  • cargo clippy -p iii --lib — no new warnings in workers/telemetry.
  • Verify in Amplitude after rollout: heartbeat events include new properties; Active Developer cohort populates with expected internal users; Active Runner cohort stays stable across idle periods.

Related

Findings for items 1–3 (project_name reporting, container/user conflation, per-user worker listing) are in the comment on III-3. They surface follow-up fixes (container device-id instability, hardcoded Amplitude key, absence of a project_name directory fallback) that warrant separate PRs so reviewers can consider each independently.

Summary by CodeRabbit

  • Tests

    • Added comprehensive test coverage validating metrics accuracy, state change detection, activity pattern recognition, and system behavior verification.
  • Refactor

    • Enhanced telemetry infrastructure with improved delta metrics computation, refined state tracking capabilities, and more accurate heartbeat cycle data collection.

…active-developer signal

Re-enables the delta counters that were commented out in #1390 and extends
them with state-change deltas so Amplitude can distinguish active development
from passive runtime.

Changes to the heartbeat / engine_stopped event properties:

Added:
- delta_function_count (i64): change in registered user-function count since
  last heartbeat (can be negative)
- delta_trigger_count (i64): change in registered trigger count
- delta_worker_count (i64): change in connected worker count
- delta_function_registrations (u64): increments since last heartbeat on the
  cumulative function registration counter (catches hot-reload / edit cycles
  where the function id stays the same)
- delta_trigger_registrations (u64)
- functions_added (string[]): function ids that appeared since last heartbeat
- functions_removed (string[]): function ids that disappeared
- workers_added / workers_removed (string[])
- is_active (bool): delta_invocations_total > 0 — runtime activity
- is_active_developer (bool): functions_added/removed non-empty OR
  delta_function_registrations > 0 OR delta_function_count/delta_trigger_count
  != 0 — development activity

Restored (previously commented out since #1390):
- delta_invocations_total, delta_invocations_success, delta_invocations_error
- delta_api_requests, delta_queue_emits, delta_queue_consumes
- delta_pubsub_publishes, delta_pubsub_subscribes, delta_cron_executions

Cohort recipe for Amplitude: filter heartbeat events where
is_active_developer == true AND session_start == false to get a clean
active-developer cohort, which separates people iterating on iii projects
from people merely running a tool that happens to embed iii.

DeltaAccumulator is now shared between the session-start heartbeat task and
the interval loop via Arc<Mutex<_>>, so the loop's first delta is relative
to session_start rather than to zero. On the very first snapshot the
state-change deltas are zeroed to avoid flagging session start as active
development.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented Apr 23, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
iii-website Ready Ready Preview, Comment Apr 23, 2026 8:55pm

Request Review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 23, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 95734380-1672-4fed-8b84-25478e251abe

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The file replaces commented-out delta metrics code with a working DeltaAccumulator/DeltaSnapshot implementation that computes deltas per heartbeat by snapshotting cumulative metric counters. New derived boolean fields is_active and is_active_developer are added. The TelemetryWorker owns a shared accumulator and injects delta fields into heartbeat event payloads. Extensive tests validate the implementation.

Changes

Cohort / File(s) Summary
Delta Metrics Implementation
engine/src/workers/telemetry/mod.rs
Replaces commented-out code with functional DeltaAccumulator and DeltaSnapshot types. Implements delta computation per heartbeat using saturating subtraction, state diffing for function/worker ID sets, and explicit handling of first snapshot behavior to emit cumulative deltas while zeroing state changes.
Active Status Derivation
engine/src/workers/telemetry/mod.rs
Adds derived boolean fields is_active (based on invocations count) and is_active_developer (computed from function set changes, registration deltas, excluding worker churn) to track engagement metrics.
Telemetry Worker Integration
engine/src/workers/telemetry/mod.rs
Updates TelemetryWorker to own a shared, mutex-protected accumulator and injects computed delta fields into both boot and periodic heartbeat event property payloads.
Test Coverage
engine/src/workers/telemetry/mod.rs
Adds extensive test suite validating first-snapshot behavior, added/removed function detection, reregistration detection, invocation gating, worker churn behavior, and JSON key emission.

Sequence Diagram

sequenceDiagram
    participant HB as Heartbeat Event
    participant TW as TelemetryWorker
    participant DA as DeltaAccumulator
    participant EP as Event Payload
    
    HB->>TW: Trigger periodic heartbeat
    TW->>DA: Request snapshot of current state
    DA->>DA: Snapshot metric counters & engine state
    DA->>DA: Compute deltas using saturating_sub
    DA->>DA: Compute is_active & is_active_developer
    DA-->>TW: Return DeltaSnapshot
    TW->>EP: Inject delta fields into payload
    TW->>HB: Emit enriched heartbeat event
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Suggested reviewers

  • ytallo

Poem

🐰 Deltas dance through every beat,
Snapshots capture metrics sweet,
Accumulators hoard the state,
Active flags we calculate,
Telemetry logs all complete! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: restoring delta metrics and adding the delta_function_count and is_active_developer signal.
Description check ✅ Passed The description comprehensively covers the What, Why, and implementation details with extensive tracking-plan diffs, design notes, and test plan, though the required template sections are not explicitly labeled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch agent/telemetry-engineer/2f2d1c02

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
engine/src/workers/telemetry/mod.rs (1)

891-938: ⚠️ Potential issue | 🟡 Minor

Startup ordering: periodic task can beat the boot task to the first snapshot when heartbeat_interval_secs is small.

The shared Arc<Mutex<DeltaAccumulator>> means whichever task calls snapshot() first gets the "first snapshot" zeroing behavior. The boot task waits up to 120 s (or until first_user_invocation); the periodic task consumes an immediate first interval.tick().await at line 944 and then waits interval_secs before the next tick.

If interval_secs < ~120s and no user invocation arrives promptly, the periodic heartbeat (session_start=false) will take snapshot #1, and the boot heartbeat (session_start=true) will get snapshot #2 with non-zero state deltas relative to whatever the periodic task captured. In production this is unlikely (default heartbeat_interval_secs = 6 * 60 * 60), but it's reachable via TelemetryConfig.heartbeat_interval_secs and is exactly the configuration exercised by test_telemetry_module_background_tasks_and_destroy_run_without_network (interval = 1 s).

Combined with the function_registrations > 0 issue in the comment above, this means a short-interval configuration could emit a periodic heartbeat with session_start=false and is_active_developer=true purely from session start.

Options:

  • Prime the accumulator synchronously in TelemetryWorker::create (before either task spawns) so both tasks always see initialized=true.
  • Or have the periodic task await a signal from the boot task before taking its first snapshot.
  • Or, as suggested above, decouple is_active_developer from first-snapshot cumulative counters so ordering doesn't matter.

Also applies to: 940-1002

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@engine/src/workers/telemetry/mod.rs` around lines 891 - 938, The race happens
because both the periodic heartbeat task and the boot task call
DeltaAccumulator::snapshot (via delta_for_started.lock().snapshot(&snap)), so
whichever runs first gets the “first snapshot” zeroing; fix this by priming the
accumulator synchronously in TelemetryWorker::create before spawning either
background task: call collect_engine_snapshot once and invoke
delta_accumulator.lock().expect(...).snapshot(&initial_snap) (DeltaAccumulator)
so initialized=true holds for both the periodic (interval.tick) task and the
boot task; update references to collect_engine_snapshot,
TelemetryWorker::create, and DeltaAccumulator (delta_for_...) to reflect the new
pre-spawn initialization.
🧹 Nitpick comments (1)
engine/src/workers/telemetry/mod.rs (1)

2598-2806: Test coverage is thorough — consider one more case for startup ordering.

The seven new tests cover first-snapshot zeroing, added/removed diffing, stable sets, reregistration via track_function_registered(), is_active invocation gating, worker churn isolation from is_active_developer, and key emission. Nice.

One gap worth closing: none of the tests actually exercise the production path where function_registrations is non-zero before the first snapshot. Adding a test that pre-increments collector().function_registrations (or calls track_function_registered()) before the first acc.snapshot(&snap) would lock down the intended semantics for is_active_developer on the first snapshot (see the comment on lines 470–480 above). It would also make the test reflect real engine boot behavior rather than the artificial globals-at-zero scenario set up by reset_telemetry_globals().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@engine/src/workers/telemetry/mod.rs` around lines 2598 - 2806, Add a new
serial test that verifies function_registrations set before the first snapshot
still prevents the first-snapshot-from-looking-like-active-dev: call
reset_telemetry_globals(), then call collector::track_function_registered() (or
increment the global function_registrations) before creating
DeltaAccumulator::new() and invoking acc.snapshot(&snap) (use
make_engine_snapshot like other tests); then assert that the first returned
delta has function_registrations == 1 and that is_active_developer() is true.
Ensure the test name reflects startup_registration_before_first_snapshot (or
similar) and follows the same setup/teardown pattern as the other tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@engine/src/workers/telemetry/mod.rs`:
- Around line 470-480: is_active_developer() currently returns true if
function_registrations>0 which can make the very first snapshot report developer
activity; fix by making first snapshot ignore cumulative registration
counters—add a first_snapshot boolean (set from !self.initialized in the
DeltaAccumulator snapshot creation) to DeltaSnapshot and have
is_active_developer() return false when first_snapshot is true (or alternatively
zero function_registrations/trigger_registrations for that first snapshot);
update references to function_registrations and trigger_registrations in
is_active_developer and ensure DeltaAccumulator sets first_snapshot before
mutating initialized.
- Around line 482-560: The new telemetry properties added in insert_into
(delta_invocations_total, delta_invocations_success, delta_invocations_error,
delta_api_requests, delta_queue_emits, delta_queue_consumes,
delta_pubsub_publishes, delta_pubsub_subscribes, delta_cron_executions,
delta_function_registrations, delta_trigger_registrations, delta_function_count,
delta_trigger_count, delta_worker_count, functions_added, functions_removed,
workers_added, workers_removed, is_active, is_active_developer) must be
documented and consumers updated: add a telemetry event schema entry describing
each property's type, meaning and example usage in the project telemetry docs
(include the Active Developer cohort recipe: filter is_active_developer == true
AND session_start == false), update any sdk consumer code to accept the new
properties and confirm they expect workers_added/workers_removed format
"{runtime}:{framework}" or "{runtime}", and validate Amplitude ingestion rules
accept the specified integer types (i64 for deltas vs u64 for cumulative) before
rolling out.

---

Outside diff comments:
In `@engine/src/workers/telemetry/mod.rs`:
- Around line 891-938: The race happens because both the periodic heartbeat task
and the boot task call DeltaAccumulator::snapshot (via
delta_for_started.lock().snapshot(&snap)), so whichever runs first gets the
“first snapshot” zeroing; fix this by priming the accumulator synchronously in
TelemetryWorker::create before spawning either background task: call
collect_engine_snapshot once and invoke
delta_accumulator.lock().expect(...).snapshot(&initial_snap) (DeltaAccumulator)
so initialized=true holds for both the periodic (interval.tick) task and the
boot task; update references to collect_engine_snapshot,
TelemetryWorker::create, and DeltaAccumulator (delta_for_...) to reflect the new
pre-spawn initialization.

---

Nitpick comments:
In `@engine/src/workers/telemetry/mod.rs`:
- Around line 2598-2806: Add a new serial test that verifies
function_registrations set before the first snapshot still prevents the
first-snapshot-from-looking-like-active-dev: call reset_telemetry_globals(),
then call collector::track_function_registered() (or increment the global
function_registrations) before creating DeltaAccumulator::new() and invoking
acc.snapshot(&snap) (use make_engine_snapshot like other tests); then assert
that the first returned delta has function_registrations == 1 and that
is_active_developer() is true. Ensure the test name reflects
startup_registration_before_first_snapshot (or similar) and follows the same
setup/teardown pattern as the other tests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5713e935-d33d-473c-8b82-ccc3342a5b91

📥 Commits

Reviewing files that changed from the base of the PR and between cd7eb05 and 1c092d4.

📒 Files selected for processing (1)
  • engine/src/workers/telemetry/mod.rs

Comment thread engine/src/workers/telemetry/mod.rs
Comment thread engine/src/workers/telemetry/mod.rs
@anthonyiscoding anthonyiscoding marked this pull request as draft April 23, 2026 19:15
…out host identity report device_id=unknown

- project_name falls back to the project-root basename (or cwd basename)
  when neither .iii/project.ini nor SDK telemetry provides one, so every
  event has a human-readable project label.
- Containers that cannot recover a host identity (no mounted project.ini
  device_id and no III_HOST_USER_ID) now emit device_id="unknown" instead
  of a synthetic docker-hash-of-hostname id. This collapses unidentifiable
  container sessions into a single Amplitude user rather than minting a
  new phantom user per container restart.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
…container-<hash> device_ids; gate is_active_developer on first_snapshot

- docker-compose mounts $HOME/.iii (and docker-compose.prod.yml mounts a
  named volume) at /home/nonroot/.iii so ~/.iii/telemetry.yaml persists
  across container restarts — containers no longer mint a new device_id
  on every boot.
- generate_container_device_id now returns container-<sha256-hash>
  rather than either the previous "unknown" sentinel or the bare
  docker-<hash> form. The hash base prefers .iii/project.ini device_id +
  hostname, then $III_HOST_USER_ID + hostname, then a UUID (minted once
  and persisted via the mount). The container- prefix lets Amplitude
  separate container-origin ids from host-origin ones.
- DeltaSnapshot gains a first_snapshot flag set from !initialized in
  DeltaAccumulator::snapshot. is_active_developer short-circuits to
  false on the first snapshot even when cumulative function_registrations
  is non-zero, so session_start heartbeats never masquerade as
  code-edit cycles (closes CodeRabbit feedback on PR #1535). New test
  covers the boot-time-registrations scenario explicitly.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
…op broken container bind mount; pass host device_id via env

- Rename III_HOST_USER_ID → III_HOST_DEVICE_ID and host_user_id →
  host_device_id across the engine telemetry worker, scaffolder-core
  install telemetry, EnvironmentInfo struct, and all tests. The old
  name suggested a human identity; the value is always a device id.
  Amplitude now receives it as `host_device_id`, not `host_user_id`.

- Replace the broken ${HOME}/.iii bind mount (distroless nonroot uid
  65532 cannot write to host-uid-owned directories) with env-var
  propagation. docker-compose.yml and docker-compose.prod.yml now
  forward III_HOST_DEVICE_ID from the host shell; no mount required.

- generate_container_device_id now hashes only the host device_id
  (no hostname salting). Every container launched from the same host
  collapses to one container-<hash> device in Amplitude — matching the
  "one Amplitude device per host" model the CEO described. Without a
  host device_id the id falls back to a per-boot UUID.

- New EnvironmentInfo::collect, build_user_properties, and container
  device_id logic all go through find_host_device_id(), which reads
  III_HOST_DEVICE_ID then falls back to .iii/project.ini device_id.

- New tests cover env→ini fallback precedence, stability of container
  id under a fixed host_device_id, and the renamed Amplitude property.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
…n-container" / "unknown-host" buckets

Previously, a container without a resolvable host device_id minted a
fresh UUID every boot (hashed into a unique `container-<hash>`), and a
host without a machineid-rs result minted `fallback-<uuid>`. Both
churned the Amplitude device count on every restart when telemetry.yaml
wasn't persisted.

Collapse both into fixed sentinel strings so restarts without a
persisted identity share a single device bucket:
- container + no host id → "unknown-container"
- host + no machine id  → "unknown-host"

Extracts two pure helpers (`container_device_id_for`, `host_device_id_for`)
so the fallback rules are unit-testable without mocking system calls.
Prints the result of get_or_create_device_id() to stdout so the
docker-compose host->container device_id plumbing becomes a single
line on the host:

    export III_HOST_DEVICE_ID=$(iii telemetry device-id)
    docker compose up

Both "telemetry" and "device-id" are marked #[command(hide = true)]
so they don't clutter the public help output. The CLI is intended
purely as a scripting hook; normal users never see it.

Run on a host the command returns the persisted salted SHA256
machine id (same value the engine itself reports). Run inside a
container it returns the container-scoped id, which is documented
as the expected failure mode -- the env-var plumbing should run
on the host.
…ontainer device_id

Previously container device_ids were salted-sha256(host_id), which
collapsed every container on the same host into a single Amplitude
device. That masked per-container behavior.

Switch to a verbatim format:

    container-<host_id>-<hostname>

where host_id is the existing persisted host device_id (itself
already a salted SHA256 of machineid-rs output, so no re-hash is
needed) and hostname comes from gethostname() -- inside Docker
this is the container short id by default or whatever the user
sets via compose "hostname:".

Fallbacks:
- missing host_id  -> unknown-container (collapses misconfigured
  deployments into one bucket, unchanged from previous behavior)
- missing hostname -> container-<host_id>-unknown-hostname (still
  well-formed; same host will still collapse if the hostname
  lookup fails consistently)

Drops the now-unused salted_sha256 helper and sha2 import.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant