feat(telemetry): restore delta metrics, add delta_function_count & active-developer signal#1535
feat(telemetry): restore delta metrics, add delta_function_count & active-developer signal#1535anthonyiscoding wants to merge 7 commits into
Conversation
…active-developer signal Re-enables the delta counters that were commented out in #1390 and extends them with state-change deltas so Amplitude can distinguish active development from passive runtime. Changes to the heartbeat / engine_stopped event properties: Added: - delta_function_count (i64): change in registered user-function count since last heartbeat (can be negative) - delta_trigger_count (i64): change in registered trigger count - delta_worker_count (i64): change in connected worker count - delta_function_registrations (u64): increments since last heartbeat on the cumulative function registration counter (catches hot-reload / edit cycles where the function id stays the same) - delta_trigger_registrations (u64) - functions_added (string[]): function ids that appeared since last heartbeat - functions_removed (string[]): function ids that disappeared - workers_added / workers_removed (string[]) - is_active (bool): delta_invocations_total > 0 — runtime activity - is_active_developer (bool): functions_added/removed non-empty OR delta_function_registrations > 0 OR delta_function_count/delta_trigger_count != 0 — development activity Restored (previously commented out since #1390): - delta_invocations_total, delta_invocations_success, delta_invocations_error - delta_api_requests, delta_queue_emits, delta_queue_consumes - delta_pubsub_publishes, delta_pubsub_subscribes, delta_cron_executions Cohort recipe for Amplitude: filter heartbeat events where is_active_developer == true AND session_start == false to get a clean active-developer cohort, which separates people iterating on iii projects from people merely running a tool that happens to embed iii. DeltaAccumulator is now shared between the session-start heartbeat task and the interval loop via Arc<Mutex<_>>, so the loop's first delta is relative to session_start rather than to zero. On the very first snapshot the state-change deltas are zeroed to avoid flagging session start as active development. Co-Authored-By: Claude Opus 4.7 <[email protected]>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe file replaces commented-out delta metrics code with a working Changes
Sequence DiagramsequenceDiagram
participant HB as Heartbeat Event
participant TW as TelemetryWorker
participant DA as DeltaAccumulator
participant EP as Event Payload
HB->>TW: Trigger periodic heartbeat
TW->>DA: Request snapshot of current state
DA->>DA: Snapshot metric counters & engine state
DA->>DA: Compute deltas using saturating_sub
DA->>DA: Compute is_active & is_active_developer
DA-->>TW: Return DeltaSnapshot
TW->>EP: Inject delta fields into payload
TW->>HB: Emit enriched heartbeat event
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
engine/src/workers/telemetry/mod.rs (1)
891-938:⚠️ Potential issue | 🟡 MinorStartup ordering: periodic task can beat the boot task to the first snapshot when
heartbeat_interval_secsis small.The shared
Arc<Mutex<DeltaAccumulator>>means whichever task callssnapshot()first gets the "first snapshot" zeroing behavior. The boot task waits up to 120 s (or untilfirst_user_invocation); the periodic task consumes an immediate firstinterval.tick().awaitat line 944 and then waitsinterval_secsbefore the next tick.If
interval_secs < ~120sand no user invocation arrives promptly, the periodic heartbeat (session_start=false) will take snapshot#1, and the boot heartbeat (session_start=true) will get snapshot#2with non-zero state deltas relative to whatever the periodic task captured. In production this is unlikely (defaultheartbeat_interval_secs = 6 * 60 * 60), but it's reachable viaTelemetryConfig.heartbeat_interval_secsand is exactly the configuration exercised bytest_telemetry_module_background_tasks_and_destroy_run_without_network(interval = 1 s).Combined with the
function_registrations > 0issue in the comment above, this means a short-interval configuration could emit a periodic heartbeat withsession_start=falseandis_active_developer=truepurely from session start.Options:
- Prime the accumulator synchronously in
TelemetryWorker::create(before either task spawns) so both tasks always seeinitialized=true.- Or have the periodic task
awaita signal from the boot task before taking its first snapshot.- Or, as suggested above, decouple
is_active_developerfrom first-snapshot cumulative counters so ordering doesn't matter.Also applies to: 940-1002
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@engine/src/workers/telemetry/mod.rs` around lines 891 - 938, The race happens because both the periodic heartbeat task and the boot task call DeltaAccumulator::snapshot (via delta_for_started.lock().snapshot(&snap)), so whichever runs first gets the “first snapshot” zeroing; fix this by priming the accumulator synchronously in TelemetryWorker::create before spawning either background task: call collect_engine_snapshot once and invoke delta_accumulator.lock().expect(...).snapshot(&initial_snap) (DeltaAccumulator) so initialized=true holds for both the periodic (interval.tick) task and the boot task; update references to collect_engine_snapshot, TelemetryWorker::create, and DeltaAccumulator (delta_for_...) to reflect the new pre-spawn initialization.
🧹 Nitpick comments (1)
engine/src/workers/telemetry/mod.rs (1)
2598-2806: Test coverage is thorough — consider one more case for startup ordering.The seven new tests cover first-snapshot zeroing, added/removed diffing, stable sets, reregistration via
track_function_registered(),is_activeinvocation gating, worker churn isolation fromis_active_developer, and key emission. Nice.One gap worth closing: none of the tests actually exercise the production path where
function_registrationsis non-zero before the first snapshot. Adding a test that pre-incrementscollector().function_registrations(or callstrack_function_registered()) before the firstacc.snapshot(&snap)would lock down the intended semantics foris_active_developeron the first snapshot (see the comment on lines 470–480 above). It would also make the test reflect real engine boot behavior rather than the artificial globals-at-zero scenario set up byreset_telemetry_globals().🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@engine/src/workers/telemetry/mod.rs` around lines 2598 - 2806, Add a new serial test that verifies function_registrations set before the first snapshot still prevents the first-snapshot-from-looking-like-active-dev: call reset_telemetry_globals(), then call collector::track_function_registered() (or increment the global function_registrations) before creating DeltaAccumulator::new() and invoking acc.snapshot(&snap) (use make_engine_snapshot like other tests); then assert that the first returned delta has function_registrations == 1 and that is_active_developer() is true. Ensure the test name reflects startup_registration_before_first_snapshot (or similar) and follows the same setup/teardown pattern as the other tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@engine/src/workers/telemetry/mod.rs`:
- Around line 470-480: is_active_developer() currently returns true if
function_registrations>0 which can make the very first snapshot report developer
activity; fix by making first snapshot ignore cumulative registration
counters—add a first_snapshot boolean (set from !self.initialized in the
DeltaAccumulator snapshot creation) to DeltaSnapshot and have
is_active_developer() return false when first_snapshot is true (or alternatively
zero function_registrations/trigger_registrations for that first snapshot);
update references to function_registrations and trigger_registrations in
is_active_developer and ensure DeltaAccumulator sets first_snapshot before
mutating initialized.
- Around line 482-560: The new telemetry properties added in insert_into
(delta_invocations_total, delta_invocations_success, delta_invocations_error,
delta_api_requests, delta_queue_emits, delta_queue_consumes,
delta_pubsub_publishes, delta_pubsub_subscribes, delta_cron_executions,
delta_function_registrations, delta_trigger_registrations, delta_function_count,
delta_trigger_count, delta_worker_count, functions_added, functions_removed,
workers_added, workers_removed, is_active, is_active_developer) must be
documented and consumers updated: add a telemetry event schema entry describing
each property's type, meaning and example usage in the project telemetry docs
(include the Active Developer cohort recipe: filter is_active_developer == true
AND session_start == false), update any sdk consumer code to accept the new
properties and confirm they expect workers_added/workers_removed format
"{runtime}:{framework}" or "{runtime}", and validate Amplitude ingestion rules
accept the specified integer types (i64 for deltas vs u64 for cumulative) before
rolling out.
---
Outside diff comments:
In `@engine/src/workers/telemetry/mod.rs`:
- Around line 891-938: The race happens because both the periodic heartbeat task
and the boot task call DeltaAccumulator::snapshot (via
delta_for_started.lock().snapshot(&snap)), so whichever runs first gets the
“first snapshot” zeroing; fix this by priming the accumulator synchronously in
TelemetryWorker::create before spawning either background task: call
collect_engine_snapshot once and invoke
delta_accumulator.lock().expect(...).snapshot(&initial_snap) (DeltaAccumulator)
so initialized=true holds for both the periodic (interval.tick) task and the
boot task; update references to collect_engine_snapshot,
TelemetryWorker::create, and DeltaAccumulator (delta_for_...) to reflect the new
pre-spawn initialization.
---
Nitpick comments:
In `@engine/src/workers/telemetry/mod.rs`:
- Around line 2598-2806: Add a new serial test that verifies
function_registrations set before the first snapshot still prevents the
first-snapshot-from-looking-like-active-dev: call reset_telemetry_globals(),
then call collector::track_function_registered() (or increment the global
function_registrations) before creating DeltaAccumulator::new() and invoking
acc.snapshot(&snap) (use make_engine_snapshot like other tests); then assert
that the first returned delta has function_registrations == 1 and that
is_active_developer() is true. Ensure the test name reflects
startup_registration_before_first_snapshot (or similar) and follows the same
setup/teardown pattern as the other tests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 5713e935-d33d-473c-8b82-ccc3342a5b91
📒 Files selected for processing (1)
engine/src/workers/telemetry/mod.rs
…out host identity report device_id=unknown - project_name falls back to the project-root basename (or cwd basename) when neither .iii/project.ini nor SDK telemetry provides one, so every event has a human-readable project label. - Containers that cannot recover a host identity (no mounted project.ini device_id and no III_HOST_USER_ID) now emit device_id="unknown" instead of a synthetic docker-hash-of-hostname id. This collapses unidentifiable container sessions into a single Amplitude user rather than minting a new phantom user per container restart. Co-Authored-By: Claude Opus 4.7 <[email protected]>
…container-<hash> device_ids; gate is_active_developer on first_snapshot - docker-compose mounts $HOME/.iii (and docker-compose.prod.yml mounts a named volume) at /home/nonroot/.iii so ~/.iii/telemetry.yaml persists across container restarts — containers no longer mint a new device_id on every boot. - generate_container_device_id now returns container-<sha256-hash> rather than either the previous "unknown" sentinel or the bare docker-<hash> form. The hash base prefers .iii/project.ini device_id + hostname, then $III_HOST_USER_ID + hostname, then a UUID (minted once and persisted via the mount). The container- prefix lets Amplitude separate container-origin ids from host-origin ones. - DeltaSnapshot gains a first_snapshot flag set from !initialized in DeltaAccumulator::snapshot. is_active_developer short-circuits to false on the first snapshot even when cumulative function_registrations is non-zero, so session_start heartbeats never masquerade as code-edit cycles (closes CodeRabbit feedback on PR #1535). New test covers the boot-time-registrations scenario explicitly. Co-Authored-By: Claude Opus 4.7 <[email protected]>
…op broken container bind mount; pass host device_id via env
- Rename III_HOST_USER_ID → III_HOST_DEVICE_ID and host_user_id →
host_device_id across the engine telemetry worker, scaffolder-core
install telemetry, EnvironmentInfo struct, and all tests. The old
name suggested a human identity; the value is always a device id.
Amplitude now receives it as `host_device_id`, not `host_user_id`.
- Replace the broken ${HOME}/.iii bind mount (distroless nonroot uid
65532 cannot write to host-uid-owned directories) with env-var
propagation. docker-compose.yml and docker-compose.prod.yml now
forward III_HOST_DEVICE_ID from the host shell; no mount required.
- generate_container_device_id now hashes only the host device_id
(no hostname salting). Every container launched from the same host
collapses to one container-<hash> device in Amplitude — matching the
"one Amplitude device per host" model the CEO described. Without a
host device_id the id falls back to a per-boot UUID.
- New EnvironmentInfo::collect, build_user_properties, and container
device_id logic all go through find_host_device_id(), which reads
III_HOST_DEVICE_ID then falls back to .iii/project.ini device_id.
- New tests cover env→ini fallback precedence, stability of container
id under a fixed host_device_id, and the renamed Amplitude property.
Co-Authored-By: Claude Opus 4.7 <[email protected]>
…n-container" / "unknown-host" buckets Previously, a container without a resolvable host device_id minted a fresh UUID every boot (hashed into a unique `container-<hash>`), and a host without a machineid-rs result minted `fallback-<uuid>`. Both churned the Amplitude device count on every restart when telemetry.yaml wasn't persisted. Collapse both into fixed sentinel strings so restarts without a persisted identity share a single device bucket: - container + no host id → "unknown-container" - host + no machine id → "unknown-host" Extracts two pure helpers (`container_device_id_for`, `host_device_id_for`) so the fallback rules are unit-testable without mocking system calls.
Prints the result of get_or_create_device_id() to stdout so the
docker-compose host->container device_id plumbing becomes a single
line on the host:
export III_HOST_DEVICE_ID=$(iii telemetry device-id)
docker compose up
Both "telemetry" and "device-id" are marked #[command(hide = true)]
so they don't clutter the public help output. The CLI is intended
purely as a scripting hook; normal users never see it.
Run on a host the command returns the persisted salted SHA256
machine id (same value the engine itself reports). Run inside a
container it returns the container-scoped id, which is documented
as the expected failure mode -- the env-var plumbing should run
on the host.
…ontainer device_id
Previously container device_ids were salted-sha256(host_id), which
collapsed every container on the same host into a single Amplitude
device. That masked per-container behavior.
Switch to a verbatim format:
container-<host_id>-<hostname>
where host_id is the existing persisted host device_id (itself
already a salted SHA256 of machineid-rs output, so no re-hash is
needed) and hostname comes from gethostname() -- inside Docker
this is the container short id by default or whatever the user
sets via compose "hostname:".
Fallbacks:
- missing host_id -> unknown-container (collapses misconfigured
deployments into one bucket, unchanged from previous behavior)
- missing hostname -> container-<host_id>-unknown-hostname (still
well-formed; same host will still collapse if the hostname
lookup fails consistently)
Drops the now-unused salted_sha256 helper and sha2 import.
Summary
Fixes items 4 & 5 from III-3. Re-enables the delta metrics that were commented out in #1390 and extends them so Amplitude can cleanly separate active development from passive runtime (the Rohit-agentmemory problem).
Tracking-plan diff
Event type:
heartbeatandengine_stopped(anonymous Amplitude stream).Added
delta_function_countdelta_trigger_countdelta_worker_countdelta_function_registrationsdelta_trigger_registrationsfunctions_addedfunctions_removedworkers_addedruntime:framework) that appeared.workers_removedis_activedelta_invocations_total > 0— runtime activity this period.is_active_developerRestored (were commented out in #1390)
delta_invocations_total,delta_invocations_success,delta_invocations_error,delta_api_requests,delta_queue_emits,delta_queue_consumes,delta_pubsub_publishes,delta_pubsub_subscribes,delta_cron_executions.Deprecated / Changed
None. All additions.
Cohort definition (Amplitude)
Active Developer cohort: users who sent a
heartbeatwhereis_active_developer = trueANDsession_start = falsein the last 28 days.is_active_developeris alwaysfalseon the session-start heartbeat by design (first snapshot has no prior state to diff against), so this filter keeps the cohort clean and excludes the inevitable boot-time registration noise.is_active = trueAND NOT in the Active Developer cohort. Captures the agentmemory-style population — people running iii as an embedded runtime without touching the code.Design notes
DeltaAccumulatoris now shared between the session-start spawn and the interval loop spawn viaArc<Mutex<_>>, so the loop's first delta is relative to the session-start snapshot rather than re-reading zero. The original commented-out code created two separate accumulators, which would have triple-counted activity across the session-start → first-loop-tick window.delta_function_count,functions_added, etc.) are deliberately zeroed on the first snapshot so session start is not misclassified as active development.is_active_developerintentionally excludes pure worker churn (SDK reconnects with the same function set) — those don't indicate code edits.Test plan
cargo test -p iii --lib workers::telemetry::— 151 tests pass (7 new tests cover add/remove/stable/reregistration/worker-churn/first-snapshot/schema).cargo fmt -p iii --checkclean.cargo clippy -p iii --lib— no new warnings inworkers/telemetry.Related
Findings for items 1–3 (project_name reporting, container/user conflation, per-user worker listing) are in the comment on III-3. They surface follow-up fixes (container device-id instability, hardcoded Amplitude key, absence of a project_name directory fallback) that warrant separate PRs so reviewers can consider each independently.
Summary by CodeRabbit
Tests
Refactor