diff --git a/CHANGELOG.md b/CHANGELOG.md index 2ad0148..028ef6c 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,44 @@ # Changelog +## v2026.4.29 — 2026-04-29 + +### Fixed + +- **Cron table empty after OpenClaw v2026.4.20+ ([#25](https://github.com/mudrii/openclaw-dashboard/issues/25))** — OpenClaw v2026.4.20 split runtime cron state into a separate `~/.openclaw/cron/jobs-state.json` sidecar; the dashboard previously read only `jobs.json`, so the cron table rendered with blank `Last run` / `Next run` / `Last status` / `Last duration` columns. The refresh collector now merges the sidecar by `job.id`, with sidecar values winning wholesale and inline state preserved as the legacy fallback when the sidecar is absent. Dashboards now work against both pre- and post-v2026.4.20 OpenClaw installs. +- **`/api/system` cold-start latency and Gateway Runtime stuck on "Loading…" ([#26](https://github.com/mudrii/openclaw-dashboard/issues/26))** — cold collections (no warm cache) could run 10–12s when the gateway was slow because version probes ran serially before the parallel host-metrics group; the frontend Gateway Runtime card had no fetch timeout and never repainted on `r.ok===false` or thrown errors, so it stayed on `Loading…` indefinitely. + - **Backend**: introduced `system.coldPathTimeoutMs` (default 4000, validated [200, 15000]); `SystemService.refresh` now wraps the entire collection in `context.WithTimeout(ColdPathTimeoutMs)` and runs versions in parallel with runtime/host-metrics goroutines. Partial cold-path results return `degraded:true` rather than blocking on the slowest probe; the version cache is only updated on full success so a deadline-cancelled collection can never poison the cached version pair. + - **Frontend**: `Sys.fetch()` now uses `AbortController` with a 6000ms ceiling (4000ms cold-path budget + jitter); on `r.ok===false` or thrown exception the new `renderGatewayDegraded(reason)` helper repaints the card with `State=Unavailable` and an explicit reason instead of leaving the placeholder text in place. + - **Skills empty state**: `web/index.html` now falls back to a `No skills configured` empty-state element when `data.skills` is `null` or `[]`, matching the existing Git Log fallback pattern. +- **`system.gatewayPort` default masked `ai.gatewayPort` inheritance** — `appconfig.Default()` pre-filled `SystemConfig.GatewayPort` with `18789`, which defeated the `Load()` fallback that was supposed to inherit from `ai.gatewayPort` when `system.gatewayPort` was omitted. The default is now zero so the inheritance path activates as documented; user-supplied values (either side) still win. +- **systemd unit missing `Environment=`** — Linux `service install` generated a unit file with no `OPENCLAW_HOME` or `PATH`, so the daemonized binary could not locate the openclaw CLI or OpenClaw runtime on fresh machines. The unit template now emits both `Environment=` directives, computed from the install-time `OPENCLAW_HOME` env override (falling back to `~/.openclaw`) and a deduplicated `PATH` with system bins (`/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin`) appended. +- **systemd `service install` did not pick up changed flags on reinstall** — the install path called `systemctl --user start`, which is a no-op when the unit is already running. Switched to `systemctl --user restart` so reinstalls with changed `--bind` / `--port` / `Environment` actually apply; `restart` also starts a stopped unit so first-installs still work. +- **Latest-version fetcher races with test cleanup** — the `getLatestVersionCached` background goroutine read a package-level `fetchLatestVersion` var that tests overrode during cleanup, occasionally producing data races under `-race`. Replaced with a per-instance `SystemService.fetchLatest` field set in the constructor; tests now isolate fully without touching shared state. +- **Version banner double-`v` prefix** — when `BuildVersion` was injected via `-ldflags` (the `make build` path) and the `VERSION` file already started with `v`, the startup banner printed `vv2026.4.x`. Both `Main()` assignment sites now normalize via `strings.TrimPrefix(version, "v")` so the banner and `--version` flag agree on the rendered value. + +### Added + +- **`system.coldPathTimeoutMs` configuration** — overall budget for a cold `/api/system` collection; defaults to 4000ms, validated [200, 15000]ms. Documented in `README.md` and `docs/CONFIGURATION.md`. +- **Frontend `renderGatewayDegraded(reason)` helper** — paints the Gateway Runtime card with an explicit `State=Unavailable` plus reason on fetch timeout, network error, or `r.ok===false`, so the card always reaches a terminal state. +- **Skills empty-state fallback** in `web/index.html`, matching the Git Log pattern. +- **Cron sidecar regression coverage** — new `internal/apprefresh/cron_state_test.go` plus split-store and legacy-inline fixtures exercise sidecar-only, legacy-only, sidecar-missing-job-id, malformed-sidecar, both-present, and `lastRunStatus` fallback paths. +- **Cold-path regression coverage** — new `internal/appsystem/cold_path_test.go` asserts the deadline is honoured, `degraded:true` is set on partial collection, host metrics still ship when gateway probes hang, and a deadline-cancelled collection cannot poison the version cache. + +### Changed + +- **`/api/system` cold path is now fully parallel** — versions, gateway runtime, and host metrics goroutines all run inside the same bounded `context.WithTimeout`. Previously versions ran serially before the parallel block. +- **CORS loopback-reflection invariants are now documented in code** — `internal/appserver/server_routes.go` carries an inline doc above `setCORSHeaders` enumerating why arbitrary `localhost:*` / `127.0.0.1:*` / `[::1]:*` origins are reflected (loopback bind by default, no `Allow-Credentials`, server-side gateway token, rate-limited `/api/chat`). +- **Planning doc preserved** — the issue #25/#26 fix plan moved to `docs/plans/2026-04-29-issue-25-26-fix-plan.md` alongside other historical planning docs. + +### Security + +- **GitHub external link hardened** — added `rel="noopener noreferrer"` to the `target="_blank"` link in the dashboard header so the linked tab cannot reach back to `window.opener`. Browsers default this for `noopener` since 2021, but the explicit attribute satisfies static auditors and older browsers. + +### Documentation + +- **`README.md`** and **`docs/CONFIGURATION.md`** — added rows for `system.coldPathTimeoutMs` and `system.gatewayPort` (including the inheritance behaviour). `examples/config.full.json` now includes the full `system` block which was previously missing. + +--- + ## v2026.4.13 — 2026-04-13 ### Added diff --git a/README.md b/README.md index 9263f33..ed6b649 100644 --- a/README.md +++ b/README.md @@ -393,6 +393,8 @@ is usually the repo root. For `install.sh` installs it is | `system.metricsTtlSeconds` | `10` | Server-side metrics cache TTL (seconds) | | `system.versionsTtlSeconds` | `300` | Version/gateway probe cache TTL (seconds) | | `system.gatewayTimeoutMs` | `5000` | Timeout for gateway liveness probe (ms) | +| `system.coldPathTimeoutMs` | `4000` | Overall budget for a cold `/api/system` collection (ms) | +| `system.gatewayPort` | `18789` | Gateway port for health probes (defaults to `ai.gatewayPort`) | | `system.diskPath` | `"/"` | Filesystem path to report disk usage for | | `system.warnPercent` | `70` | Global warn threshold (% used) — overridden by per-metric values | | `system.criticalPercent` | `85` | Global critical threshold (% used) — overridden by per-metric values | diff --git a/TODO.md b/TODO.md index 41be926..aaa53c6 100644 --- a/TODO.md +++ b/TODO.md @@ -2,6 +2,8 @@ ## ✅ Released +- **v2026.4.29**: cron sidecar merge (#25), `/api/system` cold-path deadline + degraded fallback (#26), `system.gatewayPort` inheritance fix, systemd `Environment=` + `restart` on reinstall, per-instance latest-version fetcher +- **v2026.4.13**: diagnostics + log visibility (#14), release hardening pass, structured logging cleanup - Built-in service management (`install`/`uninstall`/`start`/`stop`/`restart`/`status`) via launchd (macOS) and systemd (Linux) - Security hardening (XSS, CORS, O(N²), shell safety, file handles) - Performance, dirty-checking & test suite (initial 44 ACs, rAF, scroll preserve, tab fix) diff --git a/VERSION b/VERSION index ff042d6..61b61d7 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -v2026.4.8 +v2026.4.29 diff --git a/docs/CONFIGURATION.md b/docs/CONFIGURATION.md index 9c21e0b..b625439 100644 --- a/docs/CONFIGURATION.md +++ b/docs/CONFIGURATION.md @@ -49,6 +49,7 @@ the binary reports the installed release version correctly after upgrades. "metricsTtlSeconds": 10, "versionsTtlSeconds": 300, "gatewayTimeoutMs": 5000, + "coldPathTimeoutMs": 4000, "gatewayPort": 18789, "diskPath": "/", "warnPercent": 70, @@ -174,6 +175,7 @@ To change the OpenClaw data directory, set the `OPENCLAW_HOME` environment varia | `system.metricsTtlSeconds` | number | `10` | Server-side metrics cache TTL (2-60 seconds) | | `system.versionsTtlSeconds` | number | `300` | Version/gateway probe cache TTL (30-3600 seconds) | | `system.gatewayTimeoutMs` | number | `5000` | Timeout for gateway liveness probe (200-15000 ms) | +| `system.coldPathTimeoutMs` | number | `4000` | Overall budget for a cold `/api/system` collection — bounds total wall time when no warm cache is available (200-15000 ms) | | `system.gatewayPort` | number | `18789` | Gateway port for health probes (defaults to `ai.gatewayPort`) | | `system.diskPath` | string | `"/"` | Filesystem path to report disk usage for | | `system.warnPercent` | number | `70` | Global warn threshold (% used) — overridden by per-metric values | diff --git a/docs/plans/2026-04-29-issue-25-26-fix-plan.md b/docs/plans/2026-04-29-issue-25-26-fix-plan.md new file mode 100644 index 0000000..bf8c7c1 --- /dev/null +++ b/docs/plans/2026-04-29-issue-25-26-fix-plan.md @@ -0,0 +1,267 @@ +# Roadmap: Issues #25 + #26 — Validated Solution + +**Issues:** `mudrii/openclaw-dashboard#25`, `mudrii/openclaw-dashboard#26` +**Date:** `2026-04-29` +**Status:** Plan validated against `HEAD` of `chore/node24-actions` + +--- + +## 1. Scope and verdict per issue + +### Issue #25 — Cron table shows blank `Last run` / `Next run` + +| Claim | Verified against code | Verdict | +|---|---|---| +| Dashboard does not project nested `state.*` into flat fields | `internal/apprefresh/refresh.go:850-862` does read `jm["state"]` correctly | False root cause | +| Dashboard reads only `~/.openclaw/cron/jobs.json` and ignores `jobs-state.json` sidecar introduced in OpenClaw v2026.4.20 | `internal/apprefresh/refresh.go:144,800-886` reads only `cron/jobs.json`; live `~/.openclaw/cron/jobs-state.json` exists with populated `jobs..state.{nextRunAtMs,lastRunAtMs,lastStatus,lastDurationMs}` | **Confirmed bug** | + +**Sidecar schema (verified by reading the live file):** +```json +{ + "version": 1, + "jobs": { + "": { + "updatedAtMs": 1777389430815, + "scheduleIdentity": "...", + "state": { + "nextRunAtMs": 1777860000000, + "lastRunAtMs": 1777389230407, + "lastRunStatus": "ok", + "lastStatus": "ok", + "lastDurationMs": 200408, + "lastDeliveryStatus": "delivered", + "consecutiveErrors": 0 + } + } + } +} +``` + +`jobs.json` definition entries now carry `state: {}` (empty); state lives only in the sidecar, keyed by `job.id`. + +### Issue #26 — Multiple sections empty + +| Claim | Verified against code | Verdict | +|---|---|---| +| AVAILABLE MODELS empty | `web/index.html:1456` — `(D.availableModels||[]).map(...)` renders correctly | Not reproducible | +| GIT LOG empty | `web/index.html:1718` — falls back to `
No git history
` | Not reproducible | +| SUBAGENT LIMITS empty | `web/index.html:1687` — falls back to `Not configured` | Not reproducible | +| SKILLS empty when `null`/`[]` | `web/index.html:1709-1714` — `.map(...).join('')` produces empty string with no fallback | **Confirmed bug** | +| GATEWAY RUNTIME stuck on `Loading…` | `web/index.html:645` placeholder; `2119-2123` fetch has no `AbortController`, no timeout, no terminal-state path on failure (only flips two booleans). Card never repaints if `r.ok===false` or fetch throws | **Confirmed bug** | +| `/api/system` cold-start latency | `internal/appsystem/system_service.go:128` runs `getVersionsCached` **synchronously** before the parallel block at `:140-151`. Worst case ≈ `2 × GatewayTimeoutMs + DetectGatewayFallback(1500ms)` for cold versions, then `~GatewayTimeoutMs` for runtime probes. With default `GatewayTimeoutMs=5000`, **cold path can run ~10–12s** | **Confirmed contributing cause** | + +**Already-correct pieces we must not regress:** +- `/api/system` stale-cache path (system_service.go:87-108) — returns immediately and refreshes in background. +- Empty-state rendering for Git Log and Subagent Limits. +- Models render path. +- Per-probe timeouts in `probeOpenclawGatewayEndpoints` (system_service.go:421) and `runWithTimeout` (system_service.go:687). + +--- + +## 2. Constraints + +1. **Zero third-party dependencies** — `go.mod` has no `require` block. Stdlib only. +2. **Embedded frontend** — `web/index.html` is `//go:embed`'d; frontend changes require `make build`. +3. **Root-package facade** — logic lives in `internal/`; root files are thin wrappers (`refresh.go`, `system_service.go`). +4. **Backwards compat** — older OpenClaw versions still write inline `state` to `jobs.json`. Must work for both. +5. **Surgical changes** — every modified line must trace to a verified bug. +6. **TDD/ATDD** — write failing tests against fixtures **before** changing production code. + +--- + +## 3. Solution design + +### 3.1 Cron sidecar merge (#25) — `internal/apprefresh/refresh.go` + +**Strategy:** load both files; merge runtime state by `job.id` with sidecar winning. Preserve existing inline-state behavior when sidecar is absent. + +**Signature change (additive):** +```go +// CollectCrons reads ~/.openclaw/cron/jobs.json and merges runtime state from the +// optional sidecar at jobs-state.json (introduced in OpenClaw v2026.4.20). +// statePath may be empty; when empty, derived as filepath.Join(dir, "jobs-state.json"). +func CollectCrons(cronPath string, loc *time.Location) []map[string]any +``` + +Internal change: derive `statePath` from `cronPath` (sibling file). Read both. Merge. + +**Merge rule per job (in priority order):** +1. If sidecar has entry for `job.id` and sidecar `state` is non-empty → use sidecar `state`. +2. Else use inline `jm["state"]` (legacy path). +3. Else default zero state (current behavior). + +**Field mapping** (from sidecar `state.*` to flat fields, matching existing inline path at `refresh.go:850-862`): +- `nextRunAtMs` → `nextRun` (formatted via existing `time.UnixMilli(...).Format`) +- `lastRunAtMs` → `lastRun` +- `lastStatus` → `lastStatus` (sidecar also exposes `lastRunStatus`; keep `lastStatus` for compatibility, fall back to `lastRunStatus` only if `lastStatus` is missing) +- `lastDurationMs` → `lastDurationMs` + +**Edge cases to handle:** +- Sidecar file missing → fall back to inline state. +- Sidecar JSON malformed → log warning, fall back to inline state (don't fail the whole refresh). +- Sidecar has entry but `state` is `{}` → treat as no-state and fall back to inline. +- `jobs.json` job has no `id` field → can't merge; use inline state. + +### 3.2 Bound `/api/system` cold-path (#26) — `internal/appsystem/system_service.go` + +**Strategy:** introduce one explicit cold-path budget enforced via `context.WithTimeout`, parallelize versions with the existing wait group, return whatever completed. + +**Changes:** +1. Add `SystemConfig.ColdPathTimeoutMs` (default `4000`, validated in [200, 15000]). One field, one place to tune. +2. In `refresh(ctx)` (system_service.go:123), wrap the entire collection in `coldCtx, cancel := context.WithTimeout(ctx, coldPathTimeout)`. +3. Move `getVersionsCached` into the wait group so versions and runtime collection start in parallel. +4. After `wg.Wait()` (or `coldCtx.Done()`, whichever fires first), build the response from whatever each goroutine produced. Missing fields stay at zero values; `Degraded=true` and an error like `"cold path: deadline exceeded"` is appended to `Errors`. +5. **Crucial**: `getVersionsCached` cache is populated even on partial cold-path completion, so the next request hits the warm path. We must not poison the cache with a half-collected version. +6. Stale-cache path at lines 87-108 stays unchanged. + +**Why not a more radical refactor:** the existing cache + parallel-collector design is sound. The bug is one missing deadline + one ordering issue. Surgical fix. + +### 3.3 Frontend Gateway Runtime terminal state (#26) — `web/index.html` + +**Strategy:** add fetch timeout via `AbortController`, repaint card with degraded state on any non-success outcome. + +**Changes (lines 2119-2123):** +1. Use `AbortController` with `setTimeout(controller.abort, 6000)` — slightly longer than backend cold-path budget (4000) plus network jitter. +2. On `r.ok === false`, call a new helper `renderGatewayDegraded(reason)` that paints the card with `State=Unavailable`, `Source=cached fallback` (if `data.json` has gateway snapshot) or `Source=offline`. +3. On thrown exception (timeout / network), same helper with `reason='timeout'` or `'network error'`. +4. Keep the existing `window._sysBarActive=false` flag flips for downstream code that depends on them. + +**Source priority for fallback:** +1. Live `/api/system` runtime data (current behavior). +2. `data.json` gateway snapshot — show with explicit `Source: cached fallback` label. +3. Explicit unavailable state with reason. + +### 3.4 Skills empty state (#26) — `web/index.html` + +Single-line change at `web/index.html:1710-1713`: append `||'
No skills configured
'` to the `.map().join('')` output, matching the Git Log pattern at line 1718. + +### 3.5 Regression coverage + +**Backend (Go):** +- `internal/apprefresh/cron_state_test.go` (new file): + - `TestCollectCrons_SidecarOnly` — sidecar present, inline `state:{}`, expects merged values. + - `TestCollectCrons_LegacyInlineOnly` — no sidecar file, inline state populated (current behavior). + - `TestCollectCrons_SidecarMissingJobID` — sidecar exists but doesn't list this job ID; falls back to inline. + - `TestCollectCrons_SidecarMalformed` — invalid JSON sidecar; falls back to inline + logs warning. + - `TestCollectCrons_BothPresent_SidecarWins` — both populated; sidecar values used. + - `TestCollectCrons_SidecarLastRunStatusFallback` — sidecar has `lastRunStatus` but missing `lastStatus`; uses `lastRunStatus`. +- `internal/appsystem/cold_path_test.go` (new file): + - `TestRefresh_ColdPath_Deadline` — slow probes; assert response within `ColdPathTimeoutMs + 500ms`. + - `TestRefresh_ColdPath_DegradedFlag` — partial collection; assert `Degraded=true` and `Errors` includes deadline message. + - `TestRefresh_ColdPath_HostMetricsAlwaysShipped` — gateway probes hang; CPU/RAM/disk still in payload. + - `TestRefresh_ColdPath_PoisonsNoCache` — partial result must not be cached as fresh versions. + +**Backend tests are TDD-first:** write all of these as failing tests, then implement. + +**Frontend (manual + fixture):** +- Add `web/testdata/issue-26-fixture.json` with `skills:null`, `gitLog:null`, `agentConfig.subagentConfig:null`, populated `availableModels`, `gateway` block. +- Manual smoke test plan (documented in PR description, not in repo): serve fixture, kill backend, verify Gateway Runtime card transitions to "Unavailable" within ~6s. +- Optional: extend existing Playwright suite if it covers the affected panels — check first; do not add Playwright tests if they don't already exist for these panels. + +--- + +## 4. Implementation phases + +### Phase 0 — Branch + reproduction fixtures + +- [ ] Create branch `fix/cron-sidecar-and-system-coldpath` from `chore/node24-actions`. +- [ ] Add `internal/apprefresh/testdata/cron/jobs.split-store.json` (definitions with `state:{}`). +- [ ] Add `internal/apprefresh/testdata/cron/jobs-state.split-store.json` (sidecar with populated state). +- [ ] Add `internal/apprefresh/testdata/cron/jobs.legacy-inline.json` (single-file with inline state — already covered by `testdata/cron/jobs.every.json`, reuse). + +**Verify:** `git diff` shows only fixtures + branch checkout; no code changes yet. + +### Phase 1 — TDD red: cron sidecar tests + +- [ ] Write all six tests in §3.5 backend list for cron. +- [ ] Run `go test ./internal/apprefresh/... -run CollectCrons_Sidecar -count=1` — expect failures. + +**Verify:** Tests fail for the documented reason (sidecar not loaded), not for unrelated bugs. + +### Phase 2 — TDD green: implement cron sidecar merge + +- [ ] Implement merge logic in `internal/apprefresh/refresh.go` per §3.1. +- [ ] Run `go test ./internal/apprefresh/... -count=1 -race` — all green. +- [ ] Run `go vet ./...` — clean. +- [ ] Verify legacy tests (`TestCollectCrons_ParsesJobs`, `TestCollectCrons_ScheduleKinds`, `TestCollectCrons_MissingFile`, `TestCollectCrons_InvalidJSON`) still pass — backward compat preserved. + +**Verify:** Run `openclaw-dashboard --refresh` against real `~/.openclaw/cron/`; inspect resulting `~/.openclaw/dashboard/data.json` — `crons[].lastRun`, `nextRun`, `lastStatus`, `lastDurationMs` are populated. + +### Phase 3 — TDD red: /api/system cold-path tests + +- [ ] Write four tests in §3.5 backend list for `appsystem`. +- [ ] Use `httptest.NewServer` with `time.Sleep` to simulate slow gateway (pattern already in `system_service_test.go:160-184`). +- [ ] Run tests — expect failures. + +### Phase 4 — TDD green: bound cold-path latency + +- [ ] Add `ColdPathTimeoutMs` to `appconfig.SystemConfig` with default 4000 and validation. +- [ ] Refactor `SystemService.refresh` per §3.2: parallelize versions + runtime + host metrics under one bounded context. +- [ ] Ensure version cache only persists fully-collected results. +- [ ] Run tests + race detector + vet. + +**Verify:** With backend running and gateway killed, `time curl -s http://localhost:/api/system | jq .degraded` returns `true` and completes in under 5s. + +### Phase 5 — Frontend Gateway Runtime + Skills + +- [ ] Add `AbortController`-based timeout to `Sys.fetch()` at `web/index.html:2119`. +- [ ] Add `renderGatewayDegraded(reason)` helper that paints the card. +- [ ] Wire degraded paint into `r.ok===false` and `catch(e)` branches. +- [ ] Add Skills empty-state fallback at `web/index.html:1710-1713`. +- [ ] `make build` and serve locally. + +**Verify (manual):** +- Load dashboard with backend running normally — Gateway Runtime card paints with live data within poll interval. +- Kill the backend mid-poll — card transitions to "Unavailable" within ~6s, does not stay on "Loading…". +- Force `data.json` with `skills:null` — Skills panel shows "No skills configured". + +### Phase 6 — Final verification + +- [ ] `make check` (vet + lint + test -race) — clean. +- [ ] `go test -race -count=1 ./...` — clean. +- [ ] Manual end-to-end: refresh against real OpenClaw v2026.4.x install — cron table populated, Gateway Runtime card never stuck. +- [ ] `git log --oneline` review — every commit traces to a §3 line item, conventional-commit format. +- [ ] Update `TODO.md` Bugs section: mark cron sidecar item complete; remove obsolete entries. + +### Phase 7 — Docs / changelog + +- [ ] Append entry under next release in `CHANGELOG.md` (or equivalent — check repo for existing convention). +- [ ] Document `system.coldPathTimeoutMs` in any user-facing config docs that exist. +- [ ] Open PR referencing both #25 and #26. + +--- + +## 5. Goal-backward verification matrix + +Each phase ends only when its verify step passes. Final acceptance: + +| Symptom from issue | Verification | +|---|---| +| #25: cron `Last run` blank | After Phase 2, real `data.json` shows populated `lastRun`/`nextRun`/`lastStatus`/`lastDurationMs` for jobs that ran | +| #25: dashboard breaks against OpenClaw v2026.4.20+ | `TestCollectCrons_SidecarOnly` covers the split-store case | +| #25: legacy compat | `TestCollectCrons_LegacyInlineOnly` covers the single-file case | +| #26: Gateway Runtime stuck on `Loading…` | Phase 4 deadline test + Phase 5 manual kill-backend test | +| #26: Skills empty on `null` | Manual fixture test + Skills empty-state regression in `web/testdata/issue-26-fixture.json` | +| #26: false claims (Models, Git, Subagent) | Existing render code is correct; no change needed; documented in §1 | + +--- + +## 6. Risks and mitigations + +| Risk | Mitigation | +|---|---| +| `ColdPathTimeoutMs` too aggressive → false "Unavailable" on healthy-but-slow gateway | Default 4000ms; configurable; tests assert no degraded flag at 1.5× normal probe time | +| Sidecar merge picks wrong job when IDs collide across upgrades | Use `job.id` (UUID) as merge key; absent ID → no merge (legacy fallback) | +| Frontend timeout shorter than backend deadline → repeated false negatives | Frontend timeout 6000ms vs backend 4000ms — explicit headroom for network + JSON marshaling | +| Cache poisoning on cold partial fail | Phase 4 test `TestRefresh_ColdPath_PoisonsNoCache` asserts version cache is only updated when collection succeeds | +| Embedded frontend not rebuilt | `make build` step in Phase 5 verify; PR description includes `git diff --stat` showing `web/index.html` changed | + +--- + +## 7. Out of scope + +- Refactoring of unrelated `appsystem` paths beyond the cold-path budget. +- Adding Playwright tests where none exist for the affected panels (would expand test infra). +- Changes to `data.json` contract. +- Adding new external dependencies (project policy: zero deps). +- Fixes to issue #26 claims that did not reproduce (Models, Git Log, Subagent Limits). diff --git a/examples/config.full.json b/examples/config.full.json index 31b4d6a..e73548b 100644 --- a/examples/config.full.json +++ b/examples/config.full.json @@ -26,5 +26,21 @@ "model": "", "maxHistory": 6, "dotenvPath": "~/.openclaw/.env" + }, + "system": { + "enabled": true, + "pollSeconds": 10, + "metricsTtlSeconds": 10, + "versionsTtlSeconds": 300, + "gatewayTimeoutMs": 5000, + "coldPathTimeoutMs": 4000, + "gatewayPort": 18789, + "diskPath": "/", + "warnPercent": 70, + "criticalPercent": 85, + "cpu": { "warn": 80, "critical": 95 }, + "ram": { "warn": 80, "critical": 95 }, + "swap": { "warn": 80, "critical": 95 }, + "disk": { "warn": 80, "critical": 95 } } } diff --git a/internal/appconfig/config.go b/internal/appconfig/config.go index 9ea081f..04bc93c 100644 --- a/internal/appconfig/config.go +++ b/internal/appconfig/config.go @@ -62,19 +62,24 @@ type MetricThreshold struct { } type SystemConfig struct { - Enabled bool `json:"enabled"` - PollSeconds int `json:"pollSeconds"` - MetricsTTLSeconds int `json:"metricsTtlSeconds"` - VersionsTTLSeconds int `json:"versionsTtlSeconds"` - GatewayTimeoutMs int `json:"gatewayTimeoutMs"` - GatewayPort int `json:"gatewayPort"` - DiskPath string `json:"diskPath"` - WarnPercent float64 `json:"warnPercent"` - CriticalPercent float64 `json:"criticalPercent"` - CPU MetricThreshold `json:"cpu"` - RAM MetricThreshold `json:"ram"` - Swap MetricThreshold `json:"swap"` - Disk MetricThreshold `json:"disk"` + Enabled bool `json:"enabled"` + PollSeconds int `json:"pollSeconds"` + MetricsTTLSeconds int `json:"metricsTtlSeconds"` + VersionsTTLSeconds int `json:"versionsTtlSeconds"` + GatewayTimeoutMs int `json:"gatewayTimeoutMs"` + // ColdPathTimeoutMs bounds the worst-case wall time of a cold /api/system + // collection (no warm cache). Each subcollector still has its own per-probe + // timeout; this is the overall budget that prevents a slow gateway from + // dragging the whole refresh past the frontend fetch deadline. + ColdPathTimeoutMs int `json:"coldPathTimeoutMs"` + GatewayPort int `json:"gatewayPort"` + DiskPath string `json:"diskPath"` + WarnPercent float64 `json:"warnPercent"` + CriticalPercent float64 `json:"criticalPercent"` + CPU MetricThreshold `json:"cpu"` + RAM MetricThreshold `json:"ram"` + Swap MetricThreshold `json:"swap"` + Disk MetricThreshold `json:"disk"` } type Config struct { @@ -127,14 +132,17 @@ func Default() Config { MetricsTTLSeconds: 10, VersionsTTLSeconds: 300, GatewayTimeoutMs: 5000, - GatewayPort: 18789, - DiskPath: "/", - WarnPercent: 70, - CriticalPercent: 85, - CPU: MetricThreshold{Warn: 80, Critical: 95}, - RAM: MetricThreshold{Warn: 80, Critical: 95}, - Swap: MetricThreshold{Warn: 80, Critical: 95}, - Disk: MetricThreshold{Warn: 80, Critical: 95}, + ColdPathTimeoutMs: 4000, + // GatewayPort intentionally left zero so Load() can inherit from + // AI.GatewayPort when system.gatewayPort is omitted in config.json. + // Pre-filling here would mask user overrides on the AI side. + DiskPath: "/", + WarnPercent: 70, + CriticalPercent: 85, + CPU: MetricThreshold{Warn: 80, Critical: 95}, + RAM: MetricThreshold{Warn: 80, Critical: 95}, + Swap: MetricThreshold{Warn: 80, Critical: 95}, + Disk: MetricThreshold{Warn: 80, Critical: 95}, }, } } @@ -148,11 +156,11 @@ func Load(dir string) Config { } if err != nil { slog.Warn("[dashboard] config: no config.json found, using defaults") - return cfg - } - defer func() { _ = f.Close() }() - if err := json.NewDecoder(f).Decode(&cfg); err != nil { - slog.Warn("[dashboard] invalid config.json, using defaults for missing/invalid fields", "error", err) + } else { + defer func() { _ = f.Close() }() + if err := json.NewDecoder(f).Decode(&cfg); err != nil { + slog.Warn("[dashboard] invalid config.json, using defaults for missing/invalid fields", "error", err) + } } if len(cfg.Logs.Sources) == 0 && len(cfg.Logs.LogSources) > 0 { cfg.Logs.Sources = append([]string{}, cfg.Logs.LogSources...) @@ -208,6 +216,9 @@ func Load(dir string) Config { if cfg.System.GatewayTimeoutMs < 200 || cfg.System.GatewayTimeoutMs > 15000 { cfg.System.GatewayTimeoutMs = 5000 } + if cfg.System.ColdPathTimeoutMs < 200 || cfg.System.ColdPathTimeoutMs > 15000 { + cfg.System.ColdPathTimeoutMs = 4000 + } if cfg.System.GatewayPort <= 0 { cfg.System.GatewayPort = cfg.AI.GatewayPort } diff --git a/internal/appconfig/config_test.go b/internal/appconfig/config_test.go index fc05f0d..1a03423 100644 --- a/internal/appconfig/config_test.go +++ b/internal/appconfig/config_test.go @@ -103,6 +103,51 @@ func TestLoad_InvalidJSON(t *testing.T) { } } +// TestLoad_SystemGatewayPortInheritsFromAI verifies that omitting +// system.gatewayPort in config.json inherits the value from ai.gatewayPort, +// which is the documented behavior. Regression: previously Default() +// pre-populated System.GatewayPort=18789, masking any AI override. +func TestLoad_SystemGatewayPortInheritsFromAI(t *testing.T) { + dir := t.TempDir() + data := `{"ai":{"gatewayPort":12345}}` + if err := os.WriteFile(filepath.Join(dir, "config.json"), []byte(data), 0o644); err != nil { + t.Fatal(err) + } + cfg := Load(dir) + if cfg.AI.GatewayPort != 12345 { + t.Fatalf("AI.GatewayPort = %d, want 12345", cfg.AI.GatewayPort) + } + if cfg.System.GatewayPort != 12345 { + t.Fatalf("System.GatewayPort = %d, want inherited 12345 from ai.gatewayPort", + cfg.System.GatewayPort) + } +} + +// TestLoad_SystemGatewayPortExplicitOverridesAI verifies that an explicit +// system.gatewayPort in config.json wins over ai.gatewayPort. +func TestLoad_SystemGatewayPortExplicitOverridesAI(t *testing.T) { + dir := t.TempDir() + data := `{"ai":{"gatewayPort":12345},"system":{"enabled":true,"gatewayPort":54321}}` + if err := os.WriteFile(filepath.Join(dir, "config.json"), []byte(data), 0o644); err != nil { + t.Fatal(err) + } + cfg := Load(dir) + if cfg.System.GatewayPort != 54321 { + t.Fatalf("explicit System.GatewayPort lost: got %d, want 54321", cfg.System.GatewayPort) + } +} + +// TestLoad_SystemGatewayPortMissingFileFallsBackToDefault locks in the +// no-config-file path: System.GatewayPort still ends up at the default +// (inherited from AI.GatewayPort default 18789), not zero. +func TestLoad_SystemGatewayPortMissingFileFallsBackToDefault(t *testing.T) { + cfg := Load(t.TempDir()) + if cfg.System.GatewayPort != 18789 { + t.Fatalf("System.GatewayPort = %d, want 18789 (default via AI inheritance)", + cfg.System.GatewayPort) + } +} + func TestLoad_SystemThresholdClamping(t *testing.T) { dir := t.TempDir() // Set critical <= warn to trigger clamping diff --git a/internal/apprefresh/cron_state_test.go b/internal/apprefresh/cron_state_test.go new file mode 100644 index 0000000..2cd433f --- /dev/null +++ b/internal/apprefresh/cron_state_test.go @@ -0,0 +1,381 @@ +package apprefresh + +import ( + "encoding/json" + "os" + "path/filepath" + "testing" + "time" +) + +// copyTestdata copies testdata/cron/ into dir and returns the destination path. +func copyTestdata(t *testing.T, dir, name string) string { + t.Helper() + src := filepath.Join("testdata", "cron", name) + data, err := os.ReadFile(src) + if err != nil { + t.Fatalf("read fixture %s: %v", src, err) + } + dst := filepath.Join(dir, name) + if err := os.WriteFile(dst, data, 0o644); err != nil { + t.Fatalf("write fixture %s: %v", dst, err) + } + return dst +} + +// TestCollectCrons_SidecarOnly covers OpenClaw v2026.4.20+ split-store: jobs.json +// has empty inline state and jobs-state.json holds the runtime state. +func TestCollectCrons_SidecarOnly(t *testing.T) { + dir := t.TempDir() + cronPath := filepath.Join(dir, "jobs.json") + statePath := filepath.Join(dir, "jobs-state.json") + + jobsBytes, err := os.ReadFile(filepath.Join("testdata", "cron", "jobs.split-store.json")) + if err != nil { + t.Fatal(err) + } + stateBytes, err := os.ReadFile(filepath.Join("testdata", "cron", "jobs-state.split-store.json")) + if err != nil { + t.Fatal(err) + } + if err := os.WriteFile(cronPath, jobsBytes, 0o644); err != nil { + t.Fatal(err) + } + if err := os.WriteFile(statePath, stateBytes, 0o644); err != nil { + t.Fatal(err) + } + + crons := CollectCrons(cronPath, time.UTC) + if len(crons) != 2 { + t.Fatalf("expected 2 crons, got %d", len(crons)) + } + first := crons[0] + if first["lastStatus"] != "ok" { + t.Errorf("lastStatus = %v, want ok", first["lastStatus"]) + } + if first["lastDurationMs"] != 23028 { + t.Errorf("lastDurationMs = %v, want 23028", first["lastDurationMs"]) + } + if first["lastRun"] == "" { + t.Errorf("lastRun should be populated, got empty string") + } + if first["nextRun"] == "" { + t.Errorf("nextRun should be populated, got empty string") + } +} + +// TestCollectCrons_LegacyInlineOnly covers pre-v2026.4.20 single-file behavior +// where state lives inline in jobs.json and there is no sidecar. +func TestCollectCrons_LegacyInlineOnly(t *testing.T) { + dir := t.TempDir() + cronPath := copyTestdata(t, dir, "jobs.legacy-inline.json") + // rename to jobs.json (CollectCrons takes the full path so this is just for clarity) + finalPath := filepath.Join(dir, "jobs.json") + if err := os.Rename(cronPath, finalPath); err != nil { + t.Fatal(err) + } + + crons := CollectCrons(finalPath, time.UTC) + if len(crons) != 1 { + t.Fatalf("expected 1 cron, got %d", len(crons)) + } + c := crons[0] + if c["lastStatus"] != "ok" { + t.Errorf("lastStatus = %v, want ok", c["lastStatus"]) + } + if c["lastDurationMs"] != 12345 { + t.Errorf("lastDurationMs = %v, want 12345", c["lastDurationMs"]) + } + if c["lastRun"] == "" { + t.Error("lastRun should be populated from inline state") + } + if c["nextRun"] == "" { + t.Error("nextRun should be populated from inline state") + } +} + +// TestCollectCrons_SidecarMissingJobID: sidecar exists but does not list this job's id. +// Should fall back to inline state for that job. +func TestCollectCrons_SidecarMissingJobID(t *testing.T) { + dir := t.TempDir() + cronPath := filepath.Join(dir, "jobs.json") + statePath := filepath.Join(dir, "jobs-state.json") + + // Job with inline state + jobs := map[string]any{ + "version": 1, + "jobs": []any{ + map[string]any{ + "id": "legacy-uuid-123", + "name": "orphan-job", + "enabled": true, + "schedule": map[string]any{"kind": "cron", "expr": "0 0 * * *"}, + "state": map[string]any{ + "lastStatus": "ok", + "lastRunAtMs": float64(1777025060681), + "nextRunAtMs": float64(1777046660670), + "lastDurationMs": float64(7777), + }, + }, + }, + } + jobsBytes, _ := json.Marshal(jobs) + if err := os.WriteFile(cronPath, jobsBytes, 0o644); err != nil { + t.Fatal(err) + } + + // Sidecar references a different job id + sidecar := map[string]any{ + "version": 1, + "jobs": map[string]any{ + "unrelated-uuid": map[string]any{ + "state": map[string]any{ + "lastStatus": "fail", + "lastDurationMs": float64(11111), + }, + }, + }, + } + stateBytes, _ := json.Marshal(sidecar) + if err := os.WriteFile(statePath, stateBytes, 0o644); err != nil { + t.Fatal(err) + } + + crons := CollectCrons(cronPath, time.UTC) + if len(crons) != 1 { + t.Fatalf("expected 1 cron, got %d", len(crons)) + } + if crons[0]["lastStatus"] != "ok" { + t.Errorf("lastStatus = %v, want ok (from inline)", crons[0]["lastStatus"]) + } + if crons[0]["lastDurationMs"] != 7777 { + t.Errorf("lastDurationMs = %v, want 7777 (from inline)", crons[0]["lastDurationMs"]) + } +} + +// TestCollectCrons_SidecarMalformed: sidecar JSON is invalid; must not break collection. +// Falls back to inline state. +func TestCollectCrons_SidecarMalformed(t *testing.T) { + dir := t.TempDir() + cronPath := filepath.Join(dir, "jobs.json") + statePath := filepath.Join(dir, "jobs-state.json") + + jobs := map[string]any{ + "jobs": []any{ + map[string]any{ + "id": "any-id", + "name": "x", + "enabled": true, + "schedule": map[string]any{"kind": "cron", "expr": "0 0 * * *"}, + "state": map[string]any{ + "lastStatus": "ok", + "lastRunAtMs": float64(1777025060681), + "nextRunAtMs": float64(1777046660670), + "lastDurationMs": float64(99), + }, + }, + }, + } + jobsBytes, _ := json.Marshal(jobs) + _ = os.WriteFile(cronPath, jobsBytes, 0o644) + _ = os.WriteFile(statePath, []byte("not json"), 0o644) + + crons := CollectCrons(cronPath, time.UTC) + if len(crons) != 1 { + t.Fatalf("expected 1 cron, got %d", len(crons)) + } + if crons[0]["lastStatus"] != "ok" { + t.Errorf("lastStatus = %v, want ok (inline fallback)", crons[0]["lastStatus"]) + } + if crons[0]["lastDurationMs"] != 99 { + t.Errorf("lastDurationMs = %v, want 99 (inline fallback)", crons[0]["lastDurationMs"]) + } +} + +// TestCollectCrons_BothPresent_SidecarWins: when inline state and sidecar both have +// values for the same job id, sidecar wins (it is the live runtime state). +func TestCollectCrons_BothPresent_SidecarWins(t *testing.T) { + dir := t.TempDir() + cronPath := filepath.Join(dir, "jobs.json") + statePath := filepath.Join(dir, "jobs-state.json") + + id := "shared-id" + jobs := map[string]any{ + "jobs": []any{ + map[string]any{ + "id": id, + "name": "x", + "enabled": true, + "schedule": map[string]any{"kind": "cron", "expr": "0 0 * * *"}, + "state": map[string]any{ + "lastStatus": "stale", + "lastDurationMs": float64(1), + }, + }, + }, + } + sidecar := map[string]any{ + "jobs": map[string]any{ + id: map[string]any{ + "state": map[string]any{ + "lastStatus": "ok", + "lastDurationMs": float64(42), + "lastRunAtMs": float64(1777025060681), + "nextRunAtMs": float64(1777046660670), + }, + }, + }, + } + jb, _ := json.Marshal(jobs) + sb, _ := json.Marshal(sidecar) + _ = os.WriteFile(cronPath, jb, 0o644) + _ = os.WriteFile(statePath, sb, 0o644) + + crons := CollectCrons(cronPath, time.UTC) + if crons[0]["lastStatus"] != "ok" { + t.Errorf("lastStatus = %v, want ok (sidecar wins)", crons[0]["lastStatus"]) + } + if crons[0]["lastDurationMs"] != 42 { + t.Errorf("lastDurationMs = %v, want 42 (sidecar wins)", crons[0]["lastDurationMs"]) + } +} + +// TestCollectCrons_SidecarEmptyStateFallsBackToInline: when the sidecar has an +// entry for this job id but its `state` field is `{}` (empty object — the OpenClaw +// gateway has registered the job but never populated runtime state, e.g. brand-new +// schedule before its first run), loadCronStateSidecar must skip the entry and +// CollectCrons must fall back to inline state if available. Otherwise the UI would +// show a job with empty lastStatus/lastRun even when a legitimate inline value +// exists (e.g. mid-migration repos where jobs.json still has stale-but-non-empty +// state and jobs-state.json has been initialized with empty objects). +func TestCollectCrons_SidecarEmptyStateFallsBackToInline(t *testing.T) { + dir := t.TempDir() + cronPath := filepath.Join(dir, "jobs.json") + statePath := filepath.Join(dir, "jobs-state.json") + + id := "empty-sidecar-id" + jobs := map[string]any{ + "jobs": []any{ + map[string]any{ + "id": id, + "name": "x", + "enabled": true, + "schedule": map[string]any{"kind": "cron", "expr": "0 0 * * *"}, + "state": map[string]any{ + "lastStatus": "ok", + "lastRunAtMs": float64(1777025060681), + "nextRunAtMs": float64(1777046660670), + "lastDurationMs": float64(555), + }, + }, + }, + } + sidecar := map[string]any{ + "jobs": map[string]any{ + id: map[string]any{"state": map[string]any{}}, // empty state + }, + } + jb, _ := json.Marshal(jobs) + sb, _ := json.Marshal(sidecar) + _ = os.WriteFile(cronPath, jb, 0o644) + _ = os.WriteFile(statePath, sb, 0o644) + + crons := CollectCrons(cronPath, time.UTC) + if len(crons) != 1 { + t.Fatalf("expected 1 cron, got %d", len(crons)) + } + if crons[0]["lastStatus"] != "ok" { + t.Errorf("lastStatus = %v, want ok (inline fallback when sidecar state is empty)", crons[0]["lastStatus"]) + } + if crons[0]["lastDurationMs"] != 555 { + t.Errorf("lastDurationMs = %v, want 555 (inline fallback)", crons[0]["lastDurationMs"]) + } +} + +// TestCollectCrons_SidecarPartialOverridesInlineFully: locks in the documented +// override contract — when sidecar has an entry with non-empty state, it fully +// replaces inline state even when the sidecar entry has fewer populated fields +// than inline. This prevents stale inline data from leaking through into the UI. +// The inline state has lastDurationMs=999; the sidecar has lastStatus only +// (no duration). Result must reflect the sidecar's missing duration as 0, +// not the inline 999. +func TestCollectCrons_SidecarPartialOverridesInlineFully(t *testing.T) { + dir := t.TempDir() + cronPath := filepath.Join(dir, "jobs.json") + statePath := filepath.Join(dir, "jobs-state.json") + + id := "partial-sidecar-id" + jobs := map[string]any{ + "jobs": []any{ + map[string]any{ + "id": id, + "name": "x", + "enabled": true, + "schedule": map[string]any{"kind": "cron", "expr": "0 0 * * *"}, + "state": map[string]any{ + "lastStatus": "stale-from-inline", + "lastDurationMs": float64(999), + }, + }, + }, + } + sidecar := map[string]any{ + "jobs": map[string]any{ + id: map[string]any{ + "state": map[string]any{ + "lastStatus": "ok-from-sidecar", + // no lastDurationMs; contract says inline 999 must NOT survive + }, + }, + }, + } + jb, _ := json.Marshal(jobs) + sb, _ := json.Marshal(sidecar) + _ = os.WriteFile(cronPath, jb, 0o644) + _ = os.WriteFile(statePath, sb, 0o644) + + crons := CollectCrons(cronPath, time.UTC) + if crons[0]["lastStatus"] != "ok-from-sidecar" { + t.Errorf("lastStatus = %v, want ok-from-sidecar", crons[0]["lastStatus"]) + } + if crons[0]["lastDurationMs"] != 0 { + t.Errorf("lastDurationMs = %v, want 0 — sidecar must fully replace inline, not merge", + crons[0]["lastDurationMs"]) + } +} + +// TestCollectCrons_SidecarLastRunStatusFallback: when sidecar omits lastStatus but +// has lastRunStatus, use lastRunStatus. +func TestCollectCrons_SidecarLastRunStatusFallback(t *testing.T) { + dir := t.TempDir() + cronPath := filepath.Join(dir, "jobs.json") + statePath := filepath.Join(dir, "jobs-state.json") + + id := "6f34de99-6a2d-4b20-89e4-65a5e4410c1b" + jobs := map[string]any{ + "jobs": []any{ + map[string]any{ + "id": id, + "name": "x", + "enabled": true, + "schedule": map[string]any{"kind": "cron", "expr": "0 0 * * *"}, + "state": map[string]any{}, + }, + }, + } + jb, _ := json.Marshal(jobs) + _ = os.WriteFile(cronPath, jb, 0o644) + stateBytes, err := os.ReadFile(filepath.Join("testdata", "cron", "jobs-state.lastRunStatusOnly.json")) + if err != nil { + t.Fatal(err) + } + _ = os.WriteFile(statePath, stateBytes, 0o644) + + crons := CollectCrons(cronPath, time.UTC) + if crons[0]["lastStatus"] != "fail" { + t.Errorf("lastStatus = %v, want fail (from lastRunStatus)", crons[0]["lastStatus"]) + } + if crons[0]["lastDurationMs"] != 99 { + t.Errorf("lastDurationMs = %v, want 99", crons[0]["lastDurationMs"]) + } +} diff --git a/internal/apprefresh/refresh.go b/internal/apprefresh/refresh.go index 9e0d640..24e1974 100644 --- a/internal/apprefresh/refresh.go +++ b/internal/apprefresh/refresh.go @@ -797,6 +797,37 @@ func parseOpenclawConfig(oc map[string]any, basePath string) ( return compactionMode, skills, availableModels, modelAliases, agentConfig } +// loadCronStateSidecar reads jobs-state.json (introduced in OpenClaw v2026.4.20) +// and returns a job-id → state map. Returns nil when the file is absent or invalid; +// callers fall back to inline state. +func loadCronStateSidecar(statePath string) map[string]map[string]any { + data, err := os.ReadFile(statePath) + if err != nil { + return nil + } + var sidecar map[string]any + if err := json.Unmarshal(data, &sidecar); err != nil { + return nil + } + jobs := asObj(sidecar["jobs"]) + if jobs == nil { + return nil + } + out := make(map[string]map[string]any, len(jobs)) + for id, entry := range jobs { + em := asObj(entry) + if em == nil { + continue + } + state := asObj(em["state"]) + if len(state) == 0 { + continue + } + out[id] = state + } + return out +} + func CollectCrons(cronPath string, loc *time.Location) []map[string]any { var crons []map[string]any data, err := os.ReadFile(cronPath) @@ -812,6 +843,11 @@ func CollectCrons(cronPath string, loc *time.Location) []map[string]any { return crons } + // OpenClaw v2026.4.20+ moved runtime state to a sidecar file. When present, its + // per-job entries take precedence over inline state (which is now {} in jobs.json). + sidecarPath := filepath.Join(filepath.Dir(cronPath), "jobs-state.json") + sidecarStates := loadCronStateSidecar(sidecarPath) + for _, job := range jobs { jm := asObj(job) if jm == nil { @@ -847,8 +883,28 @@ func CollectCrons(cronPath string, loc *time.Location) []map[string]any { schedStr = string(b) } + // Sidecar override contract (OpenClaw v2026.4.20+): + // When jobs-state.json contains an entry for this job id, it is the + // authoritative live state — we replace inline state wholesale rather + // than field-merging. Rationale: in v2026.4.20+, OpenClaw stops writing + // runtime state to jobs.json, so any inline fields present there are + // pre-migration leftovers, not authoritative. A field-level merge could + // surface stale lastRun/nextRun values from a long-superseded inline + // snapshot. Inline state remains the fallback only when the sidecar is + // absent entirely (loadCronStateSidecar returns nil) or has no entry + // for this id (legacy jobs.json layout, sidecar missing this job). state := asObj(jm["state"]) - lastStatus := jsonStrDefault(state, "lastStatus", "none") + if id := jsonStr(jm, "id"); id != "" { + if sc, ok := sidecarStates[id]; ok { + state = sc + } + } + // Sidecar exposes both lastStatus and lastRunStatus; prefer lastStatus for + // continuity with the legacy schema, fall back to lastRunStatus. + lastStatus := jsonStrDefault(state, "lastStatus", "") + if lastStatus == "" { + lastStatus = jsonStrDefault(state, "lastRunStatus", "none") + } lastRunMs, _ := state["lastRunAtMs"].(float64) nextRunMs, _ := state["nextRunAtMs"].(float64) durationMs, _ := state["lastDurationMs"].(float64) diff --git a/internal/apprefresh/testdata/cron/jobs-state.lastRunStatusOnly.json b/internal/apprefresh/testdata/cron/jobs-state.lastRunStatusOnly.json new file mode 100644 index 0000000..7891f3d --- /dev/null +++ b/internal/apprefresh/testdata/cron/jobs-state.lastRunStatusOnly.json @@ -0,0 +1,14 @@ +{ + "version": 1, + "jobs": { + "6f34de99-6a2d-4b20-89e4-65a5e4410c1b": { + "updatedAtMs": 1777389430815, + "state": { + "nextRunAtMs": 1777046660670, + "lastRunAtMs": 1777025060681, + "lastRunStatus": "fail", + "lastDurationMs": 99 + } + } + } +} diff --git a/internal/apprefresh/testdata/cron/jobs-state.split-store.json b/internal/apprefresh/testdata/cron/jobs-state.split-store.json new file mode 100644 index 0000000..1607c4d --- /dev/null +++ b/internal/apprefresh/testdata/cron/jobs-state.split-store.json @@ -0,0 +1,27 @@ +{ + "version": 1, + "jobs": { + "6f34de99-6a2d-4b20-89e4-65a5e4410c1b": { + "updatedAtMs": 1777389430815, + "state": { + "nextRunAtMs": 1777046660670, + "lastRunAtMs": 1777025060681, + "lastRunStatus": "ok", + "lastStatus": "ok", + "lastDurationMs": 23028, + "consecutiveErrors": 0 + } + }, + "91ba6af1-9ab2-4f65-b24f-7b7f037f3778": { + "updatedAtMs": 1777435374595, + "state": { + "nextRunAtMs": 1777456921384, + "lastRunAtMs": 1777435321398, + "lastRunStatus": "ok", + "lastStatus": "ok", + "lastDurationMs": 53197, + "consecutiveErrors": 0 + } + } + } +} diff --git a/internal/apprefresh/testdata/cron/jobs.legacy-inline.json b/internal/apprefresh/testdata/cron/jobs.legacy-inline.json new file mode 100644 index 0000000..4b157bb --- /dev/null +++ b/internal/apprefresh/testdata/cron/jobs.legacy-inline.json @@ -0,0 +1,23 @@ +{ + "version": 1, + "jobs": [ + { + "id": "legacy-job-1", + "name": "Legacy inline-state job", + "enabled": true, + "schedule": { + "kind": "cron", + "expr": "0 9 * * *" + }, + "payload": { + "model": "claude-sonnet" + }, + "state": { + "nextRunAtMs": 1777046660670, + "lastRunAtMs": 1777025060681, + "lastStatus": "ok", + "lastDurationMs": 12345 + } + } + ] +} diff --git a/internal/apprefresh/testdata/cron/jobs.split-store.json b/internal/apprefresh/testdata/cron/jobs.split-store.json new file mode 100644 index 0000000..6e047b5 --- /dev/null +++ b/internal/apprefresh/testdata/cron/jobs.split-store.json @@ -0,0 +1,34 @@ +{ + "version": 1, + "jobs": [ + { + "id": "6f34de99-6a2d-4b20-89e4-65a5e4410c1b", + "agentId": "main", + "name": "Backup workspace git push", + "enabled": true, + "schedule": { + "kind": "cron", + "expr": "0 */6 * * *" + }, + "payload": { + "kind": "agentTurn", + "model": "claude-sonnet" + }, + "state": {} + }, + { + "id": "91ba6af1-9ab2-4f65-b24f-7b7f037f3778", + "agentId": "main", + "name": "Vault Inbox Pipeline", + "enabled": true, + "schedule": { + "kind": "every", + "everyMs": 3600000 + }, + "payload": { + "model": "openai-codex/gpt-5.4" + }, + "state": {} + } + ] +} diff --git a/internal/appserver/server_routes.go b/internal/appserver/server_routes.go index ec204ab..980aa2e 100644 --- a/internal/appserver/server_routes.go +++ b/internal/appserver/server_routes.go @@ -85,6 +85,18 @@ func (s *Server) notFound(w http.ResponseWriter, r *http.Request) { http.NotFound(w, r) } +// setCORSHeaders reflects any loopback origin (any port) and defaults to the +// configured server origin otherwise. This is safe because: +// - The dashboard binds to 127.0.0.1 by default (Server.Host), so non-loopback +// origins cannot reach it over the network in the typical deployment. +// - No Access-Control-Allow-Credentials header is set, so a cross-origin +// request cannot carry cookies or HTTP auth. +// - The /api/chat gateway token is server-side (s.gatewayToken from .env), +// never client-supplied, and that endpoint is rate-limited to 10/min per IP. +// +// Loopback reflection exists so a developer running the SPA on a separate +// localhost port (e.g. Vite on :5173) can talk to the dashboard during +// development without disabling CORS. func (s *Server) setCORSHeaders(w http.ResponseWriter, r *http.Request) { vary := w.Header().Get("Vary") switch { diff --git a/internal/appservice/systemd.go b/internal/appservice/systemd.go index 9773219..1b2c1a2 100644 --- a/internal/appservice/systemd.go +++ b/internal/appservice/systemd.go @@ -26,6 +26,8 @@ After=network.target [Service] Type=simple WorkingDirectory={{systemdQuote .WorkDir}} +Environment={{systemdQuote (printf "OPENCLAW_HOME=%s" .OpenclawHome)}} +Environment={{systemdQuote (printf "PATH=%s" .PathEnv)}} ExecStart={{systemdQuote .BinPath}} --bind {{systemdQuote .Host}} --port {{.Port}} Restart=always RestartSec=5 @@ -35,10 +37,12 @@ WantedBy=default.target `)) type unitData struct { - BinPath string - Host string - Port int - WorkDir string + BinPath string + Host string + Port int + WorkDir string + OpenclawHome string + PathEnv string } var unitPortRe = regexp.MustCompile(`(?:^|\s)--port\s+"?([0-9]+)"?`) @@ -89,7 +93,15 @@ func (sb *systemdBackend) Install(cfg InstallConfig) error { return fmt.Errorf("create unit file: %w", err) } defer func() { _ = f.Close() }() - if err := unitTmpl.Execute(f, unitData{BinPath: cfg.BinPath, Host: cfg.Host, Port: cfg.Port, WorkDir: cfg.WorkDir}); err != nil { + data := unitData{ + BinPath: cfg.BinPath, + Host: cfg.Host, + Port: cfg.Port, + WorkDir: cfg.WorkDir, + OpenclawHome: systemdOpenclawHome(), + PathEnv: systemdPathEnv(), + } + if err := unitTmpl.Execute(f, data); err != nil { return fmt.Errorf("write unit file: %w", err) } if out, err := sb.ctl("daemon-reload"); err != nil { @@ -98,12 +110,52 @@ func (sb *systemdBackend) Install(cfg InstallConfig) error { if out, err := sb.ctl("enable", systemdUnitName); err != nil { return fmt.Errorf("enable: %s: %w", strings.TrimSpace(string(out)), err) } - if out, err := sb.ctl("start", systemdUnitName); err != nil { - return fmt.Errorf("start: %s: %w", strings.TrimSpace(string(out)), err) + // Use restart so a reinstall with changed --bind/--port/Env actually picks + // up the new unit content. systemctl restart starts the unit if it is not + // currently running, so this also works for first installs. + if out, err := sb.ctl("restart", systemdUnitName); err != nil { + return fmt.Errorf("restart: %s: %w", strings.TrimSpace(string(out)), err) } return nil } +func systemdOpenclawHome() string { + if path := strings.TrimSpace(os.Getenv("OPENCLAW_HOME")); path != "" { + return path + } + if home, err := os.UserHomeDir(); err == nil && home != "" { + return filepath.Join(home, ".openclaw") + } + return "" +} + +func systemdPathEnv() string { + seen := make(map[string]struct{}) + var paths []string + add := func(entries ...string) { + for _, entry := range entries { + entry = strings.TrimSpace(entry) + if entry == "" { + continue + } + if _, ok := seen[entry]; ok { + continue + } + seen[entry] = struct{}{} + paths = append(paths, entry) + } + } + add(strings.Split(os.Getenv("PATH"), ":")...) + add( + "/usr/local/bin", + "/usr/bin", + "/bin", + "/usr/sbin", + "/sbin", + ) + return strings.Join(paths, ":") +} + func (sb *systemdBackend) Uninstall() error { p := sb.unitPath() if _, err := os.Stat(p); errors.Is(err, os.ErrNotExist) { diff --git a/internal/appservice/systemd_test.go b/internal/appservice/systemd_test.go index af3d03f..3c9d80b 100644 --- a/internal/appservice/systemd_test.go +++ b/internal/appservice/systemd_test.go @@ -27,6 +27,9 @@ func newTestSystemd(t *testing.T) (*systemdBackend, string) { func TestSystemd_Install_writesUnitFile(t *testing.T) { sb, dir := newTestSystemd(t) + t.Setenv("HOME", "/home/user") + t.Setenv("PATH", "/usr/local/bin:/usr/bin:/bin") + t.Setenv("OPENCLAW_HOME", "/srv/openclaw") cfg := InstallConfig{ BinPath: "/usr/local/bin/openclaw-dashboard", WorkDir: "/home/user/.openclaw/dashboard", @@ -53,6 +56,8 @@ func TestSystemd_Install_writesUnitFile(t *testing.T) { "/home/user/.openclaw/dashboard", "Restart=always", "WantedBy=default.target", + `Environment="OPENCLAW_HOME=/srv/openclaw"`, + `Environment="PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin"`, } { if !strings.Contains(content, want) { t.Errorf("unit file missing %q\ncontent:\n%s", want, content) @@ -101,7 +106,7 @@ func TestSystemd_Install_callsSystemctl(t *testing.T) { if err := sb.Install(cfg); err != nil { t.Fatalf("Install: %v", err) } - for _, w := range []string{"daemon-reload", "enable", "start"} { + for _, w := range []string{"daemon-reload", "enable", "restart"} { found := false for _, c := range calls { if strings.Contains(c, w) { diff --git a/internal/appsystem/bench_test.go b/internal/appsystem/bench_test.go index dce5b53..5910539 100644 --- a/internal/appsystem/bench_test.go +++ b/internal/appsystem/bench_test.go @@ -10,17 +10,17 @@ import ( func BenchmarkGetJSON_CacheHit(b *testing.B) { cfg := appconfig.SystemConfig{ - Enabled: true, - MetricsTTLSeconds: 3600, - PollSeconds: 10, - DiskPath: "/", - GatewayTimeoutMs: 100, - GatewayPort: 18789, + Enabled: true, + MetricsTTLSeconds: 3600, + PollSeconds: 10, + DiskPath: "/", + GatewayTimeoutMs: 100, + GatewayPort: 18789, VersionsTTLSeconds: 3600, - CPU: appconfig.MetricThreshold{Warn: 80, Critical: 95}, - RAM: appconfig.MetricThreshold{Warn: 80, Critical: 95}, - Swap: appconfig.MetricThreshold{Warn: 50, Critical: 80}, - Disk: appconfig.MetricThreshold{Warn: 80, Critical: 95}, + CPU: appconfig.MetricThreshold{Warn: 80, Critical: 95}, + RAM: appconfig.MetricThreshold{Warn: 80, Critical: 95}, + Swap: appconfig.MetricThreshold{Warn: 50, Critical: 80}, + Disk: appconfig.MetricThreshold{Warn: 80, Critical: 95}, } ctx, cancel := context.WithCancel(context.Background()) diff --git a/internal/appsystem/cold_path_test.go b/internal/appsystem/cold_path_test.go new file mode 100644 index 0000000..a72b609 --- /dev/null +++ b/internal/appsystem/cold_path_test.go @@ -0,0 +1,290 @@ +package appsystem + +import ( + "context" + "encoding/json" + "net/http" + "net/http/httptest" + "strconv" + "strings" + "testing" + "time" + + appconfig "github.com/mudrii/openclaw-dashboard/internal/appconfig" +) + +// hangingGateway returns an httptest server that blocks until the client +// cancels its request (or `maxDelay` elapses, whichever comes first), then +// closes. The handler honours r.Context() so srv.Close() returns promptly +// once the cold-path deadline cancels the in-flight request. +func hangingGateway(t *testing.T, maxDelay time.Duration) (port int, close func()) { + t.Helper() + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + select { + case <-time.After(maxDelay): + case <-r.Context().Done(): + return + } + w.Header().Set("Content-Type", "application/json") + _, _ = w.Write([]byte(`{"ok":true,"ready":true,"uptimeMs":0}`)) + })) + parts := strings.Split(srv.URL, ":") + p, err := strconv.Atoi(parts[len(parts)-1]) + if err != nil { + srv.Close() + t.Fatalf("parse port from %q: %v", srv.URL, err) + } + return p, srv.Close +} + +func coldPathCfg(port, coldMs int) appconfig.SystemConfig { + return appconfig.SystemConfig{ + Enabled: true, + PollSeconds: 10, + MetricsTTLSeconds: 10, + VersionsTTLSeconds: 60, + // GatewayTimeoutMs is intentionally very generous; the cold-path + // budget is what should bound the wall time, not per-probe timeouts. + GatewayTimeoutMs: 10000, + GatewayPort: port, + ColdPathTimeoutMs: coldMs, + DiskPath: "/", + WarnPercent: 70, + CriticalPercent: 85, + CPU: appconfig.MetricThreshold{Warn: 80, Critical: 95}, + RAM: appconfig.MetricThreshold{Warn: 80, Critical: 95}, + Swap: appconfig.MetricThreshold{Warn: 80, Critical: 95}, + Disk: appconfig.MetricThreshold{Warn: 80, Critical: 95}, + } +} + +// newColdPathTestService returns a SystemService configured for cold-path +// tests with the openclaw binary lookup forced to a non-existent path so +// that runWithTimeout fails fast instead of probing a real installation. +// This guarantees gateway HTTP probes (against the test httptest server) are +// the only slow path. +// +// Also overrides svc.fetchLatest so the background goroutine in +// getLatestVersionCached does not leak real outbound HTTP requests to npm +// during tests. Per-instance override avoids the cross-test data race a +// package-level var creates with goroutines that outlive the test body. +func newColdPathTestService(t *testing.T, cfg appconfig.SystemConfig) *SystemService { + t.Helper() + svc := NewSystemService(cfg, "test", context.Background()) + svc.fetchLatest = func(_ context.Context, _ int) string { return "" } + svc.binPath = "/nonexistent-openclaw-binary-for-cold-path-test" + svc.binOnce.Do(func() {}) + return svc +} + +// TestRefresh_ColdPath_Deadline asserts that when subcollectors hang +// indefinitely, refresh() honours ColdPathTimeoutMs and returns within the +// budget plus a small grace window — never blocking the request thread for +// the multi-second worst case described in #26. +func TestRefresh_ColdPath_Deadline(t *testing.T) { + port, closeSrv := hangingGateway(t, 5*time.Second) + defer closeSrv() + + cfg := coldPathCfg(port, 500) + svc := newColdPathTestService(t, cfg) + + start := time.Now() + body, _ := svc.refresh(context.Background()) + elapsed := time.Since(start) + + if body == nil { + t.Fatal("refresh returned nil body even though host metrics are available") + } + // Budget 500ms + 1.5s slack for slow CI; the bug version takes ~10–12s. + if elapsed > 2*time.Second { + t.Fatalf("cold-path refresh took %v, want <= 2s (budget=500ms)", elapsed) + } +} + +// TestRefresh_ColdPath_DegradedFlag asserts that hitting the cold-path +// deadline produces a payload the frontend can recognise as degraded: +// Degraded=true and an error string mentioning the deadline. +func TestRefresh_ColdPath_DegradedFlag(t *testing.T) { + port, closeSrv := hangingGateway(t, 5*time.Second) + defer closeSrv() + + cfg := coldPathCfg(port, 500) + svc := newColdPathTestService(t, cfg) + + body, _ := svc.refresh(context.Background()) + var resp SystemResponse + if err := json.Unmarshal(body, &resp); err != nil { + t.Fatalf("unmarshal: %v", err) + } + if !resp.Degraded { + t.Errorf("expected Degraded=true after cold-path deadline; resp=%+v", resp) + } + hasDeadline := false + for _, e := range resp.Errors { + if strings.Contains(strings.ToLower(e), "cold path") || strings.Contains(strings.ToLower(e), "deadline") { + hasDeadline = true + break + } + } + if !hasDeadline { + t.Errorf("expected cold-path/deadline message in resp.Errors, got %v", resp.Errors) + } +} + +// TestRefresh_ColdPath_HostMetricsAlwaysShipped asserts that host metrics +// (disk in particular — collected via syscall.Statfs, no I/O) are always +// in the response even when gateway probes hang. This is the contract that +// keeps the system card useful while the gateway is offline. +func TestRefresh_ColdPath_HostMetricsAlwaysShipped(t *testing.T) { + port, closeSrv := hangingGateway(t, 5*time.Second) + defer closeSrv() + + cfg := coldPathCfg(port, 1500) + svc := newColdPathTestService(t, cfg) + + body, _ := svc.refresh(context.Background()) + var resp SystemResponse + if err := json.Unmarshal(body, &resp); err != nil { + t.Fatalf("unmarshal: %v", err) + } + if resp.Disk.Error != nil { + t.Errorf("disk error should be nil even when gateway hangs; got %q", *resp.Disk.Error) + } + if resp.Disk.TotalBytes <= 0 { + t.Errorf("expected disk totalBytes > 0; got %d", resp.Disk.TotalBytes) + } +} + +// TestRefresh_ColdPath_PoisonsNoCache asserts that when cold-path collection +// is cut short by the deadline, the version cache is NOT updated with the +// partial/empty result. Otherwise the next request would hit a poisoned +// "warm" cache and skip a real collection silently. +func TestRefresh_ColdPath_PoisonsNoCache(t *testing.T) { + port, closeSrv := hangingGateway(t, 5*time.Second) + defer closeSrv() + + cfg := coldPathCfg(port, 500) + svc := newColdPathTestService(t, cfg) + + _, _ = svc.refresh(context.Background()) + + svc.verMu.RLock() + cachedAt := svc.verAt + svc.verMu.RUnlock() + + if !cachedAt.IsZero() { + t.Errorf("version cache must not be populated after cold-path deadline; verAt=%v", cachedAt) + } +} + +// TestRefresh_ColdPath_WarmVersionsCacheStillBoundsDeadline asserts that a warm +// versions cache (verAt populated, fresh within TTL) does NOT eliminate the cold- +// path deadline budget — the openclaw runtime collector probes gateway HTTP +// independently, so a hung gateway can still drag the refresh past budget if +// the deadline is not honoured. This test pre-seeds verAt with a valid result +// then hangs the gateway and verifies refresh still returns within budget. +// Locks in the contract that ColdPathTimeoutMs bounds the *whole* refresh +// regardless of which subcollector is the bottleneck. +func TestRefresh_ColdPath_WarmVersionsCacheStillBoundsDeadline(t *testing.T) { + port, closeSrv := hangingGateway(t, 5*time.Second) + defer closeSrv() + + cfg := coldPathCfg(port, 500) + svc := newColdPathTestService(t, cfg) + + // Pre-seed verAt with a valid cached SystemVersions. getVersionsCached + // will return this without probing — but CollectOpenclawRuntime still + // hits the gateway, so the cold-path deadline must still kick in. + svc.verMu.Lock() + svc.verCached = SystemVersions{ + Dashboard: "test", + Openclaw: "2026.4.20", + Latest: "2026.4.21", + Gateway: SystemGateway{Status: "online"}, + } + svc.verAt = time.Now() + svc.verMu.Unlock() + + start := time.Now() + body, _ := svc.refresh(context.Background()) + elapsed := time.Since(start) + + if body == nil { + t.Fatal("refresh returned nil body even with warm versions cache") + } + if elapsed > 2*time.Second { + t.Fatalf("refresh with warm versions cache took %v, want <= 2s (budget=500ms)", elapsed) + } + + var resp SystemResponse + if err := json.Unmarshal(body, &resp); err != nil { + t.Fatalf("unmarshal: %v", err) + } + // Cached versions should survive into the response since getVersionsCached + // returns the warm cache without probing. + if resp.Versions.Openclaw != "2026.4.20" { + t.Errorf("expected cached Openclaw=2026.4.20, got %q", resp.Versions.Openclaw) + } + // But Degraded must be true because the openclaw runtime collector hung. + if !resp.Degraded { + t.Errorf("expected Degraded=true (openclaw runtime hung), got resp=%+v", resp) + } +} + +// TestRefresh_ColdPath_DegradedPayloadIsCached locks in the metrics-cache TTL +// behavior for degraded responses. Design choice (intentional, not accidental): +// a deadline-hit payload IS cached for MetricsTTLSeconds, so a brief gateway +// outage does not turn into a request storm against the same hung gateway — +// every dashboard refresh would otherwise trigger a fresh ColdPathTimeoutMs +// wait. The cost is a bounded staleness window when the gateway recovers +// (≤ MetricsTTLSeconds, default 10s — well under the user-perceptible +// "live data" expectation). This test pins that contract: if you decide to +// stop caching degraded responses, this test must be updated deliberately. +func TestRefresh_ColdPath_DegradedPayloadIsCached(t *testing.T) { + port, closeSrv := hangingGateway(t, 5*time.Second) + defer closeSrv() + + cfg := coldPathCfg(port, 500) + svc := newColdPathTestService(t, cfg) + + // First call hits the cold path and populates the metrics cache with a + // degraded payload (gateway hangs, deadline fires). + body1, _ := svc.refresh(context.Background()) + if body1 == nil { + t.Fatal("first refresh returned nil") + } + svc.metricsMu.RLock() + cachedAt := svc.metricsAt + cachedPayload := svc.metricsPayload + svc.metricsMu.RUnlock() + + if cachedAt.IsZero() { + t.Fatal("metrics cache should be populated even when payload is degraded — " + + "otherwise GetJSON repeatedly hits the cold path during gateway outages") + } + if cachedPayload == nil { + t.Fatal("metrics payload should be cached even when degraded") + } + + // Verify GetJSON returns the cached payload immediately (warm hit) without + // triggering another cold-path collection. Use a tight elapsed bound to + // confirm we hit cache, not a fresh collection. + start := time.Now() + status, body2 := svc.GetJSON(context.Background()) + elapsed := time.Since(start) + + if status != http.StatusOK { + t.Errorf("GetJSON status = %d, want 200 (warm cache hit)", status) + } + if elapsed > 50*time.Millisecond { + t.Errorf("GetJSON took %v on warm cache, want <50ms — possible re-collection", elapsed) + } + + var resp SystemResponse + if err := json.Unmarshal(body2, &resp); err != nil { + t.Fatalf("unmarshal: %v", err) + } + if !resp.Degraded { + t.Errorf("cached payload lost Degraded=true; got %+v", resp) + } +} diff --git a/internal/appsystem/latest_version_test.go b/internal/appsystem/latest_version_test.go index 7a697be..3862edb 100644 --- a/internal/appsystem/latest_version_test.go +++ b/internal/appsystem/latest_version_test.go @@ -20,16 +20,12 @@ func TestGetLatestVersionCached_ConcurrentCalls_NoRace(t *testing.T) { } var fetchCount atomic.Int32 - original := fetchLatestVersion - t.Cleanup(func() { fetchLatestVersion = original }) - - fetchLatestVersion = func(_ context.Context, _ int) string { + svc := NewSystemService(cfg, "test", context.Background()) + svc.fetchLatest = func(_ context.Context, _ int) string { fetchCount.Add(1) return "2026.4.11" } - svc := NewSystemService(cfg, "test", context.Background()) - const goroutines = 20 var wg sync.WaitGroup wg.Add(goroutines) @@ -79,17 +75,13 @@ func TestGetLatestVersionCached_ReturnsCachedValueWhileRefreshing(t *testing.T) PollSeconds: 10, } - original := fetchLatestVersion - t.Cleanup(func() { fetchLatestVersion = original }) - fetched := make(chan struct{}) - fetchLatestVersion = func(_ context.Context, _ int) string { + svc := NewSystemService(cfg, "test", context.Background()) + svc.fetchLatest = func(_ context.Context, _ int) string { <-fetched return "2026.4.11-new" } - svc := NewSystemService(cfg, "test", context.Background()) - // Pre-seed expired cache svc.latestMu.Lock() svc.latestVer = "2026.4.10-old" @@ -121,15 +113,11 @@ func TestGetLatestVersionCached_NegativeCaching(t *testing.T) { PollSeconds: 10, } - original := fetchLatestVersion - t.Cleanup(func() { fetchLatestVersion = original }) - - fetchLatestVersion = func(_ context.Context, _ int) string { + svc := NewSystemService(cfg, "test", context.Background()) + svc.fetchLatest = func(_ context.Context, _ int) string { return "" // simulate failure } - svc := NewSystemService(cfg, "test", context.Background()) - svc.getLatestVersionCached() waitForLatestRefreshDone(t, svc) diff --git a/internal/appsystem/system_collect_linux.go b/internal/appsystem/system_collect_linux.go index 30b2a51..e93faeb 100644 --- a/internal/appsystem/system_collect_linux.go +++ b/internal/appsystem/system_collect_linux.go @@ -62,36 +62,6 @@ func collectMeminfo() (map[string]uint64, error) { return parseProcMeminfo(string(content)) } -func collectRAM(ctx context.Context) SystemRAM { - select { - case <-ctx.Done(): - e := "ram collection cancelled" - return SystemRAM{Error: &e} - default: - } - info, err := collectMeminfo() - if err != nil { - e := err.Error() - return SystemRAM{Error: &e} - } - return ramFromMeminfo(info) -} - -func collectSwap(ctx context.Context) SystemSwap { - select { - case <-ctx.Done(): - e := "swap collection cancelled" - return SystemSwap{Error: &e} - default: - } - info, err := collectMeminfo() - if err != nil { - e := err.Error() - return SystemSwap{Error: &e} - } - return swapFromMeminfo(info) -} - // ramFromMeminfo builds SystemRAM from a pre-parsed /proc/meminfo map. func ramFromMeminfo(info map[string]uint64) SystemRAM { totalKb := info["MemTotal"] diff --git a/internal/appsystem/system_service.go b/internal/appsystem/system_service.go index be7892f..f30cc2b 100644 --- a/internal/appsystem/system_service.go +++ b/internal/appsystem/system_service.go @@ -28,6 +28,12 @@ type SystemService struct { dashVer string shutdownCtx context.Context // lifecycle context — cancelled on graceful shutdown; do NOT use for per-request ops + // fetchLatest probes the npm registry for the newest published openclaw + // version. Set per-instance so each test can override it without touching a + // shared package var (which previously raced with the goroutine in + // getLatestVersionCached during test cleanup). + fetchLatest func(ctx context.Context, timeoutMs int) string + metricsMu sync.RWMutex metricsPayload []byte metricsStalePayload []byte // pre-computed version with "stale":true @@ -48,11 +54,15 @@ type SystemService struct { binPath string } -var fetchLatestVersion = FetchLatestNpmVersion var sharedSystemHTTPClient = &http.Client{} func NewSystemService(cfg appconfig.SystemConfig, dashVer string, serverCtx context.Context) *SystemService { - return &SystemService{cfg: cfg, dashVer: dashVer, shutdownCtx: serverCtx} + return &SystemService{ + cfg: cfg, + dashVer: dashVer, + shutdownCtx: serverCtx, + fetchLatest: FetchLatestNpmVersion, + } } func (s *SystemService) SetMetricsTimestampForTest(ts time.Time) { @@ -120,17 +130,22 @@ func (s *SystemService) GetJSON(ctx context.Context) (int, []byte) { // refresh collects fresh metrics and returns (jsonBytes, isHardFail). // isHardFail=true when ALL core collectors failed (no useful data). +// +// Cold-path bounding: the entire collection runs under a single +// context.WithTimeout(ColdPathTimeoutMs) so a hung gateway can never drag the +// whole refresh past the frontend's fetch deadline. Versions and OpenClaw +// runtime are collected in parallel; on the cold path that halves the worst +// case from ~2 × GatewayTimeoutMs down to ~1 × GatewayTimeoutMs (or the cold +// budget, whichever fires first). func (s *SystemService) refresh(ctx context.Context) ([]byte, bool) { - // Collect versions first (heavily cached — 300s TTL, effectively free on hot path). - // This guarantees collectOpenclawRuntime receives real version data instead of an - // empty SystemVersions{}, eliminating the fragile post-hoc patching that previously - // existed (B1 fix). - ver := s.getVersionsCached(ctx) - if latest := s.getLatestVersionCached(); latest != "" { - ver.Latest = latest + coldPath := time.Duration(s.cfg.ColdPathTimeoutMs) * time.Millisecond + if coldPath <= 0 { + coldPath = 4 * time.Second } + coldCtx, cancel := context.WithTimeout(ctx, coldPath) + defer cancel() - // Run OpenClaw runtime + disk + CPU/RAM/Swap in parallel for minimum wall-clock time. + var ver SystemVersions var openclaw SystemOpenclaw var disk SystemDisk var cpu SystemCPU @@ -138,18 +153,38 @@ func (s *SystemService) refresh(ctx context.Context) ([]byte, bool) { var swap SystemSwap oclawBin := s.openclawBin() var wg sync.WaitGroup - wg.Add(3) + wg.Add(4) + go func() { + defer wg.Done() + ver = s.getVersionsCached(coldCtx) + if latest := s.getLatestVersionCached(); latest != "" { + ver.Latest = latest + } + }() go func() { defer wg.Done() - openclaw = CollectOpenclawRuntime(ctx, oclawBin, s.cfg.GatewayTimeoutMs, s.cfg.GatewayPort, ver) + // Run independently of the versions goroutine — both probe the gateway, + // so serializing them would double the cold-path wall time. We pass + // SystemVersions{} here and patch openclaw.Status.{Current,Latest}Version + // from `ver` after wg.Wait() once both goroutines have finished. + openclaw = CollectOpenclawRuntime(coldCtx, oclawBin, s.cfg.GatewayTimeoutMs, s.cfg.GatewayPort, SystemVersions{}) }() go func() { defer wg.Done(); disk = CollectDiskRoot(s.cfg.DiskPath) }() go func() { defer wg.Done() - cpu, ram, swap = collectCPURAMSwapParallel(ctx) + cpu, ram, swap = collectCPURAMSwapParallel(coldCtx) }() wg.Wait() + deadlineHit := coldCtx.Err() == context.DeadlineExceeded + + if openclaw.Status.CurrentVersion == "" { + openclaw.Status.CurrentVersion = ver.Openclaw + } + if openclaw.Status.LatestVersion == "" { + openclaw.Status.LatestVersion = ver.Latest + } + // Hard fail = all four core collectors failed allFailed := cpu.Error != nil && ram.Error != nil && swap.Error != nil && disk.Error != nil @@ -193,6 +228,10 @@ func (s *SystemService) refresh(ctx context.Context) ([]byte, bool) { resp.Errors = append(resp.Errors, "openclaw: "+e) } } + if deadlineHit { + resp.Degraded = true + resp.Errors = append(resp.Errors, "cold path: deadline exceeded") + } b, err := json.Marshal(resp) if err != nil { @@ -244,9 +283,15 @@ func (s *SystemService) getVersionsCached(ctx context.Context) SystemVersions { v := CollectVersionsLocal(ctx, s.dashVer, s.cfg.GatewayTimeoutMs, s.cfg.GatewayPort, s.openclawBin()) s.verMu.Lock() - s.verCached = v - s.verAt = time.Now() s.verRefresh = false + // Cache only when ctx finished cleanly. If a cold-path deadline cut us + // short, the result may be partial (e.g. empty Openclaw or unknown Gateway + // status); persisting it would poison the warm cache and skip the next + // real collection silently. + if ctx.Err() == nil { + s.verCached = v + s.verAt = time.Now() + } s.verMu.Unlock() return v } @@ -277,12 +322,21 @@ func (s *SystemService) getLatestVersionCached() string { s.latestMu.Unlock() go func() { - latest := fetchLatestVersion(s.shutdownCtx, s.cfg.GatewayTimeoutMs) + latest := s.fetchLatest(s.shutdownCtx, s.cfg.GatewayTimeoutMs) now := time.Now() s.latestMu.Lock() if latest != "" { s.latestVer = latest } + // Negative caching: stamp latestAt unconditionally so an empty result + // (npm down, decode failure, non-200) suppresses retries for the full + // VersionsTTLSeconds window. This protects the npm registry from + // request storms during outages — at the cost of a slight recovery + // delay (≤ TTL) when npm comes back. NOT analogous to the verAt cold- + // path poisoning fix: that case caches *partial* results from a + // deadline-cancelled in-progress collection; this case caches a + // completed-but-failed external probe. Different concerns, different + // answers. Locked in by TestGetLatestVersionCached_NegativeCaching. s.latestAt = now s.latestRefresh = false s.latestMu.Unlock() diff --git a/internal/appsystem/system_service_test.go b/internal/appsystem/system_service_test.go index b99f3c1..05d9ec1 100644 --- a/internal/appsystem/system_service_test.go +++ b/internal/appsystem/system_service_test.go @@ -118,20 +118,16 @@ func TestParseGatewayStatusJSON_Offline(t *testing.T) { } func TestGetLatestVersionCached_FailureIsNegativelyCached(t *testing.T) { - prev := fetchLatestVersion - defer func() { fetchLatestVersion = prev }() - var calls atomic.Int32 - fetchLatestVersion = func(ctx context.Context, timeoutMs int) string { - calls.Add(1) - return "" - } - svc := NewSystemService(appconfig.SystemConfig{ Enabled: true, VersionsTTLSeconds: 60, GatewayTimeoutMs: 100, }, "test", context.Background()) + svc.fetchLatest = func(ctx context.Context, timeoutMs int) string { + calls.Add(1) + return "" + } _ = svc.getLatestVersionCached() deadline := time.Now().Add(500 * time.Millisecond) diff --git a/main.go b/main.go index 1fb7fd4..3993d0a 100644 --- a/main.go +++ b/main.go @@ -58,6 +58,7 @@ func Main() int { if version == "" { version = detectVersion(cmdCtx, dir) } + version = strings.TrimPrefix(version, "v") cfg := loadConfig(dir) // env var overrides @@ -111,6 +112,7 @@ func Main() int { if version == "" { version = detectVersion(cmdCtx, dir) } + version = strings.TrimPrefix(version, "v") cfg := loadConfig(dir) // Env var defaults diff --git a/web/index.html b/web/index.html index 0a6fc4e..fed6333 100644 --- a/web/index.html +++ b/web/index.html @@ -283,7 +283,7 @@
- + __RUNTIME__ · v__VERSION__ @@ -1710,7 +1710,7 @@ $('skillsGrid').innerHTML=(D.skills||[]).map(s=>{ const c=s.active?'color:var(--green);border-color:rgba(74,222,128,.2);background:rgba(74,222,128,.05)':'color:var(--dim);border-color:var(--border)'; return `${esc(s.name)}`; - }).join(''); + }).join('')||'
No skills configured
'; } // Git @@ -1989,6 +1989,18 @@ if(im=2?2:1;if(im>lm)return 0; return(ld-id)>=2?2:1; }, + _kv(k,v,c='var(--text)'){return `
${esc(k)}${esc(String(v))}
`;}, + renderGatewayDegraded(reason){ + const el=$('gatewayRuntimePanelInner'); + if(el){ + el.innerHTML=[ + this._kv('State','Unavailable','var(--red)'), + this._kv('Source',reason,'var(--muted)'), + ].join(''); + } + window._sysBarActive=false; + window._gwOnlineConfirmed=false; + }, _gatewayState(d) { const gwRuntime=d.openclaw?.gateway||{}; const gwVersion=d.versions?.gateway||{}; @@ -2105,12 +2117,12 @@ const upTxt=gwState.source==='runtime'?this.fmtDurationMs(gwRuntime.uptimeMs)||gwVersion.uptime||'—':gwVersion.uptime||'—'; const probeSource=gwState.source==='runtime'?'live probes':'status fallback'; gwRtEl.innerHTML=[ - kv('State', stateLabel, stateColor), - kv('Version', gwVersion.version||'—'), - kv('PID', gwVersion.pid||'—'), - kv('Uptime', upTxt), - kv('Memory', gwVersion.memory||'—'), - kv('Source', probeSource, 'var(--muted)'), + this._kv('State', stateLabel, stateColor), + this._kv('Version', gwVersion.version||'—'), + this._kv('PID', gwVersion.pid||'—'), + this._kv('Uptime', upTxt), + this._kv('Memory', gwVersion.memory||'—'), + this._kv('Source', probeSource, 'var(--muted)'), ].join(''); } @@ -2118,8 +2130,18 @@ }, async fetch() { if(this._inflight)return; this._inflight=true; - try{const r=await fetch('/api/system');if(r.ok){this.render(await r.json());}else{window._sysBarActive=false;window._gwOnlineConfirmed=false;}} - catch(e){window._sysBarActive=false;window._gwOnlineConfirmed=false;}finally{this._inflight=false;} + // 6000ms client timeout; backend cold-path budget is ~4000ms, this allows + // a small margin for marshal + network jitter before declaring degraded. + const ctl=new AbortController(); + const t=setTimeout(()=>ctl.abort(),6000); + try{ + const r=await fetch('/api/system',{signal:ctl.signal}); + if(r.ok){this.render(await r.json());} + else{this.renderGatewayDegraded('http '+r.status);} + }catch(e){ + const reason=e&&e.name==='AbortError'?'timeout':'network error'; + this.renderGatewayDegraded(reason); + }finally{clearTimeout(t);this._inflight=false;} }, start(ms){ms=Math.max(ms||5000,2000);if(this._timer&&this._lastMs===ms)return;if(this._timer)clearInterval(this._timer);this._lastMs=ms;this.fetch();this._timer=setInterval(()=>this.fetch(),ms);}, _lastMs:0