Make batch workloads first-class by nadaverell · Pull Request #1098 · skyhook-io/radar

nadaverell · 2026-07-03T23:15:37Z

Summary

Jobs, CronJobs, Argo Workflows, and CronWorkflows are now treated as first-class batch workloads across Radar, not just as generic resources with pod logs bolted on.

This keeps the earlier run-aware logs work and adds the broader product surface:

Dedicated drawer + fullscreen execution overview for Job, CronJob, Workflow, and CronWorkflow resources.
CronJob/CronWorkflow run history that makes the parent -> run -> pod relationship explicit, including empty/no-retained-run and suspended states.
CronWorkflow resource-list/status/renderer support, read-only parity with Workflow before Argo write actions.
Applications batch signals for failed, suspended, and running batch work without noisy all-green success chips.
Topology support for CronWorkflow -> Workflow -> Pod/PodGroup and native CronJob/Job scale cases.
Aggregated run progress/count fields for Jobs and Workflows so the UI can scale from one pod to many pods/steps.
Logs viewer empty-state left padding fix from the visual pass.

Design Notes

Native and Argo batch resources ship together in this PR so the shared execution UI and data model were shaped by both, instead of front-loading a universal abstraction too early.
Argo write actions are intentionally not included; CronWorkflow gets a dedicated renderer/status path and read-only execution parity first.
Timeline gets no bespoke batch-only surface here. The useful integration is through the same resource/application grouping signals that Timeline should converge on.
OSS retention remains simple: Radar reads live Kubernetes objects and pod logs. If Jobs/Workflows/pods are garbage-collected, the UI says that plainly. Cloud can later differentiate with retained logs/history.

Testing

npx tsc --noEmit from web/
go test ./internal/server
(cd pkg && go test ./topology)
git diff --check
make build

Visual Testing

Ran against real clusters:

radar-test-nonprod, namespace radar-batch-visual: native Jobs/CronJobs including running, failed, completed, retained history, suspended/no-run schedule, Applications, and topology.
gke_koalabackend_us-east1-b_nonprod-cluster-us-east1: Argo Workflow list/fullscreen and empty CronWorkflow list.

Artifacts are under .playwright-mcp/visual-test/20260704-171002/, including the final post-fix captures:

job-running-fullscreen-1920-fixed.png
cronjob-runs-fullscreen-1920-fixed.png

Fresh console check after the final build: 0 errors, 0 warnings. Only debug/log messages remained.

Known Gap

No reachable cluster had a live CronWorkflow instance, so CronWorkflow instance rendering is implemented and type-checked but visually covered only at the empty-list level. I did not create a CronWorkflow in nonprod because that can trigger controller-created Workflows.

Note

Medium Risk
Touches log streaming, RBAC on new Argo/batch APIs, and applications/topology aggregation; behavior is mostly additive but incorrect run or selector logic could surface wrong pods or empty logs.

Overview
Jobs and Argo Workflows can use the same aggregated workload-log path as Deployments, with pod selectors resolved from Job specs or Argo’s workflows.argoproj.io/workflow label. MCP get_workload_logs and docs accept job / workflow kinds. When pods are missing, responses include emptyReason, guidance, and kubectl/argo command hints—including terminal finished runs and archive-log awareness for Workflows.

Scheduled batch gets GET /workloads/.../runs for CronJobs (owned Jobs) and CronWorkflows (labeled Workflows), returning normalized WorkloadRun objects with phases, progress, and pod/step counts. The web app adds run pickers (ScheduledWorkloadLogsViewer, BatchExecutionView), shows batch signals on Applications, and wires logs/topology/resources for Workflow/CronWorkflow kinds.

Applications API now ingests standalone Jobs, Workflows, and CronWorkflows with optional batch summaries rolled up from child runs; topology adds Workflow/CronWorkflow nodes and pod ownership edges.

^{Reviewed by Cursor Bugbot for commit c353074. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor · 2026-07-04T00:28:33Z

+		phase = "Succeeded"
+	case job.Status.Failed > 0:
+		phase = "Failed"
+	}


Job run phase misclassified

Medium Severity

jobRunInfo treats status.succeeded or status.failed pod counts as terminal success or failure even when JobComplete / JobFailed conditions are not true. Retrying or partially complete CronJob runs can show Failed/Succeeded and active: false while the Job controller is still running.

^{Reviewed by Cursor Bugbot for commit 617df7c. Configure here.}

cursor · 2026-07-04T00:28:33Z

+				if !shouldWaitForPodsInLogStream(kind, metadata) {
+					sendSSEEvent(w, flusher, "end", workloadLogEndPayload(metadata))
+					return
+				}


SSE ends on empty pods

Medium Severity

During workload log streaming, rediscovery now ends the SSE stream whenever no pods match and shouldWaitForPodsInLogStream is false. Deployments and StatefulSets can briefly have zero matching pods during rollouts or scale events, so live tails stop instead of reconnecting when pods return.

^{Reviewed by Cursor Bugbot for commit 617df7c. Configure here.}

cursor · 2026-07-04T00:28:33Z

+		if job.Spec.Selector == nil {
+			return nil, fmt.Errorf("job %s/%s has no pod selector", namespace, name)
+		}
+		return job.Spec.Selector, nil


Job logs require pod selector

Medium Severity

Job workload logs resolve pods only via job.spec.selector and error when it is nil. Pods for a Job are routinely labeled with batch.kubernetes.io/job-name (as elsewhere in this repo for hook Jobs), so logs can fail while kubectl logs job/... still works.

^{Reviewed by Cursor Bugbot for commit 617df7c. Configure here.}

cursor

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

There are 5 total unresolved issues (including 3 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit c353074. Configure here.}

cursor · 2026-07-04T14:34:51Z

+
+export function pickDefaultRun(runs: WorkloadRun[]): WorkloadRun | undefined {
+  return runs.find((run) => run.active) ?? newestRun(runs)
+}


Default run ignores failures

Medium Severity

When no run is active, pickDefaultRun picks the newest run by timestamp only. That disagrees with /runs, which sorts failed/error runs ahead of newer successes, so scheduled log viewers and batch execution UI can open succeeded runs while a retained failure still exists.

Additional Locations (1)

web/src/components/logs/ScheduledWorkloadLogsViewer.tsx#L18-L28

^{Reviewed by Cursor Bugbot for commit c353074. Configure here.}

cursor · 2026-07-04T14:34:51Z

+			b.LastScheduledAt = run.ScheduledAt
+		}
+		b.Message = run.Message
+	}


Batch latest run by time

Medium Severity

applyRunToBatch sets latestRunPhase from the newest timestamp among retained runs, not using the same active-then-failed-then-newest policy as sortRuns. A newer success can hide an older retained failure in Applications health and batch chips.

Additional Locations (2)

internal/server/applications.go#L727-L743

packages/k8s-ui/src/utils/applications.ts#L766-L787

^{Reviewed by Cursor Bugbot for commit c353074. Configure here.}

nadaverell requested a review from hisco as a code owner July 3, 2026 23:15

cursor Bot reviewed Jul 3, 2026

View reviewed changes