Skip to content

Make batch workloads first-class#1098

Open
nadaverell wants to merge 2 commits into
mainfrom
execution-run-logs
Open

Make batch workloads first-class#1098
nadaverell wants to merge 2 commits into
mainfrom
execution-run-logs

Conversation

@nadaverell

@nadaverell nadaverell commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Jobs, CronJobs, Argo Workflows, and CronWorkflows are now treated as first-class batch workloads across Radar, not just as generic resources with pod logs bolted on.

This keeps the earlier run-aware logs work and adds the broader product surface:

  • Dedicated drawer + fullscreen execution overview for Job, CronJob, Workflow, and CronWorkflow resources.
  • CronJob/CronWorkflow run history that makes the parent -> run -> pod relationship explicit, including empty/no-retained-run and suspended states.
  • CronWorkflow resource-list/status/renderer support, read-only parity with Workflow before Argo write actions.
  • Applications batch signals for failed, suspended, and running batch work without noisy all-green success chips.
  • Topology support for CronWorkflow -> Workflow -> Pod/PodGroup and native CronJob/Job scale cases.
  • Aggregated run progress/count fields for Jobs and Workflows so the UI can scale from one pod to many pods/steps.
  • Logs viewer empty-state left padding fix from the visual pass.

Design Notes

  • Native and Argo batch resources ship together in this PR so the shared execution UI and data model were shaped by both, instead of front-loading a universal abstraction too early.
  • Argo write actions are intentionally not included; CronWorkflow gets a dedicated renderer/status path and read-only execution parity first.
  • Timeline gets no bespoke batch-only surface here. The useful integration is through the same resource/application grouping signals that Timeline should converge on.
  • OSS retention remains simple: Radar reads live Kubernetes objects and pod logs. If Jobs/Workflows/pods are garbage-collected, the UI says that plainly. Cloud can later differentiate with retained logs/history.

Testing

  • npx tsc --noEmit from web/
  • go test ./internal/server
  • (cd pkg && go test ./topology)
  • git diff --check
  • make build

Visual Testing

Ran against real clusters:

  • radar-test-nonprod, namespace radar-batch-visual: native Jobs/CronJobs including running, failed, completed, retained history, suspended/no-run schedule, Applications, and topology.
  • gke_koalabackend_us-east1-b_nonprod-cluster-us-east1: Argo Workflow list/fullscreen and empty CronWorkflow list.

Artifacts are under .playwright-mcp/visual-test/20260704-171002/, including the final post-fix captures:

  • job-running-fullscreen-1920-fixed.png
  • cronjob-runs-fullscreen-1920-fixed.png

Fresh console check after the final build: 0 errors, 0 warnings. Only debug/log messages remained.

Known Gap

No reachable cluster had a live CronWorkflow instance, so CronWorkflow instance rendering is implemented and type-checked but visually covered only at the empty-list level. I did not create a CronWorkflow in nonprod because that can trigger controller-created Workflows.


Note

Medium Risk
Touches log streaming, RBAC on new Argo/batch APIs, and applications/topology aggregation; behavior is mostly additive but incorrect run or selector logic could surface wrong pods or empty logs.

Overview
Jobs and Argo Workflows can use the same aggregated workload-log path as Deployments, with pod selectors resolved from Job specs or Argo’s workflows.argoproj.io/workflow label. MCP get_workload_logs and docs accept job / workflow kinds. When pods are missing, responses include emptyReason, guidance, and kubectl/argo command hints—including terminal finished runs and archive-log awareness for Workflows.

Scheduled batch gets GET /workloads/.../runs for CronJobs (owned Jobs) and CronWorkflows (labeled Workflows), returning normalized WorkloadRun objects with phases, progress, and pod/step counts. The web app adds run pickers (ScheduledWorkloadLogsViewer, BatchExecutionView), shows batch signals on Applications, and wires logs/topology/resources for Workflow/CronWorkflow kinds.

Applications API now ingests standalone Jobs, Workflows, and CronWorkflows with optional batch summaries rolled up from child runs; topology adds Workflow/CronWorkflow nodes and pod ownership edges.

Reviewed by Cursor Bugbot for commit c353074. Bugbot is set up for automated code reviews on this repo. Configure here.

@nadaverell nadaverell requested a review from hisco as a code owner July 3, 2026 23:15
Comment thread internal/server/workload_logs.go Outdated
@nadaverell nadaverell force-pushed the execution-run-logs branch from 1b07139 to fc263f3 Compare July 3, 2026 23:20
Comment thread internal/server/workload_logs.go Outdated
@nadaverell nadaverell force-pushed the execution-run-logs branch 2 times, most recently from 2c036f6 to 0521535 Compare July 3, 2026 23:27
Comment thread internal/server/workload_logs.go Outdated
@nadaverell nadaverell force-pushed the execution-run-logs branch from 0521535 to 0cdd8c3 Compare July 3, 2026 23:44
Comment thread internal/server/workload_logs.go Outdated
@nadaverell nadaverell force-pushed the execution-run-logs branch from 0cdd8c3 to 3e3d80f Compare July 3, 2026 23:56
Comment thread internal/server/workload_logs.go Outdated
@nadaverell nadaverell force-pushed the execution-run-logs branch from 3e3d80f to 2e130c8 Compare July 4, 2026 00:11
Comment thread internal/server/workload_logs.go Outdated
Comment thread internal/mcp/tools_workloads.go
@nadaverell nadaverell force-pushed the execution-run-logs branch from 2e130c8 to 617df7c Compare July 4, 2026 00:27
phase = "Succeeded"
case job.Status.Failed > 0:
phase = "Failed"
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job run phase misclassified

Medium Severity

jobRunInfo treats status.succeeded or status.failed pod counts as terminal success or failure even when JobComplete / JobFailed conditions are not true. Retrying or partially complete CronJob runs can show Failed/Succeeded and active: false while the Job controller is still running.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 617df7c. Configure here.

if !shouldWaitForPodsInLogStream(kind, metadata) {
sendSSEEvent(w, flusher, "end", workloadLogEndPayload(metadata))
return
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSE ends on empty pods

Medium Severity

During workload log streaming, rediscovery now ends the SSE stream whenever no pods match and shouldWaitForPodsInLogStream is false. Deployments and StatefulSets can briefly have zero matching pods during rollouts or scale events, so live tails stop instead of reconnecting when pods return.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 617df7c. Configure here.

Comment thread internal/k8s/workload.go
if job.Spec.Selector == nil {
return nil, fmt.Errorf("job %s/%s has no pod selector", namespace, name)
}
return job.Spec.Selector, nil

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job logs require pod selector

Medium Severity

Job workload logs resolve pods only via job.spec.selector and error when it is nil. Pods for a Job are routinely labeled with batch.kubernetes.io/job-name (as elsewhere in this repo for hook Jobs), so logs can fail while kubectl logs job/... still works.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 617df7c. Configure here.

@nadaverell nadaverell changed the title Add run-aware logs for Jobs and Argo Workflows Make batch workloads first-class Jul 4, 2026

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

There are 5 total unresolved issues (including 3 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c353074. Configure here.


export function pickDefaultRun(runs: WorkloadRun[]): WorkloadRun | undefined {
return runs.find((run) => run.active) ?? newestRun(runs)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default run ignores failures

Medium Severity

When no run is active, pickDefaultRun picks the newest run by timestamp only. That disagrees with /runs, which sorts failed/error runs ahead of newer successes, so scheduled log viewers and batch execution UI can open succeeded runs while a retained failure still exists.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c353074. Configure here.

b.LastScheduledAt = run.ScheduledAt
}
b.Message = run.Message
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Batch latest run by time

Medium Severity

applyRunToBatch sets latestRunPhase from the newest timestamp among retained runs, not using the same active-then-failed-then-newest policy as sortRuns. A newer success can hide an older retained failure in Applications health and batch chips.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c353074. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant