Refined Kubernetes scheduler implementation by senseipri · Pull Request #1316 · areal-project/AReaL

senseipri · 2026-05-08T11:16:18Z

Description

Adds a Kubernetes-backed scheduler implementation for AReaL using StatefulSet-based worker orchestration.

This PR:

implements KubernetesScheduler
reuses existing HTTP guard APIs
adds Kubernetes Python client integration
adds pod health diagnostics and rollback handling
mirrors local/Slurm scheduler semantics
adds Kubernetes documentation (EN + ZH)
adds unit and optional integration tests

Related Issue

#Native Kubernetes (K8S) scheduler [CC]

Type of Change

Checklist

I have read the Contributing Guide
Pre-commit hooks pass (pre-commit run --all-files)
Relevant tests pass; new tests added for new functionality
Documentation updated (if applicable; built with ./docs/build_all.sh)
Branch is up to date with main
Self-reviewed via /review-pr command
This PR was created by a coding agent via /create-pr
This PR is a breaking change

Breaking Change Details (if applicable):

N/A

Additional Context

Design highlights:

Uses StatefulSets for stable worker identity and ordinal mapping
Uses Kubernetes Python client instead of kubectl subprocesses
Includes scoped pod selectors, diagnostics, and fork rollback semantics
Explicitly rejects unsupported Slurm-specific scheduling semantics

Need help? Check the Contributing Guide or ask in
GitHub Discussions!

gemini-code-assist

Code Review

This pull request introduces the KubernetesScheduler, enabling AReaL to deploy and manage worker groups on Kubernetes using StatefulSets and headless Services. The implementation includes integration with existing trainers, comprehensive documentation in English and Chinese, and new unit and integration tests. The review feedback focuses on improving the robustness of the scheduler by using more specific exception handling for Kubernetes and network requests, and simplifying the code by removing redundant attribute fallbacks.

gemini-code-assist · 2026-05-08T11:20:31Z

+
+        try:
+            config.load_incluster_config()
+        except Exception:


The except Exception: is too broad. It's better to catch the specific exception thrown by load_incluster_config(), which is kubernetes.config.ConfigException. This makes the code's intent clearer and avoids catching unexpected errors.

Suggested change

except Exception:

except config.ConfigException:

gemini-code-assist · 2026-05-08T11:20:31Z

+                tail_lines=tail_lines,
+                timestamps=True,
+            )
+        except Exception as e:


Using except Exception: is too broad. It's better to catch the specific ApiException from the Kubernetes client to avoid masking other unexpected errors.

Suggested change

except Exception as e:

except self._api_exception as e:

gemini-code-assist · 2026-05-08T11:20:31Z

+                namespace=namespace,
+                field_selector=f"involvedObject.name={pod_name}",
+            ).items
+        except Exception as e:


Using except Exception: is too broad. It's better to catch the specific ApiException from the Kubernetes client to avoid masking other unexpected errors.

Suggested change

except Exception as e:

except self._api_exception as e:

gemini-code-assist · 2026-05-08T11:20:31Z

+            container_statuses = _obj_get(pod, "status.container_statuses", None)
+            if container_statuses is None:
+                container_statuses = _obj_get(pod, "status.containerStatuses", []) or []


The official Kubernetes Python client converts API fields from camelCase to snake_case for object attributes. The fallback from container_statuses to containerStatuses is likely unnecessary and can be removed for clarity. A similar simplification can be applied to exit_code vs exitCode on lines 588-602.

Suggested change

container_statuses = _obj_get(pod, "status.container_statuses", None)

if container_statuses is None:

container_statuses = _obj_get(pod, "status.containerStatuses", []) or []

container_statuses = _obj_get(pod, "status.container_statuses", []) or []

gemini-code-assist · 2026-05-08T11:20:32Z

+        url = f"http://{format_hostport(wi.worker.ip, port)}/health"
+        try:
+            return requests.get(url, timeout=2.0).status_code == 200
+        except Exception:


The except Exception: is too broad. It's better to catch requests.RequestException to handle network-related errors specifically, without masking other potential issues.

Suggested change

except Exception:

except requests.RequestException:

gemini-code-assist · 2026-05-08T11:20:32Z

+            except requests.exceptions.Timeout as e:
+                last_error = f"Timeout: {e}"
+            except requests.exceptions.ConnectionError as e:
+                self._check_pods_health(health_role)
+                last_error = f"Connection error: {e}"


The exception handling for requests errors can be simplified. requests.exceptions.Timeout and requests.exceptions.ConnectionError are both subclasses of requests.exceptions.RequestException. You can catch the base class to cover all request-related errors.

Suggested change

except requests.exceptions.Timeout as e:

last_error = f"Timeout: {e}"

except requests.exceptions.ConnectionError as e:

self._check_pods_health(health_role)

last_error = f"Connection error: {e}"

except requests.exceptions.RequestException as e:

self._check_pods_health(health_role)

last_error = f"Request error: {e}"

garrett4wade · 2026-05-09T06:53:57Z

+            from kubernetes import client, config
+            from kubernetes.client import ApiException


We can force installing it in pyproject.toml

garrett4wade · 2026-05-09T06:54:51Z

+    spec: SchedulingSpec | None = None
+
+
+class KubernetesClient(Protocol):


IDT this procotol is mandatory because we only have one concrete implementation

The protocol is mainly for testing and dependency injection. Even though we currently have one implementation, using an interface makes it easier to plug in fake/mock Kubernetes clients in tests without depending on the real client.

garrett4wade · 2026-05-09T06:57:03Z

+            )
+
+        service = {
+            "apiVersion": "v1",


I'm not familar with k8s but I usually see api version v2. Why is it hard-coded here? Can we configure it?

Actually the standard Kubernetes API usage around Service and StatefulSet is still version v1 as it is most stable in that. I hard-coded the stable API versions intentionally since these resources are part of the long-term stable Kubernetes APIs

garrett4wade · 2026-05-09T06:59:48Z

+  actor.scheduling_spec[0].image=ghcr.io/<org>/<image>:<tag> \
+  rollout.scheduling_spec[0].image=ghcr.io/<org>/<image>:<tag>


Why there's a [0] indexing?

Currently the Kubernetes scheduler supports one SchedulingSpec per role, so scheduling_spec[0] is used as the shared pod template for all replicas.
The indexing is kept because it matches the existing AReaL config structure.

garrett4wade · 2026-05-09T07:03:57Z

This test is not very helpful. It only test the initialization and some method calls with mocks. We'd better just run integration tests in real k8s environments and skip the test otherwise.

garrett4wade

Review Findings

1. Unused `health_role` in `create_engine` (MEDIUM)

In areal/infra/scheduler/kubernetes.py, create_engine assigns health_role = self._colocated_roles.get(wi.role, wi.role) but never uses it. Both call_engine and async_call_engine use health_role with self._check_pods_health(health_role) during retry loops, so this looks like a copy-paste leftover where the health check was forgotten.

Suggestion: Either remove the dead assignment, or add a self._check_pods_health(health_role) call before the HTTP request for consistency with call_engine.

2. Redundant exit code extraction in `_check_pods_health` (MEDIUM)

The exit code is extracted twice with different defaults:

exit_code = int(_obj_get(terminated, "exit_code", _obj_get(terminated, "exitCode", 0)))
if terminated and exit_code != 0:
    exit_code = int(_obj_get(terminated, "exit_code", _obj_get(terminated, "exitCode", -1)))
    raise WorkerFailedError(...)

The second extraction is dead code — if the first returned non-zero, the second returns the identical value. If the first returned 0 (field missing), the branch is skipped. Additionally, a terminated container with no exitCode field silently defaults to 0 (success), which could mask edge cases.

Suggestion: Simplify to a single extraction, and consider defaulting to -1 for terminated containers with missing exit codes:

if terminated:
    exit_code = int(_obj_get(terminated, "exit_code", _obj_get(terminated, "exitCode", -1)))
    if exit_code != 0:
        raise WorkerFailedError(...)

3. PR title typo (LOW)

"schedular" → "scheduler"

4. Residual risk: No RBAC guidance in docs

The docs mention service account permissions but don't provide a sample ClusterRole/RoleBinding manifest. This is the most common setup friction for K8s integrations — users will need to figure out the exact verbs and resources themselves. Consider adding a minimal RBAC example covering Services, StatefulSets, Pods, pod logs, and pod events.

senseipri · 2026-05-09T08:29:01Z

Thanks for the detailed review and suggestions, I will address the requested fixes and push the updates. Really appreciate the feedback on both the Kubernetes integration and the scheduler semantics consistency.

…lth_role in create_engine

senseipri requested review from HwVanICI, fishcrap, garrett4wade, guozhihao-224, nuzant, rchardx and sitabulaixizawaluduo as code owners May 8, 2026 11:16

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

garrett4wade reviewed May 9, 2026

View reviewed changes

senseipri changed the title ~~Refined Kubernetes schedular implementation~~ Refined Kubernetes scheduler implementation May 12, 2026

senseipri added 2 commits May 13, 2026 00:06

Refined Kubernetes schedular implementation

a21efe9

updated the redundant exit code extraction in checkpodshealth and hea…

c8e7d99

…lth_role in create_engine

senseipri force-pushed the feature/kubernetes-scheduler branch from a369188 to c8e7d99 Compare May 12, 2026 18:38

		from kubernetes import client, config
		from kubernetes.client import ApiException

		spec: SchedulingSpec \| None = None


		class KubernetesClient(Protocol):

		actor.scheduling_spec[0].image=ghcr.io/<org>/<image>:<tag> \
		rollout.scheduling_spec[0].image=ghcr.io/<org>/<image>:<tag>

Conversation

senseipri commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

garrett4wade left a comment

Choose a reason for hiding this comment

Review Findings

1. Unused health_role in create_engine (MEDIUM)

2. Redundant exit code extraction in _check_pods_health (MEDIUM)

3. PR title typo (LOW)

4. Residual risk: No RBAC guidance in docs

Uh oh!

senseipri commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

senseipri commented May 8, 2026 •

edited

Loading

1. Unused `health_role` in `create_engine` (MEDIUM)

2. Redundant exit code extraction in `_check_pods_health` (MEDIUM)