Skip to content

Refined Kubernetes scheduler implementation#1316

Open
senseipri wants to merge 2 commits into
areal-project:mainfrom
senseipri:feature/kubernetes-scheduler
Open

Refined Kubernetes scheduler implementation#1316
senseipri wants to merge 2 commits into
areal-project:mainfrom
senseipri:feature/kubernetes-scheduler

Conversation

@senseipri
Copy link
Copy Markdown

@senseipri senseipri commented May 8, 2026

Description

Adds a Kubernetes-backed scheduler implementation for AReaL using StatefulSet-based worker orchestration.

This PR:

  • implements KubernetesScheduler
  • reuses existing HTTP guard APIs
  • adds Kubernetes Python client integration
  • adds pod health diagnostics and rollback handling
  • mirrors local/Slurm scheduler semantics
  • adds Kubernetes documentation (EN + ZH)
  • adds unit and optional integration tests

Related Issue

#Native Kubernetes (K8S) scheduler [CC]

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📝 Documentation update
  • ♻️ Refactoring
  • ⚡ Performance improvement
  • ✅ Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated (if applicable; built with ./docs/build_all.sh)
  • Branch is up to date with main
  • Self-reviewed via /review-pr command
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Breaking Change Details (if applicable):

N/A

Additional Context

Design highlights:

  • Uses StatefulSets for stable worker identity and ordinal mapping
  • Uses Kubernetes Python client instead of kubectl subprocesses
  • Includes scoped pod selectors, diagnostics, and fork rollback semantics
  • Explicitly rejects unsupported Slurm-specific scheduling semantics

Need help? Check the Contributing Guide or ask in
GitHub Discussions!

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the KubernetesScheduler, enabling AReaL to deploy and manage worker groups on Kubernetes using StatefulSets and headless Services. The implementation includes integration with existing trainers, comprehensive documentation in English and Chinese, and new unit and integration tests. The review feedback focuses on improving the robustness of the scheduler by using more specific exception handling for Kubernetes and network requests, and simplifying the code by removing redundant attribute fallbacks.


try:
config.load_incluster_config()
except Exception:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The except Exception: is too broad. It's better to catch the specific exception thrown by load_incluster_config(), which is kubernetes.config.ConfigException. This makes the code's intent clearer and avoids catching unexpected errors.

Suggested change
except Exception:
except config.ConfigException:

tail_lines=tail_lines,
timestamps=True,
)
except Exception as e:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using except Exception: is too broad. It's better to catch the specific ApiException from the Kubernetes client to avoid masking other unexpected errors.

Suggested change
except Exception as e:
except self._api_exception as e:

namespace=namespace,
field_selector=f"involvedObject.name={pod_name}",
).items
except Exception as e:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using except Exception: is too broad. It's better to catch the specific ApiException from the Kubernetes client to avoid masking other unexpected errors.

Suggested change
except Exception as e:
except self._api_exception as e:

Comment on lines +569 to +571
container_statuses = _obj_get(pod, "status.container_statuses", None)
if container_statuses is None:
container_statuses = _obj_get(pod, "status.containerStatuses", []) or []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The official Kubernetes Python client converts API fields from camelCase to snake_case for object attributes. The fallback from container_statuses to containerStatuses is likely unnecessary and can be removed for clarity. A similar simplification can be applied to exit_code vs exitCode on lines 588-602.

Suggested change
container_statuses = _obj_get(pod, "status.container_statuses", None)
if container_statuses is None:
container_statuses = _obj_get(pod, "status.containerStatuses", []) or []
container_statuses = _obj_get(pod, "status.container_statuses", []) or []

url = f"http://{format_hostport(wi.worker.ip, port)}/health"
try:
return requests.get(url, timeout=2.0).status_code == 200
except Exception:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The except Exception: is too broad. It's better to catch requests.RequestException to handle network-related errors specifically, without masking other potential issues.

Suggested change
except Exception:
except requests.RequestException:

Comment on lines +1192 to +1196
except requests.exceptions.Timeout as e:
last_error = f"Timeout: {e}"
except requests.exceptions.ConnectionError as e:
self._check_pods_health(health_role)
last_error = f"Connection error: {e}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The exception handling for requests errors can be simplified. requests.exceptions.Timeout and requests.exceptions.ConnectionError are both subclasses of requests.exceptions.RequestException. You can catch the base class to cover all request-related errors.

Suggested change
except requests.exceptions.Timeout as e:
last_error = f"Timeout: {e}"
except requests.exceptions.ConnectionError as e:
self._check_pods_health(health_role)
last_error = f"Connection error: {e}"
except requests.exceptions.RequestException as e:
self._check_pods_health(health_role)
last_error = f"Request error: {e}"

Comment on lines +86 to +87
from kubernetes import client, config
from kubernetes.client import ApiException
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can force installing it in pyproject.toml

spec: SchedulingSpec | None = None


class KubernetesClient(Protocol):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDT this procotol is mandatory because we only have one concrete implementation

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The protocol is mainly for testing and dependency injection. Even though we currently have one implementation, using an interface makes it easier to plug in fake/mock Kubernetes clients in tests without depending on the real client.

)

service = {
"apiVersion": "v1",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familar with k8s but I usually see api version v2. Why is it hard-coded here? Can we configure it?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the standard Kubernetes API usage around Service and StatefulSet is still version v1 as it is most stable in that. I hard-coded the stable API versions intentionally since these resources are part of the long-term stable Kubernetes APIs

Comment on lines +73 to +74
actor.scheduling_spec[0].image=ghcr.io/<org>/<image>:<tag> \
rollout.scheduling_spec[0].image=ghcr.io/<org>/<image>:<tag>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there's a [0] indexing?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the Kubernetes scheduler supports one SchedulingSpec per role, so scheduling_spec[0] is used as the shared pod template for all replicas.
The indexing is kept because it matches the existing AReaL config structure.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is not very helpful. It only test the initialization and some method calls with mocks. We'd better just run integration tests in real k8s environments and skip the test otherwise.

Copy link
Copy Markdown
Collaborator

@garrett4wade garrett4wade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Findings

1. Unused health_role in create_engine (MEDIUM)

In areal/infra/scheduler/kubernetes.py, create_engine assigns health_role = self._colocated_roles.get(wi.role, wi.role) but never uses it. Both call_engine and async_call_engine use health_role with self._check_pods_health(health_role) during retry loops, so this looks like a copy-paste leftover where the health check was forgotten.

Suggestion: Either remove the dead assignment, or add a self._check_pods_health(health_role) call before the HTTP request for consistency with call_engine.

2. Redundant exit code extraction in _check_pods_health (MEDIUM)

The exit code is extracted twice with different defaults:

exit_code = int(_obj_get(terminated, "exit_code", _obj_get(terminated, "exitCode", 0)))
if terminated and exit_code != 0:
    exit_code = int(_obj_get(terminated, "exit_code", _obj_get(terminated, "exitCode", -1)))
    raise WorkerFailedError(...)

The second extraction is dead code — if the first returned non-zero, the second returns the identical value. If the first returned 0 (field missing), the branch is skipped. Additionally, a terminated container with no exitCode field silently defaults to 0 (success), which could mask edge cases.

Suggestion: Simplify to a single extraction, and consider defaulting to -1 for terminated containers with missing exit codes:

if terminated:
    exit_code = int(_obj_get(terminated, "exit_code", _obj_get(terminated, "exitCode", -1)))
    if exit_code != 0:
        raise WorkerFailedError(...)

3. PR title typo (LOW)

"schedular" → "scheduler"

4. Residual risk: No RBAC guidance in docs

The docs mention service account permissions but don't provide a sample ClusterRole/RoleBinding manifest. This is the most common setup friction for K8s integrations — users will need to figure out the exact verbs and resources themselves. Consider adding a minimal RBAC example covering Services, StatefulSets, Pods, pod logs, and pod events.

@senseipri
Copy link
Copy Markdown
Author

Thanks for the detailed review and suggestions, I will address the requested fixes and push the updates. Really appreciate the feedback on both the Kubernetes integration and the scheduler semantics consistency.

@senseipri senseipri changed the title Refined Kubernetes schedular implementation Refined Kubernetes scheduler implementation May 12, 2026
@senseipri senseipri force-pushed the feature/kubernetes-scheduler branch from a369188 to c8e7d99 Compare May 12, 2026 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants