Skip to content

Support custom resource configuration for the submit pod #3690

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

xianlubird
Copy link
Contributor

Why are these changes needed?

This PR introduces support for customizing resource requests and limits for the submit pod.

  1. In CPU-constrained clusters, users may wish to reduce the submit pod's CPU request (e.g., to 400m) to avoid scheduling delays or resource exhaustion. Previously, adjusting submit pod resources required modifying the codebase and rebuilding the binary, which was inconvenient and error-prone.

  2. For large-scale clusters with high task throughput, the ability to configure submit pod resources dynamically is essential for stability and operational flexibility.

  3. The community-provided default submitterTemplate requires users to fully override the entire pod spec — including image name, startup arguments, and other fields — even when they only want to change the resource settings. This is unnecessarily complex for most use cases where only minor adjustments to resource limits are needed.

This enhancement provides a cleaner and more configurable approach to submit pod resource tuning.

We can use the rayjob like

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sample
spec:
  submitterConfig:
    resources:
      requests:
        cpu: "250m"
        memory: "256Mi"
      limits:
        cpu: "500m"
        memory: "512Mi"
  runtimeEnvYAML: |
    pip:
      - requests==2.26.0
      - pendulum==2.1.2
    env_vars:
      counter_name: "test_counter"

  rayClusterSpec:
    rayVersion: '2.41.0' # should match the Ray version in the image of the containers
    # Ray head pod template
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams: {}
      #pod template
      template:
        spec:
          containers:
            - name: ray-head
              image: harbor.weizhipin.com/arsenal-oceanus/ray:2.41.0
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
              resources:
                limits:
                  cpu: "1"
                requests:
                  cpu: "200m"
              volumeMounts:
                - mountPath: /home/ray/samples
                  name: code-sample
          volumes:
            # You set volumes at the Pod level, then mount them into containers inside that Pod
            - name: code-sample
              configMap:
                # Provide the name of the ConfigMap you want to mount.
                name: ray-job-code-sample
                # An array of keys from the ConfigMap to create as files
                items:
                  - key: sample_code.py
                    path: sample_code.py
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        rayStartParams: {}
        #pod template
        template:
          spec:
            containers:
              - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: harbor.weizhipin.com/arsenal-oceanus/ray:2.41.0
                resources:
                  limits:
                    cpu: "1"
                  requests:
                    cpu: "200m"

######################Ray code sample#################################
# this sample is from https://docs.ray.io/en/latest/cluster/job-submission.html#quick-start-example
# it is mounted into the container and executed to show the Ray job at work
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
data:
  sample_code.py: |
    import ray
    import os
    import requests

    ray.init()

    @ray.remote
    class Counter:
        def __init__(self):
            # Used to verify runtimeEnv
            self.name = os.getenv("counter_name")
            assert self.name == "test_counter"
            self.counter = 0

        def inc(self):
            self.counter += 1

        def get_counter(self):
            return "{} got {}".format(self.name, self.counter)

    counter = Counter.remote()

    for _ in range(5):
        ray.get(counter.inc.remote())
        print(ray.get(counter.get_counter.remote()))

    # Verify that the correct runtime env was used for the job.
    assert requests.__version__ == "2.26.0"

Checks

  • [✅ ] I've made sure the tests are passing.

@xianlubird
Copy link
Contributor Author

@kevin85421 PTLA

@kevin85421
Copy link
Member

Can you use submitterPodTemplate instead?

@xianlubird
Copy link
Contributor Author

@kevin85421 Yes, I can。

However, there is a problem with using submitterPodTemplate, that is, I must write all the configurations of the pod, including command, image, resource restrictions, etc. In fact, most users may not want to change the configurations of command, image, etc., but just change the resource configuration.

Please help evaluate whether this requirement is reasonable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants