-
Notifications
You must be signed in to change notification settings - Fork 623
[draft pr][RayJob] Use timeout to prevent RayCluster leak #4090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
Future-Outlier
wants to merge
2
commits into
ray-project:master
Choose a base branch
from
Future-Outlier:rayjob-raycluster-leak-idea-2-timeout
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
[draft pr][RayJob] Use timeout to prevent RayCluster leak #4090
Future-Outlier
wants to merge
2
commits into
ray-project:master
from
Future-Outlier:rayjob-raycluster-leak-idea-2-timeout
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
cc @machichima apiVersion: ray.io/v1
kind: RayJob
metadata:
name: rayjob-sample-external-redis-2-49-0-2
spec:
# submissionMode specifies how RayJob submits the Ray job to the RayCluster.
# The default value is "K8sJobMode", meaning RayJob will submit the Ray job via a submitter Kubernetes Job.
# The alternative value is "HTTPMode", indicating that KubeRay will submit the Ray job by sending an HTTP request to the RayCluster.
# submissionMode: "K8sJobMode"
entrypoint: python -c "import os, time; print(os.environ.get('HOSTNAME')); [time.sleep(1) or print(i) for i in range(1000)]"
# shutdownAfterJobFinishes specifies whether the RayCluster should be deleted after the RayJob finishes. Default is false.
# shutdownAfterJobFinishes: false
# ttlSecondsAfterFinished specifies the number of seconds after which the RayCluster will be deleted after the RayJob finishes.
# ttlSecondsAfterFinished: 10
# activeDeadlineSeconds is the duration in seconds that the RayJob may be active before
# KubeRay actively tries to terminate the RayJob; value must be positive integer.
# activeDeadlineSeconds: 120
# RuntimeEnvYAML represents the runtime environment configuration provided as a multi-line YAML string.
# See https://docs.ray.io/en/latest/ray-core/handling-dependencies.html for details.
# (New in KubeRay version 1.0.)
runtimeEnvYAML: |
pip:
- requests==2.26.0
- pendulum==2.1.2
env_vars:
counter_name: "test_counter"
# Suspend specifies whether the RayJob controller should create a RayCluster instance.
# If a job is applied with the suspend field set to true, the RayCluster will not be created and we will wait for the transition to false.
# If the RayCluster is already created, it will be deleted. In the case of transition to false, a new RayCluster will be created.
# suspend: false
# rayClusterSpec specifies the RayCluster instance to be created by the RayJob controller.
rayClusterSpec:
rayVersion: '2.46.0' # should match the Ray version in the image of the containers
gcsFaultToleranceOptions:
# In most cases, you don't need to set `externalStorageNamespace` because KubeRay will
# automatically set it to the UID of RayCluster. Only modify this annotation if you fully understand
# the behaviors of the Ray GCS FT and RayService to avoid misconfiguration.
# [Example]:
# externalStorageNamespace: "my-raycluster-storage"
redisAddress: "redis:6379"
redisPassword:
valueFrom:
secretKeyRef:
name: redis-password-secret
key: password
# Ray head pod template
headGroupSpec:
# The `rayStartParams` are used to configure the `ray start` command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
rayStartParams: {}
#pod template
template:
spec:
# terminationGracePeriodSeconds: 1203
containers:
- name: ray-head
image: rayproject/ray:2.46.0
resources:
limits:
cpu: "1"
requests:
cpu: "1"
lifecycle:
preStop:
exec:
command: ["python -c 'import ray; ray.shutdown()'"]
ports:
- containerPort: 6379
name: redis
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
- mountPath: /home/ray/samples
name: ray-example-configmap
volumes:
- name: ray-logs
emptyDir: {}
- name: ray-example-configmap
configMap:
name: ray-example
defaultMode: 0777
items:
- key: detached_actor.py
path: detached_actor.py
- key: increment_counter.py
path: increment_counter.py
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 10
groupName: small-group
# The `rayStartParams` are used to configure the `ray start` command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
rayStartParams: {}
# Pod template
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray:2.46.0
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
resources:
limits:
cpu: "1"
requests:
cpu: "1"
volumes:
- name: ray-logs
emptyDir: {}
---
kind: ConfigMap
apiVersion: v1
metadata:
name: redis-config
labels:
app: redis
data:
redis.conf: |-
dir /data
port 6379
bind 0.0.0.0
appendonly yes
protected-mode no
requirepass 5241590000000000
pidfile /data/redis-6379.pid
---
apiVersion: v1
kind: Service
metadata:
name: redis
labels:
app: redis
spec:
type: ClusterIP
ports:
- name: redis
port: 6379
selector:
app: redis
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis
labels:
app: redis
spec:
replicas: 1
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis:7.4.0
command:
- "sh"
- "-c"
- "redis-server /usr/local/etc/redis/redis.conf"
ports:
- containerPort: 6379
volumeMounts:
- name: config
mountPath: /usr/local/etc/redis/redis.conf
subPath: redis.conf
volumes:
- name: config
configMap:
name: redis-config
---
# Redis password
apiVersion: v1
kind: Secret
metadata:
name: redis-password-secret
type: Opaque
data:
# echo -n "5241590000000000" | base64
password: NTI0MTU5MDAwMDAwMDAwMA==
---
apiVersion: v1
kind: ConfigMap
metadata:
name: ray-example
data:
detached_actor.py: |
import ray
@ray.remote(num_cpus=1)
class Counter:
def __init__(self):
self.value = 0
def increment(self):
self.value += 1
return self.value
ray.init(namespace="default_namespace")
Counter.options(name="counter_actor", lifetime="detached").remote()
increment_counter.py: |
import ray
ray.init(namespace="default_namespace")
counter = ray.get_actor("counter_actor")
print(ray.get(counter.increment.remote()))
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Related issue number
#3860 (comment)
Checks