You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/sphinx/user-docs/cluster-configuration.rst
+42
Original file line number
Diff line number
Diff line change
@@ -98,6 +98,48 @@ Custom Volumes/Volume Mounts
98
98
|For more information on creating Volumes and Volume Mounts with Python check out the Python Kubernetes docs (`Volumes <https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/V1Volume.md>`__, `Volume Mounts <https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/V1VolumeMount.md>`__).
99
99
|You can also find further information on Volumes and Volume Mounts by visiting the Kubernetes `documentation <https://kubernetes.io/docs/concepts/storage/volumes/>`__.
100
100
101
+
GCS Fault Tolerance
102
+
------------------
103
+
By default, the state of the Ray cluster is transient to the head Pod. Whatever triggers a restart of the head Pod results in losing that state, including Ray Cluster history. To make Ray cluster state persistent you can enable Global Control Service (GCS) fault tolerance with an external Redis storage.
104
+
105
+
To configure GCS fault tolerance you need to set the following parameters:
106
+
107
+
.. list-table::
108
+
:header-rows: 1
109
+
:widths: auto
110
+
111
+
* - Parameter
112
+
- Description
113
+
* - ``enable_gcs_ft``
114
+
- Boolean to enable GCS fault tolerance
115
+
* - ``redis_address``
116
+
- Address of the external Redis service, ex: "redis:6379"
117
+
* - ``redis_password_secret``
118
+
- Dictionary with 'name' and 'key' fields specifying the Kubernetes secret for Redis password
119
+
* - ``external_storage_namespace``
120
+
- Custom storage namespace for GCS fault tolerance (by default, KubeRay sets it to the RayCluster's UID)
121
+
122
+
Example configuration:
123
+
124
+
.. code:: python
125
+
126
+
from codeflare_sdk import Cluster, ClusterConfiguration
127
+
128
+
cluster = Cluster(ClusterConfiguration(
129
+
name='ray-cluster-with-persistence',
130
+
num_workers=2,
131
+
enable_gcs_ft=True,
132
+
redis_address="redis:6379",
133
+
redis_password_secret={
134
+
"name": "redis-password-secret",
135
+
"key": "password"
136
+
},
137
+
# external_storage_namespace="my-custom-namespace" # Optional: Custom namespace for GCS data in Redis
138
+
))
139
+
140
+
.. note::
141
+
You need to have a Redis instance deployed in your Kubernetes cluster before using this feature.
0 commit comments