Skip to content

Conversation

chipspeak
Copy link

@chipspeak chipspeak commented Aug 25, 2025

Why are these changes needed?

Pods within a namespace can communicate freely by default. Exposing the Ray head via an Ingress or Route without a NetworkPolicy allows any pod in the cluster to access the Ray Dashboard and API. Managing NetworkPolicy objects manually is cumbersome and error-prone. This additional controller lifecycles a NetworkPolicy per RayCluster ensuring that only relevant pods can communicate. It is behind a feature flag in this PR to ensure no disruption to users.

This is WIP to serve as a demonstration of how I might approach this.

Verification Steps

Setup

  • First, build this branch using the below command from within the ray-operator directory.
make docker-build IMG=network-policy-test:odh
  • Now we can load that image into kind using:
kind load docker-image network-policy-test:odh --name network-policy-test
  • We need to ray-operator/config/manager/manager.yaml to point to our new image and to enable the feature-flag. You can do this by adjusting the --feature-gates on line 32 to include the network policy flag like below:
- --feature-gates=RayClusterStatusConditions=true,RayClusterNetworkPolicy=true
  • We’ll also need to adjust the image field to point to our new local image and to set the imagePullPolicy to never (this ensures that it falls back to the local image rather than pulling from a registry). You can do this by making the follow adjustment to line 34 and 35:
image: network-policy-test:odh
imagePullPolicy: Never
  • We can then install the operator using the below command:
kubectl apply --server-side -k config/default
  • Verify that the operator is in a running state by using the below command:
kubectl get pods | grep kuberay-operator
  • We can now create a RayCluster using the samples by running the below command:
kubectl apply -f config/samples/ray-cluster.sample.yaml
  • Now that we have a RayCluster, lets verify that a matching NetworkPolicy has been created by running the below command:
kubectl get networkpolicies -n default
  • This should return raycluster-kuberay-head and raycluster-kuberay-worker. You can view the NetworkPolicy by running the below command:
kubectl describe networkpolicy <network_policy_name>

Rule Verification

Rule 1

  • A pod with a label matching the newly created RayCluster can access all ports. We can verify this by running the below command:
kubectl run test-rule1-intra --image=busybox --rm -i --restart=Never -n default --labels="ray.io/cluster=raycluster-kuberay" -- timeout 5 nc -zv raycluster-kuberay-head-svc 6379
  • This should succeed as we’ve applied the required label to the pod, granting it access to secure ports. Lets further verify this by attempting to access another secure port via a pod with the same label:
kubectl run test-rule1-metrics --image=busybox --rm -i --restart=Never -n default --labels="ray.io/cluster=raycluster-kuberay" -- timeout 5 nc -zv raycluster-kuberay-head-svc 8080
  • This should work once again. We can verify the expected failure when the label is missing by attempting to access GCS (port 6379) again but without a label:
kubectl run test-rule1-intra --image=busybox --rm -i --restart=Never -n default --labels="ray.io/cluster=raycluster-kuberay" -- timeout 5 nc -zv raycluster-kuberay-head-svc 6379
  • This will have failed as we don’t have permission to access that port without the label.

Rule 2

  • We’re restricting communication with the Ray dashboard and client to pods within the same namespace. We can verify this by running the following to create a pod and query the dashboard.
kubectl run test-same-namespace --image=busybox --rm -i --restart=Never -- \
  timeout 1 wget -qO- raycluster-kuberay-head-svc:8265
  • This should succeed and return the HTML response from the dashboard as seen below.
<!doctype html><html lang="en"><head><meta charset="utf-8"/><link rel="shortcut icon" href="./favicon.ico"/><meta name="viewport" content="width=device-width,initial-scale=1"/><title>Ray Dashboard</title><script defer="defer" src="./static/js/main.04e1bfe3.js"></script><link href="./static/css/main.388a904b.css" rel="stylesheet"></head><body><noscript>You need to enable JavaScript to run this app.</noscript><div id="root"></div></body></html>
  • We can further verify this by doing the same, but with a pod from a different namespace. Run the below commands in order to achieve this.
kubectl create namsepace test-namespace

kubectl run test-diff-namespace --image=busybox --rm -i --restart=Never -n test-namespace -- timeout 2 wget -qO- raycluster-kuberay-head-svc.default:8265
  • This time, we should timeout as this pod is not permitted to communicate with the Ray Head’s dashboard as it is from an external namespace.

Rule 3

  • The KubeRay operator should also have appropriate permissions to communicate with the RayCluster pods. This requires that the operator is within an expected namespace (dynamically retrieved by the controller) and has the app.kubernetes.io/name=kuberay label. We can verify this by running the below command.
kubectl run test-rule3-operator --image=busybox --rm -i --restart=Never -n default --labels="app.kubernetes.io/name=kuberay" -- timeout 5 nc -zv raycluster-kuberay-head-svc 8265
  • This will pass implicitly due to rule 2, regardless of the label as the kuberay operator is within the same namespace as the RayCluster we created. We’ll need to create a RayCluster in the test-namespace we created earlier by running:
kubectl apply -f config/samples/ray-cluster.sample.yaml -n test-namespace
  • Now we can create a pod with the KubeRay operator’s label in our default namespace and attempt to access this pod:
kubectl run test-cross-ns-operator --image=busybox --rm -i --restart=Never -n default --labels="app.kubernetes.io/name=kuberay" -- timeout 5 nc -zv raycluster-kuberay-head-svc.test-namespace 8265
  • This should succeed due to rule 3. Let’s try the same but minus the KubeRay operator label:
kubectl run test-cross-ns-no-label --image=busybox --rm -i --restart=Never -n default -- timeout 5 nc -zv raycluster-kuberay-head-svc.test-namespace 8265
  • This should fail as the label is required for cross-namespace communication. This is effectively what rule 3 adds to the rule 2 logic.

Rule 4

  • We permit access to the monitoring port (8080) from two common monitoring namespaces: openshift-monitoring and prometheus. To verify this access we first need to create these namespaces by running:
kubectl create namespace openshift-monitoring && kubectl create namespace prometheus
  • Now we can test the openshift namespace’s access by running:
kubectl run test-rule4-openshift --image=busybox --rm -i --restart=Never -n openshift-monitoring -- timeout 5 nc -zv raycluster-kuberay-head-svc.default 8080
  • This should succeed. We can also verify the prometheus namespace’s access by running:
kubectl run test-rule4-prometheus --image=busybox --rm -i --restart=Never -n prometheus -- timeout 5 nc -zv raycluster-kuberay-head-svc.default 8080
  • This should once again succeed. We can also confirm that this access is limited to the above by attempting to access port 8080 from a different, non-monitoring namespace by running:
kubectl run test-rule4-blocked --image=busybox --rm -i --restart=Never -n test-namespace -- timeout 5 nc -zv raycluster-kuberay-head-svc.default 8080
  • This should fail as this pod is from a non-monitoring namespace (and not covered by Rule 2 as it is not in the same namespace as the RayCluster). We can verify that the rule limits the monitoring namespaces access appropriately by running:
kubectl run test-rule4-wrong-port --image=busybox --rm -i --restart=Never -n openshift-monitoring -- timeout 5 nc -zv raycluster-kuberay-head-svc.default 6379
  • This should fail as these monitoring namespaces are only permitted to access port 8080.

Rule 5

  • We permit access to port 8443 from any pod but Ray won’t be listening on this port unless mTLS has been configured for this particular RayCluster. We can verify this by running:
kubectl run test-rule5-non-mtls --image=busybox --rm -i --restart=Never -n test-namespace -- timeout 2 nc -zv raycluster-kuberay-head-svc.default 8443
  • This should timeout as this RayCluster is not mTLS enabled and as a result, is not listening on port 8443. We can further illustrate this by checking the head pods NetworkPolicy:
kubectl describe networkpolicy raycluster-kuberay
  • In the output of the above, you should be able to observe that in the final rule, only port 8443 is exposed as seen below.
    To Port: 8443/TCP
    From: <any> (traffic not restricted by source)
  • We can demonstrate how a RayCluster with mTLS enabled is in fact listening on this port by applying a CR with mTLS enabled. In this instance, we don’t have an mTLS controller to automatically add a container port of 8443 so we’ll do it manually by running:
kubectl apply -f - <<EOF
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-kuberay-mtls
spec:
  rayVersion: '2.46.0'
  headGroupSpec:
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.46.0
          env:
          - name: RAY_USE_TLS
            value: "1"
          - name: RAY_TLS_SERVER_CERT
            value: "/etc/tls/server.crt"
          - name: RAY_TLS_SERVER_KEY
            value: "/etc/tls/server.key"
          resources:
            limits:
              cpu: 1
              memory: 2G
            requests:
              cpu: 1
              memory: 2G
          ports:
          - containerPort: 6379
            name: gcs-server
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          - containerPort: 8443
            name: mtls
  workerGroupSpecs:
  - replicas: 1
    minReplicas: 1
    maxReplicas: 5
    groupName: workergroup
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:2.46.0
          env:
          - name: RAY_USE_TLS
            value: "1"
          - name: RAY_TLS_SERVER_CERT
            value: "/etc/tls/server.crt"
          - name: RAY_TLS_SERVER_KEY
            value: "/etc/tls/server.key"
          resources:
            limits:
              cpu: 1
              memory: 1G
            requests:
              cpu: 1
              memory: 1G
EOF
  • NOTE: this is only verifiable from the NetworkPolicy and port exposure perspective. This hinges on mTLS being configured as part of KubeRay in the future. While this will succeed in creating an appropriate network policy, the pods will fail as there will be no certificates available.
  • We can verify that the ports are exposed correctly by running:
kubectl describe networkpolicy raycluster-kuberay-mtls-head
  • In the output, you should be able to observe the final ruleset and see that port 10001 is now also exposed as seen below. This is expected when mTLS is configured.
    To Port: 8443/TCP
    To Port: 10001/TCP
    From: <any> (traffic not restricted by source)

Related issue number

Closes #3987

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@chipspeak chipspeak force-pushed the add-networkpolicy-support branch 2 times, most recently from b65237b to e3fafb9 Compare August 26, 2025 12:34
@chipspeak chipspeak marked this pull request as ready for review September 15, 2025 17:26
@chipspeak chipspeak force-pushed the add-networkpolicy-support branch 2 times, most recently from e54bfbe to d855f02 Compare September 25, 2025 13:16
@chipspeak chipspeak force-pushed the add-networkpolicy-support branch from d855f02 to 393fe05 Compare September 25, 2025 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Add NetworkPolicy support
1 participant