[WIP] Implement NetworkPolicy support #3986

chipspeak · 2025-08-25T10:42:04Z

Why are these changes needed?

Pods within a namespace can communicate freely by default. Exposing the Ray head via an Ingress or Route without a NetworkPolicy allows any pod in the cluster to access the Ray Dashboard and API. Managing NetworkPolicy objects manually is cumbersome and error-prone. This additional controller lifecycles a NetworkPolicy per RayCluster ensuring that only relevant pods can communicate. It is behind a feature flag in this PR to ensure no disruption to users.

This is WIP to serve as a demonstration of how I might approach this.

Verification Steps

Setup

First, build this branch using the below command from within the ray-operator directory.

make docker-build IMG=network-policy-test:odh

Now we can load that image into kind using:

kind load docker-image network-policy-test:odh --name network-policy-test

We need to ray-operator/config/manager/manager.yaml to point to our new image and to enable the feature-flag. You can do this by adjusting the --feature-gates on line 32 to include the network policy flag like below:

- --feature-gates=RayClusterStatusConditions=true,RayClusterNetworkPolicy=true

We’ll also need to adjust the image field to point to our new local image and to set the imagePullPolicy to never (this ensures that it falls back to the local image rather than pulling from a registry). You can do this by making the follow adjustment to line 34 and 35:

image: network-policy-test:odh
imagePullPolicy: Never

We can then install the operator using the below command:

kubectl apply --server-side -k config/default

Verify that the operator is in a running state by using the below command:

kubectl get pods | grep kuberay-operator

We can now create a RayCluster using the samples by running the below command:

kubectl apply -f config/samples/ray-cluster.sample.yaml

Now that we have a RayCluster, lets verify that a matching NetworkPolicy has been created by running the below command:

kubectl get networkpolicies -n default

This should return raycluster-kuberay-head and raycluster-kuberay-worker. You can view the NetworkPolicy by running the below command:

kubectl describe networkpolicy <network_policy_name>

Rule Verification

Rule 1

A pod with a label matching the newly created RayCluster can access all ports. We can verify this by running the below command:

kubectl run test-rule1-intra --image=busybox --rm -i --restart=Never -n default --labels="ray.io/cluster=raycluster-kuberay" -- timeout 5 nc -zv raycluster-kuberay-head-svc 6379

This should succeed as we’ve applied the required label to the pod, granting it access to secure ports. Lets further verify this by attempting to access another secure port via a pod with the same label:

kubectl run test-rule1-metrics --image=busybox --rm -i --restart=Never -n default --labels="ray.io/cluster=raycluster-kuberay" -- timeout 5 nc -zv raycluster-kuberay-head-svc 8080

This should work once again. We can verify the expected failure when the label is missing by attempting to access GCS (port 6379) again but without a label:

kubectl run test-rule1-intra --image=busybox --rm -i --restart=Never -n default --labels="ray.io/cluster=raycluster-kuberay" -- timeout 5 nc -zv raycluster-kuberay-head-svc 6379

This will have failed as we don’t have permission to access that port without the label.

Rule 2

We’re restricting communication with the Ray dashboard and client to pods within the same namespace. We can verify this by running the following to create a pod and query the dashboard.

kubectl run test-same-namespace --image=busybox --rm -i --restart=Never -- \
  timeout 1 wget -qO- raycluster-kuberay-head-svc:8265

This should succeed and return the HTML response from the dashboard as seen below.

<!doctype html><html lang="en"><head><meta charset="utf-8"/><link rel="shortcut icon" href="./favicon.ico"/><meta name="viewport" content="width=device-width,initial-scale=1"/><title>Ray Dashboard</title><script defer="defer" src="./static/js/main.04e1bfe3.js"></script><link href="./static/css/main.388a904b.css" rel="stylesheet"></head><body><noscript>You need to enable JavaScript to run this app.</noscript><div id="root"></div></body></html>

We can further verify this by doing the same, but with a pod from a different namespace. Run the below commands in order to achieve this.

kubectl create namsepace test-namespace

kubectl run test-diff-namespace --image=busybox --rm -i --restart=Never -n test-namespace -- timeout 2 wget -qO- raycluster-kuberay-head-svc.default:8265

This time, we should timeout as this pod is not permitted to communicate with the Ray Head’s dashboard as it is from an external namespace.

Rule 3

The KubeRay operator should also have appropriate permissions to communicate with the RayCluster pods. This requires that the operator is within an expected namespace (dynamically retrieved by the controller) and has the app.kubernetes.io/name=kuberay label. We can verify this by running the below command.

kubectl run test-rule3-operator --image=busybox --rm -i --restart=Never -n default --labels="app.kubernetes.io/name=kuberay" -- timeout 5 nc -zv raycluster-kuberay-head-svc 8265

This will pass implicitly due to rule 2, regardless of the label as the kuberay operator is within the same namespace as the RayCluster we created. We’ll need to create a RayCluster in the test-namespace we created earlier by running:

kubectl apply -f config/samples/ray-cluster.sample.yaml -n test-namespace

Now we can create a pod with the KubeRay operator’s label in our default namespace and attempt to access this pod:

kubectl run test-cross-ns-operator --image=busybox --rm -i --restart=Never -n default --labels="app.kubernetes.io/name=kuberay" -- timeout 5 nc -zv raycluster-kuberay-head-svc.test-namespace 8265

This should succeed due to rule 3. Let’s try the same but minus the KubeRay operator label:

kubectl run test-cross-ns-no-label --image=busybox --rm -i --restart=Never -n default -- timeout 5 nc -zv raycluster-kuberay-head-svc.test-namespace 8265

This should fail as the label is required for cross-namespace communication. This is effectively what rule 3 adds to the rule 2 logic.

Rule 4

We permit access to the monitoring port (8080) from two common monitoring namespaces: openshift-monitoring and prometheus. To verify this access we first need to create these namespaces by running:

kubectl create namespace openshift-monitoring && kubectl create namespace prometheus

Now we can test the openshift namespace’s access by running:

kubectl run test-rule4-openshift --image=busybox --rm -i --restart=Never -n openshift-monitoring -- timeout 5 nc -zv raycluster-kuberay-head-svc.default 8080

This should succeed. We can also verify the prometheus namespace’s access by running:

kubectl run test-rule4-prometheus --image=busybox --rm -i --restart=Never -n prometheus -- timeout 5 nc -zv raycluster-kuberay-head-svc.default 8080

This should once again succeed. We can also confirm that this access is limited to the above by attempting to access port 8080 from a different, non-monitoring namespace by running:

kubectl run test-rule4-blocked --image=busybox --rm -i --restart=Never -n test-namespace -- timeout 5 nc -zv raycluster-kuberay-head-svc.default 8080

This should fail as this pod is from a non-monitoring namespace (and not covered by Rule 2 as it is not in the same namespace as the RayCluster). We can verify that the rule limits the monitoring namespaces access appropriately by running:

kubectl run test-rule4-wrong-port --image=busybox --rm -i --restart=Never -n openshift-monitoring -- timeout 5 nc -zv raycluster-kuberay-head-svc.default 6379

This should fail as these monitoring namespaces are only permitted to access port 8080.

Rule 5

We permit access to port 8443 from any pod but Ray won’t be listening on this port unless mTLS has been configured for this particular RayCluster. We can verify this by running:

kubectl run test-rule5-non-mtls --image=busybox --rm -i --restart=Never -n test-namespace -- timeout 2 nc -zv raycluster-kuberay-head-svc.default 8443

This should timeout as this RayCluster is not mTLS enabled and as a result, is not listening on port 8443. We can further illustrate this by checking the head pods NetworkPolicy:

kubectl describe networkpolicy raycluster-kuberay

In the output of the above, you should be able to observe that in the final rule, only port 8443 is exposed as seen below.

    To Port: 8443/TCP
    From: <any> (traffic not restricted by source)

We can demonstrate how a RayCluster with mTLS enabled is in fact listening on this port by applying a CR with mTLS enabled. In this instance, we don’t have an mTLS controller to automatically add a container port of 8443 so we’ll do it manually by running:

kubectl apply -f - <<EOF
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-kuberay-mtls
spec:
  rayVersion: '2.46.0'
  headGroupSpec:
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.46.0
          env:
          - name: RAY_USE_TLS
            value: "1"
          - name: RAY_TLS_SERVER_CERT
            value: "/etc/tls/server.crt"
          - name: RAY_TLS_SERVER_KEY
            value: "/etc/tls/server.key"
          resources:
            limits:
              cpu: 1
              memory: 2G
            requests:
              cpu: 1
              memory: 2G
          ports:
          - containerPort: 6379
            name: gcs-server
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          - containerPort: 8443
            name: mtls
  workerGroupSpecs:
  - replicas: 1
    minReplicas: 1
    maxReplicas: 5
    groupName: workergroup
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:2.46.0
          env:
          - name: RAY_USE_TLS
            value: "1"
          - name: RAY_TLS_SERVER_CERT
            value: "/etc/tls/server.crt"
          - name: RAY_TLS_SERVER_KEY
            value: "/etc/tls/server.key"
          resources:
            limits:
              cpu: 1
              memory: 1G
            requests:
              cpu: 1
              memory: 1G
EOF

NOTE: this is only verifiable from the NetworkPolicy and port exposure perspective. This hinges on mTLS being configured as part of KubeRay in the future. While this will succeed in creating an appropriate network policy, the pods will fail as there will be no certificates available.
We can verify that the ports are exposed correctly by running:

kubectl describe networkpolicy raycluster-kuberay-mtls-head

In the output, you should be able to observe the final ruleset and see that port 10001 is now also exposed as seen below. This is expected when mTLS is configured.

    To Port: 8443/TCP
    To Port: 10001/TCP
    From: <any> (traffic not restricted by source)

Related issue number

Closes #3987

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Signed-off-by: Pat O'Connor <[email protected]>

chipspeak mentioned this pull request Aug 25, 2025

[Feature] Add NetworkPolicy support #3987

Open

2 tasks

chipspeak force-pushed the add-networkpolicy-support branch 2 times, most recently from b65237b to e3fafb9 Compare August 26, 2025 12:34

chipspeak marked this pull request as ready for review September 15, 2025 17:26

chipspeak requested review from kevin85421, andrewsykim, rueian and MortalHappiness as code owners September 15, 2025 17:26

chipspeak force-pushed the add-networkpolicy-support branch 2 times, most recently from e54bfbe to d855f02 Compare September 25, 2025 13:16

feat: Add lifecycled networkpolicies options for raycluster hardening

393fe05

Signed-off-by: Pat O'Connor <[email protected]>

chipspeak force-pushed the add-networkpolicy-support branch from d855f02 to 393fe05 Compare September 25, 2025 13:26

updated tests re monitoring namespaces + mtls

2e896f9

Signed-off-by: Pat O'Connor <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Implement NetworkPolicy support #3986

[WIP] Implement NetworkPolicy support #3986

chipspeak commented Aug 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

[WIP] Implement NetworkPolicy support #3986

Are you sure you want to change the base?

[WIP] Implement NetworkPolicy support #3986

Conversation

chipspeak commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Verification Steps

Setup

Rule Verification

Related issue number

Checks

Uh oh!

Uh oh!

chipspeak commented Aug 25, 2025 •

edited

Loading