Network policy incorrectly applied due to mismatch between pod selector and agent pod labels #1490

bogatuadrian · 2024-10-28T14:39:39Z

Context

We're using the Datadog Operator to deploy a Datadog agent with Kubernetes network policies enabled, but the operator doesn't use the correct pod selectors when creating the NetworkPolicy resources to target the deployed agent and cluster agent pods.

Problem

This might lead to either a false sense of security if your network plugin allows traffic by default, or Datadog simply not working if your network plugin disallows traffic by default. We fall in the second category, having multiple network issues in both the agent and the cluster agent because traffic is denied by default and the network policies set up by the operator do not seem to work. For us, this leads to losing observability data.

Setup

We configure our Datadog agent something like this:

kind: DatadogAgent
apiVersion: datadoghq.com/v2alpha1
metadata:
  name: datadog
spec:
  global:
    networkPolicy:
      create: true
      flavor: kubernetes

This leads to the creation of agents and cluster agents with the following spec (some labels omitted for brevity):

apiVersion: v1
kind: Pod
metadata:
  generateName: datadog-agent-
  labels:
    agent.datadoghq.com/component: agent
    agent.datadoghq.com/name: datadog
    agent.datadoghq.com/provider: ""
    app.kubernetes.io/component: agent
    app.kubernetes.io/instance: datadog-agent # Notice how this is prefixed with "datadog" (likely from the `DatadogAgent` resource)
    app.kubernetes.io/managed-by: datadog-operator
    app.kubernetes.io/name: datadog-agent-deployment
    app.kubernetes.io/part-of: datadog-datadog

However the network policies created by the operator use the following pod selectors:

➜ kubectl get networkpolicies.networking.k8s.io                                                 
NAME                            POD-SELECTOR                                                                                 AGE
datadog-agent                   app.kubernetes.io/instance=agent,app.kubernetes.io/part-of=datadog-datadog                   77m
datadog-cluster-agent           app.kubernetes.io/instance=cluster-agent,app.kubernetes.io/part-of=datadog-datadog           77m
datadog-cluster-checks-runner   app.kubernetes.io/instance=cluster-checks-runner,app.kubernetes.io/part-of=datadog-datadog   77m

Notice how the pods use datadog-agent and the policies use agent as value for the app.kubernetes.io/instance label.

Code breadcrumbs

I took a look at the code and I might have found where the discrepancy might occur.

On one hand you have the Network Policy being created with the hard-coded agent value in

datadog-operator/internal/controller/datadogagent/component/objects/network.go

Line 140 in 926370e

suffix = apicommon.DefaultAgentResourceSuffix

datadog-operator/api/datadoghq/common/const.go

Line 28 in 010c848

DefaultAgentResourceSuffix = "agent"

On the other hand, the app.kubernetes.io/instance label seems to be set here

datadog-operator/internal/controller/datadogagent/object/labels.go

Line 24 in 010c848

labels[kubernetes.AppKubernetesInstanceLabelKey] = instanceName

with the value from here

datadog-operator/internal/controller/datadogagent/component/agent/default.go

Lines 90 to 93 in 010c848

    
           // GetAgentName return the Agent name based on the DatadogAgent info 
        
           func GetAgentName(dda metav1.Object) string { 
        
           	return fmt.Sprintf("%s-%s", dda.GetName(), apicommon.DefaultAgentResourceSuffix) 
        
           }

It looks like in our case the value would be datadog-agent, while the network policy has the hardcoded agent, hence the mismatch in pod selector, leading to the network policies not targeting the correct pods.

Potential solution

I'm guessing this discrepancy appeared when the Datadog Operator added support for running multiple DD agents at the same time, and probably migrated to using <agent-name>-agent, <agent-name>-cluster-agent etc. as the app.kubernetes.io/instance value, while not updating the network policy accordingly.

The solution would be to correctly set the pod selectors for the created network policies, for each Datadog agent created by the operator.

Environment

We are using the latest versions for both the operator and the agent, available at the time of writing.

Operator helm chart version: 2.1.0
Operator version: 1.9.0
Agent version: 7.58.1
Kubernetes Distribution: EKS

The text was updated successfully, but these errors were encountered:

bogatuadrian · 2024-10-28T14:51:35Z

One more note: not only that the pod selectors used to attach the network policies are not correct, the pod selectors used in from and to in the ingress/egress allowlists are also not correct, e.g.:

    - from:
      - podSelector:
          matchLabels:
            app.kubernetes.io/instance: agent # no agent pod exists with this label
            app.kubernetes.io/part-of: datadog-datadog

khewonc · 2024-10-30T18:58:56Z

@bogatuadrian Thank you for the detailed report 🙇 Indeed we're using different labels for the network policy selectors than we use for the agent pods. We will work on a fix for this

bogatuadrian · 2024-11-01T15:10:51Z

Thanks for the reply, @khewonc.

I want to mention one more issue that I found after more investigation. While trying to work around the label issue by deploying our custom NetworkPolicies, I noticed that the policies created by the operator are not allowing ingress traffic to the admission controller target port (8000). This leads to pods not being created when having admission.datadoghq.com/enabled: "true", because the Kube API server cannot contact the admission controller.

khewonc · 2024-11-04T22:53:53Z

@bogatuadrian Thanks for letting us know. I'll add a card in our backlog to add network policies to the admission controller feature

khewonc added the bug Something isn't working label Oct 30, 2024

khewonc mentioned this issue Nov 4, 2024

Use same instance label value for network policies #1509

Merged

2 tasks

khewonc mentioned this issue Nov 6, 2024

Add network policies for admission controller feature #1515

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network policy incorrectly applied due to mismatch between pod selector and agent pod labels #1490

Network policy incorrectly applied due to mismatch between pod selector and agent pod labels #1490

bogatuadrian commented Oct 28, 2024

bogatuadrian commented Oct 28, 2024

khewonc commented Oct 30, 2024

bogatuadrian commented Nov 1, 2024

khewonc commented Nov 4, 2024

Network policy incorrectly applied due to mismatch between pod selector and agent pod labels #1490

Network policy incorrectly applied due to mismatch between pod selector and agent pod labels #1490

Comments

bogatuadrian commented Oct 28, 2024

Context

Problem

Setup

Code breadcrumbs

Potential solution

Environment

bogatuadrian commented Oct 28, 2024

khewonc commented Oct 30, 2024

bogatuadrian commented Nov 1, 2024

khewonc commented Nov 4, 2024