Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with K8S Secret Provider #3442

Closed
barkbay opened this issue Sep 20, 2023 · 10 comments
Closed

Performance issue with K8S Secret Provider #3442

barkbay opened this issue Sep 20, 2023 · 10 comments
Labels
bug Something isn't working Team:Obs-InfraObs Label for the Observability Infrastructure Monitoring team

Comments

@barkbay
Copy link
Contributor

barkbay commented Sep 20, 2023

Hi 👋 ,

We have an Agent configuration which is referencing 2 Secrets using the K8S Secret Provider, for example:

          streams:
            - id: elasticsearch/metrics-elasticsearch.stack_monitoring.ccr
              data_stream:
                dataset: elasticsearch.stack_monitoring.ccr
                type: metrics
              metricsets:
                - ccr
              hosts:
                - 'https://${kubernetes.pod.ip}:9200'
              headers:
                Authorization: Bearer ${kubernetes_secrets.elastic-agent.somesecret1.value}
                AnotherHeader: SharedSecret ${kubernetes_secrets.elastic-agent.somesecret2.value}
              scope: node
              period: 10s

While investigating a global performance issue in our K8S cluster we observed a fair amount of requests to our K8S API server, with a constant rate of around 4 to 8 GET requests on these 2 Secrets per second. The audit trace suggests that that those requests are made by Agents managed by a DaemonSet on our clusters:

Audit Event Sample
{
    "kind": "Event",
    "apiVersion": "audit.k8s.io/v1",
    "level": "Metadata",
    "auditID": "4a1822f1-6509-45ad-a767-4174f0beded",
    "stage": "ResponseComplete",
    "requestURI": "/api/v1/namespaces/elastic-agent/secrets/somesecret1",
    "verb": "get",
    "user": {
        "username": "system:serviceaccount:elastic-agent:elastic-agent",
        "uid": "7322c0d3-f498-4dae-a7e5-699d899fdfdf",
        "groups": [
            "system:serviceaccounts",
            "system:serviceaccounts:elastic-agent",
            "system:authenticated"
        ],
        "extra": {
            "authentication.kubernetes.io/pod-name": [
                "elastic-agent-b5hff"
            ],
            "authentication.kubernetes.io/pod-uid": [
                "64cb7be8-3d16-470f-afe6-c77fd092ffff"
            ]
        }
    },
    "sourceIPs": [
        "192.168.92.12"
    ],
    "userAgent": "Go-http-client/2.0",
    "objectRef": {
        "resource": "secrets",
        "namespace": "elastic-agent",
        "name": "somesecret1",
        "apiVersion": "v1"
    },
    "responseStatus": {
        "metadata": {},
        "code": 200
    },
    "requestReceivedTimestamp": "2023-09-20T14:55:44.652043Z",
    "stageTimestamp": "2023-09-20T14:55:44.655909Z",
    "annotations": {
        "authorization.k8s.io/decision": "allow",
        "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"elastic-agent\" of ClusterRole \"elastic-agent\" to ServiceAccount \"elastic-agent/elastic-agent\""
    }
}

I think these Secrets should be cached, as it is usually the case when using the K8S client from the controller runtime for example. It would definitely help in our case.

We tried to have a look at the code and we were wondering if the client calls are coming from here.

Thanks!

For confirmed bugs, please report:

  • Version: docker.elastic.co/beats/elastic-agent:8.8.2
  • Operating System: K8S 1.25 on EKS
@barkbay barkbay added the bug Something isn't working label Sep 20, 2023
@jlind23 jlind23 added the Team:Elastic-Agent Label for the Agent team label Sep 20, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@cmacknz
Copy link
Member

cmacknz commented Sep 20, 2023

I believe this is happening in this function which indeed has no cache

func (p *contextProviderK8sSecrets) Fetch(key string) (string, bool) {

@cmacknz
Copy link
Member

cmacknz commented Sep 20, 2023

The fastest way to get this fixed will be to open a PR adding the cache yourself, the changes needed should be contained to a single source file.

@barkbay barkbay self-assigned this Sep 21, 2023
@blakerouse
Copy link
Contributor

Adding caching always has the issue of what invalidates the cache. Which I think needs to be discussed before a cache is added.

I think updating the provider to be more efficient as well as more responsive to changes could be done.

The order of operations I believe could fix this issue as well as make it more responsive.

First loop

For each fetched secret the provider makes we store the key and value fetched, as well as subscribe to kubernetes events for a change in that secrets value.

Sequential loop

Mark all current secret keys as old. Then do the same as above from the first loop, except don't fetch the value unless it was marked dirty from a secret change event. Then all secrets that are marked old can be removed from the cache and unsubscribe from the events.

@barkbay
Copy link
Contributor Author

barkbay commented Sep 22, 2023

@barkbay barkbay removed their assignment Sep 26, 2023
@jlind23
Copy link
Contributor

jlind23 commented Oct 9, 2023

@bturquet after discussing with @michalpristas and team it looks that this rather falls into your area.
Would you mind finding someone to take a stab at it? This is severely impacting serverless deployments.
cc @pierrehilbert @cmacknz

@bturquet
Copy link

bturquet commented Oct 9, 2023

@constanca-m as you are on rotation this week, could you have a look please ?

@constanca-m
Copy link
Contributor

Hi @barkbay, I noticed you opened this PR and closed it: #3464. Why was that? Is the intention of this issue to continue the work on that PR or to create a new PR to fix the performance issue?

@barkbay
Copy link
Contributor Author

barkbay commented Oct 9, 2023

Why was that?

I wrongly assumed that my experience building K8s controllers might help, however to solve this issue I think I would have to invest more time understanding the Agent architecture, which is not possible given my current priorities.

Is the intention of this issue to continue the work on that PR or to create a new PR to fix the performance issue?

I think you are in a better position to answer this question 😄

@constanca-m
Copy link
Contributor

I created a separate issue for this #3594 and added it to our boards.

@jlind23 jlind23 added Team:Obs-InfraObs Label for the Observability Infrastructure Monitoring team and removed Team:Elastic-Agent Label for the Agent team labels Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Obs-InfraObs Label for the Observability Infrastructure Monitoring team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants