Default resources specified in Helm chart are insufficient for vulnerability processing #25566

noahtalerman · 2025-01-17T19:26:58Z

prospect-interkosmos: Opened a PR here: Helm Chart: Move vulnerability processing to be a cronjob by default #25488
@noahtalerman: User made this change because the documented default resources don't work. They're proposing to change the default so that the vulnerability processing job runs less frequently and thus takes an amount of resources that fits under the 1Gi ceiling.
- @noahtalerman: Can we bump the default resources instead? Sounds like we have to find a balance between frequency of the job and amount of resources.

User stories

Update Helm chart to support vulnerability processing #25832

The text was updated successfully, but these errors were encountered:

iansltx · 2025-01-17T21:28:03Z

So, the issue here is that the vulnerabilities job has massively different compute requirements than the normal API operation, no matter how often you run it. The Fleet app in general, sans vuln processing, runs comfortably in 1GB of RAM (there are caveats for some endpoints when you pull large result sets, but those can be worked around via pagination), but whenever vuln processing hits it's both CPU-intensive (there are steps in the process that are CPU-bound) and RAM-intensive (4GB or more depending on host counts) relative to other use cases. This is why our cloud environments turn the vuln processing cron off entirely for containers exposed to HTTP traffic, and have a separate single-container tier where the vuln processing cron is kept on, though that architecture is as much for keeping vuln processing being a noisy neighbor vs. web traffic as anything else as IIRC the ECS tasks aren't sized differently (/cc @rfairburn in case I'm wrong here).

Now, you can just run your web tier at the same provisioning level as your vuln processing tier in a setup like the above, but that gets you HTTP-serving containers that are either overprovisioned (for lower host counts) or too coarsely provisioned (in a larger/autoscaled setup), when for everything other than vulns a smaller container would run Fleet just fine.

We're periodically doing work on making vuln processing less heavy, but it doesn't make sense to make folks wait for that.

The attached PR uses an explicit command line run rather than Fleet's internal background cron for vulnerability processing. While this absolutely works (I use it more often than not for refreshing vulns when debugging etc.), it doesn't have any concurrent execution prevention built in like the background "cron" does. This is why we don't use it in our cloud-hosted environments, and why @rfairburn mentioned concurrency checks in the PR.

@noahtalerman you mentioned grabbing some time to talk this through with Robert next week. Might be useful for either @mostlikelee or I to be on that call since some issues may be easier to solve if we can touch app code rather than just IaC.

noahtalerman · 2025-01-20T22:34:58Z

@iansltx our best practice Helm chart spins up Fleet with 1GB RAM but Fleet with vuln processing calls for 4GB. Is that right?

iansltx · 2025-01-20T22:37:04Z

Yep, that matches what I'm seeing.

rfairburn · 2025-01-21T16:59:57Z

@noahtalerman since it looks like your calendar is pretty loaded up and you are out of the office for the 2nd half of the week, I'll target Monday or Tuesday of next week for a 15-minute windows and include @iansltx and @mostlikelee as optional people.

My thoughts are that we should included this but not have it enabled by default. Especially since we have something similar created for our terraform aws stuff (although implemented slightly differently).

Vuln processing is still a Fleet Premium specific feature right?

I'm happy to talk about all of this during a quick call though.

pboushy · 2025-01-23T07:30:07Z

prospect-interkosmos: Opened a PR here: Helm Chart: Move vulnerability processing to be a cronjob by default #25488

@noahtalerman: User made this change because the documented default resources don't work. They're proposing to change the default so that the vulnerability processing job runs less frequently and thus takes an amount of resources that fits under the 1Gi ceiling.

I'd like to clarify that I didn't make the change to make it run less frequently. I thought the standard process was daily and set it to that not realizing it was hourly. I've remedied that.

It's true, the default memory for fleet in kubernetes is insufficient for running vulnerability processing.
But, making every fleet container allow 4Gi is inefficient.

Because kubernetes has functionality to run a task on a set schedule it made more sense IMO to setup a separate cronjob in kubernetes and uses 4Gi of memory for a short period while leaving the other pods to 1Gi unless people need to bump those higher.

As an example, the change I made makes a pod run for ~2 minutes and eats up 4Gi of memory, so I now use ~6Gi of Memory to run my Fleet VS ~8Gi if I bump the allowed memory for both web containers.

This is why our cloud environments turn the vuln processing cron off entirely for containers exposed to HTTP traffic, and have a separate single-container tier where the vuln processing cron is kept on, though that architecture is as much for keeping vuln processing being a noisy neighbor vs. web traffic as anything else as IIRC the ECS tasks aren't sized differently (/cc @rfairburn in case I'm wrong here).

That is precisely what my PR does in the helm chart.

While this absolutely works (I use it more often than not for refreshing vulns when debugging etc.), it doesn't have any concurrent execution prevention built in like the background "cron" does. This is why we don't use it in our cloud-hosted environments, and why @rfairburn mentioned concurrency checks in the PR.

That should be fixed now.
NOTE - even without the concurrency, the cronjob when it kicks off a job, does appear to see the lock in the database and just immediately errors out (I ran into this today and had to hunt down the mysql command to see what was happening)

My thoughts are that we should included this but not have it enabled by default.

If this is the only remaining item to get the PR approved, I'm glad to change that. I was worried that being a "breaking change" might prevent this one being approved.

mostlikelee · 2025-01-24T13:53:39Z

Vuln processing is still a Fleet Premium specific feature right?

Vuln processing is a free feature. Additional vuln metadata (EPSS scores, etc) are premium.

noahtalerman · 2025-01-28T14:09:21Z

I'd like to clarify that I didn't make the change to make it run less frequently. I thought the standard process was daily and set it to that not realizing it was hourly. I've remedied that.

If this is the only remaining item to get the PR approved, I'm glad to change that. I was worried that being a "breaking change" might prevent this one being approved.

Thanks for making the changes and the clarification @pboushy!

@lukeheath @rfairburn and I are taking a look at your PR later today.

noahtalerman · 2025-01-28T20:27:47Z

@rfairburn here's the user story tracking the work to help get @pboushy's PR merged.

FYI @zayhanlon because this is an infra change, I added this user story to the customer request board and assigned Robert.

Sorry for the delay @@pboushy thanks for bearing with us!

noahtalerman added :product Product Design department (shows up on 🦢 Drafting board) ~feature fest Will be reviewed at next Feature Fest and removed :product Product Design department (shows up on 🦢 Drafting board) labels Jan 17, 2025

noahtalerman mentioned this issue Jan 17, 2025

Helm Chart: Move vulnerability processing to be a cronjob by default #25488

Draft

4 tasks

noahtalerman self-assigned this Jan 17, 2025

noahtalerman added :product Product Design department (shows up on 🦢 Drafting board) and removed ~feature fest Will be reviewed at next Feature Fest labels Jan 17, 2025

noahtalerman mentioned this issue Jan 28, 2025

Update Helm chart to support vulnerability processing #25832

Open

19 tasks

noahtalerman removed their assignment Jan 28, 2025

noahtalerman added ~customer request A prioritized, customer feature request. Has ≥ 1 customer codename label(s) prospect-interkosmos and removed :product Product Design department (shows up on 🦢 Drafting board) labels Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default resources specified in Helm chart are insufficient for vulnerability processing #25566

Default resources specified in Helm chart are insufficient for vulnerability processing #25566

noahtalerman commented Jan 17, 2025 •

edited

Loading

iansltx commented Jan 17, 2025

noahtalerman commented Jan 20, 2025

iansltx commented Jan 20, 2025

rfairburn commented Jan 21, 2025

pboushy commented Jan 23, 2025 •

edited

Loading

mostlikelee commented Jan 24, 2025

noahtalerman commented Jan 28, 2025

noahtalerman commented Jan 28, 2025

Default resources specified in Helm chart are insufficient for vulnerability processing #25566

Default resources specified in Helm chart are insufficient for vulnerability processing #25566

Comments

noahtalerman commented Jan 17, 2025 • edited Loading

User stories

iansltx commented Jan 17, 2025

noahtalerman commented Jan 20, 2025

iansltx commented Jan 20, 2025

rfairburn commented Jan 21, 2025

pboushy commented Jan 23, 2025 • edited Loading

mostlikelee commented Jan 24, 2025

noahtalerman commented Jan 28, 2025

noahtalerman commented Jan 28, 2025

noahtalerman commented Jan 17, 2025 •

edited

Loading

pboushy commented Jan 23, 2025 •

edited

Loading