Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default resources specified in Helm chart are insufficient for vulnerability processing #25566

Open
noahtalerman opened this issue Jan 17, 2025 · 8 comments
Labels
~customer request A prioritized, customer feature request. Has ≥ 1 customer codename label(s) prospect-interkosmos

Comments

@noahtalerman
Copy link
Member

noahtalerman commented Jan 17, 2025

  • prospect-interkosmos: Opened a PR here: Helm Chart: Move vulnerability processing to be a cronjob by default #25488
  • @noahtalerman: User made this change because the documented default resources don't work. They're proposing to change the default so that the vulnerability processing job runs less frequently and thus takes an amount of resources that fits under the 1Gi ceiling.
    • @noahtalerman: Can we bump the default resources instead? Sounds like we have to find a balance between frequency of the job and amount of resources.

User stories

@noahtalerman noahtalerman added :product Product Design department (shows up on 🦢 Drafting board) ~feature fest Will be reviewed at next Feature Fest and removed :product Product Design department (shows up on 🦢 Drafting board) labels Jan 17, 2025
@noahtalerman noahtalerman self-assigned this Jan 17, 2025
@noahtalerman noahtalerman added :product Product Design department (shows up on 🦢 Drafting board) and removed ~feature fest Will be reviewed at next Feature Fest labels Jan 17, 2025
@iansltx
Copy link
Member

iansltx commented Jan 17, 2025

So, the issue here is that the vulnerabilities job has massively different compute requirements than the normal API operation, no matter how often you run it. The Fleet app in general, sans vuln processing, runs comfortably in 1GB of RAM (there are caveats for some endpoints when you pull large result sets, but those can be worked around via pagination), but whenever vuln processing hits it's both CPU-intensive (there are steps in the process that are CPU-bound) and RAM-intensive (4GB or more depending on host counts) relative to other use cases. This is why our cloud environments turn the vuln processing cron off entirely for containers exposed to HTTP traffic, and have a separate single-container tier where the vuln processing cron is kept on, though that architecture is as much for keeping vuln processing being a noisy neighbor vs. web traffic as anything else as IIRC the ECS tasks aren't sized differently (/cc @rfairburn in case I'm wrong here).

Now, you can just run your web tier at the same provisioning level as your vuln processing tier in a setup like the above, but that gets you HTTP-serving containers that are either overprovisioned (for lower host counts) or too coarsely provisioned (in a larger/autoscaled setup), when for everything other than vulns a smaller container would run Fleet just fine.

We're periodically doing work on making vuln processing less heavy, but it doesn't make sense to make folks wait for that.

The attached PR uses an explicit command line run rather than Fleet's internal background cron for vulnerability processing. While this absolutely works (I use it more often than not for refreshing vulns when debugging etc.), it doesn't have any concurrent execution prevention built in like the background "cron" does. This is why we don't use it in our cloud-hosted environments, and why @rfairburn mentioned concurrency checks in the PR.

@noahtalerman you mentioned grabbing some time to talk this through with Robert next week. Might be useful for either @mostlikelee or I to be on that call since some issues may be easier to solve if we can touch app code rather than just IaC.

@noahtalerman
Copy link
Member Author

@iansltx our best practice Helm chart spins up Fleet with 1GB RAM but Fleet with vuln processing calls for 4GB. Is that right?

@iansltx
Copy link
Member

iansltx commented Jan 20, 2025

Yep, that matches what I'm seeing.

@rfairburn
Copy link
Contributor

@noahtalerman since it looks like your calendar is pretty loaded up and you are out of the office for the 2nd half of the week, I'll target Monday or Tuesday of next week for a 15-minute windows and include @iansltx and @mostlikelee as optional people.

My thoughts are that we should included this but not have it enabled by default. Especially since we have something similar created for our terraform aws stuff (although implemented slightly differently).

Vuln processing is still a Fleet Premium specific feature right?

I'm happy to talk about all of this during a quick call though.

@pboushy
Copy link
Contributor

pboushy commented Jan 23, 2025

I'd like to clarify that I didn't make the change to make it run less frequently. I thought the standard process was daily and set it to that not realizing it was hourly. I've remedied that.

It's true, the default memory for fleet in kubernetes is insufficient for running vulnerability processing.
But, making every fleet container allow 4Gi is inefficient.

Because kubernetes has functionality to run a task on a set schedule it made more sense IMO to setup a separate cronjob in kubernetes and uses 4Gi of memory for a short period while leaving the other pods to 1Gi unless people need to bump those higher.

As an example, the change I made makes a pod run for ~2 minutes and eats up 4Gi of memory, so I now use ~6Gi of Memory to run my Fleet VS ~8Gi if I bump the allowed memory for both web containers.

This is why our cloud environments turn the vuln processing cron off entirely for containers exposed to HTTP traffic, and have a separate single-container tier where the vuln processing cron is kept on, though that architecture is as much for keeping vuln processing being a noisy neighbor vs. web traffic as anything else as IIRC the ECS tasks aren't sized differently (/cc @rfairburn in case I'm wrong here).

That is precisely what my PR does in the helm chart.

While this absolutely works (I use it more often than not for refreshing vulns when debugging etc.), it doesn't have any concurrent execution prevention built in like the background "cron" does. This is why we don't use it in our cloud-hosted environments, and why @rfairburn mentioned concurrency checks in the PR.

That should be fixed now.
NOTE - even without the concurrency, the cronjob when it kicks off a job, does appear to see the lock in the database and just immediately errors out (I ran into this today and had to hunt down the mysql command to see what was happening)

My thoughts are that we should included this but not have it enabled by default.

If this is the only remaining item to get the PR approved, I'm glad to change that. I was worried that being a "breaking change" might prevent this one being approved.

@mostlikelee
Copy link
Contributor

Vuln processing is still a Fleet Premium specific feature right?

Vuln processing is a free feature. Additional vuln metadata (EPSS scores, etc) are premium.

@noahtalerman
Copy link
Member Author

I'd like to clarify that I didn't make the change to make it run less frequently. I thought the standard process was daily and set it to that not realizing it was hourly. I've remedied that.

If this is the only remaining item to get the PR approved, I'm glad to change that. I was worried that being a "breaking change" might prevent this one being approved.

Thanks for making the changes and the clarification @pboushy!

@lukeheath @rfairburn and I are taking a look at your PR later today.

@noahtalerman
Copy link
Member Author

@rfairburn here's the user story tracking the work to help get @pboushy's PR merged.

FYI @zayhanlon because this is an infra change, I added this user story to the customer request board and assigned Robert.

Sorry for the delay @@pboushy thanks for bearing with us!

@noahtalerman noahtalerman removed their assignment Jan 28, 2025
@noahtalerman noahtalerman added ~customer request A prioritized, customer feature request. Has ≥ 1 customer codename label(s) prospect-interkosmos and removed :product Product Design department (shows up on 🦢 Drafting board) labels Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
~customer request A prioritized, customer feature request. Has ≥ 1 customer codename label(s) prospect-interkosmos
Development

No branches or pull requests

5 participants