-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default resources specified in Helm chart are insufficient for vulnerability processing #25566
Comments
So, the issue here is that the vulnerabilities job has massively different compute requirements than the normal API operation, no matter how often you run it. The Fleet app in general, sans vuln processing, runs comfortably in 1GB of RAM (there are caveats for some endpoints when you pull large result sets, but those can be worked around via pagination), but whenever vuln processing hits it's both CPU-intensive (there are steps in the process that are CPU-bound) and RAM-intensive (4GB or more depending on host counts) relative to other use cases. This is why our cloud environments turn the vuln processing cron off entirely for containers exposed to HTTP traffic, and have a separate single-container tier where the vuln processing cron is kept on, though that architecture is as much for keeping vuln processing being a noisy neighbor vs. web traffic as anything else as IIRC the ECS tasks aren't sized differently (/cc @rfairburn in case I'm wrong here). Now, you can just run your web tier at the same provisioning level as your vuln processing tier in a setup like the above, but that gets you HTTP-serving containers that are either overprovisioned (for lower host counts) or too coarsely provisioned (in a larger/autoscaled setup), when for everything other than vulns a smaller container would run Fleet just fine. We're periodically doing work on making vuln processing less heavy, but it doesn't make sense to make folks wait for that. The attached PR uses an explicit command line run rather than Fleet's internal background cron for vulnerability processing. While this absolutely works (I use it more often than not for refreshing vulns when debugging etc.), it doesn't have any concurrent execution prevention built in like the background "cron" does. This is why we don't use it in our cloud-hosted environments, and why @rfairburn mentioned concurrency checks in the PR. @noahtalerman you mentioned grabbing some time to talk this through with Robert next week. Might be useful for either @mostlikelee or I to be on that call since some issues may be easier to solve if we can touch app code rather than just IaC. |
@iansltx our best practice Helm chart spins up Fleet with 1GB RAM but Fleet with vuln processing calls for 4GB. Is that right? |
Yep, that matches what I'm seeing. |
@noahtalerman since it looks like your calendar is pretty loaded up and you are out of the office for the 2nd half of the week, I'll target Monday or Tuesday of next week for a 15-minute windows and include @iansltx and @mostlikelee as optional people. My thoughts are that we should included this but not have it enabled by default. Especially since we have something similar created for our terraform aws stuff (although implemented slightly differently). Vuln processing is still a Fleet Premium specific feature right? I'm happy to talk about all of this during a quick call though. |
I'd like to clarify that I didn't make the change to make it run less frequently. I thought the standard process was daily and set it to that not realizing it was hourly. I've remedied that. It's true, the default memory for fleet in kubernetes is insufficient for running vulnerability processing. Because kubernetes has functionality to run a task on a set schedule it made more sense IMO to setup a separate cronjob in kubernetes and uses 4Gi of memory for a short period while leaving the other pods to 1Gi unless people need to bump those higher. As an example, the change I made makes a pod run for ~2 minutes and eats up 4Gi of memory, so I now use ~6Gi of Memory to run my Fleet VS ~8Gi if I bump the allowed memory for both web containers.
That is precisely what my PR does in the helm chart.
That should be fixed now.
If this is the only remaining item to get the PR approved, I'm glad to change that. I was worried that being a "breaking change" might prevent this one being approved. |
Vuln processing is a free feature. Additional vuln metadata (EPSS scores, etc) are premium. |
Thanks for making the changes and the clarification @pboushy! @lukeheath @rfairburn and I are taking a look at your PR later today. |
@rfairburn here's the user story tracking the work to help get @pboushy's PR merged. FYI @zayhanlon because this is an infra change, I added this user story to the customer request board and assigned Robert. Sorry for the delay @@pboushy thanks for bearing with us! |
prospect-interkosmos
: Opened a PR here: Helm Chart: Move vulnerability processing to be a cronjob by default #25488User stories
The text was updated successfully, but these errors were encountered: