-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add resources requests and limits management #127
base: main
Are you sure you want to change the base?
Conversation
Thank you Julien, my points at the end of my last comment in the issue #102 about java and memory limits still apply in order to merge this PR. There is a real need to investigate, last thing I want is to get apps that restart every hour because the app doesn't limit its memory itself in order to avoid an OOMKill. The tests are failing probably because you need to use |
Not sure to understand how to test this, does it imply to modify docker's images used ? As explained in README update, the default values provided are provided for testing purposes and for prod, they have to be tuned depending on platform usage and traffic.
I tried but nothing is wrong :
|
You have multiple ways to test that. The major outcome is to see if java is going to limit its memory usage. So you can either set a low memory limit or/and stress test the application so that it uses as much memory as possible, then see if it reacts correctly without any OOMKill.
Ha, yes indeed, it's not helm lint but ct lint. You have the output here: https://github.com/georchestra/helm-georchestra/actions/runs/12882804969/job/35915756058?pr=127 |
I investigated a little over JVM parameters to use with containers, here is what I found :
Sources :
Summary : We should remove |
Thank you for the details @jusabatier. I wasn't aware of Actually, we have an open issue for removing -Xmx and -Xms here: georchestra/georchestra#4354
Yes, very good idea, this will be a requirement for merging the PR. We can continue this discussion at georchestra/georchestra#4354 Is |
I don't think, but if for whatever reason the default change, it's more safe to add it. In the article, they also suggest it can be benefic to :
Depending on how the app uses threading. |
I read an article @pmauduit share to me on the IRC and they suggest it's better to set :
Source : https://home.robusta.dev/blog/stop-using-cpu-limits Do you think we should remove the CPU limits ? |
If it says that it's default, we shouldn't try to override default with the same value. That would be confusing.
After reading paketo-buildpacks/libjvm#136 (comment), we can add ActiveProcessorCount but only for the apps still on java 8/11.
As of right now, we recommend the helm chart to only be deployed on a single node in a Kubernetes cluster. Requests are useful when you have multiple nodes in the cluster so that Kubernetes understand how to spread all the different apps across all the nodes. So in our case, Requests are useless (even though we are still required to set them for limiting the resources). I'm still in favor of having limits because there are some apps like geoserver that are very resource hungry and in all of our clients infra, if these apps consume too many resources then this leads to a slow-down of the other apps which is not ideal. So for users using the default settings, we set a good practice to restrict the apps from consuming all the resources and leading to a general slow-down of the geOrchestra platform.
No idea about this one if we should add it or not. Any ideas @pmauduit? |
I didn't know about that. Is there a specific reason that forces to deploy on a single node ? Also, I wonder if current values aren't too high for testing purposes ? |
Well if someone is using the helm chart in Kubernetes it's to deploy geOrchestra in production. If he wants to test georchestra in a test environment, we have the docker composition: https://github.com/georchestra/docker. So I would be inclined to continue building this helm chart for production purpose. At least that's why Camptocamp develops this helm chart. In my opinion, the default values should reflect what an average production environment might be. But we can for sure guide people to adapt these limits.
We just haven't tested it. At camptocamp we rely on the fact that it's a single node because it makes managing the data much easier, we have a central "SFTP" server in order to access all the various PVC created by the modules. And usually in a multi-node scenario, you need to use NFS in order to have the data accessible from any server in the cluster. And NFS is slower than accessing the data from the drive itself because it's over the network. While you can use longhorn to be able to move the data from a server to another, there is still a performance hit. Since all the modules weren't designed with Kubernetes in mind nor multiserver in mind, it's possible that you could get worse performance than using a single server. Finally, the majority of the modules can't be highly available so you won't benefit from multiple servers compared to a single server. |
This PR add the ability to set for each app requests and limits for CPU and RAM.
Related to this issue : #102