-
Notifications
You must be signed in to change notification settings - Fork 2.6k
CGroup Handling Enhancements #507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@baude If you get a chance, can you validate that this doesn't break our stats code? I'm pretty sure it's still working, but you've worked with it more than I have. |
case SystemdCgroupsManager: | ||
if ctr.config.CgroupParent == "" { | ||
ctr.config.CgroupParent = SystemdDefaultCgroupParent | ||
} else if len(ctr.config.CgroupParent) < 6 || !strings.HasSuffix(path.Base(ctr.config.CgroupParent), ".slice") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know just enough about cgroups to be dangerous. Should ".slice" here be "system.slice"? Could a cgroups.slice string satisfy this and I'm not sure you'd want that? @rhatdan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure that anything with .slice is systemd-managed, which is why we want to avoid them when using cgroupfs - but not 100%. For what it's worth, this code is basically identical to CRI-O's validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes .slice means systemd.
Seeing some issues with cgroup parent, and a 'podman run' flag. Chasing them down. |
Fixed CGroup parent. Unclear if the 'podman run' issue is actually a bug or a flake. |
bot, retest this please |
2 similar comments
bot, retest this please |
bot, retest this please |
is this going to break running a container under podman from a systemd unit like the one below, or would we just need to delegate differently?
i've been using it's working pretty well so far:
i'm using |
@nathwill I've been digging around under the hood, and I'm fairly certain our cgroup handling for systemd-managed cgroups is very wrong right now - we're inserting ourself into their hierarchy without telling systemd. It seems to be working, but it is definitely not the right way of doing things and is very fragile. This patch as written would break using the systemd cgroup hierarchy in an effort to fix cgroupfs, but I think we can get both working with a bit more code. I'll mark this WIP and take a stab at it on Monday. Be aware that you'll probably need to modify the configuration file or set a flag to use the systemd cgroup manager after these changes land, though. The config change should be a one-liner, but we should call it out in the manpage/readme. |
An update on this: I've enabled proper CGroup handling via the runc systemd cgroup manager, based on the CRI-O code for the same. However, CRI-O places Conmon and the container's own processes in two separate CGroups. This doesn't seem like a particularly nice way of doing things (we get everything under the same CGroup with cgroupfs), so I'm going to talk to the CRI-O devs to see if there is a reason things are done that way. |
bot, retest this please |
1 similar comment
bot, retest this please |
fwiw, i tested this out, and things seem to work as intended 👏. i was also able to get it working from a systemd service unit, but wanted to share a couple of observations on the experience. cgroupfs: this PR doesn't seem to have broken the method i had been using above (i.e. using cgroupfs as the cgroup-manager and specifying the cgroup-parent as e.g. systemd: when specifying systemd as the cgroup-manager, i'm not able to specify the systemd-generated service cgroup (e.g. when doing so, podman creates
as a result, the conmon/container processes are un-tracked by the service unit and in order for the service to successfully start, you have to do one of:
which is more like the in short, my preference given the above would be to continue using cgroupfs. if the sole concern with cgroupfs is:
would specifying thanks in advance for your time and consideration! 🙏 |
@nathwill I was talking with some of the CRI-O team, and it appears we're missing some CGroup generation logic. We should be grouping the two scopes (conmon and the rest of the container) under a service we create under the parent (individual .service for containers outside a pod, or a shared one for containers in a pod). We're missing a step in the current implementation I pushed (creating that service). Would this work better for what you're doing? You'd lose the ability to name the .service you drop in |
Also, this still has some test failures related to kernel memory limits, specifically on CentOS. Need to look into that. |
That would definitely break how I'm using it, as the service cgroup name is prescribed by systemd from the unit name, and is relied on heavily by systemd for tracking the service processes. i totally understand the need for grouping in the pod use case, and wouldn't expect this to work for grouping unit file-launched containers into the same pod, but it'd be super helpful to have an option to specify the service cgroup name for non-pod containers in order to preserve the ability to use podman to run containers from service units that are able to have all the resultant container processes be tracked by systemd. if the plan is to put all containers into pods by default, could the service cgroup could be more directly based on the pod name when provided (e.g. |
|
||
// SystemdDefaultCgroupParent is the cgroup parent for the systemd cgroup | ||
// manager in libpod | ||
const SystemdDefaultCgroupParent = "system.slice" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any particular reason for using system.slice here? my understanding from the reading i've done was that machine.slice is the preferred target for containers ("machines" being interpreted here as referring to processes launched from rootfs tarballs like OCI or nspawn supported formats, instead of normal "contained" system/user processes).
my worry with a system.slice default is namespace conflicts with system services running under system.slice. as e.g. if my pod scheduler is running etcd.service on the host, and receives an etcd pod as part of an application deployment, do they both end up under /system.slice/etcd.service? since machine.slice is unpopulated by default and intended as a home for containers, it seems like a preferable default cgroup parent, unless i'm missing something?
edit: i should add, the above is just a matter of curiosity, as i've seen several container runtimes do this, and not understood the reasoning for ignoring machine.slice.
Hm. We can probably wire out an option to allow setting service name (or potentially detect if there is a |
@mrunalp PTAL |
☔ The latest upstream changes (presumably 3f5da4d) made this pull request unmergeable. Please resolve the merge conflicts. |
fwiw, in the latest podman, using --conmon-pidfile when running podman containers as systemd services is working quite well, even without specifying --cgroup-parent. hopefully that simplifies things a bit :) |
☔ The latest upstream changes (presumably 39a7a77) made this pull request unmergeable. Please resolve the merge conflicts. |
@mheon needs a rebase. |
Alright. This should fix cgroupfs CGroup handling, which is increasingly becoming a problem. Systemd-based cgroups are still broken, but can be fixed later. I'm going to try and push this through today. |
Cleanup code implemented. Assuming tests are green, this is finally good for review and merge. Systemd cgroup manager code is next. |
Signed-off-by: Matthew Heon <[email protected]>
Alright, this is finally ready. @rhatdan PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but would like @rhatdan to take a look see.
@@ -23,6 +23,9 @@ has the capability to debug pods/images created by crio. | |||
**--help, -h** | |||
Print usage statement | |||
|
|||
**--cgroup-manager** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A blog showing how this works would probably be nice once this hits the streets.
I am still seeing doubled/triple/quad cgroup scopes with a container+bind mounts. By the second or third start/stop cycle there are 86(!) mounts of this form but only 4 are distinct mounts. mount reports 86 cgroup mounts on
Of these 86 mounts only 4 are unique.
|
@aalba6675 I strongly suspect the bug there (and the required fix) are in |
@aalba6675 could you try without using a container that runs init, and see if the issue goes away. I believe that there is a bug in oci hooks in runc, that is not calling oci-systemd-hook in the posthook phase. This is why the cgroup hooks are being left behind. |
@rhatdan - yes it goes away: running a /bin/bash container with fedora:28 image there is no cgroup cruft left behind. BTW - sorry for spamming the There are also no mounts of type cgroup on /sys/fs/cgroup/systemd/libpod_parent/.... When the container is stopped, all the Running
lscgroup during
|
The lack of mounts is expected - we need those for containers running This seems to confirm it's a hook, though. The theory from @rhatdan that this is a postrun hook failing to fire and clean things up seems to make more sense than my initial bug in the hook theory, but we can't say for sure without looking into it more. |
LGTM |
📌 Commit eb4c75a has been approved by |
Signed-off-by: Matthew Heon <[email protected]> Closes: #507 Approved by: baude
Until we get Systemd cgroup manager working, this will cause a validation error. Signed-off-by: Matthew Heon <[email protected]> Closes: #507 Approved by: baude
Signed-off-by: Matthew Heon <[email protected]> Closes: #507 Approved by: baude
Signed-off-by: Matthew Heon <[email protected]> Closes: #507 Approved by: baude
Signed-off-by: Matthew Heon <[email protected]> Closes: #507 Approved by: baude
Signed-off-by: Matthew Heon <[email protected]> Closes: #507 Approved by: baude
Signed-off-by: Matthew Heon <[email protected]> Closes: #507 Approved by: baude
☀️ Test successful - status-papr |
Add validation to CGroup parents passed into libpod. If we are using the systemd CGroup manager, the basename must end in
.slice
; if we are using cgroupfs, the path cannot containcgroup.slice
. This is based on the CRI-O validation code located at https://github.com/kubernetes-incubator/cri-o/blob/master/server/sandbox_run.go#L379-L400Also, add our CGroup path to the OCI spec, passing it into
runc
. When the container is deleted inrunc
, it will clean up configured CGroups, solving our problem of leaving CGroup scopes around after containers exit.Closes: #497
Closes: #496