Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Attempt to fix build (update py-gi-docgen) #651

Merged
merged 2 commits into from
Aug 18, 2024
Merged

ci: Attempt to fix build (update py-gi-docgen) #651

merged 2 commits into from
Aug 18, 2024

Conversation

jayaddison
Copy link
Contributor

Attempts to resolve Cirrus continuous integration failures observed for recent pull requests / commits.

@jayaddison
Copy link
Contributor Author

There is/was something else going on with daps on Debian testing for other recent commits, but since tests have passed here, I'll move this out of draft status and into ready-for-review.

@jayaddison jayaddison marked this pull request as ready for review August 15, 2024 18:57
@ximion
Copy link
Owner

ximion commented Aug 17, 2024

I'll merge this, but is there any way we can prevent this from breaking again with the next Python update?
Is there some metapackage maybe that just says "give me the latest Python"?

@jayaddison
Copy link
Contributor Author

I'm not super familiar with ports packages, so I'm not completely sure - but it does seem from the description on FreshPorts that textproc/py-gi-docgen could be that (I think the prefix textproc/ part is required).

@ximion
Copy link
Owner

ximion commented Aug 18, 2024

Can you test/add that? I don't have much knowledge about FreeBSD in particular, but it would be nice to not run into this issue every so often.

You can ignore the failure on Debian Testing, I'll be looking into that later (it's a weird one, and definitely caused by some package update in Testing).

@jayaddison
Copy link
Contributor Author

Can you test/add that? I don't have much knowledge about FreeBSD in particular, but it would be nice to not run into this issue every so often.

Sure - just so that you're aware: I'm relying on the GitHub Actions continuous integration results to test the results; I'll push a commit to attempt the generic versionless package name in a few moments.

@jayaddison
Copy link
Contributor Author

Can you test/add that? I don't have much knowledge about FreeBSD in particular, but it would be nice to not run into this issue every so often.

Sure - just so that you're aware: I'm relying on the GitHub Actions continuous integration results to test the results; I'll push a commit to attempt the generic versionless package name in a few moments.

Ah, oops: Cirrus CI in this case (nitpicking my description; it's a GitHub CI check.. but run via Cirrus CI).

@ximion
Copy link
Owner

ximion commented Aug 18, 2024

This looks good! Thank you! :-)

@ximion ximion merged commit 9501ce0 into ximion:main Aug 18, 2024
9 of 10 checks passed
@jayaddison
Copy link
Contributor Author

You're welcome - thanks!

@jayaddison jayaddison deleted the ci/20240815-dependency-fixup branch August 18, 2024 17:42
@jayaddison
Copy link
Contributor Author

You can ignore the failure on Debian Testing, I'll be looking into that later (it's a weird one, and definitely caused by some package update in Testing).

The change to disable seccomp seems OK - I think I might spend some time to try to figure out exactly why the tar / untar process encountered a permissions problem there (no guarantee I'll find anything, but I'll let you know if I do).

@ximion
Copy link
Owner

ximion commented Aug 19, 2024

It's caused by stricter seccomp rules in newer Podman versions. I haven't tested which ones cause this issue though, and since this is not security relevant at all for how we use Podman, just disabling seccomp was the easy and quick way out.

@jayaddison
Copy link
Contributor Author

Yep, I just think it might be useful to understand with a bit more precision.

There's a possibility that it's related to fchmodat2, a syscall added in Linux 6.6 -- that wouldn't be used by software in Debian stable, because the kernel there is 6.1 -- and it could affect testing, where a 6.10 kernel is in use.

The golang containers-common library didn't add seccomp support for fchmodat2 until containers/common#1773 was merged (and landed in v0.58), and that hasn't reached Ubuntu yet.. so the podman that runs in GitHub Actions doesn't have the updated rule.

Since the rule fix was to add fchmodat2 to an allow-list, the yet-unknown syscall would be default-denied otherwise, I expect.

Again - not 100% sure about this, but I think that might fit.

@ximion
Copy link
Owner

ximion commented Aug 19, 2024

That actually sounds like an incredibly reasonable explanation! No sure how many people this is hitting (clearly not a critical mass of people...), but I could be worth fixing globally.

@jayaddison
Copy link
Contributor Author

I guess the set intersection of folks using containers-common prior to v0.58 in combination with userland utilities built for Linux 6.6+ may be a relatively small?

Fortunately I think the issue is already fixed, so what's left is mostly a case of waiting until updated dependencies land in Ubuntu -- and then we should be able to revert bf6a2a9.

To confirm the theory I was considering generating a JSON seccomp policy from the current (ubuntu-latest podman, pre-0.58 containers-common) default, then adding/removing fchmodat2 to the policy to prove that it's an atomic change that affects build success/failure -- I might still do, but it could take me a while to get around to.

@jayaddison
Copy link
Contributor Author

I guess the set intersection of folks using containers-common prior to v0.58 in combination with userland utilities built for Linux 6.6+ may be a relatively small?

In hindsight that doesn't seem likely to be true - so perhaps there is something more specific about the way that daps is using tar that causes this. I notice from the description of fchmodat2 that it was introduced to handle situations relating specifically to symlink resoluton (but I don't understand the details).

@jayaddison
Copy link
Contributor Author

I think the bug involves the intersection of all of:

  • A daps HTML documentation build with static content enabled with some symlink content (as appstream has).
  • ...running within a containerized environment with seccomp rules that have not added an allow entry for fchmodat2...
  • ...and with an inter-filesystem write to one of a subset of affected filesystem types (because the fchmodat2 is only required for certain filesystems)
  • ...and on a recent-enough kernel (because otherwise fchmodat2 wouldn't be available; a different syscall would be attempted and would fail with a different error code).

I expected that adding the following command before podman is used in the CI Build step to fix the problem:

sed -i -e 's/^\(.*\)"fchmodat"\(.*\)$/\1"fchmodat"\2\n\1"fchmodat2"\2/g' /usr/share/containers/seccomp.json

...however, based on testing on my fork: it doesn't seem to. I'm now 99% confident that fchmodat2 is indeed the problem -- but I don't understand why the sed command above doesn't resolve it.

@jayaddison
Copy link
Contributor Author

I attempted to replicate the chmod permission problem using an Ubuntu 24.04 live USB image booted from a laptop here - however, the build succeeded there, meaning that the bug did not reappear. If the theory holds, then I think this implies that the affected filesystem type was not in use in the local environment (all other controllable factors remained unchanged). This seems to be a tricky problem to narrow in on.

(I realize this is getting a bit off-topic and that it probably doesn't really matter what the exact cause is, since we have a workaround -- but I have some time to investigate and would like to confirm what's going on here if feasible)

@jayaddison
Copy link
Contributor Author

Ok, here's what I find to be fairly strong evidence that fchmodat2 is indeed the cause, as discovered by comparing strace results from the tar cph ... tar xpv subprocess step used by daps during a comparative build:

Debian Stable

2024-08-22T10:54:13.7293186Z strace: Process 1833 attached
[pid  1833] ioctl(2, TIOCGPGRP, 0x7fffb1f2a8c4) = -1 ENOTTY (Inappropriate ioctl for device)
[pid  1833] newfstatat(AT_FDCWD, "/usr/local/sbin/tar", 0x7fffb1f2a3d0, 0) = -1 ENOENT (No such file or directory)
[pid  1833] newfstatat(AT_FDCWD, "/usr/local/bin/tar", 0x7fffb1f2a3d0, 0) = -1 ENOENT (No such file or directory) 
[pid  1833] newfstatat(AT_FDCWD, "/usr/sbin/tar", 0x7fffb1f2a3d0, 0) = -1 ENOENT (No such file or directory)
[pid  1833] access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) 
[pid  1833] statfs("/sys/fs/selinux", 0x7ffdb31dfc30) = -1 ENOENT (No such file or directory) 
[pid  1833] statfs("/selinux", 0x7ffdb31dfc30) = -1 ENOENT (No such file or directory) 
[pid  1833] access("/etc/selinux/config", F_OK) = -1 ENOENT (No such file or directory)
[pid  1833] openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid  1833] openat(AT_FDCWD, "/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid  1833] openat(AT_FDCWD, "/usr/lib/locale/C.UTF-8/LC_CTYPE", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid  1833] ioctl(0, TCGETS, 0x7ffdb31df990) = -1 ENOTTY (Inappropriate ioctl for device)
[pid  1833] connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
[pid  1833] connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
[pid  1833] connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
[pid  1833] connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
[pid  1833] ioctl(1, TCGETS, 0x7ffdb31df730) = -1 ENOTTY (Inappropriate ioctl for device)
[pid  1833] mkdirat(AT_FDCWD, "static", 0700) = -1 EEXIST (File exists)
[pid  1833] +++ exited with 0 +++

Debian Testing

2024-08-22T10:53:56.8848043Z strace: Process 1609 attached
[pid  1609] ioctl(2, TIOCGPGRP, 0x7ffe65ad4c24) = -1 ENOTTY (Inappropriate ioctl for device)
[pid  1609] newfstatat(AT_FDCWD, "/usr/local/sbin/tar", 0x7ffe65ad4730, 0) = -1 ENOENT (No such file or directory)
[pid  1609] newfstatat(AT_FDCWD, "/usr/local/bin/tar", 0x7ffe65ad4730, 0) = -1 ENOENT (No such file or directory) 
[pid  1609] newfstatat(AT_FDCWD, "/usr/sbin/tar", 0x7ffe65ad4730, 0) = -1 ENOENT (No such file or directory)
[pid  1609] access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) 
[pid  1609] statfs("/sys/fs/selinux", 0x7fff194f9a70) = -1 ENOENT (No such file or directory) 
[pid  1609] statfs("/selinux", 0x7fff194f9a70) = -1 ENOENT (No such file or directory) 
[pid  1609] access("/etc/selinux/config", F_OK) = -1 ENOENT (No such file or directory)
[pid  1609] openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid  1609] openat(AT_FDCWD, "/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid  1609] openat(AT_FDCWD, "/usr/lib/locale/C.UTF-8/LC_CTYPE", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid  1609] ioctl(0, TCGETS, 0x7fff194f97c0) = -1 ENOTTY (Inappropriate ioctl for device)
[pid  1609] connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
[pid  1609] connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
[pid  1609] connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
[pid  1609] connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
[pid  1609] ioctl(1, TCGETS, 0x7fff194f9560) = -1 ENOTTY (Inappropriate ioctl for device)
[pid  1609] mkdirat(AT_FDCWD, "static", 0700) = -1 EEXIST (File exists)
[pid  1609] fchmodat2(AT_FDCWD, "static/js", 0755, AT_SYMLINK_NOFOLLOW) = -1 EPERM (Operation not permitted)
[pid  1609] fchmodat2(AT_FDCWD, "static/images", 0755, AT_SYMLINK_NOFOLLOW) = -1 EPERM (Operation not permitted)
[pid  1609] fchmodat2(AT_FDCWD, "static/css", 0755, AT_SYMLINK_NOFOLLOW) = -1 EPERM (Operation not permitted)
[pid  1609] +++ exited with 2 +++

(log format narrowed to remove per-line timestamps and unrelated processid (pid) activity)

This does not comprehensively answer why adding fchmodat2 to a job-local seccomp.json policy file and requesting use of that does not resolve the problem. But it does indicate fairly strongly (not absolutely -- there could still be another explanation within the tar code, but I think not here) that that is the relevant syscall.

@jayaddison
Copy link
Contributor Author

Ah, I think I understand: in order for containers-common to accept an additional syscall entry for fchmodat2, libseccomp2 has to be updated to a version (specifically, v2.5.5 or greater) that is aware of that kernel syscall. This is described in some detail here: https://bugs.launchpad.net/ubuntu/+source/tar/+bug/2059734/

The output of another build indicates that the GitHub Actions continuous integration ubuntu-latest image contains version 2.5.3-2ubuntu2 of libseccomp2 at build-time. This would prevent resolution of the syscall.

Perhaps it may be possible to update libseccomp2 in the GitHub Actions workflow runner -- if so, I think that may allow the Debian Testing job to succeed without running with seccomp in unconfined mode.

@jayaddison
Copy link
Contributor Author

I attempted to replicate the chmod permission problem using an Ubuntu 24.04 live USB image booted from a laptop here - however, the build succeeded there, meaning that the bug did not reappear. If the theory holds, then I think this implies that the affected filesystem type was not in use in the local environment (all other controllable factors remained unchanged). This seems to be a tricky problem to narrow in on.

Note: this was the incorrect version of Ubuntu to attempt to replicate the problem on; the release version number of ubuntu-latest in GitHub Actions at the time-of-writing is 22.04 -- that may explain why the issue did not appear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants