Reuse the same rootfs for the inner and outer container #1048

apyrgio · 2025-01-09T18:09:16Z

When building the Dangerzone container image, we use multi-stage builds, where we create the inner container first, and then copy it to the outer container, under /home/dangerzone/dangerzone-image/rootfs:

dangerzone/Dockerfile

Line 94 in 1298e9c

COPY --from=dangerzone-image / /home/dangerzone/dangerzone-image/rootfs

This was the most clean way to use nested containers, and was implemented during #590. Now that we plan to use Debian Stable for the base image of Dangerzone (see #1046), it would be nice to merge these two container images into one, to avoid the double overhead of the Debian container image.

I have experimented with a way we can achieve this, and the following seem to work:

Remove the /home/dangerzone/dangerzone-image/ dir, and write instead the OCI config to /config.json, i.e., at the root of the container image.
Change the OCI config to use the container root (/) as the image bundle, instead of /home/dangerzone/dangerzone-image/rootfs.

These changes are reflected in the following diff:

--- a/dangerzone/gvisor_wrapper/entrypoint.py
+++ b/dangerzone/gvisor_wrapper/entrypoint.py
@@ -56,7 +56,7 @@ oci_config: dict[str, typing.Any] = {
             {"type": "RLIMIT_NOFILE", "hard": 4096, "soft": 4096},
         ],
     },
-    "root": {"path": "rootfs", "readonly": True},
+    "root": {"path": "/", "readonly": True},
     "hostname": "dangerzone",
     "mounts": [
         {
@@ -133,7 +142,7 @@ if os.environ.get("RUNSC_DEBUG"):
     json.dump(oci_config, sys.stderr, indent=2, sort_keys=True)
     # json.dump doesn't print a trailing newline, so print one here:
     log("")
-with open("/home/dangerzone/dangerzone-image/config.json", "w") as oci_config_out:
+with open("/config.json", "w") as oci_config_out:
     json.dump(oci_config, oci_config_out, indent=2, sort_keys=True)
 
 # Run gVisor.
@@ -150,7 +159,7 @@ if os.environ.get("RUNSC_DEBUG"):
     runsc_argv += ["--debug=true", "--alsologtostderr=true"]
 if os.environ.get("RUNSC_FLAGS"):
     runsc_argv += [x for x in shlex.split(os.environ.get("RUNSC_FLAGS", "")) if x]
-runsc_argv += ["run", "--bundle=/home/dangerzone/dangerzone-image", "dangerzone"]
+runsc_argv += ["run", "--bundle=/", "dangerzone"]
 log(
     "Running gVisor with command line: {}", " ".join(shlex.quote(s) for s in runsc_argv)
 )

My question is, do we risk something security-wise by doing this? Note that we instruct GVisor to treat the root filesystem as read-only (not that GVisor would write to these files in any case). Also, we mask sensitive system dirs (/proc, /tmp, /dev, sys, /home/dangerzone). There are some extra files that are mounted to the outer container on /etc/ and /run that we can mask, although they don't seem to give any sensitive info:

$ mount
[...]
tmpfs on /run/.containerenv type tmpfs (rw,nosuid,nodev,relatime,size=3258360k,nr_inodes=814590,mode=700,uid=1000,gid=958,inode64)
tmpfs on /etc/hostname type tmpfs (rw,nosuid,nodev,relatime,size=3258360k,nr_inodes=814590,mode=700,uid=1000,gid=958,inode64)
tmpfs on /etc/hosts type tmpfs (rw,nosuid,nodev,relatime,size=3258360k,nr_inodes=814590,mode=700,uid=1000,gid=958,inode64)
[...]
$ cat /etc/hostname
cec0ac615cbe
$ cat /etc/hosts
127.0.0.1       localhost
::1     localhost
192.168.1.23    host.containers.internal host.docker.internal
127.0.0.1       cec0ac615cbe dangerzone-doc-to-pixels-KwY1lA
$ cat /run/.containerenv

Overall, I'm inclined to go forward with this, but I'd like a second set of 👀 to validate my assumptions.

CCing @EtiennePerot for obvious reasons.

The text was updated successfully, but these errors were encountered:

Remove the need to copy the Dangerzone container image (used by the inner container) within a wrapper gVisor image (used by the outer container). Instead, use the root of the container filesystem for both containers. We can do this safely because we don't mount any secrets to the container, and because gVisor offers a read-only view of the underlying filesystem Fixes #1048

EtiennePerot · 2025-01-17T20:27:42Z

There's no reason why this wouldn't work in general. The risks are those that you highlighted: the sandbox exposes more files than it previously did, but if you trust gVisor to do its job then that should not be an issue.

There is one additional important risk from exposure of the Unix Domain Socket that is used to control a running sandbox. gVisor stores per-sandbox UDSs inside the OCI root directory (passed to as runsc --root=/some/path, defaulting to $XDG_RUNTIME_DIR/runsc or fallback to /var/run/runsc if $XDG_RUNTIME_DIR is unset in the environment). gVisor sandboxes run a server on this UDS in order to respond to requests from the OCI runtime (this is how commands like podman exec work). So write access to this UDS allows an equivalent amount of control over that sandbox as the host would have. This means the ability to start new containers inside the same sandbox, stop/checkpoint the sandbox, etc. The most harmful things I can think of using that interface are either a host DoS, or the ability to write to arbitrary files in the out-of-sandbox filesystem (by triggering a checkpoint request), though the contents being written would be only semi-attacker-controller. In any case, it's in general not meant to be a security-hardened interface, and a sandbox certainly has no business having access to its own control mechanism.

To prevent this, gVisor has a --host-uds flag which defaults to none, which makes gVisor refuse to forward connection attempts from the sandbox to host UDSs. Given that Dangerzone doesn't set this flag, it currently has the --host-uds=none behavior, so it should already be safe. But I would definitely recommend explicitly setting this flag because now it would become load-bearing. To add another layer of access prevention to this UDS, I would also recommend masking the OCI root directory with a tmpfs mount, and explicitly setting runsc's --root flag for the same reason (it would now become load-bearing, and the default value is environment-variable-dependent).

In my opinion, sticking with the two-stage structure would be safer just on the basis of making fewer security assumptions. It may be possible to avoid the double-overhead of the Debian image by sharing the large-but-static subtrees of it. Deleting /usr/{bin,lib,sbin,share} and creating symlinks from /usr/{bin,lib,sbin,share} -> /dangerzone-image/usr/{bin,lib,sbin,share} might be all it takes (this covers 121 MiB out of the 127 MiB of the debian image on my machine), and it should work if the two images are of the same Debian base image. The runsc binary (currently at /usr/bin/runsc) can be moved to somewhere else such as at /runsc. It is a static binary so it has no .so dependencies.

apyrgio · 2025-01-17T21:44:09Z

Truth is, I originally tried the symlink approach, but I did it the other way around, i.e., /dangerzone-image/usr/{bin,lib,sbin,share} -> /usr/{bin,lib,sbin,share} and I encountered some errors (don't have them handy I'm afraid). I may give it a shot again and report here.

Also, good to know about the UDS and other potentially sensitive files that gVisor may create. I had a concern like that in mind, and turns out we already use runsc ... --root /home/dangerzone/.containers, and /home/dangerzone is masked, so hopefully all gVisor-related files will be masked:

dangerzone/dangerzone/gvisor_wrapper/entrypoint.py

Lines 140 to 144 in 7f41811

    
           runsc_argv = [ 
        
               "/usr/bin/runsc", 
        
               "--rootless=true", 
        
               "--network=none", 
        
               "--root=/home/dangerzone/.containers",

dangerzone/dangerzone/gvisor_wrapper/entrypoint.py

Lines 87 to 92 in 7f41811

    
           { 
        
               "destination": "/home/dangerzone", 
        
               "type": "tmpfs", 
        
               "source": "tmpfs", 
        
               "options": ["nosuid", "noexec", "nodev"], 
        
           },

What bothers me is that if an assumption of ours change, e.g., Docker decides to mount a sensitive file somewhere, then we may not detect it. I'll think more about this and report back.

EtiennePerot · 2025-01-18T02:16:01Z

Truth is, I originally tried the symlink approach, but I did it the other way around, i.e., /dangerzone-image/usr/{bin,lib,sbin,share} -> /usr/{bin,lib,sbin,share} and I encountered some errors (don't have them handy I'm afraid). I may give it a shot again and report here.

Right, it won't work in that direction because they would appear as symlinks that lead nowhere when trying to resolve them with the view of the filesystem that the inner container has. gVisor doesn't resolve these symlinks on the host when the sandbox attempts to read them (nor could it anyway, because it isolates its own view of the out-of-sandbox filesystem such that it can no longer access files outside of the rootfs directory and explicitly-configured mountpoints).

we already use runsc ... --root /home/dangerzone/.containers, and /home/dangerzone is masked, so hopefully all gVisor-related files will be masked

Should work.

Remove the need to copy the Dangerzone container image (used by the inner container) within a wrapper gVisor image (used by the outer container). Instead, use the root of the container filesystem for both containers. We can do this safely because we don't mount any secrets to the container, and because gVisor offers a read-only view of the underlying filesystem Fixes #1048

Mask some paths of the outer container in the OCI config of the inner container. This is done to avoid leaking any sensitive information from Podman / Docker / gVisor, since we reuse the same rootfs Refs #1048

apyrgio · 2025-01-21T22:39:58Z

Etienne's assumption that only /usr is required for Dangerzone to run inside gVisor is pretty much spot on. Based on my tests, I'd add /etc/ to that, for LibreOffice settings and font config.

Since it's just two dirs, and we can mask all the rest, well, I did just that. I went ahead and created a mount entry in our OCI config file for every root dir that is listed in the Filesystem Hierarchy Standard (FHS), except for:

/usr
/lib (due to /usr/lib -> /lib),
/bin (due to /usr/bin -> /bin), and
/etc

See 005d742 (unmerged commit, hash may chnage) for details.

I think I prefer that approach over the symlinks one for simplicity reasons. I'm aware it relies on gVisor properly implementing masking, and respecting container boundaries, but that's something we just have to rely on at the end of the day (and a separate container probably wouldn't help there).

apyrgio added the enhancement New feature or request label Jan 9, 2025

github-project-automation bot added this to Dangerzone ✨ Jan 9, 2025

github-project-automation bot moved this to Todo in Dangerzone ✨ Jan 9, 2025

apyrgio self-assigned this Jan 9, 2025

apyrgio added the container label Jan 9, 2025

apyrgio added this to the 0.9.0 milestone Jan 9, 2025

apyrgio moved this from Todo to In Progress in Dangerzone ✨ Jan 9, 2025

apyrgio linked a pull request Jan 14, 2025 that will close this issue

Make our Dangerzone image reproducible #1049

Open

apyrgio moved this from In Progress to PR Review in Dangerzone ✨ Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse the same rootfs for the inner and outer container #1048

Reuse the same rootfs for the inner and outer container #1048

apyrgio commented Jan 9, 2025

EtiennePerot commented Jan 17, 2025 •

edited

Loading

apyrgio commented Jan 17, 2025

EtiennePerot commented Jan 18, 2025 •

edited

Loading

apyrgio commented Jan 21, 2025

Reuse the same rootfs for the inner and outer container #1048

Reuse the same rootfs for the inner and outer container #1048

Comments

apyrgio commented Jan 9, 2025

EtiennePerot commented Jan 17, 2025 • edited Loading

apyrgio commented Jan 17, 2025

EtiennePerot commented Jan 18, 2025 • edited Loading

apyrgio commented Jan 21, 2025

EtiennePerot commented Jan 17, 2025 •

edited

Loading

EtiennePerot commented Jan 18, 2025 •

edited

Loading