-
Notifications
You must be signed in to change notification settings - Fork 275
Merge various fixes for C-LCOW since conf-aci/0.2.5 #2559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
micromaomao
wants to merge
74
commits into
microsoft:main
Choose a base branch
from
micromaomao:tingmao_github/merge-msrc-to-main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Merge various fixes for C-LCOW since conf-aci/0.2.5 #2559
micromaomao
wants to merge
74
commits into
microsoft:main
from
micromaomao:tingmao_github/merge-msrc-to-main
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Mahati Chamarthy <[email protected]>
Signed-off-by: Mahati Chamarthy <[email protected]>
Signed-off-by: Mahati Chamarthy <[email protected]>
Move inject and load fragment into the securitypolicy pkg Signed-off-by: Mahati Chamarthy <[email protected]>
Signed-off-by: Mahati Chamarthy <[email protected]>
Signed-off-by: Mahati Chamarthy <[email protected]>
Signed-off-by: Mahati Chamarthy <[email protected]>
A helper to gate changes behind confidential containers only. Signed-off-by: Tingmao Wang <[email protected]>
It seems to me that for 9p mounts from the host into a UVM, we are only supposed
to mount to exactly
^/run/gcs/c/<containerID>/mounts/m[0-9]+$
i.e. /run/gcs/c/.../m? is not just a prefix check. The only place which uses
this prefix is allocateLinuxResources, and it doesn't try to mount to anything
under the m<number> directory. Hence we should make the regex match there be
full string match too.
Combined with the fact that 9p mountpoints are already checked for duplicate,
this prevents any “mounting on top of symlinks” tricks from the host.
Another example, for mount source, the policy usually has:
"mounts": [
{
"destination": "/etc/resolv.conf",
"options": [
"rbind",
"rshared",
"rw"
],
"source": "sandbox:///tmp/atlas/resolvconf/.+",
"type": "bind"
}
],
and is intended to enforce that when starting the container, the source for the
/etc/resolv.conf mount must come from /tmp/atlas/resolvconf/ within the
sandboxMounts. Similar policies will be generated for file mounts:
{
"destination": "/mnt/volume",
"options": [
"rbind",
"rshared",
"rw"
],
"source": "sandbox:///tmp/atlas/azureFileVolume/.+",
"type": "bind"
},
This commit changes these cases to so that we use the anchored pattern,
effectively enforcing a full match.
Fixes: https://msazure.visualstudio.com/One/_workitems/edit/33064760
Signed-off-by: Tingmao Wang <[email protected]>
Since these IDs are used to construct various paths (mount dir, path for resolv.conf, scratch path (containerScratchPathInUVM in lcow.go)), there is potential for path traversal attack here. We check that it can't be something weird for confidential containers (while still allowing anything the host passes if we're not in confidential mode, to not accidentally break other dependencies). Aside from CreateContainer, we also check for this in modify*Settings in case the host crafts a request with a malformed ID later on. Since the functional tests uses names like TestContainerExecLCOW-1df570f2-container, we can't enforce that this must strictly be a hex or a UUID. Signed-off-by: Tingmao Wang <[email protected]>
The main purpose of this is to prevent mounting host-controlled, non-encrypted
filesystems. Combined with the ability to mount to anywhere, this results in
code execution from the host into the guest. Unencrypted and
non-integrity-checked disks on its own also runs the risk of kernel filesystem
bugs.
While this commit does not yet prevent mounting to arbitrary paths, it makes
exploiting this much more difficult now as all the host can do is mount an empty
directory on top of things.
Using the reproducer in the bug report, we get the following deny message:
{"decision":"deny","input":{"encrypted":false,"ensureFilesystem":false,"filesystem":"","readonly":false,"rule":"mount_device","target":"/bin/"},"reason":{"errors":["ensureFilesystem must be set on rw device mounts","rw device mounts uses a filesystem that is not allowed","unencrypted scratch not allowed, device to be mounted must be encrypted"]}}
Fixes: https://msazure.visualstudio.com/One/_workitems/edit/33144273
Signed-off-by: Tingmao Wang <[email protected]>
If the host can control the rootfs path, by using otherwise legitimate 9p mounts, this lets it take over the container. Similar exploits might be possible by controlling ScratchDirPath or OCIBundlePath in this request as well, so we enforce that those paths are as expected too. Since in a previous commit we already ensure that 9pfs can only be mounted to /run/gcs/c/.../mounts, this means that it is no longer possible to use 9p mounts to exploit this, and nor should it be possible to use e.g. something in sandbox://. A later commit should also ensure that disks cannot be mounted to arbitrary paths too. Error message example: time="2025-06-09T15:34:38Z" level=fatal msg="starting the container \"fdb12ddbbfdeb1ee990abe892a06ac80304e5298c48d7eae2845c201e72efd5b\": rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: failed to create container fdb12ddbbfdeb1ee990abe892a06ac80304e5298c48d7eae2845c201e72efd5b: guest RPC failure: OCISpecification.Root.Path \"/run/gcs/c/f3c2e64041edd6aa1c7f20c15c2bed0d2afffabb3b22aceac391d8f43d9fc567/mounts/m0\" must equal expected \"/run/gcs/c/fdb12ddbbfdeb1ee990abe892a06ac80304e5298c48d7eae2845c201e72efd5b/rootfs\": unknown" Fixes: https://msazure.visualstudio.com/One/_workitems/edit/33205622 Signed-off-by: Tingmao Wang <[email protected]>
Need to fix tests Signed-off-by: Tingmao Wang <[email protected]>
This replaces the currently unused* LCOWGlobalMountPrefixFmt and WCOWGlobalScsiMountPrefixFmt, and allows these format strings to be reused in a later commit for policy enforcement. *: I searched in hcsshim and azcri with no results. Signed-off-by: Tingmao Wang <[email protected]>
This commit makes sure that we only accept mount requests with mountpoints that
we expect:
- Read-only scsi disks: These are container layers, and can only be mounted
under /run/mounts/scsi/m[0-9]+
- Read-write scsi disks: These are scratch disks, and should appear only at
/run/gcs/c/<container-id>, where <container-id> might also be the sandbox ID if
shared scratch is used.
- Overlay mounts (LCOWCombinedLayers): They should be at exactly
/run/gcs/c/<container-id>/rootfs, and we make sure that the container ID matches
with the one passed in the request, as this container ID is passed to rego.
We check the overlay mountpoints in Go code as we're checking rootfs in Go
already, unconditionally regardless of the policy. We also check that the
scratch dir passed in is correct.
Error message examples:
{"decision":"deny","input":{"deviceHash":"16b514057a06ad665f92c02863aca074fd5976c755d26bff16365299169e8415","mountPathRegex":"/run/mounts/scsi/m[0-9]+","readonly":true,"rule":"mount_device","target":"/tmp/scsi_m0"},"reason":{"errors":["mountpoint invalid"]}}
{"decision":"deny","input":{"encrypted":true,"ensureFilesystem":true,"filesystem":"xfs","mountPathRegex":"/run/gcs/c/[0-9a-fA-F]{64}","readonly":false,"rule":"mount_device","target":"/tmp/weird_root/8609646aeeafa903d6f15bb2d220e25c71d3b0596c6cfa0645230482975e4fe5"},"reason":{"errors":["mountpoint invalid"]}}
time="2025-06-10T17:36:55Z" level=fatal msg="run pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: failed to mount container storage: guest modify: guest RPC failure: scratch path \"/tmp/weird_scratch/76250e7bb19ce9b0b3476451efe67ac4a3bd4ffe8dd51e639a203ec8fc813599\" must match regex \"^/run/gcs/c/[0-9a-fA-F]{64}/scratch/76250e7bb19ce9b0b3476451efe67ac4a3bd4ffe8dd51e639a203ec8fc813599$\": unknown"
time="2025-06-10T17:39:02Z" level=fatal msg="run pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: failed to mount container storage: guest modify: guest RPC failure: combined layers target \"/\" does not match expected path \"/run/gcs/c/808fe5a536a33faa6c8a66a8af7b952ffd208bc7d2e96b85fb8ae353a575bc48/rootfs\": unknown"
Signed-off-by: Tingmao Wang <[email protected]>
This matters the most for for device_mount. Some of this is not strictly necessary (such as using the correct overlay target) as we do the rootfs checks in Go code rather than rego, but we still try to make the test correct here. Signed-off-by: Tingmao Wang <[email protected]>
- Added test for path traversal fix. - Updated some error checking to require specific messages. - Update assertDecisionJSONContains to print out the actual error message if no match. Signed-off-by: Tingmao Wang <[email protected]>
Check that invalid targets gets denied, and for rw mounts, ensureFileSystem and
encrypted must be set correctly. Make sure we can both mount and umount the
layers and the scratch disk, in any order. Also check that unmount is denied
for targets that has not been mounted, for both ro and rw mounts.
Also rename Test_Rego_EnforceDeviceUmountPolicy_Removes_Device_Entries to
Test_Rego_EnforceDeviceUnmountPolicy_Removes_Device_Entries for consistency
("unmount" instead of "umount").
Signed-off-by: Tingmao Wang <[email protected]>
Signed-off-by: Tingmao Wang <[email protected]>
This test is currently broken already, due to VPMem multimapping not being disabled, and also due to missing required environment variables. In addition to fixing that, this commit also make it work with the latest changes, by using proper containerID, and disabling VPMem altogether (as it will fail the mount target check). Signed-off-by: Tingmao Wang <[email protected]>
…dential At least on confidential mode, read only disks are only supposed to be ext4, and in fact if it isn't, we will fail to read the verity info, and error out earlier. However, currently the host can specify a filesystem even when we're mounting a dm-verify protected volume. This presents a risk for exploits (but is not currently exploitable), for example the host could specify virtiofs, and prepare the correct vhost socket with tags like /dev/mapper/dm-verify-..., and be able to override a container layer. This is not currently exploitable thanks to the fact that we try to mount with the noload option first, and mounting would not continue if that fails. This option will prevent mounting virtiofs, 9p, overlay, etc. Note that the host can also specify its own options, but those options are only applied if we get through the first "mount with noload" stage. This commit prevents the host from specifying anything other than ext4, thus eliminating this risk. Signed-off-by: Tingmao Wang <[email protected]>
Suggested by Ken in PR review Signed-off-by: Tingmao Wang <[email protected]>
Signed-off-by: Tingmao Wang <[email protected]>
Since that function creates the overlay mount, it is reasonable for it to also try to mount the read-write scratch disk first, for realism. Signed-off-by: Tingmao Wang <[email protected]>
Mahati suggested below to add a new enforcement point rw_mount_device, instead of adding an input.readonly to the existing mount_device. This commit does that, and also bump the API version and set introducedVersion for the new rule, so that old policies will fallback to the "default" for this enforcement point, which in this commit is defined to allow. This also has the benefit that handwritten policies that does slightly different things in mount_device would not break. Since previously written policies would not have handled the scratch mount in mount_device, using a separate rule for read-write mounts is less likely to break those. However, doing it this way means that existing policies does not get this rw mountpoint protection. To remedy that, the next commit will add a "use_framework" mechanism to pass through new enforcement points to the framework by default, and use it for rw_mount_device instead of defaulting to allow. Signed-off-by: Tingmao Wang <[email protected]>
…ment points This allows us to introduce the new rw_mount_device enforcement point while also letting it work even if we have an old policy. Signed-off-by: Tingmao Wang <[email protected]>
Signed-off-by: Tingmao Wang <[email protected]>
Signed-off-by: Tingmao Wang <[email protected]>
Signed-off-by: Tingmao Wang <[email protected]>
We do this since the read-only unmount and read-write unmount may also use different logic (even though they don't do so right now), and if a customer has overridden unmount_device, using it also for the scratch disk would not be backwards compatible. Signed-off-by: Tingmao Wang <[email protected]>
This allow us to re-use the HasSecurityPolicy method, and reuse more things in Host to do checks in later commits. Signed-off-by: Tingmao Wang <[email protected]>
…up / mount error
Previously, if the dm-crypt setup fails, the mountpoint is not deleted due to
(accidental?) shaowing of the `err` variable. Also, on mount failures, the
dm-crypt device is not cleaned up. This leaks the dm-crypt device (and prevent
another mount of a scratch disk with the same scsi controller and lun from
succeeding), and also leaves the device in an in-use state which is undesirable
given that the host will immediately try to detach it.
While those failures should be rare (and in practice, since confidential
containers use shared scratch, azcri will normally never try another one if the
scratch mount for the pause container fails, as that would be a UVM startup
failure), the repro case for 6b5b89346 ("hcsv2/uvm: Rollback policy state on
mount errors") has showed that leaving things lying around can make exploits
easier (e.g. in that particular case, if gcs did not clean up the mountpoint, we
would have been able to mount the overlay and start a container with missing
layers).
This also makes the policy rollback we're about to do on mount failures more
correct.
Signed-off-by: Tingmao Wang <[email protected]>
This commit uses the newly added revertable section mechanism to roll back the policy metadata when the mounting disk/9p/overlay fails, thus preventing inconsistent state between Rego and GCS/Linux. Best effort testing instruction: We can construct a VHD that will reliably trigger a mount error. It needs to have a valid ext4 superblock and also have the original dm-verity hash to get pass verifyinfo check and mount policy enforcement, but we can corrupt the actual content of the filesystem such that it won't mount (due to dm-verity fail => can't read). First figure out where is the layer.vhd for your image (by looking at the runhcs logs while starting the container), copy it off, then do: dd if=/dev/zero of=layer.vhd conv=notrunc bs=4096 count=1 seek=1 And replace the vhd. Then we need to patch host-side shim to ignore mount errors and proceed to try to mount the overlay and start the container: diff --git a/internal/uvm/scsi/mount.go b/internal/uvm/scsi/mount.go index 63ac6a5..dd5f2758 100644 --- a/internal/uvm/scsi/mount.go +++ b/internal/uvm/scsi/mount.go @@ -7,6 +7,7 @@ import ( "fmt" "reflect" "sort" + "strings" "sync" ) @@ -81,6 +82,10 @@ func (mm *mountManager) mount(ctx context.Context, controller, lun uint, path st }() if err := mm.mounter.mount(ctx, controller, lun, mount.path, c); err != nil { + if strings.Contains(err.Error(), "input/output error") { + // hack + return mount.path, nil + } return "", fmt.Errorf("mount scsi controller %d lun %d at %s: %w", controller, lun, mount.path, err) } return mount.path, nil Then try to start the container: PS C:\lcow_attest> .\runp.ps1 PS C:\lcow_attest> crictl start (crictl create --no-pull (crictl pods --name lcow_attest_attestation -q) ./container_attestation_attestation.json ./pod_attestation.json) Before: ... guest RPC failure: failed to mount overlayfs at /run/gcs/c/4754a20651ed647ae38bbd8f242d1e9f6510d171e0ca5824818b01ebed37bfd1/rootfs: no such file or directory: unknown" (Overlay mount still fails, but not because of policy rejection, rather it's because GCS happens to do a rmdir on the target on mount failure.) After: ... guest RPC failure: overlay creation denied by policy: policyDecision< {"decision":"deny","input":{"containerID":"30c9f8572fe9264deddd24180c1603a5acdd950b3f05b9c869a73240bb0fb34e","layerPaths":["/run/mounts/scsi/m2","/run/mounts/scsi/m3","/run/mounts/scsi/m4","/run/mounts/scsi/m5","/run/mounts/scsi/m6","/run/mounts/scsi/m7"],"rule":"mount_overlay","target":"/run/gcs/c/30c9f8572fe9264deddd24180c1603a5acdd950b3f05b9c869a73240bb0fb34e/rootfs"},"reason":{"errors":["no matching containers for overlay"]}} >policyDecision: unknown" Fixes: https://msazure.visualstudio.com/One/_workitems/edit/33232631 # partially - this commit fixes the most significant issue, but we still # need to think about what happens when unmount fails. Signed-off-by: Tingmao Wang <[email protected]>
The MountLayer function does not in fact do any policy enforcement. Signed-off-by: Tingmao Wang <[email protected]>
…Container if any unmount fails Unlike mounts, if a unmount fails (either because the unmount itself fails, or in the case of scratch disks, the unmount succeed but dm-crypt fails to close the device), we cannot safely revert the Rego metadata, as we will not be able to properly "undo" the unmount if it has successfully happened. Even if the unmount itself fails, this is still a highly unexpected state, and we might not want to continue using the device as usual. One possible solution here is just to not revert the Rego metadata, and let rego pretend that the unmount succeeded. Doing so for SCSI disks and overlay mounts is probably OK, but not for 9p as we might inadvertently be allowing the host to trick us into mounting on top of an existing host-controlled 9p system (which might then lead to possible symlink-related exploits?). However, since we really do not expect unmounts to fail (aside from policy enforcement failure, or if the host passes an invalid mount point), if it does then we protect ourselves by bailing out of all further mounts and unmounts (including overlay). We prevent ourselves from starting new containers as well to protect against using a "broken" overlay. Note that if the policy denies such a unmount, we won't end up in this broken state, and will just return an error as usual. Fixes: https://msazure.visualstudio.com/One/_workitems/edit/33232631 Signed-off-by: Tingmao Wang <[email protected]>
…exist Since this is a non-erroring non-op, we logs a warning to ease debugging should something weird happens. Signed-off-by: Tingmao Wang <[email protected]>
[cherry-picked from 0ca40bb4f130b3508f4a130011463070328d40d0] - rego: Fix missing error reason when mounting a rw device to an existing mount point. This fixes a missing error message introduced in the last round of security fixes. It's not hugely important, but eases debugging if we get policy denials on mounting the scratch, for whatever reason. Also adds test for it. - Remove a no-op from rego Checked with @<Matthew Johnson (AR)> earlier that this basically does nothing and is just something left over. However I will not actually add a remove op for `metadata.started` for now. This PR is targeting the conf-aci branch on ADO because the commit being fixed is not on main yet. This should be backported to main together with the fixes from last month.
[cherry-picked from 421b12249544a334e36df33dc4846673b2a88279] This set of changes fixes the [Metadata Desync with UVM State](https://msazure.visualstudio.com/One/_workitems/edit/33232631/) bug, by reverting the Rego policy state on mount and some types of unmount failures. For mounts, a minor cleanup code is added to ensure we close down the dm-crypt device if we fails to mount it. Aside from this, it is relatively straightforward - if we get a failure, the clean up functions will remove the directory, remove any dm-devices, and we will revert the Rego metadata. For unmounts, careful consideration needs to be taken, since if the directory has been unmounted successfully (or even partially successful?), and we get an error, we cannot just revert the policy state, as this may allow the host to use a broken / empty mount as one of the image layer. See 615c9a0bdf's commit message for more detailed thoughts. The solution I opted for is, for non-trivial unmount failure cases (i.e. not policy denial, not invalid mountpoint), if it fails, then we will block all further mount, unmount, container creation and deletion attempts. I think this is OK since we really do not expect unmounts to fail - this is especially the case for us since the only writable disk mount we have is the shared scratch disk, which we do not unmount at all unless we're about to kill the UVM. Testing ------- The "Rollback policy state on mount errors" commit message has some instruction for making a deliberately corrupted (but still passes the verifyinfo extraction) VHD that will cause a mount error. The other way we could make mount / unmount fail, and thus test this change, is by mounting a tmpfs or ro bind in relevant places: To make unmount fail: mkdir /run/gcs/c/.../rootfs/a && mount -t tmpfs none /run/gcs/c/.../rootfs/a or mkdir /run/gcs/mounts/scsi/m1/a && mount -t tmpfs none /run/gcs/mounts/scsi/m1/a To make mount fail: mount -o ro --bind /run/mounts/scsi /run/mounts/scsi or mount --bind -o ro /run/gcs/c /run/gcs/c Once failure is triggered, one can make them work again by `umount`ing the tmpfs or ro bind. What about other operations? ---------------------------- This PR covers mount and unmount of disks, overlays and 9p. Aside from setting `metadata.matches` as part of the narrowing scheme, and except for `metadata.started` to prevent re-using a container ID, Rego does not use persistent state for any other operations. Since it's not clear whether reverting the state would be semantically correct (we would need to carefully consider exactly what are the side effects of say, failing to execute a process, start a container, or send a signal, etc), and adding the revert to those operations does not currently affect much behaviour, I've opted not to apply the metadata revert to those for now. Signed-off-by: Tingmao Wang <[email protected]>
This fixes a vulnerability (and reduces the surface for other similar potential vulnerabilities) in confidential containers where if the host sends a mount/unmount modification request concurrently with an ongoing CreateContainer request, the host could re-order or skip image layers for the container to be started. While this could be fixed by adding mutex lock/unlock around the individual modifyMappedVirtualDisk/modifyCombinedLayers/CreateContainer functions, we decided that in order to prevent any more of this class of issues, the UVM, when running in confidential mode, should just not allow concurrent requests (with exception for any actually long-running requests, which for now is just waitProcess). Fixes: https://msazure.visualstudio.com/One/_workitems/edit/33357501 Signed-off-by: Tingmao Wang <[email protected]>
This doesn't change any behavior, just slightly moves the code around. Signed-off-by: Tingmao Wang <[email protected]>
…evices Refactor out code to match a path with device to findDeviceContainingPath, and other small renames. Signed-off-by: Tingmao Wang <[email protected]>
… of in-use layers/scratch Signed-off-by: Tingmao Wang <[email protected]>
[cherry-picked from d1dbec46c86d08d9babf5fcd1b0d8445e7d878e4 note: since SetConfidentialUVMOptions is now refactored into securityOptions.SetConfidentialOptions, that function now no longer have access to host. Thus the original change to it which initializes newHostMounts is now moved directly to modifyHostSettings] Since we're placing additional restrictions on when unmounts are allowed, we ensure that the impact of this change is scoped to confidential containers only. Signed-off-by: Tingmao Wang <[email protected]>
This traces out whether hostMounts is set, and is probably useful for general debugging. Signed-off-by: Tingmao Wang <[email protected]>
While any mount changes from gcs, which lives in the init namespace, wouldn't actually affect any container that has already started anyway, due to the container being in a separate mount namespace, the host trying to unmount an overlay that is in use is still incorrect, and we harden ourselves by preventing that. Together with hostMounts, this also prevents us trying to close down a dm device that is in use by a scratch / layer (even though Linux would EBUSY that anyway). If we allow unmounting the in-use overlay, then the host can do that first, then go on to ask us to unmount the scratch / layer and close the dm. Signed-off-by: Tingmao Wang <[email protected]>
This will make it easier to debug should the gcs "hung" on a request. Example output: ...,LogrusEntry,1,10384,bridge: request processing thread in sequential mode blocked on the current request for more than 5 seconds,...,27,ComputeSystemModifySettingsV1,... Signed-off-by: Tingmao Wang <[email protected]>
…ks on confidential This fixes a vulnerability where the host can arbitrarily delete the rootfs (or potentially other mounts) of a container (but without killing it), leaving it in a bad state and can lead to potential exploits. We also require that the overlay is unmounted before deleting the container state, since otherwise we would be deleting the scratch directory of a mounted overlay. Testing code: diff --git a/internal/hcsoci/create.go b/internal/hcsoci/create.go index 774449b..b3a4732f5 100644 --- a/internal/hcsoci/create.go +++ b/internal/hcsoci/create.go @@ -10,6 +10,7 @@ import ( "os" "path/filepath" "strconv" + "time" "github.com/Microsoft/go-winio/pkg/guid" "github.com/Microsoft/hcsshim/internal/cow" @@ -278,6 +279,24 @@ func CreateContainer(ctx context.Context, createOptions *CreateOptions) (_ cow.C if err != nil { return nil, r, err } + + go func() { + ctx := context.Background() + log.G(ctx).WithField("containerID", coi.ID).Info("Scheduling deleteContainerState after 5 seconds") + defer func() { + if r := recover(); r != nil { + log.G(ctx).Errorf("recovered from panic: %v", r) + } + }() + time.Sleep(5 * time.Second) + log.G(ctx).WithField("containerID", coi.ID).Info("calling deleteContainerState") + if err := coi.HostingSystem.DeleteContainerState(ctx, coi.ID); err != nil { + log.G(ctx).WithError(err).Error("deleteContainerState failed") + } else { + log.G(ctx).WithField("containerID", coi.ID).Info("deleteContainerState completed successfully") + } + }() + return c, r, nil } @@ -285,6 +304,7 @@ func CreateContainer(ctx context.Context, createOptions *CreateOptions) (_ cow.C if err != nil { return nil, r, err } + return system, r, nil } On system without this fix, one can observe that the container's rootfs is cleared (attempts to nsenter or exec in the container might still succeed, likely due to overlayfs quirks when something keeps /usr/bin open). `ls /` will show nothing. Signed-off-by: Tingmao Wang <[email protected]>
…fidential containers [cherry-picked from f81b450894206a79fff4d63182ff034ba503ebdb] This PR contains 2 commits. The first one is the fix: **bridge: Force sequential message handling for confidential containers** This fixes a vulnerability (and reduces the surface for other similar potential vulnerabilities) in confidential containers where if the host sends a mount/unmount modification request concurrently with an ongoing CreateContainer request, the host could re-order or skip image layers for the container to be started. While this could be fixed by adding mutex lock/unlock around the individual modifyMappedVirtualDisk/modifyCombinedLayers/CreateContainer functions, we decided that in order to prevent any more of this class of issues, the UVM, when running in confidential mode, should just not allow concurrent requests (with exception for any actually long-running requests, which for now is just waitProcess). The second one adds a log entry for when the processing thread blocks. This will make it easier to debug should the gcs "hung" on a request. This PR is created on ADO targeting the conf branch as this security vulnerability is not public yet. This fix should be backported to main once deployed. Related work items: #33357501, #34327300 Signed-off-by: Tingmao Wang <[email protected]>
Currently the host can pass in a share name with injected option in it, e.g.
"123,cache=loose". While this is currently probably harmless, it's still an
risk and so we should block these kind of mount option injections.
In Linux, this is parsed by v9fs_parse_options. It basically scans until the
next ',', and it doesn't matter whether we add quotes. In hcsshim, all plan9
mounts go through AddPlan9 on the host side, and that function uses a number as
the share name. Therefore, we will simply restrict the share name to be digits
only.
Test:
+++ b/internal/uvm/plan9.go
@@ -90,7 +90,7 @@ func (uvm *UtilityVM) AddPlan9(ctx context.Context, hostPath string, uvmPath str
RequestType: guestrequest.RequestTypeAdd,
Settings: guestresource.LCOWMappedDirectory{
MountPath: uvmPath,
- ShareName: name,
+ ShareName: name + ",cache=loose",
Port: plan9Port,
ReadOnly: readOnly,
},
Output:
failed to share directory C:\lcow_info\ into UVM: rpc error: code = Unknown desc = guest modify: guest RPC failure: invalid plan9 share name "1,cache=loose": must match regex "^[0-9]+$"
Closes: https://msazure.visualstudio.com/One/_workitems/edit/34370380
Signed-off-by: Tingmao Wang <[email protected]>
Since the expected usage of this struct expects caller to undo operations in case of failure, it makes more sense to expect that the caller hold the lock throughout, until it has either committed the operation or undone it. This prevents accidental misuse, although in practice this struct is unlikely to be called from different threads anyway due to sequential bridge message processing in confidential containers. Locking hostMounts this way effectively means that mount/unmount operations are always single-threaded. However, this is the case in confidential containers anyway due to the sequential message processing, and on non-confidential containers hostMounts isn't used for now. Signed-off-by: Tingmao Wang <[email protected]>
[cherry-picked from 055ee5eb4a802cb407575fb6cc1e9b07069d3319] guest/network: Restrict hostname to valid characters Because we write this hostname to /etc/hosts, without proper validation the host can trick us into writing arbitrary data to /etc/hosts, which can, for example, redirect things like ip6-localhost (but likely not localhost itself) to an attacker-controlled IP address. We implement a check here that the host-provided DNS name in the OCI spec is valid. ACI actually restricts this to 5-63 characters of a-zA-Z0-9 and '-', where the first and last characters can not be '-'. This aligns with the Kubernetes restriction. c.f. IsValidDnsLabel in Compute-ACI. However, there is no consistent official agreement on what a valid hostname can contain. RFC 952 says that "Domain name" can be up to 24 characters of A-Z0-9 '.' and '-', RFC 1123 expands this to 255 characters, but RFC 1035 claims that domain names can in fact contain anything if quoted (as long as the length is within 255 characters), and this is confirmed again in RFC 2181. In practice we see names with underscopes, most commonly \_dmarc.example.com. curl allows 0-9a-zA-Z and -.\_|~ and any other codepoints from \u0001-\u001f and above \u007f: https://github.com/curl/curl/blob/master/lib/urlapi.c#L527-L545 With the above in mind, this commit allows up to 255 characters of a-zA-Z0-9 and '_', '-' and '.'. This change is applied to all LCOW use cases. This fix can be tested with the below code to bypass any host-side checks: +++ b/internal/hcsoci/hcsdoc_lcow.go @@ -52,6 +52,10 @@ func createLCOWSpec(ctx context.Context, coi *createOptionsInternal) (*specs.Spe spec.Linux.Seccomp = nil } + if spec.Annotations[annotations.KubernetesContainerType] == "sandbox" { + spec.Hostname = "invalid-hostname\n1.1.1.1 ip6-localhost ip6-loopback localhost" + } + return spec, nil } Output: time="2025-10-01T15:13:41Z" level=fatal msg="run pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: failed to create container f2209bb2960d5162fc9937d3362e1e2cf1724c56d1296ba2551ce510cb2bcd43: guest RPC failure: hostname \"invalid-hostname\\n1.1.1.1 ip6-localhost ip6-loopback localhost\" invalid: must match ^[a-zA-Z0-9_\\-\\.]{0,999}$: unknown" Related work items: #34370598 Closes: https://msazure.visualstudio.com/One/_workitems/edit/34370598 Signed-off-by: Tingmao Wang <[email protected]>
In C-LCOW, we do not want to host to be able to arbitrarily control mount options. Currently there are two possible ways mount options might be specified by the host: 1. For read-only mounts (image layers), option "ro" is specified (see addLCOWLayer). 2. If the OCI spec passed by containerd contains physical/virtual disk mounts, it might contain mount options, and hcsshim would pass this through to GCS (see allocateLinuxResources). We can allow 1 (and in fact, require it to be consistent with the readOnly field in the request), and today C-LCOW does not support external disk mounts, and so we can reject any other mount options passed via route 2. Signed-off-by: Tingmao Wang <[email protected]>
…nts, and prevent unmounting or deleting in-use things [cherry-picked from d0334883cd43eecbb401a6ded3e0317179a3e54b] This set of changes adds some checks (when running with a confidential policy) to prevent the host from trying to clean up mounts, overlays, or the container states dir when the container is running (or when the overlay has not been unmounted yet). This is through enhancing the existing `hostMounts` utility, as well as adding a `terminated` flag to the Container struct. The correct order of operations should always be: - mount read-only layers and scratch (in any order, and individual containers (not the sandbox) might not have their own scratch) - mount the overlay - start the container - container terminates - unmount overlay - unmount read-only layers and scratch The starting up order is implied, and we now explicitly deny e.g. unmounting layer/scratch before unmounting overlay, or unmounting the overlay while container has not terminated. We also deny deleteContainerState requests when the container is running or when the overlay is mounted. Doing so when a container is running can result in unexpectedly deleting its files, which breaks it in unpredictable ways and is bad. Signed-off-by: Tingmao Wang <[email protected]>
[cherry-picked from 1dd0b7ea0b0f91d3698f6008fb0bd5b0de777da6] Blocks mount option passing for 9p (which is accidental) and SCSI disks. - guest: Restrict plan9 share names to digits only on Confidential mode - hcsv2/uvm: Restrict SCSI mount options in confidential mode (The only one we allow is `ro`) Related work items: #34370380 Signed-off-by: Tingmao Wang <[email protected]>
96e3ea7 to
6c3755c
Compare
Signed-off-by: Tingmao Wang <[email protected]>
6c3755c to
d32e178
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Warning
This PR is currently stacked on top of #2544. Once that is merged this branch can be rebased to drop those commits.
This PR merges a set of changes, all already reviewed by ContainerPlatform team except one, applied to the C-ACI release. Use the links below to review changes by each cherry-picked PR or see the original PR.
(Generated by
git log --reverse --first-parent --format='- [**%s**](https://github.com/microsoft/hcsshim/pull/2559/commits/%H) ([original ADO PR](https://dev.azure.com/msazure/ContainerPlatform/_git/Microsoft.hcsshim/pullrequest/{{%s}}))' mahati/refactoring-confidential..tingmao_github/merge-msrc-to-main | sed 's/{{Merged PR \([0-9]\+\).\+}}/\1/')