Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add encrypted scratch space for container sandbox #2212

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mkulke
Copy link
Collaborator

@mkulke mkulke commented Dec 16, 2024

fixes #1921

This is a PoC to address the problem of secure scratch space in which we can download and unpack images to the sandbox, write files at runtime etc.

Currently this is either backed by storage that is not protected by the TEE (packer) or as tmpfs mounts that will consume ram during runtime (mkosi). For larger images like pytorch w/ cuda this implies in theory that we have to set aside 8GB of ram just to be able to download and unpack the image. tmpfs mounts are by default limited to a ratio of the ram, so we might end up having to overprovision the machine to be able to run a workload.

The proposes change introduces a CAA-wide parameter that allows to specify the disk size. A cloud provider can use this as an indicator of the size of the to-be-provisioned podvm disk. A zero value means that we stick to the default size of the image.

if a size has been given (the mkosi image is ~0.5gb) it will extend the image to that size. At launch the podvm will claim that space and create an ad-hoc encrypted volume on it. Depending on the presence of a marker file in user-data the agent unit will mount that to /run/kata-containers prior to launching kata-agent.

I'm not super happy about the API (I would prefer to specify it per pod) and the implementation (the marker file) but I wanted to push early to get feedback. (also, it's only implemented for azure at the moment)

FWIW it does what it's supposed to do, see some numbers on #1958

@mkulke mkulke force-pushed the mkulke/add-scratch-space-for-sandbox branch 3 times, most recently from a6c6fe4 to c19898a Compare December 16, 2024 17:58
@mkulke mkulke added test_e2e_libvirt Run Libvirt e2e tests and removed test_e2e_libvirt Run Libvirt e2e tests labels Dec 17, 2024
@mkulke mkulke force-pushed the mkulke/add-scratch-space-for-sandbox branch 11 times, most recently from 5fb77d7 to ab7fccf Compare December 17, 2024 16:41
@mkulke mkulke force-pushed the mkulke/add-scratch-space-for-sandbox branch from ab7fccf to 0e961cf Compare December 17, 2024 16:44
@bpradipt
Copy link
Member

if a size has been given (the mkosi image is ~0.5gb) it will extend the image to that size. At launch the podvm will claim that space and create an ad-hoc encrypted volume on it. Depending on the presence of a marker file in user-data the agent unit will mount that to /run/kata-containers prior to launching kata-agent.

Would it also make sense to extend this approach to use add-on disk like available with some instance types ? Not needed for initial iteration, but is this something we should consider and lay the foundation now?

I'm not super happy about the API (I would prefer to specify it per pod) and the implementation (the marker file) but I wanted to push early to get feedback. (also, it's only implemented for azure at the moment)

I think it would be better to also add an annotation (eg io.katacontainers.config.hypervisor.root_disk_size) in kata for remote hypervisor to use for per-pod scratch space. Similar to what we use for other such annotations.

@mkulke
Copy link
Collaborator Author

mkulke commented Dec 17, 2024

Would it also make sense to extend this approach to use add-on disk like available with some instance types ? Not needed for initial iteration, but is this something we should consider and lay the foundation now?

for future extensibility, I think we can freely iterate. We don't currently have a version contract between PodVM and CAA. We would have to look at the user-facing api maybe to make this possible.

I think it would be better to also add an annotation (eg io.katacontainers.config.hypervisor.root_disk_size) in kata for remote hypervisor to use for per-pod scratch space. Similar to what we use for other such annotations.

I agree, I didn't spot any existing fitting annotation in kata at first glance, so we might have to add it.

@mkulke mkulke force-pushed the mkulke/add-scratch-space-for-sandbox branch from 0e961cf to 6655aa3 Compare December 17, 2024 18:04
@mkulke mkulke added the test_e2e_libvirt Run Libvirt e2e tests label Dec 17, 2024
@mkulke mkulke force-pushed the mkulke/add-scratch-space-for-sandbox branch from 6655aa3 to 34a553c Compare December 17, 2024 18:13
@bpradipt
Copy link
Member

I think it would be better to also add an annotation (eg io.katacontainers.config.hypervisor.root_disk_size) in kata for remote hypervisor to use for per-pod scratch space. Similar to what we use for other such annotations.

I agree, I didn't spot any existing fitting annotation in kata at first glance, so we might have to add it.

Yes, a new annotation will need to be added.

@mkulke mkulke force-pushed the mkulke/add-scratch-space-for-sandbox branch 4 times, most recently from ea95283 to a097ec7 Compare December 18, 2024 13:59
@mkulke mkulke force-pushed the mkulke/add-scratch-space-for-sandbox branch from a097ec7 to ba8cc4c Compare December 18, 2024 14:04
@mkulke mkulke added test_e2e_libvirt Run Libvirt e2e tests and removed test_e2e_libvirt Run Libvirt e2e tests labels Dec 18, 2024
@mkulke mkulke force-pushed the mkulke/add-scratch-space-for-sandbox branch from 4b05766 to c0c9f4c Compare December 20, 2024 09:36
@mkulke mkulke force-pushed the mkulke/add-scratch-space-for-sandbox branch 3 times, most recently from 45bdba1 to 6b24c6c Compare January 7, 2025 11:03
@mkulke mkulke changed the title RFC: add encrypted scratch space for sandbox Add encrypted scratch space for container sandbox Jan 8, 2025
@mkulke mkulke marked this pull request as ready for review January 8, 2025 14:29
@mkulke mkulke requested a review from a team as a code owner January 8, 2025 14:29
@mkulke mkulke force-pushed the mkulke/add-scratch-space-for-sandbox branch from 6b24c6c to a402bae Compare January 9, 2025 08:26
@mkulke
Copy link
Collaborator Author

mkulke commented Jan 9, 2025

Yes, a new annotation will need to be added.

We discussed it in the last community call and concluded to start with a caa-wide option. once this is merged, we can start add something to kata-runtime to make it tweakable per-pod

@mkulke mkulke force-pushed the mkulke/add-scratch-space-for-sandbox branch from a402bae to 3521c2d Compare January 13, 2025 09:21
@mkulke mkulke requested a review from bpradipt January 13, 2025 09:32
Copy link
Member

@stevenhorsman stevenhorsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think commits 1 & 2 need some tidying as there is new code added in 1 that is changed in 2 which was a bit confusing. Additionally it would have been nicer (for me to review) to split the resource refactor out from commit 1, into a previous commit. It's not a blocker, maybe adding a sentence to explain that in the commit message #1 would be helpful for future people looking back at this?

src/cloud-api-adaptor/pkg/adaptor/cloud/cloud.go Outdated Show resolved Hide resolved
src/cloud-api-adaptor/pkg/util/cloud.go Show resolved Hide resolved
src/cloud-api-adaptor/pkg/adaptor/cloud/cloud.go Outdated Show resolved Hide resolved
src/cloud-api-adaptor/pkg/adaptor/cloud/cloud.go Outdated Show resolved Hide resolved
@@ -76,6 +76,7 @@ else
touch resources/buildBootableImage
nix develop ..#podvm-mkosi --command mkosi --environment=VARIANT_ID=production
qemu-img convert -f raw -O qcow2 build/system.raw build/podvm-$(PODVM_DISTRO)-$(ARCH).qcow2
qemu-img resize build/podvm-$(PODVM_DISTRO)-$(ARCH).qcow2 +100M
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any significance to +100M, or is it just a starter amount of space?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it needs to have a bit of size, so repart will not stumble over it at launch, attempting to create a partition without space. not sure if needs to be +100m, could just also work with +10m, have to test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I don't have an issue with the 100M, just wanted to check if there was special logic behind it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only being done on the x86 path here as only Azure is supporting this scratch space at the moment I assume? Do you see any obstacles in the code for s390x support if/when needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know much about partitioning schemes on s390x. if systemd-repart works in the same way, it should work ootb. The cloud-provider code would need to consider the .storage property when provisioning a VM and then repart would grab whatever it available to allocate it to the scratch space.

@mkulke mkulke force-pushed the mkulke/add-scratch-space-for-sandbox branch 2 times, most recently from e3fbec7 to 0b40295 Compare January 13, 2025 12:42
Copy link
Member

@stevenhorsman stevenhorsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good starting point to me. Thanks @mkulke!

Copy link
Member

@bpradipt bpradipt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

The CI wasn't building debug images, due string/bool mismatch.

Signed-off-by: Magnus Kulke <[email protected]>
@mkulke mkulke force-pushed the mkulke/add-scratch-space-for-sandbox branch 2 times, most recently from 0edb5be to 2c43186 Compare January 17, 2025 10:19
This adds the configuration for an encrypted scratch space on an mkosi
image. At bootup a /dev/sda4 partition will be created and encrypted
with LUKS using an ephemeral key.

The partition will use the space available on the image volume. By
default the qcow2 image has 100mb allocated for this space. This amount
of space will only work for very small images, hence we do not mount the
scratch space to `/run/kata-container` by default.

If the kata-agent service units encounters a `/run/peerpod/mount-scratch`
file it will mount the encrypted partition `/dev/sda4` to
`/run/kata-containers`.

This file is provisioned by `process-user-data`, configured by the CAA
daemonset.

A new CAA parameter has been introduced that allows to specify the disk
size. If we have a disk size that is larger than 0 a cloud provider can
attempt to consider this size to create space for an encrypted
scratch partition that can be mounted to /run/container.

Signed-off-by: Magnus Kulke <[email protected]>
@mkulke mkulke force-pushed the mkulke/add-scratch-space-for-sandbox branch from 2c43186 to c091e0c Compare January 17, 2025 10:27
@mkulke mkulke removed the test_e2e_libvirt Run Libvirt e2e tests label Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

podvm-mkosi: need allocate more space for "/run" to store large image data
3 participants