Skip to content

Conversation

fangge1212
Copy link

@fangge1212 fangge1212 commented Jul 28, 2025

AMD SEV-SNP is one of the confidential computing technologies. This commit adds support for AMD SEV-SNP on AWS, so users can utilize the confidential computing on the cluster nodes.

Upstream CAPA PR: kubernetes-sigs/cluster-api-provider-aws#5605

Copy link
Contributor

openshift-ci bot commented Jul 28, 2025

Hello @fangge1212! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jul 28, 2025
@openshift-ci openshift-ci bot requested review from everettraven and mandre July 28, 2025 08:53
@fangge1212 fangge1212 force-pushed the aws_amd_sev_snp branch 2 times, most recently from 1271931 to a6478c1 Compare July 29, 2025 08:18
@fangge1212 fangge1212 force-pushed the aws_amd_sev_snp branch 3 times, most recently from 82e877d to 1df992a Compare August 6, 2025 22:44
// instanceType is the type of instance to create. Example: m4.xlarge
InstanceType string `json:"instanceType"`
// cpuOptions is the set of cpu options for the instance.
// +optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this field is not specified by a user?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If unset, no CPU options are passed to the AWS platform and AWS default values are used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know what the default values are currently on AWS?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we don't have concrete defaults we can point to, it would be nice to include guidance on how an end-user could identify what the defaults for their configuration would be.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CpuOptions in AWS consists of three fields:

In this PR, only amdSevSnp is exposed to users. I'm not entirely sure how best to describe this in the API documentation — should we include a link to the AWS documentation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the link to AWS website

@fangge1212 fangge1212 force-pushed the aws_amd_sev_snp branch 5 times, most recently from bbef962 to e48669f Compare August 8, 2025 06:42
@fangge1212
Copy link
Author

/retest-required

@fangge1212
Copy link
Author

/retest

@fangge1212 fangge1212 force-pushed the aws_amd_sev_snp branch 3 times, most recently from f28b17a to 520141d Compare August 12, 2025 11:09
Comment on lines 20 to 21
// cpuOptions defines CPU-related settings for the instance, including the confidential computing policy.
// If unset, no CPU options will be passed to the AWS platform and AWS default CPU options will be applied.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I know what the "AWS default CPU options" that will be applied are? Are these literally defaults AWS imposes on requests that don't specify these options, or are these defaulted elsewhere?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment! To clarify: OpenShift does not set defaults for cpuOptions. If the field is unset, the RunInstances request is sent without a CpuOptions block, and AWS applies its own defaults for the chosen instance type.
I’ll update the field description to make that clear.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the description, please take a look again.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the description again to make it more concise:
When omitted, this means no opinion and the AWS platform is left to choose a reasonable default.

// cpuOptions defines CPU-related settings for the instance, including the confidential computing policy.
// If unset, no cpuOptions will be included in the API request to AWS, and the instance will use the default CPU options
// applied by AWS for the selected intance type.
// +kubebuilder:validation:MinProperties=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pop this one on the struct rather than the field please

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

// More details can be checked at https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/sev-snp.html
// When omitted, this means no opinion and the AWS platform is left to choose a reasonable default,
// which is subject to change without notice. The current default is Disabled.
// +kubebuilder:validation:Enum=Disabled;AMDEncrytedVirtualizationNestedPaging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to put this here, vs on the type definition on L119?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The placement of this line differs between AWSNetworkInterfaceType (within the struct) and MarketType (on the type definition). I was unsure which pattern to follow, so I selected one approach arbitrarily.

@fangge1212 fangge1212 force-pushed the aws_amd_sev_snp branch 2 times, most recently from 89d92a1 to f5ba092 Compare September 3, 2025 08:44
@fangge1212 fangge1212 requested a review from JoelSpeed September 4, 2025 02:01
@JoelSpeed
Copy link
Contributor

/hold until upstream PRs have been merged

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 8, 2025
@damdo
Copy link
Member

damdo commented Sep 8, 2025

The upstream PR is now unblocked, and reviews/merging can proceed
kubernetes-sigs/cluster-api-provider-aws#5605 (comment)

@fangge1212
Copy link
Author

/unhold
Upstream pr is merged

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 19, 2025
type CPUOptions struct {
// confidentialCompute specifies whether confidential computing should be enabled for the instance,
// and, if so, which confidential computing technology to use.
// Valid values are: Disabled, AMDEncryptedVirtualizationNestedPaging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We generally also mention that an allowed value is omitted for optional enums.

In practice, omitting this field is for now an invalid configuration, but in the future if new fields are added omitting this field and specifying another one would be valid.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated. Also updated the description for CPUOptions to make it clear:
+ // If provided, it must not be empty — at least one field must be set.

@everettraven
Copy link
Contributor

@fangge1212 Could you include a link to the associated PR that implements the validations outlined here in the validating webhook this API runs through?

@fangge1212
Copy link
Author

Could you include a link to the associated PR that implements the validations outlined here in the validating webhook this API runs through?

Previously, I added instance type validation for the specified confidential computing technology (currently only AMD SEV-SNP). However, the instance type list is hard-coded and needs to be updated manually, so I removed the validation. As a result, there is now no webhook validation for this API change.

@everettraven
Copy link
Contributor

Previously, I added instance type validation for the specified confidential computing technology (currently only AMD SEV-SNP). However, the instance type list is hard-coded and needs to be updated manually, so I removed the validation. As a result, there is now no webhook validation for this API change.

AFAIK none of the validations that exist on the markers in the API here are actually enforced unless done through the webhook. I would expect us to have changes in the webhook to reject invalid configurations.

@fangge1212
Copy link
Author

Previously, I added instance type validation for the specified confidential computing technology (currently only AMD SEV-SNP). However, the instance type list is hard-coded and needs to be updated manually, so I removed the validation. As a result, there is now no webhook validation for this API change.

AFAIK none of the validations that exist on the markers in the API here are actually enforced unless done through the webhook. I would expect us to have changes in the webhook to reject invalid configurations.

Ah, I didn't know this. I will add validation in webhook

@fangge1212
Copy link
Author

Previously, I added instance type validation for the specified confidential computing technology (currently only AMD SEV-SNP). However, the instance type list is hard-coded and needs to be updated manually, so I removed the validation. As a result, there is now no webhook validation for this API change.

AFAIK none of the validations that exist on the markers in the API here are actually enforced unless done through the webhook. I would expect us to have changes in the webhook to reject invalid configurations.

Ah, I didn't know this. I will add validation in webhook

When I added +kubebuilder:validation:MinProperties=1 before, I had this error. It seems the validation on the marker is working?

    --- FAIL: TestSSHKeyName/SSH_key_name_is_nil_is_valid (0.01s)
        sshkeyname_test.go:89: ValidateCreate() error = AWSMachine.infrastructure.cluster.x-k8
       "machine-9zsqb" is invalid: spec.cpuOptions: Invalid value: 0: spec.cpuOptions in body should have at least 1 properties, wantErr false

@everettraven
Copy link
Contributor

everettraven commented Sep 19, 2025

That looks like it is an upstream test. @JoelSpeed knows the nuances here more than I do, but my understanding is that machine configurations for OpenShift are embedded as raw JSON in the Machine API and as such they must be programmatically evaluated

Copy link
Contributor

openshift-ci bot commented Sep 19, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@JoelSpeed
Copy link
Contributor

Upstream, the types are real CRDs and so all of the validation markers apply. But, the AWSProviderConfig (and likewise for other platforms) in the Machine V1beta1 group downstream are not real CRDs, and are served via a RawExtension within the Machine CRD.

This means that they have no validation, and the only validation we can do is implemented as a Go ValidatingWebhookConfiguration which lives in the machine-api-operator repository.

This means the markers (while helpful and I appreciate being added even if they do nothing), will need to also be implemented manually in the webhook

@fangge1212 fangge1212 force-pushed the aws_amd_sev_snp branch 2 times, most recently from c4d560a to 85a7d3f Compare September 23, 2025 03:33
@fangge1212
Copy link
Author

Upstream, the types are real CRDs and so all of the validation markers apply. But, the AWSProviderConfig (and likewise for other platforms) in the Machine V1beta1 group downstream are not real CRDs, and are served via a RawExtension within the Machine CRD.

This means that they have no validation, and the only validation we can do is implemented as a Go ValidatingWebhookConfiguration which lives in the machine-api-operator repository.

This means the markers (while helpful and I appreciate being added even if they do nothing), will need to also be implemented manually in the webhook

This is the draft PR(https://github.com/openshift/machine-api-operator/pull/1420/files) that adds webhook validation for the new parameter cpuOptions.
The only problem is that I can't distinguish between cpuOptions not provided and provided but empty(cpuOptions:{}), because both of them will be:

CPUOptions{
    ConfidentialCompute: "",
}

Should we change CPUOptions to pointer type in AWSMachineProviderConfig? Then if it is not provided, it will be nil.

@everettraven
Copy link
Contributor

Should we change CPUOptions to pointer type in AWSMachineProviderConfig? Then if it is not provided, it will be nil.

Yes, this was a nuance of webhook based validation that I missed and is not something the linter is aware of. We can override the lint check if it fails on this.

@JoelSpeed and I have already started discussing how we can improve this linting behavior for APIs that would need the same validation behaviors.

@fangge1212
Copy link
Author

Should we change CPUOptions to pointer type in AWSMachineProviderConfig? Then if it is not provided, it will be nil.

Yes, this was a nuance of webhook based validation that I missed and is not something the linter is aware of. We can override the lint check if it fails on this.

@JoelSpeed and I have already started discussing how we can improve this linting behavior for APIs that would need the same validation behaviors.

After chaning CPUOptions to pointer type, linting fails:

# make lint
hack/golangci-lint.sh run --new-from-rev=master 
make[1]: Entering directory '/root/go/src/github.com/openshift/api/tools'
make[1]: Nothing to be done for 'kube-api-linter'.
make[1]: Leaving directory '/root/go/src/github.com/openshift/api/tools'
machine/v1beta1/types_awsprovider.go:25:2: optionalfields: field CPUOptions does not allow the zero value. The field does not need to be a pointer. (kubeapilinter)
	CPUOptions *CPUOptions `json:"cpuOptions,omitempty,omitzero"`
	^
1 issues:
* kubeapilinter: 1
make: *** [Makefile:47: lint] Error 1

Remove omitzero:

[root@dell-per740-78 api]# make lint
hack/golangci-lint.sh run --new-from-rev=master 
make[1]: Entering directory '/root/go/src/github.com/openshift/api/tools'
make[1]: Nothing to be done for 'kube-api-linter'.
make[1]: Leaving directory '/root/go/src/github.com/openshift/api/tools'
machine/v1beta1/types_awsprovider.go:25:2: optionalfields: field CPUOptions does not allow the zero value. It must have the omitzero tag. (kubeapilinter)
	CPUOptions *CPUOptions `json:"cpuOptions,omitempty"`
	^
1 issues:
* kubeapilinter: 1
make: *** [Makefile:47: lint] Error 1

How about I remove the marker "// +kubebuilder:validation:MinProperties=1" to make the linting pass?

@everettraven
Copy link
Contributor

everettraven commented Sep 24, 2025

How about I remove the marker "// +kubebuilder:validation:MinProperties=1" to make the linting pass?

I don't think we want to remove that. We don't want to allow an empty struct ({}). I can override the linter failure if it is related to a nuance the linter isn't aware of, like this needing to be a pointer.

@everettraven
Copy link
Contributor

everettraven commented Sep 24, 2025

Remove omitzero:

I might be wrong here, but I think we also want to keep the omitzero so that the zero value of the struct is omitted during serialization as well.

I was wrong here, omitzero will still serialize as {} when put on a pointer.

@fangge1212
Copy link
Author

fangge1212 commented Sep 25, 2025

When CPUOptions is a pointer type, I can't distinguish between the following configurations in webhook validation:

  • provided but empty(CPUOptions:{})
  • provided with zero value(CPUOptions:{ConfidentialCompute:""})
    Both of them will be CPUOptions:{ConfidentialCompute:""}

When CPUOptions is a struct type, I can't distinguish between the following configurations in webhook validation:

  • not provided
  • provided but empty(CPUOptions:{})
  • provided with zero value(CPUOptions:{ConfidentialCompute:""})
    All of them will be CPUOptions:{ConfidentialCompute:""}

So there is no way to validate minProperties=1 for CPUOptions in webhook validation.

Updates:
I set both CPUOptions and ConfidentialCompute to pointer type, now I can distiguish between the three configurations. But then I got a new kubeapilinter error:

machine/v1beta1/types_awsprovider.go:147:2: optionalfields: field ConfidentialCompute does not allow the zero value. The field does not need to be a pointer. (kubeapilinter)
	ConfidentialCompute *AWSConfidentialComputePolicy `json:"confidentialCompute,omitempty"`
	^

@everettraven
Copy link
Contributor

Updates:
I set both CPUOptions and ConfidentialCompute to pointer type, now I can distiguish between the three configurations. But then I got a new kubeapilinter error:

That seems like a reasonable error to override as well if you can't properly distinguish between the necessary states otherwise.

AMD SEV-SNP is one of the confidential computing technologies.
This commit adds support for AMD SEV-SNP on AWS, so users can
utilize the confidential computing on the cluster nodes.

Signed-off-by: Fangge Jin <[email protected]>
@fangge1212
Copy link
Author

fangge1212 commented Sep 26, 2025

Updates:
I set both CPUOptions and ConfidentialCompute to pointer type, now I can distiguish between the three configurations. But then I got a new kubeapilinter error:

That seems like a reasonable error to override as well if you can't properly distinguish between the necessary states otherwise.

@everettraven
I have updated the pr with both CPUOptions and ConfidentialCompute set to pointer type, now I just need to wait for you to overwrite the lint rule, right?

Copy link
Contributor

openshift-ci bot commented Sep 26, 2025

@fangge1212: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/lint 5f65881 link true /test lint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants