fix!: replace CreateInstanceFailed Launched reason with classified reasons (check dependent automations)#1728
fix!: replace CreateInstanceFailed Launched reason with classified reasons (check dependent automations)#1728ravishen wants to merge 4 commits into
Conversation
When error handlers classify an instance creation failure (e.g. SKUNotAvailable, ZonalAllocationFailure), preserve that classification by wrapping errors as CreateError with the specific reason code, allowing consumers to branch on the cause instead of treating all failures as generic CreateInstanceFailed.
|
@microsoft-github-policy-service agree |
| // it falls back to the generic CreateInstanceFailed reason and error text, preserving prior | ||
| // behavior. | ||
| func toCreateError(err error, wrapMsg string) error { | ||
| reason, message := CreateInstanceFailedReason, err.Error() |
There was a problem hiding this comment.
In this project, we don't generally use compound assignments like this unless capturing multiple return values from a method/function call. #minor #readability
| reason, message := CreateInstanceFailedReason, err.Error() | |
| reason := CreateInstanceFailedReason | |
| , message := err.Error() |
theunrepentantgeek
left a comment
There was a problem hiding this comment.
Looks good, a few minor quibbles but nothing serious in the code.
Does this qualify as a breaking change? We're changing the Reason published for the error, I wonder if it's likely people have built automations/monitoring based on the existing values, and whether this change will need to be called out in the release notes.
| var classified *cloudprovider.CreateError | ||
| if stderrors.As(err, &classified) { |
There was a problem hiding this comment.
The new errors.AsType method in the stdlib can make this cleaner: #minor #moderncode
| var classified *cloudprovider.CreateError | |
| if stderrors.As(err, &classified) { | |
| if cerr, ok := stderrors.AsType[*cloudprovider.CreateError](err, &classified); ok { |
| // reason on the Launched condition, with the friendly message preserved. When no | ||
| // reason is expected (e.g. InsufficientCapacityError passthrough), the error must | ||
| // not carry a Launched reason and its message must match. | ||
| func assertHandledError(g Gomega, actual, expected error, expectedReason string) { |
There was a problem hiding this comment.
Prefer explicit argument types #minor #readability
| func assertHandledError(g Gomega, actual, expected error, expectedReason string) { | |
| func assertHandledError(g Gomega, actual error, expected error, expectedReason string) { |
I'm not sure if its breaking in strict sense since I don't know if these reasons were ever documented as a stable contract; But agreed that anyone exact-matching |
Including this in the release notes is a good idea - at the very least it means someone can discover why things broke (if they did). |
matthchr
left a comment
There was a problem hiding this comment.
Generally LGTM, had a few small comments. PTAL and let me know how you want to address them.
| } | ||
| return fmt.Errorf("subscription level %s vCPU quota for %s has been reached (may try provision an alternative instance type)", capacityType, instanceType.Name) | ||
| err := fmt.Errorf("subscription level %s vCPU quota for %s has been reached (may try provision an alternative instance type)", capacityType, instanceType.Name) | ||
| return corecloudprovider.NewCreateError(err, SubscriptionQuotaReachedReason, err.Error()) |
There was a problem hiding this comment.
Can you move these reason const definitions from this package to pkg/consts/condition_reasons.go (or similarly named), along with the ones at the top of pkg/cloudprovider/cloudprovider.go, so we have all the condition reasons together?
If you'd rather we do that as a follow-up PR, I'm fine with that too, to not bloat this PR.
There was a problem hiding this comment.
I'll do that as a follow-up PR to keep this one focused.
| // reason on the Launched condition, with the friendly message preserved. When no | ||
| // reason is expected (e.g. InsufficientCapacityError passthrough), the error must | ||
| // not carry a Launched reason and its message must match. | ||
| func assertHandledError(g Gomega, actual error, expected error, expectedReason string) { |
There was a problem hiding this comment.
minor: Move to pkg/test/expectations.go and export it as ExpectHandledError?
There was a problem hiding this comment.
minor: Move to
pkg/test/expectations.goand export it asExpectHandledError?
Moving it to pkg/test/expectations creates an import cycle
go list -deps ./pkg/test/expectations | grep -x \ github.com/Azure/karpenter-provider-azure/pkg/providers/instance/offerings => github.com/Azure/karpenter-provider-azure/pkg/providers/instance/offerings .
Also mark as
GingkoHelper()?
These are stdlib table tests (t.Run + NewWithT), not Ginkgo specs, so GinkgoHelper() wouldn't take effect here; applied the stdlib equivalent instead; the helper now takes *testing.T and calls t.Helper(), matching assertOfferingsState.
Fixes #1727
What this does
When instance creation fails, the offerings error handlers already know precisely why — capacity, quota, or allocation — and each maps the failure to a specific reason constant (
SKUNotAvailable,ZonalAllocationFailure, …). But before this change that reason was only used as a log label when marking the offering unavailable in the cache; it was dropped before reaching the user. Create failures surfaced on theLaunchedcondition with the same genericreason: CreateInstanceFailed, with the actual cause only in the free-formmessage— leaving dashboards/alerts/automation with nothing stable and machine-readable to branch on.This threads the reason the handler already knows out to the
Launchedcondition, matching the parity behaviorkarpenter-provider-awsReason mapping
LaunchedreasonhandleSKUNotAvailableError/handleSKUNotAvailableForSubscriptionErrorSKUNotAvailablehandleZonalAllocationFailureErrorZonalAllocationFailurehandleAllocationFailureErrorAllocationFailurehandleOverconstrainedZonalAllocationFailureErrorOverconstrainedZonalAllocationFailurehandleOverconstrainedAllocationFailureErrorOverconstrainedAllocationFailurehandleSKUFamilyQuotaErrorSubscriptionQuotaReachedhandleLowPriorityQuotaErrorSubscriptionQuotaReachedhandleRegionalQuotaErrorInsufficientCapacityError)Before / after
Does this change impact docs?