Skip to content

fix!: replace CreateInstanceFailed Launched reason with classified reasons (check dependent automations)#1728

Open
ravishen wants to merge 4 commits into
Azure:mainfrom
ravishen:ravishen/fix/launchErrors
Open

fix!: replace CreateInstanceFailed Launched reason with classified reasons (check dependent automations)#1728
ravishen wants to merge 4 commits into
Azure:mainfrom
ravishen:ravishen/fix/launchErrors

Conversation

@ravishen

Copy link
Copy Markdown

Fixes #1727

What this does

When instance creation fails, the offerings error handlers already know precisely why — capacity, quota, or allocation — and each maps the failure to a specific reason constant (SKUNotAvailable, ZonalAllocationFailure, …). But before this change that reason was only used as a log label when marking the offering unavailable in the cache; it was dropped before reaching the user. Create failures surfaced on the Launched condition with the same generic reason: CreateInstanceFailed, with the actual cause only in the free-form message — leaving dashboards/alerts/automation with nothing stable and machine-readable to branch on.

This threads the reason the handler already knows out to the Launched condition, matching the parity behavior karpenter-provider-aws

Reason mapping

Handler Launched reason
handleSKUNotAvailableError / handleSKUNotAvailableForSubscriptionError SKUNotAvailable
handleZonalAllocationFailureError ZonalAllocationFailure
handleAllocationFailureError AllocationFailure
handleOverconstrainedZonalAllocationFailureError OverconstrainedZonalAllocationFailure
handleOverconstrainedAllocationFailureError OverconstrainedAllocationFailure
handleSKUFamilyQuotaError SubscriptionQuotaReached
handleLowPriorityQuotaError SubscriptionQuotaReached
handleRegionalQuotaError (unchanged — still InsufficientCapacityError)

Before / after

# before
- type: Launched
  status: "Unknown"
  reason: CreateInstanceFailed     # identical for capacity, quota, allocation
  message: "the requested SKU is unavailable for instance type Standard_D4s_v5 ..."

# after
- type: Launched
  status: "Unknown"
  reason: SKUNotAvailable          # actionable, stable, machine-readable
  message: "the requested SKU is unavailable for instance type Standard_D4s_v5 ..."

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

When error handlers classify an instance creation failure (e.g. SKUNotAvailable, ZonalAllocationFailure), preserve that classification by wrapping errors as CreateError with the specific reason code, allowing consumers to branch on the cause instead of treating all failures as generic CreateInstanceFailed.
@ravishen

Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Comment thread pkg/cloudprovider/cloudprovider.go Outdated
// it falls back to the generic CreateInstanceFailed reason and error text, preserving prior
// behavior.
func toCreateError(err error, wrapMsg string) error {
reason, message := CreateInstanceFailedReason, err.Error()

@theunrepentantgeek theunrepentantgeek Jun 30, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this project, we don't generally use compound assignments like this unless capturing multiple return values from a method/function call. #minor #readability

Suggested change
reason, message := CreateInstanceFailedReason, err.Error()
reason := CreateInstanceFailedReason
, message := err.Error()

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@theunrepentantgeek theunrepentantgeek left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, a few minor quibbles but nothing serious in the code.

Does this qualify as a breaking change? We're changing the Reason published for the error, I wonder if it's likely people have built automations/monitoring based on the existing values, and whether this change will need to be called out in the release notes.

Comment thread pkg/cloudprovider/cloudprovider.go Outdated
Comment on lines +717 to +718
var classified *cloudprovider.CreateError
if stderrors.As(err, &classified) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new errors.AsType method in the stdlib can make this cleaner: #minor #moderncode

Suggested change
var classified *cloudprovider.CreateError
if stderrors.As(err, &classified) {
if cerr, ok := stderrors.AsType[*cloudprovider.CreateError](err, &classified); ok {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

// reason on the Launched condition, with the friendly message preserved. When no
// reason is expected (e.g. InsufficientCapacityError passthrough), the error must
// not carry a Launched reason and its message must match.
func assertHandledError(g Gomega, actual, expected error, expectedReason string) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer explicit argument types #minor #readability

Suggested change
func assertHandledError(g Gomega, actual, expected error, expectedReason string) {
func assertHandledError(g Gomega, actual error, expected error, expectedReason string) {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@ravishen

ravishen commented Jul 2, 2026

Copy link
Copy Markdown
Author

Looks good, a few minor quibbles but nothing serious in the code.

Does this qualify as a breaking change? We're changing the Reason published for the error, I wonder if it's likely people have built automations/monitoring based on the existing values, and whether this change will need to be called out in the release notes.

I'm not sure if its breaking in strict sense since I don't know if these reasons were ever documented as a stable contract; But agreed that anyone exact-matching reason: CreateInstanceFailed would silently stop matching for the classified cases, so a release-notes callout is cheap insurance. WDYT ?

@theunrepentantgeek

Copy link
Copy Markdown
Member

I'm not sure if its breaking in strict sense since I don't know if these reasons were ever documented as a stable contract; But agreed that anyone exact-matching reason: CreateInstanceFailed would silently stop matching for the classified cases, so a release-notes callout is cheap insurance. WDYT ?

Including this in the release notes is a good idea - at the very least it means someone can discover why things broke (if they did).

@ravishen ravishen changed the title fix: thread classified error reasons through instance creation failures fix!: replace CreateInstanceFailed Launched reason with classified reasons (check dependent automations) Jul 3, 2026
matthchr
matthchr previously approved these changes Jul 3, 2026

@matthchr matthchr left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, had a few small comments. PTAL and let me know how you want to address them.

}
return fmt.Errorf("subscription level %s vCPU quota for %s has been reached (may try provision an alternative instance type)", capacityType, instanceType.Name)
err := fmt.Errorf("subscription level %s vCPU quota for %s has been reached (may try provision an alternative instance type)", capacityType, instanceType.Name)
return corecloudprovider.NewCreateError(err, SubscriptionQuotaReachedReason, err.Error())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move these reason const definitions from this package to pkg/consts/condition_reasons.go (or similarly named), along with the ones at the top of pkg/cloudprovider/cloudprovider.go, so we have all the condition reasons together?

If you'd rather we do that as a follow-up PR, I'm fine with that too, to not bloat this PR.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do that as a follow-up PR to keep this one focused.

// reason on the Launched condition, with the friendly message preserved. When no
// reason is expected (e.g. InsufficientCapacityError passthrough), the error must
// not carry a Launched reason and its message must match.
func assertHandledError(g Gomega, actual error, expected error, expectedReason string) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Move to pkg/test/expectations.go and export it as ExpectHandledError?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also mark as GingkoHelper()?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Move to pkg/test/expectations.go and export it as ExpectHandledError?

Moving it to pkg/test/expectations creates an import cycle
go list -deps ./pkg/test/expectations | grep -x \ github.com/Azure/karpenter-provider-azure/pkg/providers/instance/offerings => github.com/Azure/karpenter-provider-azure/pkg/providers/instance/offerings .

Also mark as GingkoHelper()?

These are stdlib table tests (t.Run + NewWithT), not Ginkgo specs, so GinkgoHelper() wouldn't take effect here; applied the stdlib equivalent instead; the helper now takes *testing.T and calls t.Helper(), matching assertOfferingsState.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Surface specific failure reason on the NodeClaim Launched condition (capacity vs quota vs allocation)

3 participants