Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose shoot last operation errors #653

Open
ebensom opened this issue Feb 9, 2025 · 0 comments
Open

Expose shoot last operation errors #653

ebensom opened this issue Feb 9, 2025 · 0 comments
Labels
area/control-plane Related to all activities around Kyma Control Plane kind/feature Categorizes issue or PR as related to a new feature.

Comments

@ebensom
Copy link
Member

ebensom commented Feb 9, 2025

Description

In the waiting for shoot creation and reconciliation FSM steps, in case the shoot lastOperation.state is Progressing, Error or Failed, the lastErrors should always be logged by KIM.

Furthermore the lastErrors should also conveyed and exposed in the Runtime CR status block, so that KEB and SRE tooling could parse it and further process it.

One example for the shoot lastErrors to be captured and exposed by KIM:

status:
  lastOperation:
    description: Waiting until Kubernetes API server rolled out
    lastUpdateTime: '2025-02-07T13:50:42Z'
    progress: 25
    state: Processing
    type: Create
  lastErrors:
    - description: "task \"Waiting until shoot worker nodes have been reconciled\" failed: Error while waiting for Worker shoot--kyma--f6c50cd/f6c50cd to become ready: error during reconciliation: Error reconciling Worker: failed while waiting for all machine deployments to be ready: machine(s) failed: 1 error occurred: \"shoot--kyma--f6c50cd-cpu-worker-0-z3-98968-l5nt9\": Cloud provider message - machine codes error: code = [ResourceExhausted] message = [InsufficientInstanceCapacity: We currently do not have sufficient m6i.16xlarge capacity in the Availability Zone you requested (eu-central-1a). Our system will be working on provisioning additional capacity. You can currently get m6i.16xlarge capacity by not specifying an Availability Zone in your request or choosing eu-central-1b, eu-central-1c.\n\tstatus code: 500, request id: 2ba83cd3-4a2c-42c0-a00e-142bee1f9316]"
      taskID: Waiting until shoot worker nodes have been reconciled
      codes:
        - ERR_INFRA_RESOURCES_DEPLETED

Reasons

Gap in post-mortem troubleshooting. In case the shoot cluster is deleted, the only way to perform forensic analysis why a shoot could not be reconciled by Gardener is to ask Gardener DoD about shoot-controller-manager logs. This is a manual tedious and time-consuming process.

The enhancement should be treated with high priority as this gap is identified by a post-mortem action item: https://wiki.one.int.sap/wiki/pages/viewpage.action?pageId=5035150267

Attachments

@ebensom ebensom added area/control-plane Related to all activities around Kyma Control Plane kind/feature Categorizes issue or PR as related to a new feature. labels Feb 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Related to all activities around Kyma Control Plane kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

1 participant