Skip to content

Handle Out of host capacity scenario in OCI nodepools #8315

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions cluster-autoscaler/cloudprovider/oci/nodepools/oci_manager.go
Original file line number Diff line number Diff line change
Expand Up @@ -525,6 +525,7 @@ func (m *ociManagerImpl) GetNodePoolNodes(np NodePool) ([]cloudprovider.Instance

nodePool, err := m.nodePoolCache.get(np.Id())
if err != nil {
klog.Error(err, "error while performing GetNodePoolNodes call")
return nil, err
}

Expand All @@ -540,10 +541,14 @@ func (m *ociManagerImpl) GetNodePoolNodes(np NodePool) ([]cloudprovider.Instance

if node.NodeError != nil {

// We should move away from the approach of determining a node error as a Out of host capacity
// through string comparison. An error code specifically for Out of host capacity must be set
// and returned in the API response.
errorClass := cloudprovider.OtherErrorClass
if *node.NodeError.Code == "LimitExceeded" ||
(*node.NodeError.Code == "InternalServerError" &&
strings.Contains(*node.NodeError.Message, "quota")) {
*node.NodeError.Code == "QuotaExceeded" ||
(*node.NodeError.Code == "InternalError" &&
strings.Contains(*node.NodeError.Message, "Out of host capacity")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like string matching as a "contract". Is there no better way to have a hard error code that denotes out of host capacity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of today we do not have a better approach. Have added a comment to move away from this approach once we have an errorCode for OOHC in the API response.

errorClass = cloudprovider.OutOfResourcesErrorClass
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -120,8 +120,8 @@ func TestGetNodePoolNodes(t *testing.T) {
{
Id: common.String("node8"),
NodeError: &oke.NodeError{
Code: common.String("InternalServerError"),
Message: common.String("blah blah quota exceeded blah blah"),
Code: common.String("InternalError"),
Message: common.String("blah blah Out of host capacity blah blah"),
},
},
{
Expand Down Expand Up @@ -186,8 +186,8 @@ func TestGetNodePoolNodes(t *testing.T) {
State: cloudprovider.InstanceCreating,
ErrorInfo: &cloudprovider.InstanceErrorInfo{
ErrorClass: cloudprovider.OutOfResourcesErrorClass,
ErrorCode: "InternalServerError",
ErrorMessage: "blah blah quota exceeded blah blah",
ErrorCode: "InternalError",
ErrorMessage: "blah blah Out of host capacity blah blah",
},
},
},
Expand Down
27 changes: 21 additions & 6 deletions cluster-autoscaler/cloudprovider/oci/nodepools/oci_node_pool.go
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,27 @@ func (np *nodePool) DecreaseTargetSize(delta int) error {
}
}
klog.V(4).Infof("DECREASE_TARGET_CHECK_VIA_COMPUTE: %v", decreaseTargetCheckViaComputeBool)
np.manager.InvalidateAndRefreshCache()
nodes, err := np.manager.GetNodePoolNodes(np)
if err != nil {
klog.V(4).Error(err, "error while performing GetNodePoolNodes call")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we already log an error somewhere (i.e. is this an extraneous log)?

If we get an error while scaling down, we shouldn't hide it behind v==4.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added a error log with default verbosity in the GetNodePoolNodes function.

return err
}
// We do not have an OCI API that allows us to delete a node with a compute instance. So we rely on
// the below approach to determine the number running instance in a nodepool from the compute API and
//update the size of the nodepool accordingly. We should move away from this approach once we have an API
// to delete a specific node without a compute instance.
if !decreaseTargetCheckViaComputeBool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked offline about how this isn't ideal and the ideal case is that Delete Node endpoint would be able to handle "deleting" these "ghost" instances.

We should leave a comment explaining why we are doing this way instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added a comment explaining this.

for _, node := range nodes {
if node.Status != nil && node.Status.ErrorInfo != nil {
if node.Status.ErrorInfo.ErrorClass == cloudprovider.OutOfResourcesErrorClass {
klog.Infof("Using Compute to calculate nodepool size as nodepool may contain nodes without a compute instance.")
decreaseTargetCheckViaComputeBool = true
break
}
}
}
}
var nodesLen int
if decreaseTargetCheckViaComputeBool {
nodesLen, err = np.manager.GetExistingNodePoolSizeViaCompute(np)
Expand All @@ -222,12 +243,6 @@ func (np *nodePool) DecreaseTargetSize(delta int) error {
return err
}
} else {
np.manager.InvalidateAndRefreshCache()
nodes, err := np.manager.GetNodePoolNodes(np)
if err != nil {
klog.V(4).Error(err, "error while performing GetNodePoolNodes call")
return err
}
nodesLen = len(nodes)
}

Expand Down
Loading