-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Handle Out of host capacity scenario in OCI nodepools #8315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -214,6 +214,27 @@ func (np *nodePool) DecreaseTargetSize(delta int) error { | |
} | ||
} | ||
klog.V(4).Infof("DECREASE_TARGET_CHECK_VIA_COMPUTE: %v", decreaseTargetCheckViaComputeBool) | ||
np.manager.InvalidateAndRefreshCache() | ||
nodes, err := np.manager.GetNodePoolNodes(np) | ||
if err != nil { | ||
klog.V(4).Error(err, "error while performing GetNodePoolNodes call") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we already log an error somewhere (i.e. is this an extraneous log)? If we get an error while scaling down, we shouldn't hide it behind v==4. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done, added a error log with default verbosity in the GetNodePoolNodes function. |
||
return err | ||
} | ||
// We do not have an OCI API that allows us to delete a node with a compute instance. So we rely on | ||
// the below approach to determine the number running instance in a nodepool from the compute API and | ||
//update the size of the nodepool accordingly. We should move away from this approach once we have an API | ||
// to delete a specific node without a compute instance. | ||
if !decreaseTargetCheckViaComputeBool { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We talked offline about how this isn't ideal and the ideal case is that Delete Node endpoint would be able to handle "deleting" these "ghost" instances. We should leave a comment explaining why we are doing this way instead There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done, added a comment explaining this. |
||
for _, node := range nodes { | ||
if node.Status != nil && node.Status.ErrorInfo != nil { | ||
if node.Status.ErrorInfo.ErrorClass == cloudprovider.OutOfResourcesErrorClass { | ||
klog.Infof("Using Compute to calculate nodepool size as nodepool may contain nodes without a compute instance.") | ||
decreaseTargetCheckViaComputeBool = true | ||
break | ||
} | ||
} | ||
} | ||
} | ||
var nodesLen int | ||
if decreaseTargetCheckViaComputeBool { | ||
nodesLen, err = np.manager.GetExistingNodePoolSizeViaCompute(np) | ||
|
@@ -222,12 +243,6 @@ func (np *nodePool) DecreaseTargetSize(delta int) error { | |
return err | ||
} | ||
} else { | ||
np.manager.InvalidateAndRefreshCache() | ||
nodes, err := np.manager.GetNodePoolNodes(np) | ||
if err != nil { | ||
klog.V(4).Error(err, "error while performing GetNodePoolNodes call") | ||
return err | ||
} | ||
nodesLen = len(nodes) | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like string matching as a "contract". Is there no better way to have a hard error code that denotes out of host capacity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of today we do not have a better approach. Have added a comment to move away from this approach once we have an errorCode for OOHC in the API response.