-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Fail query runner when nodes do not come up #25184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tdcmeehan for this code.
(server.isResourceManager() && activeNodeCount != expectedActiveNodesForRm)) { | ||
return false; | ||
if (!allNodes.getInactiveNodes().isEmpty()) { | ||
if (nanosSince(startTimeInMs).compareTo(timeout) >= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit : Abstract the checked condition with custom exception message as a lambda maybe ?
while (true) { | ||
for (TestingPrestoServer server : servers) { | ||
AllNodes allNodes = server.refreshNodes(); | ||
int activeNodeCount = allNodes.getActiveNodes().size(); | ||
|
||
if (!allNodes.getInactiveNodes().isEmpty() || | ||
(server.isCoordinator() && activeNodeCount != expectedActiveNodesForCoordinator) || | ||
(server.isResourceManager() && activeNodeCount != expectedActiveNodesForRm)) { | ||
return false; | ||
if (!allNodes.getInactiveNodes().isEmpty()) { | ||
if (nanosSince(startTimeInMs).compareTo(timeout) >= 0) { | ||
throw new TimeoutException(format("Timed out waiting for all nodes to be globally visible. Inactive nodes: %s", allNodes.getInactiveNodes())); | ||
} | ||
break; | ||
} | ||
else if ((server.isCoordinator() || server.isResourceManager()) && activeNodeCount != expectedActiveNodes) { | ||
if (nanosSince(startTimeInMs).compareTo(timeout) >= 0) { | ||
throw new TimeoutException(format( | ||
"Timed out waiting for all nodes to be globally visible. Node count: %s, expected: %s", | ||
activeNodeCount, expectedActiveNodes)); | ||
} | ||
break; | ||
} | ||
return; | ||
} | ||
MILLISECONDS.sleep(10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems in the new logic, if the first TestingPrestoServer
server satisfies the check, we will return immediately. Should we make sure that all the servers satisfy the check before returning?
Description
When query runners fail to launch a cluster, a log is printed that shows this, but it doesn't fail the query runner. This could result in timeouts. It is better to fail early when the cluster does not launch.
Motivation and Context
More obvious errors.
Impact
See above
Test Plan
Query runners are used throughout our testing suite, so this should be thoroughly tested with existing tests.
Contributor checklist
Release Notes