Skip to content

Fail query runner when nodes do not come up #25184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tdcmeehan
Copy link
Contributor

Description

When query runners fail to launch a cluster, a log is printed that shows this, but it doesn't fail the query runner. This could result in timeouts. It is better to fail early when the cluster does not launch.

Motivation and Context

More obvious errors.

Impact

See above

Test Plan

Query runners are used throughout our testing suite, so this should be thoroughly tested with existing tests.

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== NO RELEASE NOTE ==

@tdcmeehan tdcmeehan requested a review from hantangwangd May 23, 2025 16:25
@prestodb-ci prestodb-ci added the from:IBM PR from IBM label May 23, 2025
@prestodb-ci prestodb-ci requested review from a team, BryanCutler and anandamideShakyan and removed request for a team May 23, 2025 16:25
Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tdcmeehan for this code.

(server.isResourceManager() && activeNodeCount != expectedActiveNodesForRm)) {
return false;
if (!allNodes.getInactiveNodes().isEmpty()) {
if (nanosSince(startTimeInMs).compareTo(timeout) >= 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : Abstract the checked condition with custom exception message as a lambda maybe ?

Comment on lines +530 to +551
while (true) {
for (TestingPrestoServer server : servers) {
AllNodes allNodes = server.refreshNodes();
int activeNodeCount = allNodes.getActiveNodes().size();

if (!allNodes.getInactiveNodes().isEmpty() ||
(server.isCoordinator() && activeNodeCount != expectedActiveNodesForCoordinator) ||
(server.isResourceManager() && activeNodeCount != expectedActiveNodesForRm)) {
return false;
if (!allNodes.getInactiveNodes().isEmpty()) {
if (nanosSince(startTimeInMs).compareTo(timeout) >= 0) {
throw new TimeoutException(format("Timed out waiting for all nodes to be globally visible. Inactive nodes: %s", allNodes.getInactiveNodes()));
}
break;
}
else if ((server.isCoordinator() || server.isResourceManager()) && activeNodeCount != expectedActiveNodes) {
if (nanosSince(startTimeInMs).compareTo(timeout) >= 0) {
throw new TimeoutException(format(
"Timed out waiting for all nodes to be globally visible. Node count: %s, expected: %s",
activeNodeCount, expectedActiveNodes));
}
break;
}
return;
}
MILLISECONDS.sleep(10);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems in the new logic, if the first TestingPrestoServer server satisfies the check, we will return immediately. Should we make sure that all the servers satisfy the check before returning?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
from:IBM PR from IBM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants