Fail query runner when nodes do not come up #25184

tdcmeehan · 2025-05-23T16:25:45Z

Description

When query runners fail to launch a cluster, a log is printed that shows this, but it doesn't fail the query runner. This could result in timeouts. It is better to fail early when the cluster does not launch.

Motivation and Context

More obvious errors.

Impact

See above

Test Plan

Query runners are used throughout our testing suite, so this should be thoroughly tested with existing tests.

Contributor checklist

Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

== NO RELEASE NOTE ==

aditi-pandit

Thanks @tdcmeehan for this code.

aditi-pandit · 2025-05-24T00:11:51Z

presto-tests/src/main/java/com/facebook/presto/tests/DistributedQueryRunner.java

-                    (server.isResourceManager() && activeNodeCount != expectedActiveNodesForRm)) {
-                return false;
+                if (!allNodes.getInactiveNodes().isEmpty()) {
+                    if (nanosSince(startTimeInMs).compareTo(timeout) >= 0) {


Nit : Abstract the checked condition with custom exception message as a lambda maybe ?

hantangwangd · 2025-05-24T05:35:38Z

presto-tests/src/main/java/com/facebook/presto/tests/DistributedQueryRunner.java

+        while (true) {
+            for (TestingPrestoServer server : servers) {
+                AllNodes allNodes = server.refreshNodes();
+                int activeNodeCount = allNodes.getActiveNodes().size();

-            if (!allNodes.getInactiveNodes().isEmpty() ||
-                    (server.isCoordinator() && activeNodeCount != expectedActiveNodesForCoordinator) ||
-                    (server.isResourceManager() && activeNodeCount != expectedActiveNodesForRm)) {
-                return false;
+                if (!allNodes.getInactiveNodes().isEmpty()) {
+                    if (nanosSince(startTimeInMs).compareTo(timeout) >= 0) {
+                        throw new TimeoutException(format("Timed out waiting for all nodes to be globally visible. Inactive nodes: %s", allNodes.getInactiveNodes()));
+                    }
+                    break;
+                }
+                else if ((server.isCoordinator() || server.isResourceManager()) && activeNodeCount != expectedActiveNodes) {
+                    if (nanosSince(startTimeInMs).compareTo(timeout) >= 0) {
+                        throw new TimeoutException(format(
+                                "Timed out waiting for all nodes to be globally visible. Node count: %s, expected: %s",
+                                activeNodeCount, expectedActiveNodes));
+                    }
+                    break;
+                }
+                return;
            }
+            MILLISECONDS.sleep(10);


Seems in the new logic, if the first TestingPrestoServer server satisfies the check, we will return immediately. Should we make sure that all the servers satisfy the check before returning?

Fail query runner when nodes do not come up

140dcc1

tdcmeehan requested review from jaystarshot, feilong-liu, elharo, ClarenceThreepwood and a team as code owners May 23, 2025 16:25

tdcmeehan requested a review from hantangwangd May 23, 2025 16:25

prestodb-ci added the from:IBM PR from IBM label May 23, 2025

prestodb-ci requested review from a team, BryanCutler and anandamideShakyan and removed request for a team May 23, 2025 16:25

aditi-pandit reviewed May 24, 2025

View reviewed changes

hantangwangd reviewed May 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fail query runner when nodes do not come up #25184

Fail query runner when nodes do not come up #25184

tdcmeehan commented May 23, 2025

Uh oh!

aditi-pandit left a comment

Uh oh!

aditi-pandit May 24, 2025

Uh oh!

hantangwangd May 24, 2025

Uh oh!

Uh oh!

Fail query runner when nodes do not come up #25184

Are you sure you want to change the base?

Fail query runner when nodes do not come up #25184

Conversation

tdcmeehan commented May 23, 2025

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

Uh oh!

aditi-pandit left a comment

Choose a reason for hiding this comment

Uh oh!

aditi-pandit May 24, 2025

Choose a reason for hiding this comment

Uh oh!

hantangwangd May 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!