[SPARK-51667][SS][PYTHON] Disable Nagle's algorithm (via TCP_NODELAY = true) in TWS + PySpark for python <-> state server #50460

HeartSaVioR · 2025-03-31T05:14:14Z

What changes were proposed in this pull request?

This PR proposes to disable Nagle's algorithm (TCP_NODELAY = true) for the connection between Python worker and state server, in TWS + PySpark.

Why are the changes needed?

We have observed the consistent latency increment, which is almost slightly more than 40ms, from specific state interactions. e.g. ListState.put() / ListState.get() / ListState.appendList().

The root cause is figured out as the bad combination of Nagle's algorithm and delayed ACK. The sequence is following:

Python worker sends the proto message to JVM, and flushes the socket.
Additionally, Python worker sends the follow-up data to JVM, and flushes the socket.
JVM reads the proto message, and realizes there is follow-up data.
JVM reads the follow-up data.
JVM processes the request, and sends the response back to Python worker.

Due to delayed ACK, even after 3, ACK is not sent back from JVM to Python worker. It is waiting for some data or multiple ACKs to be sent, but JVM is not going to send the data during that phase.

Due to Nagle's algorithm, the message from 2 is not sent to JVM since there is no ACK for the message from 1. (There is in-flight unacknowledged message.)

This deadlock situation is resolved after the timeout of delayed ACK, which is 40ms (minimum duration) in Linux. After the timeout, ACK is sent back from JVM to Python worker, hence Nagle's algorithm allows the message from 2 to be finally sent to JVM.

The direction can be flipped depending on the command - the same thing can happen on the opposite direction of communication, JVM to Python worker.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manually tested (via adding debug log to measure the time spent from the state interaction).

Beyond that, this should pass the existing tests, which will be verified by CI.

Was this patch authored or co-authored using generative AI tooling?

No.

…WS + PySpark for python <-> state server

neilramaswamy · 2025-03-31T07:22:54Z

Is this not a problem for two back-to-back calls to ValueState.get because the second call to valueState.get() on the Python side will only proceed after the response (not ack) from the first one is received?

HeartSaVioR · 2025-03-31T07:34:13Z

I think we do not design any "asynchronous" or "concurrent" execution of state interaction. Every request with state server is "blocking". As an implementation detail, we limit the amount of data in a call for iterator on state interaction, and the other state interaction can be made while the consumption of iterator has not finished, but this doesn't mean we support "concurrent" state interations.

HeartSaVioR · 2025-03-31T07:56:16Z

cc. @HyukjinKwon @cloud-fan PTAL, thanks!

HeartSaVioR · 2025-03-31T07:59:00Z

cc. @bogao007 since he authored the code.

HeartSaVioR · 2025-03-31T21:52:55Z

Thanks! Merging to master/4.0 (since the delay is not a trivial).

…= true) in TWS + PySpark for python <-> state server ### What changes were proposed in this pull request? This PR proposes to disable Nagle's algorithm (TCP_NODELAY = true) for the connection between Python worker and state server, in TWS + PySpark. ### Why are the changes needed? We have observed the consistent latency increment, which is almost slightly more than 40ms, from specific state interactions. e.g. ListState.put() / ListState.get() / ListState.appendList(). The root cause is figured out as the bad combination of Nagle's algorithm and delayed ACK. The sequence is following: 1. Python worker sends the proto message to JVM, and flushes the socket. 2. Additionally, Python worker sends the follow-up data to JVM, and flushes the socket. 3. JVM reads the proto message, and realizes there is follow-up data. 4. JVM reads the follow-up data. 5. JVM processes the request, and sends the response back to Python worker. Due to delayed ACK, even after 3, ACK is not sent back from JVM to Python worker. It is waiting for some data or multiple ACKs to be sent, but JVM is not going to send the data during that phase. Due to Nagle's algorithm, the message from 2 is not sent to JVM since there is no ACK for the message from 1. (There is in-flight unacknowledged message.) This deadlock situation is resolved after the timeout of delayed ACK, which is 40ms (minimum duration) in Linux. After the timeout, ACK is sent back from JVM to Python worker, hence Nagle's algorithm allows the message from 2 to be finally sent to JVM. The direction can be flipped depending on the command - the same thing can happen on the opposite direction of communication, JVM to Python worker. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested (via adding debug log to measure the time spent from the state interaction). Beyond that, this should pass the existing tests, which will be verified by CI. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50460 from HeartSaVioR/SPARK-51667. Authored-by: Jungtaek Lim <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]> (cherry picked from commit a760df7) Signed-off-by: Jungtaek Lim <[email protected]>

[SPARK-51667][SS] Disable Nagle's algorithm (TCP_NODELAY = true) in T…

8e4cbf4

…WS + PySpark for python <-> state server

github-actions bot added SQL STRUCTURED STREAMING PYTHON labels Mar 31, 2025

HeartSaVioR changed the title ~~[SPARK-51667][SS] Disable Nagle's algorithm (TCP_NODELAY = true) in TWS + PySpark for python <-> state server~~ [SPARK-51667][SS] Disable Nagle's algorithm (via TCP_NODELAY = true) in TWS + PySpark for python <-> state server Mar 31, 2025

HeartSaVioR changed the title ~~[SPARK-51667][SS] Disable Nagle's algorithm (via TCP_NODELAY = true) in TWS + PySpark for python <-> state server~~ [SPARK-51667][SS][PYTHON] Disable Nagle's algorithm (via TCP_NODELAY = true) in TWS + PySpark for python <-> state server Mar 31, 2025

HyukjinKwon approved these changes Mar 31, 2025

View reviewed changes

anishshri-db approved these changes Mar 31, 2025

View reviewed changes

bogao007 approved these changes Mar 31, 2025

View reviewed changes

HeartSaVioR closed this in a760df7 Mar 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-51667][SS][PYTHON] Disable Nagle's algorithm (via TCP_NODELAY = true) in TWS + PySpark for python <-> state server #50460

[SPARK-51667][SS][PYTHON] Disable Nagle's algorithm (via TCP_NODELAY = true) in TWS + PySpark for python <-> state server #50460

Uh oh!

HeartSaVioR commented Mar 31, 2025 •

edited

Loading

Uh oh!

neilramaswamy commented Mar 31, 2025

Uh oh!

HeartSaVioR commented Mar 31, 2025 •

edited

Loading

Uh oh!

HeartSaVioR commented Mar 31, 2025

Uh oh!

HeartSaVioR commented Mar 31, 2025

Uh oh!

HeartSaVioR commented Mar 31, 2025

Uh oh!

Uh oh!

[SPARK-51667][SS][PYTHON] Disable Nagle's algorithm (via TCP_NODELAY = true) in TWS + PySpark for python <-> state server #50460

[SPARK-51667][SS][PYTHON] Disable Nagle's algorithm (via TCP_NODELAY = true) in TWS + PySpark for python <-> state server #50460

Uh oh!

Conversation

HeartSaVioR commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

neilramaswamy commented Mar 31, 2025

Uh oh!

HeartSaVioR commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HeartSaVioR commented Mar 31, 2025

Uh oh!

HeartSaVioR commented Mar 31, 2025

Uh oh!

HeartSaVioR commented Mar 31, 2025

Uh oh!

Uh oh!

HeartSaVioR commented Mar 31, 2025 •

edited

Loading

HeartSaVioR commented Mar 31, 2025 •

edited

Loading