Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 9 additions & 3 deletions skyrl/backends/ray_jax.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,15 @@


def _get_random_port() -> int:
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(("", 0))
return s.getsockname()[1]
# try a few different ports in case another process is using randomly assigned port
for _ in range(10):
try:
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(("", 0))
return s.getsockname()[1]
except OSError:
continue
raise RuntimeError("Could not allocate a free port")
Comment on lines +13 to +21
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The retry logic added here is unlikely to resolve the common "port in use" issues encountered during JAX initialization, and the accompanying comment is somewhat misleading.

  1. Misleading Comment: socket.bind(("", 0)) requests any available port from the OS. It will not return a port that is already in use. An OSError here typically indicates system-wide ephemeral port exhaustion or a network configuration issue, not a collision with a specific port.
  2. Race Condition: The typical failure mode occurs because the port is released when the with block ends (line 18) and is then hijacked by another process before JAX can bind to it in the setup method (which happens much later in the orchestration). Retrying inside _get_random_port does not protect against this window of vulnerability.
  3. Tight Loop: The loop retries immediately without any backoff or delay. If the system is indeed out of ports, a tight loop is inefficient and unlikely to succeed.

To effectively handle port collisions, the retry logic should ideally encompass the jax.distributed.initialize call or the orchestration step where the port is actually consumed.



@ray.remote
Expand Down
Loading