[deployment,docs] fix: select local OCI runtime endpoint by network path by C23333 · Pull Request #44 · verl-project/uni-agent

C23333 · 2026-05-27T07:04:27Z

What does this PR do?

This PR fixes how LocalDeployment builds the RemoteRuntimeConfig endpoint for Docker/Podman sandboxes.

For Docker/Podman, the sandbox server listens on runtime_port inside the container. When Uni-Agent connects from the host through port publishing, it must connect to published_port.

Example:

docker run -p 4567:8000 ...

In this case:

runtime_port = 8000: the port where swerex.server listens inside the sandbox container
published_port = 4567: the host-side port that Uni-Agent connects to

The previous implementation always used runtime_port for the runtime client. That breaks the default host-side Docker/Podman path because Uni-Agent would try 127.0.0.1:8000 instead of 127.0.0.1:4567.

This PR resolves the runtime endpoint from the actual Docker/Podman access path:

host-side port publishing uses 127.0.0.1:published_port
Uni-Agent running inside a container can use the sandbox container IP with runtime_port
host networking uses runtime_port and skips port publishing
explicit host follows the same port rule based on whether Uni-Agent is running on the host or inside a container

It also detects the Podman container marker at /run/.containerenv when checking whether Uni-Agent itself is running inside a container.

Why is this needed?

Local Docker/Podman deployment has more than one valid access path.

Host-side port publishing:

Uni-Agent on host -> http://127.0.0.1:4567 -> sandbox:8000

Container-to-container access:

Uni-Agent container -> http://172.18.0.9:8000

Host networking:

Uni-Agent -> http://127.0.0.1:8000

Using a single port for all paths makes one of these cases wrong. This PR keeps those paths separate in LocalDeployment and adds regression coverage for the Docker/Podman endpoint matrix.

Checklist Before Starting

PR title follows [{modules}] {type}: {description}
Related issue / PR: N/A

Test

python -m pytest tests/deployment/test_local_deployment.py -q

22 passed

python -m ruff check uni_agent/deployment/local/deployment.py tests/deployment/test_local_deployment.py

All checks passed!

python -m ruff format --check uni_agent/deployment/local/deployment.py tests/deployment/test_local_deployment.py

2 files already formatted

git diff --check

passed

gemini-code-assist

Code Review

This pull request introduces the runtime_port configuration and refactors how the OCI container runtime endpoint is resolved based on the network configuration. The review feedback correctly identifies two issues in the _get_runtime_endpoint resolution logic: a bug when connecting from within a container using a custom network, and a cross-platform compatibility issue when running on a host with a custom network (such as on macOS/Windows where container IPs are not routable). A code suggestion is provided to ensure that container IP and runtime_port are only used when actually running inside a container or using host networking.

gemini-code-assist · 2026-05-27T07:06:24Z

+    def _get_runtime_endpoint(self, container_name: str, published_port: int, network: str | None) -> tuple[str, int]:
        if self._config.host:
-            return self._config.host
+            port = self._config.runtime_port if network == _HOST_NETWORK else published_port
+            return self._config.host, port
+
+        if network == _HOST_NETWORK:
+            return "http://127.0.0.1", self._config.runtime_port

-        if _is_running_in_container() or self._config.network:
+        if network or _is_running_in_container():
            container_ip = self._get_container_ip(container_name)
            if container_ip:
-                return f"http://{container_ip}"
+                return f"http://{container_ip}", self._config.runtime_port

-        return "http://127.0.0.1"
+        return "http://127.0.0.1", published_port


There are two issues in the port and host resolution logic within _get_runtime_endpoint:

Bug when self._config.host is set inside a container:
If Uni-Agent is running inside a container (_is_running_in_container() is True) and network is not "host" (e.g., it is None or a custom bridge network), network == _HOST_NETWORK evaluates to False. This causes the code to select published_port. However, container-to-container communication bypasses port publishing and must connect directly to the target container's runtime_port. Attempting to connect to published_port will fail.

Cross-platform compatibility issue when running on the host with a custom network:
If Uni-Agent is running on the host (_is_running_in_container() is False) but a custom network is specified, network or _is_running_in_container() evaluates to True. This forces the use of container_ip and runtime_port. On macOS and Windows (Docker Desktop), container IPs are not routable from the host, causing the connection to fail. Since the port is published to the host's localhost anyway, we should connect via 127.0.0.1:published_port when running on the host.

Suggested Fix

We can resolve both issues by ensuring we only use runtime_port and container_ip when Uni-Agent is actually running inside a container (or when using host networking).

Note: If you apply this change, you will need to update the following tests in tests/deployment/test_local_deployment.py to mock _is_running_in_container to return True (since they simulate container-to-container network paths):

test_oci_runtime_uses_container_port_when_connecting_over_container_network

test_oci_runtime_falls_back_to_published_port_when_container_ip_is_unavailable

Suggested change

def _get_runtime_endpoint(self, container_name: str, published_port: int, network: str | None) -> tuple[str, int]:

if self._config.host:

return self._config.host

port = self._config.runtime_port if network == _HOST_NETWORK else published_port

return self._config.host, port

if network == _HOST_NETWORK:

return "http://127.0.0.1", self._config.runtime_port

if _is_running_in_container() or self._config.network:

if network or _is_running_in_container():

container_ip = self._get_container_ip(container_name)

if container_ip:

return f"http://{container_ip}"

return f"http://{container_ip}", self._config.runtime_port

return "http://127.0.0.1"

return "http://127.0.0.1", published_port

def _get_runtime_endpoint(self, container_name: str, published_port: int, network: str | None) -> tuple[str, int]:

if self._config.host:

port = self._config.runtime_port if (network == _HOST_NETWORK or _is_running_in_container()) else published_port

return self._config.host, port

if network == _HOST_NETWORK:

return "http://127.0.0.1", self._config.runtime_port

if _is_running_in_container():

container_ip = self._get_container_ip(container_name)

if container_ip:

return f"http://{container_ip}", self._config.runtime_port

return "http://127.0.0.1", published_port

C23333 · 2026-05-28T03:21:03Z

/gemini review

gemini-code-assist

Code Review

This pull request updates the local deployment logic to support dynamic runtime port selection depending on whether the agent is running inside a container, on host networking, or over a custom OCI network. It also adds support for detecting Podman containers via /run/.containerenv and includes extensive unit tests. The reviewer feedback points out that when running inside a container, if the host is explicitly set to a loopback address (such as host.docker.internal or localhost), the connection back to the host should use the published port instead of the internal runtime port. The reviewer provides a code suggestion to fix this and requests an additional test case to verify the behavior.

gemini-code-assist · 2026-05-28T03:23:00Z

        if self._config.host:
-            return self._config.host
+            port = (
+                self._config.runtime_port if network == _HOST_NETWORK or _is_running_in_container() else published_port
+            )
+            return self._config.host, port


When _is_running_in_container() is True and self._config.host is explicitly set to a host-loopback address (such as host.docker.internal, localhost, or 127.0.0.1), the connection from within the container back to the host must go through the host's published port rather than the container's internal runtime_port. Currently, the code incorrectly selects runtime_port for all cases where _is_running_in_container() is True.

Suggested change

if self._config.host:

return self._config.host

port = (

self._config.runtime_port if network == _HOST_NETWORK or _is_running_in_container() else published_port

)

return self._config.host, port

if self._config.host:

is_host_loopback = any(h in self._config.host for h in ("localhost", "127.0.0.1", "0.0.0.0", "host.docker.internal"))

port = (

self._config.runtime_port

if network == _HOST_NETWORK or (_is_running_in_container() and not is_host_loopback)

else published_port

)

return self._config.host, port

gemini-code-assist · 2026-05-28T03:23:00Z

+def test_oci_runtime_uses_container_port_with_explicit_host_from_container(monkeypatch):
+    deployment = LocalDeployment(
+        run_id="test",
+        type="local",
+        container_runtime="docker",
+        image="python:3.12",
+        host="http://sandbox",
+        runtime_port=8000,
+        network="agent-net",
+    )
+    deployment._runtime_exec = lambda args, check=True: _completed_container()
+    monkeypatch.setattr(local_deployment, "_is_running_in_container", lambda: True)
+    monkeypatch.setattr(local_deployment, "LocalRuntime", _CapturingRuntime)
+    _CapturingRuntime.configs = []
+
+    asyncio.run(deployment._start_oci_container(token="secret", container_name="sandbox-name", published_port=4567))
+
+    runtime_config, run_id = _CapturingRuntime.configs[-1]
+    assert run_id == "test"
+    assert runtime_config.host == "http://sandbox"
+    assert runtime_config.port == 8000


Add a test case to verify that when connecting from within a container back to the host via a host-loopback address (like host.docker.internal), the runtime correctly uses the published port instead of the container's internal runtime_port.

def test_oci_runtime_uses_container_port_with_explicit_host_from_container(monkeypatch): deployment = LocalDeployment( run_id="test", type="local", container_runtime="docker", image="python:3.12", host="http://sandbox", runtime_port=8000, network="agent-net", ) deployment._runtime_exec = lambda args, check=True: _completed_container() monkeypatch.setattr(local_deployment, "_is_running_in_container", lambda: True) monkeypatch.setattr(local_deployment, "LocalRuntime", _CapturingRuntime) _CapturingRuntime.configs = [] asyncio.run(deployment._start_oci_container(token="secret", container_name="sandbox-name", published_port=4567)) runtime_config, run_id = _CapturingRuntime.configs[-1] assert run_id == "test" assert runtime_config.host == "http://sandbox" assert runtime_config.port == 8000 def test_oci_runtime_uses_published_port_with_host_loopback_from_container(monkeypatch): deployment = LocalDeployment( run_id="test", type="local", container_runtime="docker", image="python:3.12", host="http://host.docker.internal", runtime_port=8000, network="agent-net", ) deployment._runtime_exec = lambda args, check=True: _completed_container() monkeypatch.setattr(local_deployment, "_is_running_in_container", lambda: True) monkeypatch.setattr(local_deployment, "LocalRuntime", _CapturingRuntime) _CapturingRuntime.configs = [] asyncio.run(deployment._start_oci_container(token="secret", container_name="sandbox-name", published_port=4567)) runtime_config, run_id = _CapturingRuntime.configs[-1] assert run_id == "test" assert runtime_config.host == "http://host.docker.internal" assert runtime_config.port == 4567

C23333 · 2026-05-28T05:05:55Z

/gemini review

gemini-code-assist

Code Review

This pull request updates the local deployment logic to correctly handle port and host resolution when running Uni-Agent inside or outside of OCI containers (Docker/Podman), particularly with custom or host networks. It introduces a new runtime_port configuration option and refactors the runtime endpoint selection. The review feedback correctly points out that loopback addresses (such as localhost and 127.0.0.1) should be treated as host gateway names when Uni-Agent runs inside a container to prevent incorrect port selection and connection failures, and suggests corresponding test updates.

gemini-code-assist · 2026-05-28T05:07:31Z

 _APPTAINER_RUNTIMES = {"apptainer", "singularity"}
 _CONTAINER_RUNTIME_ENV_VARS = ("UNI_AGENT_CONTAINER_RUNTIME", "LOCAL_CONTAINER_RUNTIME")
 _DEFAULT_CONTAINER_RUNTIME_CANDIDATES = ("apptainer", "singularity", "docker", "podman")
+_HOST_GATEWAY_NAMES = {"host.docker.internal", "host.containers.internal"}


Loopback addresses like localhost, 127.0.0.1, 0.0.0.0, and ::1 should be treated as host gateway names when Uni-Agent is running inside a container (e.g., with host networking). If they are not included in _HOST_GATEWAY_NAMES, the runtime endpoint will incorrectly select runtime_port instead of published_port when connecting to the host via loopback, causing connection failures.

_HOST_GATEWAY_NAMES = { "host.docker.internal", "host.containers.internal", "localhost", "127.0.0.1", "0.0.0.0", "::1", }

gemini-code-assist · 2026-05-28T05:07:31Z

+
+
+def test_oci_runtime_uses_published_port_with_host_gateway_from_container(monkeypatch):
+    for host in ("http://host.docker.internal", "http://host.containers.internal"):


Update the test to also verify that loopback addresses (localhost, 127.0.0.1, etc.) are correctly identified as host gateway names and use the published port when connecting from a container.

Suggested change

for host in ("http://host.docker.internal", "http://host.containers.internal"):

for host in ("http://host.docker.internal", "http://host.containers.internal", "http://localhost", "http://127.0.0.1"):

host.docker.internal and host.containers.internal are host-gateway names, so they should use the published port.

localhost, 127.0.0.1, 0.0.0.0, and ::1 are different. In a normal bridge network, loopback points to the current container, not the host, and 0.0.0.0 is mainly a bind address rather than a client endpoint. So they are intentionally not treated as host-gateway names here.

yyDing1 · 2026-05-29T07:23:44Z

Could you help review this PR? @Jianbing-D

@aoshen02

…47) ### What does this PR do? Three interlocked changes to `ModalDeployment` to survive multi-day RL training at `agent_concurrency >= 128`. All three respond to distinct failure modes hit on a Qwen3-235B SWE-bench campaign across ~1.4M trajectories on Modal. **1) Cap concurrent sandboxes in the STARTING state.** Dominant failure at high concurrency is `Runtime did not start within 299s` — too many sandboxes simultaneously in Modal's runtime-start pipeline, not the sandbox CREATE rate (Modal Support confirmed 30/s + 2k burst, far above our ~1-2/s). Adds an `asyncio.Semaphore` keyed on a fleet-wide `MODAL_MAX_STARTING` (default 128), divided by `UNIAGENT_NUM_WORKERS` to derive each worker's share — same pattern already used for the per-worker `agent_loop` semaphore. The permit is held from `sandbox.create` through `runtime.create_session`, then released so the long tool-call body does not occupy a permit. **2) Bound a single trajectory's init wall-clock.** Pre-patch: `max_retries=5 × startup_timeout=300s = up to 25 min` of one trajectory hogging a STARTING permit. Reduce `max_retries` to 2 and add `MODAL_INIT_WALL_BUDGET` (default 900s = 15 min) as a hard cap. `asyncio.wait_for` around each `_start()` prevents a hung `modal.Sandbox.create` from blocking past the deadline. When the budget is exhausted the trajectory raises; the outer `agent_loop` converts it into a `reward=0` masked sample. **3) Guarantee Modal sandbox termination on teardown.** Observed (round 12, 2026-05-18): `self._runtime.close()` raised `aiohttp.ServerDisconnectedError` when the in-sandbox agent server had already torn down its socket. Without `try/except`, `stop()` returned early and `self._sandbox.terminate.aio()` never ran. After thousands of trajectories ~847 sandboxes were leaked, hitting the account's concurrent-sandbox cap and 100% failing new creates on subsequent runs. Wraps each step; guarantees terminate is attempted (and retried once) even when `runtime.close` fails. ### Checklist Before Starting - [x] Search for similar PRs/issues: - `gh pr list --repo verl-project/uni-agent --state open --search "modal"` → none touching `modal/deployment.py` (PR #44 touches `local/deployment.py` only) - `gh issue list --search "modal sandbox"` → only RFC-level (#31 OpenYuanrong, #43 local SWE-Bench), nothing on STARTING-state limiter - [x] Format the PR title as `[deployment] fix: ...` (single module, single type) ### Test Unit tests added in `tests/deployment/test_modal_starting_limiter.py`. ```bash pytest tests/deployment/test_modal_starting_limiter.py -v # 12 passed in 64.76s ``` Test coverage: - Per-worker share math (`MAX_STARTING / NUM_WORKERS`, incl. clamp-to-1) - Singleton + lazy init of `_STARTING_SEMA` - `max_retries=2` (proves regression from old `=5`) - Wall-budget short-circuits subsequent attempts when exhausted - `asyncio.wait_for` cancels a hung `_start` - Serialization across concurrent `_start` callers (peak ≤ permits) - Permit released on failure path (no leak → no deadlock) **Production validation**: 6+ consecutive 8-hour rollouter sessions without the "299s runtime not start" cluster-collapse we hit pre-patch; sandbox leak count returned to ~0 (was steady at 50-100/hr). ### API and Usage Example Three new env-var knobs (all optional; defaults pre-tuned for 8 trays × `agent_concurrency=360`): ```bash export UNIAGENT_NUM_WORKERS=8 # rollouter worker count export MODAL_MAX_STARTING=128 # fleet-wide STARTING cap export MODAL_INIT_WALL_BUDGET=900 # per-trajectory init seconds ``` No public API changes. Existing callers of `ModalDeployment.start()` / `.stop()` work unchanged. ### Design & Code Changes - `_get_starting_semaphore()` — lazy singleton, constructed inside the event loop on first use; per-worker share computed as `max(1, MAX_STARTING // NUM_WORKERS)`. - `_start()` now wraps the entire create-tunnel-runtime block in `async with _get_starting_semaphore()`, releasing immediately after `runtime.create_session` returns. - `start()` retry loop: `max_retries=5→2`, exp-backoff cap `30s→10s`, outer `MODAL_INIT_WALL_BUDGET` deadline, per-attempt `asyncio.wait_for(_start(), timeout=max(60, remaining))`. - `stop()`: each cleanup step is `try/except`-wrapped so a transient failure in `runtime.close` does not skip `sandbox.terminate`; `terminate.aio()` retried once if first attempt fails. ### Checklist Before Submitting - [x] Read the Contribute Guide - [x] `pre-commit run --files uni_agent/deployment/modal/deployment.py tests/deployment/test_modal_starting_limiter.py` — all hooks passed (ruff, ruff format, mypy, compile-all) - [x] Tests added (`tests/deployment/test_modal_starting_limiter.py`, 12 tests, runtime-agnostic — no Modal network calls) - [x] AI assistance was used (Claude Code); the submitting human (@aoshen02) reviewed every changed line and ran the test suite. - [x] Not duplicating any open PR (PR #44 changes `local/deployment.py`, not `modal/`) --------- Signed-off-by: aoshen02 <aoshen@inferact.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: yyDing1 <yyding.me@gmail.com>

[deployment,docs] fix local OCI published port

4e6c99d

gemini-code-assist Bot reviewed May 27, 2026

View reviewed changes

C23333 marked this pull request as draft May 27, 2026 07:47

[deployment] fix OCI endpoint for container network paths

d4cd231

C23333 changed the title ~~[deployment,docs] fix: select local OCI runtime port by host~~ [deployment,docs] fix: select local OCI runtime endpoint by network path May 28, 2026

C23333 marked this pull request as ready for review May 28, 2026 03:16

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

C23333 marked this pull request as draft May 28, 2026 03:30

[deployment] handle host-gateway OCI endpoints

432bbbd

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

C23333 marked this pull request as ready for review May 28, 2026 05:11

aoshen02 mentioned this pull request May 28, 2026

[deployment] fix: harden Modal sandbox lifecycle at high concurrency #47

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[deployment,docs] fix: select local OCI runtime endpoint by network path#44

[deployment,docs] fix: select local OCI runtime endpoint by network path#44
C23333 wants to merge 3 commits into
verl-project:mainfrom
C23333:codex/fix-local-oci-published-port

C23333 commented May 27, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 27, 2026

Uh oh!

C23333 commented May 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 28, 2026

Uh oh!

gemini-code-assist Bot May 28, 2026

Uh oh!

C23333 commented May 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 28, 2026

Uh oh!

gemini-code-assist Bot May 28, 2026

Uh oh!

C23333 May 28, 2026

Uh oh!

yyDing1 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		def test_oci_runtime_uses_published_port_with_host_gateway_from_container(monkeypatch):
		for host in ("http://host.docker.internal", "http://host.containers.internal"):

Conversation

C23333 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is this needed?

Checklist Before Starting

Test

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 27, 2026

Choose a reason for hiding this comment

Suggested Fix

Uh oh!

C23333 commented May 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

C23333 commented May 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

C23333 May 28, 2026

Choose a reason for hiding this comment

Uh oh!

yyDing1 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

C23333 commented May 27, 2026 •

edited

Loading